Full text search with Sphinx and MySQL InnoDB

First get sphinx download package. Then get some other packages we may need:

apt-get install libmysqlclient-dev
apt-get install g++
apt-get install libexpat-dev

Config, make, install

./configure --with-mysql
make
make install

All the results should be in ./src - you will find 3 executable files that will be needed, they are:

* indexer
* search
* searchd

I did this in /opt/sphinx (Note: I wasn’t too certain I would continue to use Sphinx when I put it on my system I didn’t do anything very permanent or integrated- hence installing under /opt/). Next we will need to add a config file- I will add a file called sphinx.conf to “/opt/sphinx/src/sphinx.conf”.

source src1
{
    type  = mysql

    sql_host=localhost
    sql_port=3306
    sql_db=REDACTED
    sql_user=REDACTED
    sql_pass=REDACTED

    sql_query_range = SELECT MIN(id),MAX(id) FROM table_name_here
    sql_range_step = 1000
    sql_query = SELECT * FROM table_name_here WHERE id>=$start AND id<=$end

}

# Define the index.
index members
{
    # Which source to use
    source = src1

    # Path where to store the index data
    path = /opt/sphinx/var/data/members

    # Charset of the data
    charset_type = utf-8

    # Minimum lenght of a word to be indexed.
    min_word_len = 3

}

# Indexer definition
indexer
{
    # Memory limit for the indexer
    mem_limit = 32M
}

# Searchd settings
searchd
{
    # Port to listen on
    listen = 3312

    # Next few are the paths to log files
    log           = /opt/sphinx/var/log/searchd.log
    query_log = /opt/sphinx/var/log/query.log

    # Maximum amount of concurrent searches to run - 0 for unlimited
    max_children = 30

    # Path to pid file
    pid_file = /opt/sphinx/var/log/searchd.pid
}

We will add a few more directories for the log file and test file:

mkdir /opt/sphinx/var
mkdir /opt/sphinx/var/log
mkdir /opt/sphinx/var/data
mkdir /opt/sphinx-php

Sphinx api docs are located here.

The next problem to solve is the issue of an ever-growing database. We cannot re-index a huge db every time now can we? For this we must use the “Delta Index Updates” feature.

The steps are:

Create a sphinx counter table (so we can keep track of our last id updated).
You will need to add two sources- a main and a delta.
In the main you will pull all the records less than the max and update the record of the max id
In the delta you will pull all the records greater than the max id, then for your post query you should re-update with new max ID.

CREATE TABLE sph_counter
(
    counter_id INTEGER PRIMARY KEY NOT NULL,
    max_doc_id INTEGER NOT NULL
);

source main
{
    # We will use the xmlpipe2 datasource
    type  = mysql

    # Command which should be executed to get the xml. The following php script outputs the required xml to stdout
    #xmlpipe_command  = php /opt/sphinx-php/makeindex.php
    sql_host=localhost
    sql_port=3306
    sql_db=REDACTED
    sql_user=REDACTED
    sql_pass=REDACTED

    #sql_query_range = SELECT 3500000,MAX(id) FROM table_name_here
    sql_range_step = 1000
    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM table_name_here
    #sql_query = SELECT * FROM table_name_here WHERE id >= $start AND id<=$end
    sql_query = SELECT * FROM table_name_here WHERE id <= (select max_doc_id FROM sph_counter where counter_id=1)

}

source maindelta : main
{
        sql_query_pre = SET NAMES utf8
        sql_query = SELECT * from table_name_here where id > (select max_doc_id FROM sph_counter WHERE counter_id=1)
}



# Define the index.
index members
{
    # Which source to use
    source = main

    # Path where to store the index data
    path = /opt/sphinx/var/data/members

    # Charset of the data
    charset_type = utf-8

    # Minimum lenght of a word to be indexed.
    min_word_len = 3

}

index membersdelta
{
        source = maindelta
        path = /opt/sphinx/var/data/membersdelta
        charset_type = utf-8
        min_word_len = 3
}


# Indexer definition
indexer
{
    # Memory limit for the indexer
    mem_limit = 32M
}

# Searchd settings
searchd
{
    # Port to listen on
    listen = 3312

    # Next few are the paths to log files
    log           = /opt/sphinx/var/log/searchd.log
    query_log = /opt/sphinx/var/log/query.log

    # Maximum amount of concurrent searches to run - 0 for unlimited
    max_children = 30

    # Path to pid file
    pid_file = /opt/sphinx/var/log/searchd.pid
}

Notice how we do “source maindelta: main”? This causes us to inherit settings from the main source. Next we will go over the relevant index commands. First we have the big one- index all:

# it seems that this command re-indexes all of the sources every time.
./indexer --all

DO this when searchd is running.

./indexer --rotate

Once we have both indexes built we may want to merge them:

./indexer --merge members membersdelta

Now lets do a test search- I have indexed about a gigabyte of chat records from a popular bitcoin-related website.

./search bitcoin

displaying matches:
document=14559545, weight=1617
document=23588637, weight=1612
document=25349448, weight=1607
document=25349548, weight=1607
document=25349655, weight=1607
document=28222881, weight=1607
document=2946423, weight=1604
document=8347522, weight=1604
document=21633204, weight=1604
document=21633278, weight=1604
document=21781013, weight=1604
document=23107998, weight=1604
document=23129561, weight=1604
document=23129608, weight=1604
document=32941392, weight=1604
document=35398549, weight=1604
document=1394556, weight=1599
document=4642739, weight=1599
document=6456179, weight=1599
document=7595116, weight=1599

words:
'bitcoin': 84622 documents, 91520 hits

index 'membersdelta': query 'bitcoin ': returned 0 matches of 0 total in 0.000 sec

words:
'bitcoin': 0 documents, 0 hits

Excellent so basic search is working. What next? We need to get the indexer pulling in our changes. I will start ./searchd and then run the indexer only on our delta.

./searchd

#user --rotate when searchd has a lock on the index.
./indexer --rotate membersdelta

Next I will put the deltaindexer on a 5 minute cron job:

# make a sym link for config, it has a default location:
ln -s /opt/sphinx/src/sphinx.conf /usr/local/etc/sphinx.conf

#put in cron:
*/5 * * * * /opt/sphinx/src/indexer --rotate membersdelta > /tmp/cronlog.txt

In the next post I will show you a PHP script to interact with Sphinx + MySQL to get textual searching on my large InnoDB database.

Written on January 22, 2014