Sphinx is a free, dual-licensed search server. Sphinx is written in C++, and focuses on query performance and search relevance.
The primary client API is currently SphinxQL, a dialect of SQL. Almost any MySQL connector should work. Additionally, basic HTTP/JSON API and native APIs for a number of languages (PHP, Python, Ruby, C, Java) are provided.
This document is an effort to build a better documentation for Sphinx v.3.x and up. Think of it as a book or a tutorial which you could actually read; think of the previous “reference manual” as of a “dictionary” where you look up specific syntax features. The two might (and should) eventually converge.
Top level picture, what does Sphinx offer?
Other things that seem worth mentioning (this list is probably incomplete at all times, and definitely in random order):
charset_table and exceptions)morphology)core 2 duo => c2dAnd, of course, there is always stuff that we know we currently lack!
This section is supposed to provide a bit more detail on all the available features; to cover them more or less fully; and give you some further pointers into the specific reference sections (on the related config directives and SphinxQL statements).
SELECT ... WHERE MATCH('this') SphinxQL statement
(one two) | (three !four)OPTION boolean_simplify=1 in SELECT statement@title hello world or @!title hello or @(title,body) any of the two etc@title[50] hellocat MAYBE dog"roses are red""pick any 3 keywords out of this entire set"/3"within 10 positions all terms in yoda order"~10 or hello NEAR/3 world NEAR/4 "my test"(bag of words) << "exact phrase" << this|thatall SENTENCE words SENTENCE "in one sentence""Bill Gates" PARAGRAPH "Steve Jobs"ZONE:(h3,h4) in any of these title tags and ZONESPAN:(h2) only in a single instanceraining =cats and =dogs^hello world$boosted^1.234min_prefix_len and min_infix_len directivesth?se three keyword% wild*cards *verywher* (? = 1 char exactly; % = 0 or 1 char; * = 0 or more chars)TODO: describe more, add links!
That should now be rather simple. No magic installation required! On any platform, the sufficient thing to do is:
searchdSee more details on that (running in config-less mode) just below.
Or alternatively, you can ETL your data offline, using the indexer tool:
sphinx.conf, with at least 1 index section.indexer --all once, to initially create the indexes.searchdindexer --all --rotate regularly, to update the indexes.Note that instead of inserting the data into indexes online, the indexer tool instead creates a shadow copy of the specified index(es) offline, and then sends a signal to searchd to pick up that copy. So you should never get a partially populated index with indexer; it’s always all-or-nothing.
More details on running with configs are also below, refer to the “Writing your first config” section.
Versions and file names will vary, and you most likely will want to configure Sphinx at least a little, but for an immediate quickstart:
$ wget -q http://sphinxsearch.com/files/sphinx-3.0.2-2592786-linux-amd64.tar.gz
$ tar zxf sphinx-3.0.2-2592786-linux-amd64.tar.gz
$ cd sphinx-3.0.2-2592786-linux-amd64/bin
$ ./searchd
Sphinx 3.0.2 (commit 2592786)
Copyright (c) 2001-2018, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
listening on all interfaces, port=9312
listening on all interfaces, port=9306
WARNING: No extra index definitions found in data folder
$That’s it; the daemon should now be running and accepting connections on port 9306. And you can connect to it using MySQL CLI (see below for more details, or just try mysql -P9306 right away).
Pretty much the same story, except that on Windows searchd will not automatically go into background:
C:\Sphinx\>searchd.exe
Sphinx 3.0-dev (c3c241f)
...
accepting connections
prereaded 0 indexes in 0.000 sec
This is alright. Do not kill it. Just switch to a separate session and start querying.
Run the MySQL CLI and point it to a port 9306. For example on Windows:
C:\>mysql -h127.0.0.1 -P9306
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 3.0-dev (c3c241f)
...
I have intentionally used 127.0.0.1 in this example for two reasons (both caused by MySQL CLI quirks, not Sphinx):
-P9306 switch, not localhostlocalhost works but causes a connection delayBut in the simplest case even just mysql -P9306 should work fine.
And from there, just run SphinxQL queries:
mysql> CREATE TABLE test (gid integer, title field stored,
-> content field stored);
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO test (id, title) VALUES (123, 'hello world');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO test (id, gid, content) VALUES (234, 345, 'empty title');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT * FROM test;
+------+------+-------------+-------------+
| id | gid | title | content |
+------+------+-------------+-------------+
| 123 | 0 | hello world | |
| 234 | 345 | | empty title |
+------+------+-------------+-------------+
2 rows in set (0.00 sec)
mysql> SELECT * FROM test WHERE MATCH('hello');
+------+------+-------------+---------+
| id | gid | title | content |
+------+------+-------------+---------+
| 123 | 0 | hello world | |
+------+------+-------------+---------+
1 row in set (0.00 sec)
mysql> SELECT * FROM test WHERE MATCH('@content hello');
Empty set (0.00 sec)SphinxQL is our own SQL dialect, described in more detail in the respective SphinxQL Reference section.
All those easy examples above were utilizing a config-less mode, where searchd stores and manages all the data and settings ./sphinxdata data folder, and you have to manage everything via searchd itself.
However, because as of v.3.x the config-less mode is not yet complete (certain settings and tools are still missing), many instances would likely still want to specify their indexes and settings via ye good olde sphinx.conf config file.
You can begin with etc/sphinx-min.conf.dist example as a starting point. That example shows how to use both RT indexes that you would populate on the fly, and plain indexes that you would populate from your database. For the latter, there also is a tiny matching etc/example.sql MySQL dump with a few sample tables and rows that the test1 index from sphinx-min.conf.dist would access.
However, even that small sample is not really minimal. The minimal config only requires you to specify 2 paths for log and pid files, and define 1 index:
searchd
{
log = ./data/searchd.log
pid_file = ./data/searchd.pid
}
index mydocs
{
type = rt
path = ./data/mydocs
rt_field = title
rt_attr_json = j
}And with just that, you can start searchd in config mode and start firing queries right away. (Do not forget to create the data directory though.)
$ mkdir data
$ ./searchd
Sphinx 3.3.1-dev (commit 00642f6c)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
listening on all interfaces, port=9306
precaching index 'mydocs'
precached 1 indexes using 1 threads in 0.0 sec$ mysql -h127.0.0.1 -P9306
mysql> insert into mydocs values(111, 'hello world', '{"foo":"bar"}');
Query OK, 1 row affected (0.01 sec)
mysql> select * from mydocs where match('hello');
+------+---------------+
| id | j |
+------+---------------+
| 111 | {"foo":"bar"} |
+------+---------------+
1 row in set (0.00 sec)<?php
$conn = mysqli_connect("127.0.0.1:9306", "", "", "");
if (mysqli_connect_errno())
die("failed to connect to Sphinx: " . mysqli_connect_error());
$res = mysqli_query($conn, "SHOW VARIABLES");
while ($row = mysqli_fetch_row($res))
print "$row[0]: $row[1]\n";TODO: examples
TODO: examples
indexer SQL drivers on LinuxThis only affects indexer ETL tool only. If you never bulk load data from SQL sources using it (of course CSV and XML sources are still fine), you can safely skip this section. (And on Windows all the drivers come with the package.)
Depending on your OS, the required package names may vary. Here are some current (as of Mar 2018) package names for Ubuntu and CentOS:
ubuntu$ apt-get install libmysqlclient-dev libpq-dev unixodbc-dev
ubuntu$ apt-get install libmariadb-client-lgpl-dev-compat
centos$ yum install mariadb-devel postgresql-devel unixODBC-develWhy might these be needed, and how they work?
indexer natively supports MySQL (or MariaDB), PostgreSQL, and UnixODBC drivers. Meaning it can natively connect to those databases, run SQL queries, extract results, and create full-text indexes from that. Binaries now always come with that support enabled.
However, you still need to have a specific driver library installed on your system, so that indexer could dynamically load it, and access the database. Depending on the specific database and OS you use, the package names might be different, but here go a few common pointers.
The driver libraries are loaded by name. The following names are attempted:
libmysqlclient.so or libmariadb.solibpq.solibodbc.soTo support MacOS, .dylib extension (in addition to .so) is also tried.
Alas, many projects tend to reinvent their own dictionary, and Sphinx is no exception. Sometimes that probably creates confusion for no apparent reason. For one, what SQL guys call “tables” (or even “relations” if they are old enough to remember Edgar Codd), and MongoDB guys call “collections”, we the text search guys tend to call “indexes”, and not really out of mischief and malice either, but just because for us, those things are primarily FT (full-text) indexes. Thankfully, most of the concepts are close enough, so our personal little Sphinx dictionary is tiny. Let’s see.
Short cheat sheet!
| Sphinx | Closest SQL equivalent |
|---|---|
| Index | Table |
| Document | Row |
| Field or attribute | Column and/or a FULLTEXT index |
| Indexed field | Just a FULLTEXT index on a text column |
| Stored field | Text column and a FULLTEXT index on it |
| Attribute | Column |
| MVA | Column with an INT_SET type |
| JSON attribute | Column with a JSON type |
| Attribute index | Index |
| Document ID, docid | Column called “id”, with a BIGINT type |
| Row ID, rowid | Internal Sphinx row number |
And now for a little more elaborate explanation.
Sphinx indexes are semi-structured collections of documents. They may seem closer to SQL tables than to Mongo collections, but in their core, they really are neither. The primary, foundational data structure here is a full-text index. It is a special structure that lets us respond very quickly to a query like “give me the (internal) identifiers of all the documents that mention This or That keyword”. And everything else (any extra attributes, or document storage, or even the SQL or HTTP querying dialects, and so on) that Sphinx provides is essentially some kind of an addition on top of that base data structure. Well, hence the “index” name.
Schema-wise, Sphinx indexes try to combine the best of schemaful and schemaless worlds. For “columns” where you know the type upfront, you can use the statically typed attributes, and get the absolute efficiency. For more dynamic data, you can put it all into a JSON attribute, and still get quite decent performance.
So in a sense, Sphinx indexes == SQL tables, except (a) full-text searches are fast and come with a lot of full-text-search specific tweaking options; (b) JSON “columns” (attributes) are quite natively supported, so you can go schemaless; and (c) for full-text indexed fields, you can choose to store just the full-text index and ditch the original values.
Documents are essentially just a list of named text fields, and arbitrary-typed attributes. Quite similar to SQL rows; almost indistiguishable, actually.
As of v.3.0.1, Sphinx still requires a unique id attribute, and implicitly injects an id BIGINT column into indexes (as you probably noticed in the Getting started section). We still use those docids to identify specific rows in DELETE and other statements. However, unlike in v.2.x, we no longer use docids to identify documents internally. Thus, zero and negative docids are already allowed.
Fields are the texts that Sphinx indexes and makes keyword-searchable. They always are indexed, as in full-text indexed. Their original, unindexed contents can also be stored into the index for later retrieval. By default, they are not, and Sphinx is going to return attributes only, and not the contents. However, if you explicitly mark them as stored (either with a stored flag in CREATE TABLE or in the ETL config file using stored_fields directive), you can also fetch the fields back:
mysql> CREATE TABLE test1 (title field);
mysql> INSERT INTO test1 VALUES (123, 'hello');
mysql> SELECT * FROM test1 WHERE MATCH('hello');
+------+
| id |
+------+
| 123 |
+------+
1 row in set (0.00 sec)
mysql> CREATE TABLE test2 (title field stored);
mysql> INSERT INTO test2 VALUES (123, 'hello');
mysql> SELECT * FROM test2 WHERE MATCH('hello');
+------+-------+
| id | title |
+------+-------+
| 123 | hello |
+------+-------+
1 row in set (0.00 sec)Stored fields contents are stored in a special index component called document storage, or DocStore for short.
Sphinx supports the following attribute types:
All of these should be pretty straightforward. However, there are a couple Sphinx specific JSON performance tricks worth mentioning:
123.45f syntax extension to mark 32-bit floats (by default all floating point values in JSON are 64-bit doubles).For example, when the following document is stored into a JSON column in Sphinx:
Sphinx detects that the “tags” array consists of integers only, and stores the array data using 24 bytes exactly, using just 4 bytes per each of the 6 values. Of course, there still are the overheads of storing the JSON keys, and the general document structure, so the entire document will take more than that. Still, when it comes to storing bulk data into Sphinx index for later use, just provide a consistently typed JSON array, and that data will be stored - and processed! - with maximum efficiency.
Attributes are supposed to fit into RAM, and Sphinx is optimized towards that case. Ideally, of course, all your index data should fit into RAM, while being backed by a fast enough SSD for persistence.
Now, there are fixed-width and variable-width attributes among the supported types. Naturally, scalars like INTEGER and FLOAT will always occupy exactly 4 bytes each, while STRING and JSON types can be as short as, well, empty; or as long as several megabytes. How does that work internally? Or in other words, why don’t I just save everything as JSON?
The answer is performance. Internally, Sphinx has two separate storages for those row parts. Fixed-width attributes, including hidden system ones, are essentially stored in big static NxM matrix, where N is the number of rows, and M is the number of fixed-width attributes. Any accesses to those are very quick. All the variable-width attributes for a single row are grouped together, and stored in a separate storage. A single offset into that second storage (or “vrow” storage, short for “variable-width row part” storage) is stored as hidden fixed-width attribute. Thus, as you see, accessing a string or a JSON or an MVA value, let alone a JSON key, is somewhat more complicated. For example, to access that year JSON key from the example just above, Sphinx would need to:
vrow_offset from a hidden integer attributeyear key startyear valueOf course, optimizations are done on every step here, but still, if you access a lot of those values (for sorting or filtering the query results), there will be a performance impact. Also, the deeper the key is buried into that JSON, the worse. For example, using a tiny test with 1,000,000 rows and just 4 integer attributes plus exactly the same 4 values stored in a JSON, computing a sum yields the following:
| Attribute | Time | Slowdown |
|---|---|---|
| Any INT | 0.032 sec | - |
| 1st JSON key | 0.045 sec | 1.4x |
| 2nd JSON key | 0.052 sec | 1.6x |
| 3rd JSON key | 0.059 sec | 1.8x |
| 4th JSON key | 0.065 sec | 2.0x |
And with more attributes it would eventually slowdown even worse than 2x times, especially if we also throw in more complicated attributes, like strings or nested objects.
So bottom line, why not JSON everything? As long as your queries only touch a handful of rows each, that is fine, actually! However, if you have a lot of data, you should try to identify some of the “busiest” columns for your queries, and store them as “regular” typed columns, that somewhat improves performance.
Storing fields into your indexes is easy, just list those fields in a stored_fields directive and you’re all set:
index mytest
{
type = rt
path = data/mytest
rt_field = title
rt_field = content
stored_fields = title, content
# hl_fields = title, content
rt_attr_uint = gid
}
Let’s check how that worked:
mysql> desc mytest;
+---------+--------+-----------------+------+
| Field | Type | Properties | Key |
+---------+--------+-----------------+------+
| id | bigint | | |
| title | field | indexed, stored | |
| content | field | indexed, stored | |
| gid | uint | | |
+---------+--------+-----------------+------+
4 rows in set (0.00 sec)
mysql> insert into mytest (id, title) values (123, 'hello world');
Query OK, 1 row affected (0.00 sec)
mysql> select * from mytest where match('hello');
+------+------+-------------+---------+
| id | gid | title | content |
+------+------+-------------+---------+
| 123 | 0 | hello world | |
+------+------+-------------+---------+
1 row in set (0.00 sec)Yay, original document contents! Not a huge step generally, not for a database anyway; but a nice improvement for Sphinx which was initially designed “for searching only” (oh, the mistakes of youth). And DocStore can do more than that, namely:
store_fields directivestored_only_fields directivehl_fields directivedocstore_type, docstore_comp, and docstore_block directivesSo DocStore can effectively replace the existing rt_attr_string directive. What are the differences, and when to use each?
rt_attr_string creates an attribute, uncompressed, and stored in RAM. Attributes are supposed to be small, and suitable for filtering (WHERE), sorting (ORDER BY), and other operations like that, by the millions. So if you really need to run queries like … WHERE title=‘abc’, or in case you want to update those strings on the fly, you will still need attributes.
But complete original document contents are rather rarely accessed in that way! Instead, you usually need just a handful of those, in the order of 10s to 100s, to have them displayed in the final search results, and/or create snippets. DocStore is designed exactly for that. It compresses all the data it receives (by default), and tries to keep most of the resulting “archive” on disk, only fetching a few documents at a time, in the very end.
Snippets become pretty interesting with DocStore. You can generate snippets from either specific stored fields, or the entire document, or a subdocument, respectively:
SELECT id, SNIPPET(title, QUERY()) FROM mytest WHERE MATCH('hello')
SELECT id, SNIPPET(DOCUMENT(), QUERY()) FROM mytest WHERE MATCH('hello')
SELECT id, SNIPPET(DOCUMENT({title}), QUERY()) FROM mytest WHERE MATCH('hello')Using hl_fields can accelerate highlighting where possible, sometimes making snippets times faster. If your documents are big enough (as in, a little bigger than tweets), try it! Without hl_fields, SNIPPET() function will have to reparse the document contents every time. With it, the parsed representation is compressed and stored into the index upfront, trading off a not-insignificant amount of CPU work for more disk space, and a few extra disk reads.
And speaking of disk space vs CPU tradeoff, these tweaking knobs let you fine-tune DocStore for specific indexes:
docstore_type = vblock_solid (default) groups small documents into a single compressed block, upto a given limit: better compression, slower accessdocstore_type = vblock stores every document separately: worse compression, faster accessdocstore_block = 16k (default) lets you tweak the block size limitdocstore_comp = lz4hc (default) uses LZ4HC algorithm for compression: better compression, but slowerdocstore_comp = lz4 uses LZ4 algorithm: worse compression, but fasterdocstore_comp = none disables compressionQuick kickoff: we now have CREATE INDEX statement, and sometimes it does make your queries faster!
CREATE INDEX i1 ON mytest(group_id)
DESC mytest
SELECT * FROM mytest WHERE group_id=1
SELECT * FROM mytest WHERE group_id BETWEEN 10 and 20
SELECT * FROM mytest WHERE MATCH('hello world') AND group_id=23
DROP INDEX i1 ON mytestPoint reads, range reads, and intersections between MATCH() and index reads are all intended to work. Moreover, GEODIST() can also automatically use indexes (see more below). One of the goals is to completely eliminate the need to insert “fake keywords” into your index. (Also, it’s possible to update attribute indexes on the fly, as opposed to indexed text.)
Indexes on JSON keys should also work, but you might need to cast them to a specific type when creating the index:
CREATE INDEX j1 ON mytest(j.group_id)
CREATE INDEX j2 ON mytest(INTEGER(j.year))
CREATE INDEX j3 ON mytest(FLOAT(j.latitude))As of v.3.0, attribute indexes can only be created on RT indexes. However, you can (almost) instantly convert your plain indexes to RT by using ATTACH ... WITH TRUNCATE, and run CREATE INDEX after that, as follows:
Geosearches with GEODIST() can also benefit quite a lot from attribute indexes. They can automatically compute a bounding box (or boxes) around a static reference point, and then process only a fraction of data using index reads. Refer to Geosearches section for more details.
TODO: describe more!
Query optimizer is the mechanism that decides, on a per-query basis, whether to use or to ignore specific indexes to compute the current query.
The optimizer can usually choose any combination of any applicable indexes. The specific index combination gets chosen based on cost estimates. Curiously, that choice is not exactly completely obvious even when we have just 2 indexes.
For instance, assume that we are doing a geosearch, something like this:
Assume that we have indexes on both lat and lon columns, and can use them. More, we can get an exact final result set out of that index pair, without any extra checks needed. But should we? Instead of using both indexes it is actually sometimes more efficient to use just one! Because with 2 indexes, we have to:
lat range index read, get X lat candidate rowidslon range index read, get Y lon candidate rowidsWhile when using 1 index on lat we only have to:
lat range index read, get X lat candidate rowidslon range, get N matching rowsNow, lat and lon frequently are somewhat correlated. Meaning that X, Y, and N values can all be pretty close. For example, let’s assume we have 11K matches in that specific latitude range, 12K matches in longitude range, and 10K final matches, ie. X = 11000, Y = 12000, N = 10000. Then using just 1 index means that we can avoid reading 12K lat rowids and then intersecting 23K rowids, introducing, however, 2K extra row lookups and 12K lon checks instead. Guess what, row lookups and extra checks are actually cheaper operations, and we are doing less of them. So with a few quick estimates, using only 1 index out of 2 applicable ones suddenly looks like a better bet. That can be indeed confirm on real queries, too.
And that’s exactly how the optimizer works. Basically, it checks multiple possible index combinations, tries to estimate the associated query costs, and then picks the best one it finds.
However, the number of possible combinations grows explosively with the attribute index count. Consider a rather crazy (but possible) case with as many as 20 applicable indexes. That means more than 1 million possible “on/off” combinations. Even quick estimates for all of them would take too much time. There are internal limits in the optimizer to prevent that. Which in turn means that eventually some “ideal” index set might not get selected. (But, of course, that is a rare situation. Normally there are just a few applicable indexes, say from 1 to 10, so the optimizer can afford “brute forcing” upto 1024 possible index combinations, and does so.)
Now, perhaps even worse, both the count and cost estimates are just that, ie. only estimates. They might be slightly off, or way off. The actual query costs might be somewhat different than estimated when we execute the query.
For those reasons, optimizer might occasionally pick a suboptimal query plan. In that event, or perhaps just for testing purposes, you can tweak its behavior with SELECT hints, and make it forcibly use or ignore specific attribute indexes. For a reference on the exact syntax and behavior, refer to “Index hints clause”.
K-batches (“kill batches”) let you bulk delete older versions of the documents (rows) when bulk loading new data into Sphinx, for example, adding a new delta index on top of an older main archive index.
K-batches in Sphinx v.3.x replace k-lists (“kill lists”) from v.2.x and before. The major differences are that:
“Not anonymous” means that when loading a new index with an associated k-batch into searchd, you now have to explicitly specify target indexes that it should delete the rows from. In other words, “deltas” now must explicitly specify all the “main” indexes that they want to erase old documents from, at index-time.
The effect of applying a k-batch is equivalent to running (just once) a bunch of DELETE FROM X WHERE id=Y queries, for every index X listed in kbatch directive, and every document id Y stored in the k-batch. With the index format updates this is now both possible, even in “plain” indexes, and quite efficient too.
K-batch only gets applied once. After a succesful application to all the target indexes, the batch gets cleared.
So, for example, when you load an index called delta with the following settings:
index delta
{
...
sql_query_kbatch = SELECT 12 UNION SELECT 13 UNION SELECT 14
kbatch = main1, main2
}
The following (normally) happens:
delta kbatch file is loaded
main1main2main1, main2 save those deletions to diskdelta kbatch file is clearedAll these operations are pretty fast, because deletions are now internally implemented using a bitmap. So deleting a given document by id results in a hash lookup and a bit flip. In plain speak, very quick.
“Loading” can happen either by restarting or rotation or whatever, k-batches should still try to apply themselves.
Last but not least, you can also use kbatch_source to avoid explicitly storing all newly added document ids into a k-batch, instead, you can use kbatch_source = kl, id or just kbatch_source = id; this will automatically add all the document ids from the index to its k-batch. The default value is kbatch_source = kl, that is, to use explicitly provided docids only.
TODO: describe rotations (legacy), RELOAD, ATTACH, etc.
For the most part using JSON in Sphinx should be very simple. You just put pretty much arbitrary JSON in a proper column (aka attribute). Then you just access the necessary keys using a col1.key1.subkey2.subkey3 syntax. Or, you access the array values using col1.key1[123] syntax. And that’s it.
For a literally 30-second kickoff, you can configure a test RT index like this:
Then restart or searchd or reload the config, and fire away a few test queries:
mysql> INSERT INTO jsontest (id, j) VALUES (1, '{"foo":"bar", "year":2019,
"arr":[1,2,3,"yarr"], "address":{"city":"Moscow", "country":"Russia"}}');
Query OK, 1 row affected (0.00 sec)
mysql> SELECT j.foo FROM jsontest;
+-------+
| j.foo |
+-------+
| bar |
+-------+
1 row in set (0.00 sec)
mysql> SELECT j.year+10, j.arr[3], j.address.city FROM jsontest;
+-----------+----------+----------------+
| j.year+10 | j.arr[3] | j.address.city |
+-----------+----------+----------------+
| 2029.0 | yarr | Moscow |
+-----------+----------+----------------+
1 row in set (0.00 sec)However, sometimes that is not quite enough (mostly for performance reasons), and thus we have both several Sphinx-specific JSON syntax extensions, and several important internal implementation details to discuss, including a few Sphinx-specific limits. Briefly, those are as follows:
optimized scalar storage (for int8, int32, int64, bool, float, and NULL types)
optimized array storage (for int8, int32, int64, float, double, and string types)
optimized key name storage with key compression (optional, with json_packed_keys = 1 directive)
0.0f syntax extension for 32-bit float values
int8[] and float[] syntax extensions for 8-bit integer and 32-bit float arrays, respectively
Optimized storage means that usually Sphinx auto-detects the actual value types, both for standalone values and for arrays, and then uses the smallest storage type that works.
So when a 32-bit (4-byte) integer is enough for a numeric value, Sphinx would automatically store just that. If that overflows, no need to worry, Sphinx would just automatically switch to 8-byte integer values, or even double values (still 8-byte).
Ditto for arrays. When your arrays contain a mix of actual types, Sphinx handles that just fine, and stores a generic array, where every element has a different type attached to it. However, when your array only actually contains one very specific type (for example, regular 32-bit integers only), Sphinx auto-detects that fact, and stores that array in an optimized manner, using just 4 bytes per value, and skipping the repeated types. All the builtin functions support all such optimized array types, and have a special fast codepath to handle them, in a transparent fashion.
As of v.3.2, array value types that might get optimized that way are int8, int32, int64, float, double, and string. This covers pretty much all the usual numeric types, and therefore all you have to do to ensure that the optimizations kick in is, well, to only use one actual type in your data.
So everything is on autopilot, mostly. However, there are several exceptions to that rule that still require a tiny bit of effort from you!
First, there is a catch with float vs double types. The default storage type for floating point values in Sphinx is the high-precision, 64-bit double type, just as (kinda) specified by JSON.
So the regular {"scale":1.23} syntax would have Sphinx store an 8-byte double value. However, that just might be overkill precision for some (if not most!) tasks, and to save on storage, you might want to store your values using the 32-bit (4-byte) float type instead.
Unfortunately, at the moment there is no standard way to specify that, and so Sphinx offers a custom extension: attach f suffix to your value, ie. use {"scale":1.23f} syntax instead. That works in Sphinx, and that lets it know that a float type is enough.
Second, int8 arrays must be explicit. Even though Sphinx can auto-detect the fact that all your array values are integers in the -128 to 127 range, and can be stored efficiently using just 1 byte per value, it does not just make that assumption, and uses int32 type instead.
And this happens because there is no way for Sphinx to tell by looking at just those values whether you realy wanted an optimized int8 vector, or the intent was to just have a placeholder (filled with either 0, or -1, or what have you) int32 vector for future updates. Given that JSON updates are currently in-place, at this decision point Sphinx chooses to go with the more conservative but flexible route, and store an int32 vector even for something that could be store more efficiently like [0, 0, 0, 0].
To force that vector into super-slim 1-byte values, you have to use a syntax extension, and use int8[0, 0, 0, 0] as your value.
Third, watch out for integer vs float mixes. The auto-detection happens on a per-value basis. Meaning that an array value like [1, 2, 3.0] will be marked as mixing two different types, int32 and double. So neither the int32 nor (worse) double array storage optimization can kick in for this particular array.
You can enforce any JSON-standard type on Sphinx here using regular JSON syntax. To store it as integers, you should simply get rid of that pesky dot that triggers doubles, and use [1, 2, 3] syntax. For doubles, on the contrary, the dot should be everywhere, ie. you should use [1.0, 2.0, 3.0] syntax.
Finally, for the non-standard float type extension, you can also use the f suffix, ie. [1.0f, 2.0f, 3.0f] syntax. But that might be inconvenient, so you can also use the float[1, 2, 3.0] syntax instead. Either of these two forms enables Sphinx to auto-convert your vector to nice and fast optimized floats.
That was all about the values though. What about the keys?
Normally, keys are stored as is. Meaning that if you have a superLongKey in (almost) every single document, that key will be stored as a plain old text string, and repeated as many times as there are documents. And all those repetitions would consume some RAM bytes. Flexible, but not really efficient.
So the rule of thumb is, super-long key names are, well, okay, but not really great. Just as with regular JSON. Of course, for smaller indexes the savings might just be negligible. But for bigger ones, you might want to consider shorter key names.
Or, packed JSON keys feature alleviates that automatically. It keeps track of the most frequently mentioned keys, and stores top keys using just 1 byte per key value. You can enable it by setting json_packed_keys = 1 flag in your index config (and rebuilding the index where necessary). That way, you can use as long key names as you want, and Sphinx will kinda “zip” the JSON internally.
Beware and benchmark, though, that currently there can occasionally be some associated performance impact here (even though not huge), both at indexing and at search time. (Otherwise, the packing would just be always on!)
Comparisons with JSON can be a little tricky when it comes to value types. Especially the numeric ones, because of all the integer vs float vs double jazz. (And, mind you, by default the floating-point values will be stored as double.) Briefly, beware that:
String comparisons are strict, and require the string type.
Meaning that WHERE j.str1='abc' check must only pass when all the following conditions are true: 1) str1 key exists; 2) str1 value type is exactly string; 3) the value matches.
Therefore, for a sudden integer value compared against a string constant, for example, {"str1":123} value against a WHERE j.str1='123' condition, the check will fail. As it should, there are no implicit conversions here.
Numeric comparisons against integers match any numeric type, not just integers.
Meaning that both {"key1":123} and {"key1":123.0} values must pass the WHERE j.key1=123 check. Again, as expected.
Numeric comparisons against floats forcibly convert double values to (single-precision) floats, and roundoff issues may arise.
Meaning that when you store something like {"key1":123.0000001} into your index, then the WHERE j.key1=123.0 check will pass, because roundoff to float looses that fractional part. However, at the same time WHERE j.key1=123 check will not pass, because that check will use the original double value and compare it against the integer constant.
This might be a bit confusing, but otherwise (without roundoff) the situation would be arguably worse: in an even more counter-intuitive fashion, {"key1":2.22} does not pass the WHERE j.key1>=2.22 check, because the reference constant here is float(2.22), and then because of rounding, double(2.22) < float(2.22)!
TODO: describe limits, json_xxx settings, our syntax extensions, etc.
Array attributes let you save a fixed amount of integer or float values into your index. The supported types are:
attr_int_array that stores signed 32-bit integers;attr_int8_array that stores signed 8-bit integers (-128 to 127 range);attr_float_array that stores 32-bit floats.To declare an array attribute, use the following syntax:
{rt|sql|xmlpipe|csvpipe|tsvpipe}_attr_{int|int8|float}_array = NAME[SIZE]
Where NAME is the attribute name, and SIZE is the array size, in elements. For example:
index rt
{
type = rt
rt_field = title
rt_field = content
rt_attr_uint = gid # regular attribute
rt_attr_float_array = vec1[5] # 5D array of floats
rt_attr_int8_array = vec2[3] # 3D array of small 8-bit integers
# ...
}
source test1
{
type = mysql
sql_attr_int8_array = vec1[17] # 17D array of small 8-bit integers
# ...
}The array dimensions must be between 2 and 8192, inclusive.
The array gets aligned to the nearest 4 bytes. This means that an int8_array with 17 elements will actually use 20 bytes for storage.
The expected input array value for both INSERT queries and source indexing is either a comma or space-separated string with the values, or an empty string:
INSERT INTO rt (id, vec1) VALUES (123, '3.14, -1, 2.718, 2019, 100500');
INSERT INTO rt (id, vec1) VALUES (124, '');Empty strings will zero-fill the array. Non-empty strings are subject to strict validation. First, there must be exactly as many values as the array can hold. So you can not store 3 or 7 values into a 5-element array. Second, the value ranges might also be be validated. So you will not be able to store a value of 1000 into an int8_array because it’s out of -128..127 range.
Attempting to INSERT an invalid array value will fail. For example:
mysql> INSERT INTO rt (id, vec1) VALUES (200, '1 2 3');
ERROR 1064 (42000): bad array value
mysql> INSERT INTO rt (id, vec1) VALUES (200, '1 2 3 4 5 6');
ERROR 1064 (42000): bad array value
mysql> INSERT INTO rt (id, vec2) VALUES (200, '0, 1, 2345');
ERROR 1064 (42000): bad array valueHowever, when batch indexing with indexer, an invalid array value will be reported as a warning, and zero-fill the array, but it will not fail the entire indexing batch.
At the moment, the only function that supports arrays is DOT(), so you can compute a dot product between an array and a constant vector:
mysql> SELECT id, DOT(vec1,FVEC(1,2,3,4,5)) d FROM rt;
+------+--------------+
| id | d |
+------+--------------+
| 123 | 510585.28125 |
| 124 | 0 |
+------+--------------+
2 rows in set (0.00 sec)Sphinx supports several different keyword remapping types (usually applied both when indexing and when searching), all currently called wordforms. The principal types are as follows.
Additionally, two of these types can be applied at indexing time only.
Wordforms are usually all listed in a separate text file (or a set of files), and added to the index with a wordforms directive. The wordforms file syntax for all these types is as follows.
geese => goose # type 1, morph-repl 1:1
core 2 duo => c2d # type 2, pre-morph M:N
e8400 => c2d e8400 # type 2, pre-morph M:N
~vehicle => car # type 3, post-morph 1:1
!foo => bar # type 4, doc-only morph-repl 1:1
!grrm => grrm george martin # type 5, doc-only pre-morph M:N
So generally wordforms are just two lists of keywords (source and destination mappings respectively), separated by => special token.
Post-morphology wordforms are marked with ~ in the first column.
Document-only wordforms are marked with ! in the first column.
Comments are marked with #, and everything from # to the end of a line is considered a comment.
Briefly, all these different types work as follows.
Morphology-replacing wordforms effectively are the top-priority morphology alternative. They engage before any morphology, and the fully substitute any other morphology processing. So use them with extreme care (or better yet, avoid using them at all) when you have any other morphology processing enabled.
For instance, that geese => goose example above is utterly wrong when you use morphology = stem_en. No, it does properly normalize that plural geese to singular goose, so that part is correct. Problem is, Porter stemmer does not normalize goose to itself, it will emit goos stem instead. So in order to fixup Porter stemmer specifically you would have to map to that stem, ie. use a geese => goos wordform. Extreme care.
Pre-morphology wordforms remap the source keywords before morphology, and then the destination keywords are additionally passed through the morphology. So, for example, six geese => 6 goose would not be subject to stem_en issue above, as both 6 and goose would also be stemmed.
At the moment only the M:N wordforms can be pre-morphology, not 1:1, so you have to have either multiple source keywords, or multiple destination keywords.
Post-morphology wordforms remap the source keywords after morphology. Note that at the moment (and rather counter-intuitively) the destination keywords will not be passed through morphology one more time, so the stem_en issue arises once again.
Document-only wordforms are only applied at indexing time, and never at query time. This is pretty useful for indexing time expansions, and that’s why that example with grrm maps it to itself too, and not just george martin.
In the “expansion” usecase, they are more efficient when searching, compared to similar regular wordforms.
Indeed, when searching for a source mapping, regular wordforms would expand to all keywords (in our example, to all 3 keywords, grrm george martin), fetch and intersect them, and do all that work for… nothing! Because we can obtain exactly the same result much more efficiently by simply fetching just the source keywords (just grrm in our example). And that’s exactly how document-only wordforms work, they would just skip the expansion altogether.
And when searching for (a part of) a destination mapping, nothing would change. In that case both document-only and regular wordforms would just execute the query completely identically.
Bottom line, use document-only wordforms when you’re doing expansions, in order to avoid that unnecessary performance hit.
Sphinx supports User Defined Functions (or UDFs for short) that let you extend its expression engine:
You can load and unload UDFs into searchd dynamically, ie. without having to restart the daemon itself, and then use them in most expressions when searching and ranking. Quick summary of the UDF features is as follows.
FACTORS(), ie. special blobs with ranking signals;UDFs have a wide variety of uses, for instance:
UDFs reside in the external dynamic libraries (.so files on UNIX and .dll on Windows systems). Library files need to reside in a trusted folder specified by plugin_dir directive, for obvious security reasons: securing a single folder is easy; letting anyone install arbitrary code into searchd is a risk. You can load and unload them dynamically into searchd with CREATE FUNCTION and DROP FUNCTION SphinxQL statements, respectively. Also, you can seamlessly reload UDFs (and other plugins) with RELOAD PLUGINS statement. Sphinx keeps track of the currently loaded functions, that is, every time you create or drop an UDF, searchd writes its state to the sphinxql_state file as a plain good old SQL script.
Once you successfully load an UDF, you can use it in your SELECT or other statements just as any of the builtin functions:
Multiple UDFs (and other plugins) may reside in a single library. That library will only be loaded once. It gets automatically unloaded once all the UDFs and plugins from it are dropped.
Aggregation functions are not supported just yet. In other words, your UDFs will be called for just a single document at a time and are expected to return some value for that document. Writing a function that can compute an aggregate value like AVG() over the entire group of documents that share the same GROUP BY key is not yet possible. However, you can use UDFs within the builtin aggregate functions: that is, even though MYCUSTOMAVG() is not supported yet, AVG(MYCUSTOMFUNC()) should work alright!
UDFs are local. In order to use them on a cluster, you have to put the same library on all its nodes and run proper CREATE FUNCTION statements on all the nodes too. This might change in the future versions.
The UDF interface is plain C. So you would usually write your UDF in C or C++. (Even though in theory it might be possible to use other languages.)
Your very first starting point should be src/udfexample.c, our example UDF library. That libary implements several different functions, to demonstrate how to use several different techniques (stateless and stateful UDFs, different argument types, batched calls, etc).
The files that provide the UDF interface are:
src/sphinxudf.h that declares the essential types and helper functions;src/sphinxudf.c that implements those functions.For UDFs that do not implement ranking, and therefore do not need to handle FACTORS() arguments, simply including the sphinxudf.h header is sufficient.
To be able to parse the FACTORS() blobs from your UDF, however, you will also need to compile and link with sphinxudf.c source file.
Both sphinxudf.h and sphinxudf.c are standalone. So you can copy around those files only. They do not depend on any other bits of Sphinx source code.
Within your UDF, you should literally implement and export just two functions.
First, you must define int LIBRARYNAME_ver() { return SPH_UDF_VERSION; } in order to implement UDF interface version control. LIBRARYNAME should be replaced with the name of your library. Here’s a more complete example:
#include <sphinxudf.h>
// our library will be called udfexample.so, thus, so it must define
// a version function named udfexample_ver()
int udfexample_ver()
{
return SPH_UDF_VERSION;
}This version checker protects you from accidentally loading libraries with mismatching UDF interface versions. (Which would in turn usually cause either incorrect behavior or crashes.)
Second, you must implement the actual function, too. For example:
sphinx_int64_t testfunc(SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
char * error_message)
{
return 123;
}UDF function names in SphinxQL are case insensitive. However, the respective C/C++ function names must be all lower-case, or the UDF will fail to load.
More importantly, it is vital that:
__cdecl);CREATE FUNCTION;Unfortunately, there is no (easy) way for searchd to automatically check for those mistakes when loading the function, and they could crash the server and/or result in unexpected results.
Let’s discuss the simple testfunc() example in a bit more detail.
The first argument, a pointer to SPH_UDF_INIT structure, is essentially just a pointer to our function state. Using that state is optional. In this example, the function is stateless, it simply returns 123 every time it gets called. So we do not have to define an initialization function, and we can simply ignore that argument.
The second argument, a pointer to SPH_UDF_ARGS, is the most important one. All the actual call arguments are passed to your UDF via this structure. It contains the call argument count, names, types, etc. So whether your function gets called like with simple constants, like this:
or with a bunch of subexpressions as its arguments, like this:
or anyhow else, it will receive the very same SPH_UDF_ARGS structure, in all of these cases. However, the data passed in the args structure can be a little different.
In the testfunc(1) call example args->arg_count will be set to 1, because, naturally we have just one argument. In the second example, arg_count will be equal to 3. Also args->arg_types array will contain different type data for these two calls. And so on.
Finally, the third argument, char * error_message serves both as error flag, and a method to report a human-readable message (if any). UDFs should only raise that flag/message to indicate unrecoverable internal errors; ones that would prevent any subsequent attempts to evaluate that instance of the UDF call from continuing.
You must not use this flag for argument type checks, or for any other error reporting that is likely to happen during “normal” use. This flag is designed to report sudden critical runtime errors only, such as running out of memory.
If we need to, say, allocate temporary storage for our function to use, or check upfront whether the arguments are of the supported types, then we need to add two more functions, with UDF initialization and deinitialization, respectively.
int testfunc_init ( SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
char * error_message )
{
// allocate and initialize a little bit of temporary storage
init->func_data = malloc(sizeof(int));
*(int*)init->func_data = 123;
// return a success code
return 0;
}
void testfunc_deinit(SPH_UDF_INIT * init)
{
// free up our temporary storage
free(init->func_data);
}Note how testfunc_init() also receives the call arguments structure. At that point in time we do not yet have any actual per-row values though, so the args->arg_values will be NULL. But the argument names and types are already known, and will be passed. You can check them in the initialization function and return an error if they are of an unsupported type.
UDFs can receive arguments of pretty much any valid internal Sphinx type. When in doubt, refer to sphinx_udf_argtype enum in sphinxudf.h for a full list. For convenience, here’s a short reference table:
| UDF arg type | C/C++ type, and a short description | Len |
|---|---|---|
| UINT32 | uint32_t, unsigned 32-bit integer |
- |
| INT64 | int64_t, signed 64-bit integer |
- |
| FLOAT | float, single-precision (32-bit) IEEE754 float |
- |
| STRING | char *, non-ASCIIZ string, with a separate length |
Yes |
| UINT32SET | uint32_t *, sorted set of u32 integers |
Yes |
| INT64SET | int64_t *, sorted set of i64 integers |
Yes |
| FACTORS | void *, special blob with ranking signals |
- |
| JSON | char *, JSON (sub)object or field in a string format |
- |
| FLOAT_VEC | float *, an unsorted array of floats |
Yes |
The Len column in this table means that the argument length is passed separately via args->str_lengths[i] in addition to the argument value args->arg_values[i] itself.
For STRING arguments, the length contains the string length, in bytes. For all other types, it contains the number of elements.
As for the return types, UDFs can currently return numeric or string values. The respective types are as follows:
| Sphinx type | Regular return type | Batched output arg type |
|---|---|---|
INTEGER |
sphinx_int64_t |
int * |
BIGINT |
sphinx_int64_t |
sphinx_int64_t * |
FLOAT |
double |
float * |
STRING |
char * |
- |
Batched calls are discussed below.
We still define our own sphinx_int64_t type in sphinxudf.h for clarity and convenience, but these days, any standard 64-bit integer type like int64_t or long long should also suffice, and can be safely used in your UDF code.
Also note that for STRING types you must use args->fn_malloc function to allocate the returned values.
Note how internally in your UDF you can use whatever allocator you want, so the testfunc_init() example above is correct even though it uses malloc() directly. You manage that pointer yourself, it gets freed up using a matching free() call, and all is well. However, the returned strings values will be managed by Sphinx, and we have our own allocator. So for the return values specifically, you need to use it too.
Since v.3.3 Sphinx supports two types of the “main” UDF call with a numeric return type:
These two types have different C/C++ signatures, for example:
/// regular call that RETURNS INTEGER
/// note the `sphinx_int64_t` ret type
sphinx_int64_t foo(SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
char * error);
/// batched call that RETURNS INTEGER
/// note the `int *` out arg type
void foo_batch(SPH_UDF_INIT * init, SPH_UDF_ARGS * args,
int * results, int batch_size, char * error);UDF must define at least 1 of these two functions. As of v.3.3, UDF can define both functions, but batched calls take priority. So when both foo_batch() and foo() are defined, the engine will only use foo_batch(), and completely ignore foo().
Batched calls are needed for performance. For instance, processing multiple documents at once with certain CatBoost ML models could be more than 5x faster.
As mentioned a little earlier, return types for batched calls differ from regular ones, again for performance reasons. So yes, the types in the example above are correct. Regular, single-row foo() call must use sphinx_int64_t for its return type either when the function was created with RETURNS INTEGER or RETURNS BIGINT, for simplicity. However the batched multi-row foo_batch() call must use an output buffer typed as int * when created with RETURNS INTEGER; or a buffer typed as sphinx_int64_t * when created with RETURNS BIGINT; just as mentioned in that types table earlier.
Current target batch size is 128, but that size may change in either direction in the future. Assume little about batch_size, and very definitely do not hardcode the current limit anywhere. (Say, it is reasonably safe to assume that batches will always be in 1 to 65536 range, though.)
Engine should accumulate matches upto the target size, so that most UDF calls receive complete batches. However, trailing batches will be sized arbitrarily. For example, for 397 matches there should be 4 calls to foo_batch(), with 128, 128, 128, and 13 matches per batch respectively.
Arguments (and their sizes where applicable) are stored into arg_values (and str_lengths) sequentially for every match in the batch. For example, you can access them as follows:
for (int row = 0; row < batch_size; row++)
for (int arg = 0; arg < args->arg_count; arg++)
{
int index = row * args->args_count + arg;
use_arg(args->arg_values[index], args->str_lengths[index]);
}Batched UDF must fill the entire results array with some sane default value, even if it decides to fail with an unrecoverable error in the middle of the batch. It must never return garbage results.
On error, engine will stop calling the batched UDF for the rest of the current SELECT query (just as it does with regular UDFs), and automatically zero out the rest of the values. However, it is the UDFs responsbility to completely fill the failed batch anyway.
Batched calls are currently only supported for numeric UDFs, ie. functions that return INTEGER, BIGINT, or FLOAT; batching is not yet supported for STRING functions. That may change in the future.
FACTORS() in UDFsMost of the types map straightforwardly to the respective C types. The most notable exception is the SPH_UDF_TYPE_FACTORS argument type. You get that type by passing FACTORS() expression as an argument to your UDF. The value that the UDF will receive is a binary blob in a special internal format.
To extract individual ranking signals from that blob, you need to use either of the two sphinx_factors_XXX() or sphinx_get_YYY_factor() function families.
The first family consists of just 3 functions:
sphinx_factors_init() that initializes the unpacked SPH_UDF_FACTORS structure;sphinx_factors_unpack() that unpacks a binary blob value into it;sphinx_factors_deinit() that cleans up an deallocates SPH_UDF_FACTORS.So you need to call init() and unpack() first, then you can use the fields within the SPH_UDF_FACTORS structure, and then you have to call deinit() for cleanup. The resuling code would be rather simple, like this:
// init!
SPH_UDF_FACTORS F;
sphinx_factors_init(&F);
if (sphinx_factors_unpack((const unsigned int *)args->arg_values[0], &F))
{
sphinx_factors_deinit(&F); // no leaks please
return -1;
}
// process!
int result = F.field[3].hit_count;
// ... maybe more math here ...
// cleanup!
sphinx_factors_deinit(&F);
return result;However, this access simplicity has an obvious drawback. It will cause several memory allocations per each processed document (made by init() and unpack() and later freed by deinit() respectively), which might be slow.
So there is another interface to access FACTORS() that consists of a bunch of sphinx_get_YYY_factor() functions. It is more verbose, but it accesses the blob data directly, and it guarantees zero allocations and zero copying. So for top-notch ranking UDF performance, you want that one. Here goes the matching example code that also accesses just 1 signal from just 1 field:
// init!
const unsigned int * F = (const unsigned int *)args->arg_values[0];
const unsigned int * field3 = sphinx_get_field_factors(F, 3);
// process!
int result = sphinx_get_field_factor_int(field3, SPH_FIELDF_HIT_COUNT);
// ... maybe more math here ...
// done! no cleanup needed
return result;Depending on how your UDFs are used in the query, the main function call (testfunc() in our running example) might get called in a rather different volume and order. Specifically,
UDFs referenced in WHERE, ORDER BY, or GROUP BY clauses must and will be evaluated for every matched document. They will be called in the natural matching order.
without subselects, UDFs that can be evaluated at the very last stage over the final result set will be evaluated that way, but before applying the LIMIT clause. They will be called in the result set order.
with subselects, such UDFs will also be evaluated after applying the inner LIMIT clause.
The calling sequence of the other functions is fixed, though. Namely,
testfunc_init() is called once when initializing the query. It can return a non-zero code to indicate a failure; in that case query gets terminated early, and the error message from the error_message buffer is returned.
testfunc() or testfunc_batch() is called for every eligible row batch (see above), whenever Sphinx needs to compute the UDF value(s). This call can indicate an unrecoverable error by writing either a value of 1, or some human-readable message to the error_message argument. (So in other words, you can use error_message either as a boolean flag, or a string buffer.)
After getting a non-zero error_message from the main UDF call, the engine guarantees to stop calling that UDF call for subsequent rows for the rest of the query. A default return value of 0 for numerics and an empty string for strings will be used instead. Sphinx might or might not choose to terminate such queries early, neither behavior is currently guaranteed.
testfunc_deinit() is called once when the query processing (in a given index shard) ends. It must get called even if the main call reported an unrecoverable error earlier.
indexer supports indexing data in both CSV and TSV formats, via the csvpipe and tsvpipe source types, respectively. Here’s a brief cheat sheet on the respective source directives.
csvpipe_command = ... specifies a command to run (for instance, csvpipe_command = cat mydata.csv in the simplest case).csvpipe_header = 1 tells the indexer to pick the column list from the first row (otherwise, by default, the column list has to be specified in the config file).csvpipe_attr_XXX (where XXX is an attribute type, i.e. one of bigint, bool, float, json, multi, multi_64, string, timestamp, or uint) specifies an attribute type for a given column.csvpipe_field and csvpipe_field_string specify a regular full-text field and a full-text field that should also be stored as a string attribute, respectively.csvpipe_delimiter changes the column delimiter to a given character (this is csvpipe only; tsvpipe naturally uses tabs).When working with TSV, you would use the very same directives, but start them with tsvpipe prefix (i.e. tsvpipe_command, tsvpipe_header, etc).
The first column is currently always treated as id, and must be a unique document identifier.
The first row can either be treated as a named list of columns (when csvpipe_header = 1), or as a first row of actual data. By default it’s treated as data. The column names are trimmed, so a bit of extra whitespace should not hurt.
csvpipe_header affects how CSV input columns are matched to Sphinx attributes and fields.
With csvpipe_header = 0 the input file only contains data, so the order of columns is taken from the config file. Thus, the order of csvpipe_attr_XXX and csvpipe_field directives is very important in this case. You will have to explicitly declare all the fields and attributes (except the leading id), and in exactly the same order they appear in the CSV file. indexer will warn if there were unmatched or extraneous columns.
With csvpipe_header = 1 the input file starts with the column names list, so the declarations from the config file are only used to adjust the types. So in this case, the order of csvpipe_attr_XXX and csvpipe_field directives does not matter any more. Also, by default all the input CSV columns will be considered as fields, so you only need to explicitly configure attributes, not fields. For example, the following should work nicely, and index title and content as fields automatically:
1.csv:
id, gid, title, content
123, 11, hello world, document number one
124, 12, hello again, document number two
sphinx.conf:
source csv1
{
type = csvpipe
csvpipe_command = cat 1.csv
csvpipe_header = 1
csvpipe_attr_uint = gid
}
Sphinx provides tools to help you better index (and then later search):
@Rihanna, or Procter&Gamble or U.S.A, etc;UE53N5740AU.The general approach, so-called “blending”, is the same in both cases:
So in the examples just above Sphinx can:
rihanna or ue53n5740au;@rihanna;ue 53 and ue53.To index blended tokens, i.e. tokens with special characters in them, you should:
blend_chars directive;blend_mode directive;Blended characters are going to be indexed both as separators, and at the same time as valid characters. They are considered separators when generating the base tokenization (or “base split” for short). But in addition they also are processed as valid characters when generating extra tokens.
For instance, when you set blend_chars = @, &, . and index the text @Rihanna Procter&Gamble U.S.A, the base split stores the following six tokens into the final index: rihanna, procter, gamble, u, s, and a. Exactly like it would without the blend_chars, based on just the charset_table.
And because of blend_chars settings, the following three extra tokens get stored: @rihanna, procter&gamble, and u.s.a. Regular characters are still case-folded according to charset_table, but those special blended characters are now preserved. As opposed to being treated as whitespace, like they were in the base split. So far so good.
But why not just add @, &, . to charset_table then? Because that way we would completely lose the base split. Only the three “magic” tokens like @rihanna would be stored. And then searching for their “parts” (for example, for just rihanna or just gamble) would not work. Meh.
Last but not least, the in-field token positions are adjusted accordingly, and shared between the base and extra tokens:
rihanna and @rihannaprocter and procter&gamblegambleu and u.s.asaBottom line, blend_chars lets you enrich the index and store extra tokens with special characters in those. That might be a handy addition to your regular tokenization based on charset_table.
To index mixed codes, i.e. terms that mix letters and digits, you need to enable blend_mixed_codes = 1 setting (and reindex).
That way Sphinx adds extra spaces on letter-digit boundaries when making the base split, but still stores the full original token as an extra. For example, UE53N5740AU gets broken down to as much as 5 parts:
ue and ue53n5740au53n5740auBesides the “full” split and the “original” code, it is also possible to store prefixes and suffixes. See blend_mode discussion just below.
Also note that on certain input data mixed codes indexing can generate a lot of undesired noise tokens. So when you have a number of fields with special terms that do not need to be processed as mixed codes (consider either terms like _category1234, or just long URLs), you can use the mixed_codes_fields directive and limit mixed codes indexing to human-readable text fields only. For instance:
blend_mixed_codes = 1
mixed_codes_fields = title, content
That could save you a noticeable amount of both index size and indexing time.
There’s somewhat more than one way to generate extra tokens. So there is a directive to control that. It’s called blend_mode and it lets you list all the different processing variants that you require:
trim_none, store a full token with all the blended characters;trim_head, store a token with heading blended characters trimmed;trim_tail, store a token with trailing blended characters trimmed;trim_both, store a token with both heading and trailing blended characters trimmed;skip_pure, do not store tokens that only contain blended characters;prefix_tokens, store all possible prefix tokens;suffix_tokens, store all possible suffix tokens.To visualize all those trims a bit, consider the following setup:
blend_chars = @, !
blend_mode = trim_none, trim_head, trim_tail, trim_both
doc_title = @someone!
Quite a bunch of extra tokens will be indexed in this case:
someone for the base split;@someone! for trim_none;someone! for trim_head;@someone for trim_tail;someone (yes, again) for trim_both.trim_both option might seem redundant here for a moment. But do consider a bit more complicated term like &U.S.A! where all the special characters are blended. It’s base split is three tokens (u, s, and a); it’s original full form (stored for trim_none) is lower-case &u.s.a!; and so for this term trim_both is the only way to still generate the cleaned-up u.s.a variant.
prefix_tokens and suffix_tokens actually begin to generate something non-trivial on that very same &U.S.A! example, too. For the record, that’s because its base split is long enough, 3 or more tokens. prefix_tokens would be the only way to store the (useful) u.s prefix; and suffix_tokens would in turn store the (questionable) s.a suffix.
But prefix_tokens and suffix_tokens modes are, of course, especially useful for indexing mixed codes. The following gets stored with blend_mode = prefix_tokens in our running example:
ue, ue53, ue53n, ue53n5740, and ue53n5740au53n5740auAnd with blend_mode = suffix_tokens respectively:
ue and ue53n5740au53 and 53n5740aun and n5740au5740 and 5740auauOf course, there still can be missing combinations. For instance, ue 53n query will still not match any of that. However, for now we intentionally decided to avoid indexing all the possible base token subsequences, as that seemed to produce way too much noise.
The rule of thumb is quite simple. All the extra tokens are indexing-only. And in queries, all tokens are treated “as is”.
Blended characters are going to be handled as valid characters in the queries, and require matching.
For example, querying for "@rihanna" will not match Robyn Rihanna Fenty is a Barbadian-born singer document. However, querying for just rihanna will match both that document, and @rihanna doesn't tweet all that much document.
Mixed codes are not going to be automatically “sliced” in the queries.
For example, querying for UE53 will not automatically match neither UE 53 nor UE 37 53 documents. You need to manually add extra whitespace into your query term for that.
By default, full-text queries in Sphinx are treated as simple “bags of words”, and all keywords are required in a document to match. In other words, by default we perform a strict boolean AND over all keywords.
However, text queries are much more flexible than just that, and Sphinx has its own full-text query language to expose that flexibility.
You essentially use that language within the MATCH() clause in your SELECT statements. So in this section, when we refer to just the hello world (text) query for brevity, the actual complete SphinxQL statement that you would run is something like SELECT *, WEIGHT() FROM myindex WHERE MATCH('hello world').
That said, let’s begin with a couple key concepts, and a cheat sheet.
Operators generally work on arbitrary subexpressions. For instance, you can combine keywords using operators AND and OR (and brackets) as needed, and build any boolean expression that way.
However, there is a number of exceptions. Not all operators are universally compatible. For instance, phrase operator (double quotes) naturally only works on keywords. You can’t build a “phrase” from arbitrary boolean expressions.
Some of the operators use special characters, like the phrase operator uses double quotes: "this is phrase". Thus, sometimes you might have to filter out a few special characters from end-user queries, to avoid unintentionally triggering those operators.
Other ones are literal, and their syntax is an all-caps keyword. For example, MAYBE operator would quite literally be used as (rick MAYBE morty) in a query. To avoid triggering those operators, it should be sufficient to lower-case the query: rick maybe morty is again just a regular bag-of-words query that just requires all 3 keywords to match.
Modifiers are attached to individual keywords, and they must work at all times, and must be allowed within any operator. So no compatibility issues there!
A couple examples would be the exact form modifier or the field start modifier, =exact ^start. They limit matching of “their” keyword to either its exact morphological form, or at the very start of (any) field, respectively.
As of v.3.2, there are just 4 per-keyword modifiers.
| Modifier | Example | Description |
|---|---|---|
| exact form | =cats |
Only match this exact form, needs index_exact_words |
| field start | ^hello |
Only match at the very start of (any) field |
| field end | world$ |
Only match at the very end of (any) field |
| IDF boost | boost^1.23 |
Multply keyword IDF by a given value when ranking |
The operators are a bit more interesting!
| Operator | Example | Description |
|---|---|---|
| brackets | (one two) |
Group a subexpression |
| AND | one two |
Match both args |
| OR | one | two |
Match any arg |
| term-OR | one || two |
Match any keyword, and reuse in-query position |
| NOT | one -two |
Match 1st arg, but exclude matches of 2nd arg |
| NOT | one !two |
Match 1st arg, but exclude matches of 2nd arg |
| MAYBE | one MAYBE two |
Match 1st arg, but include 2nd arg when ranking |
| field limit | @title one @body two |
Limit matching to a given field |
| fields limit | @(title,body) test |
Limit matching to given fields |
| fields limit | @!(phone,year) test |
Limit matching to all but given fields |
| fields limit | @* test |
Reset any previous field limits |
| position limit | @title[50] test |
Limit matching to N first positions in a field |
| phrase | "one two" |
Match all keywords as an (exact) phrase |
| phrase | "one * * four" |
Match all keywords as an (exact) phrase |
| proximity | "one two"~3 |
Match all keywords within a proximity window |
| quorum | "uno due tre"/2 |
Match any N out of all keywords |
| quorum | "uno due tre"/0.7 |
Match any given fraction of all keywords |
| BEFORE | one << two |
Match args in this specific order only |
| NEAR | one NEAR/3 "two three" |
Match args in any order within a given distance |
| SENTENCE | one SENTENCE "two three" |
Match args in one sentence; needs index_sp |
| PARAGRAPH | one PARAGRAPH two |
Match args in one paragraph; needs index_sp |
| ZONE | ZONE:(h3,h4) one two |
Match in given zones only; needs index_zones |
| ZONESPAN | ZONESPAN:(h3,h4) one two |
Match in contiguous spans only; needs index_zones |
Now let’s discuss all these modifiers and operators in a bit more detail.
Exact form modifier is only applicable when morphology (ie. either stemming or lemmatizaion) is enabled. With morphology on, Sphinx searches for normalized keywords by default. This modifier lets you search for an exact original form. It requires index_exact_words setting to be enabled.
The syntax is = at the keyword start.
=exact
For the sake of an example, assume that English stemming is enabled, ie. that the index was configured with morphology = stem_en setting. Also assume that we have these three sample documents:
id, content
1, run
2, runs
3, running
Without index_exact_words, only the normalized form, namely run, is stored into the index for every document. Even with the modifier, it is impossible to differentiate between them.
With index_exact_words = 1, both the normalized and original keyword forms are stored into the index. However, by default the keywords are also normalized when searching. So a query runs will get normalized to run, and will still match all 3 documents.
And finally, with index_exact_words = 1 and with the exact form modifier, a query like =runs will be able to match just the original form, and return just the document #2.
For convenience, you can also apply this particular modifier to an entire phrase operator, and it will propagate down to all keywords.
="runs down the hills"
"=runs =down =the =hills"
Field start modifier makes the keyword match if and only if it occurred at the very beginning of (any) full-text field. (Technically, it will only match postings with an in-field position of 1.)
The syntax is ^ at the keyword start, mimicked after regexps.
^fieldstart
Field end modifier makes the keyword match if and only if it occurred at the very end of (any) full-text field. (Technically, it will only match postings with a special internal “end-of-field” flag.)
The syntax is $ at the keyword start, mimicked after regexps.
fieldend$
IDF boost modifier lets you adjust the keyword IDF value (used for ranking), it multiples the IDF value by a given constant. That affects a number of ranking factors that build upon the IDF. That in turn also affects default ranking.
The syntax is ^ followed by a scale constant.
boostme^1.23
These let you implement grouping (with brackets) and classic boolean logic. The respective formal syntax is as follows:
(expr1)expr1 expr2expr1 | expr2-expr1 or !expr1Where expr1 and expr2 are either keywords, or any other computable text query expressions. Here go a few query examples showing all of the operators.
(shaken !stirred)
"barack obama" (alaska | california | texas | "new york")
one -(two | (three -four))
Nothing too exciting to see here. But still there are a few quirks worth a quick mention. Here they go, in no particular order.
OR operator precedence is higher than AND.
In other words, ORs take priority, they are evaluated first, ANDs are then evaluated on top of ORs. Thus, looking for cat | dog | mouse query is equivalent to looking for (cat | dog | mouse), and not (looking for cat) | dog | mouse.
ANDs are implicit.
There isn’t any explicit syntax for them in Sphinx. Just put two expressions right next to each other, and that’s it.
No all-caps versions for AND/OR/NOT, those are valid keywords.
So something like rick AND morty is equivalent to rick and morty, and both these queries require all 3 keywords to match, including that literal and.
Notice the difference in behavior between this, and, say, rick MAYBE morty, where the syntax for operator MAYBE is that all-caps keyword.
Field and zone limits affect the entire (sub)expression.
Meaning that @title limit in a @title hello world query applies to all keywords, not just a keyword or expression immediately after the limit operator. Both keywords in this example would need to match in the title field, not only the first hello. An explicit way to write this query, with an explicit field limit for every keyword, would be (@title hello) (@title world).
Brackets push and pop field and zone limits.
For example, (@title hello) world query requires hello to be matched in title only. But that limit ends on a closing bracket, and world can then match anywhere in the document again. Therefore this query is equivalent to something like (@title hello) (@* world).
Even more curiously, but quite predictably, @body (@title hello) world query would in turn be equivalent to (@title hello) (@body world). The first @body limit gets pushed on an opening bracket, and then restored on a closing one.
Sames rules apply to zones, see ZONE and ZONESPAN operators below.
In-query positions in boolean operators are sequential.
And while those do not affect matching (aka text based filtering), they do noticeably affect ranking. For example, even if you splice a phrase with ORs, a rather important “phrase match degree” ranking factor (the one called ‘lcs’) does not change at all, even though matching changes quite a lot:
mysql> select id, weight(), title from test1
where match('@title little black dress');
+--------+----------+--------------------+
| id | weight() | title |
+--------+----------+--------------------+
| 334757 | 3582 | Little black dress |
+--------+----------+--------------------+
1 row in set (0.01 sec)
mysql> select id, weight(), title from test1
where match('@title little | black | dress');
+--------+----------+------------------------+
| id | weight() | title |
+--------+----------+------------------------+
| 334757 | 3582 | Little black dress |
| 420209 | 2549 | Little Black Backpack. |
...So in a sense, everything you construct using brackets and operators still looks like a single huge “phrase” (bag of words, really) to the ranking code. As if there were no brackets and no operators.
Operator NOT is really operator ANDNOT.
While a query like -something technically can be computed, more often than not such a query is just a programming error. And a potentially expensive one at that, because an implicit list of all the documents in the index could be quite big. Here go a few examples.
// correct query, computable at every level
aaa -(bbb -(ccc ddd))
// non-computable queries
-aaa
aaa | -bbb
(On a side note, that might also raise the philosophical question of ranking documents that contain zero matched keywords; thankfully, from an engineering perspective it would be extremely easy to brutally cut that Gordian knot by merely setting the weight to zero, too.)
For that reason, NOT operator requires something computable to its left. An isolated NOT will raise a query error. In case that you absolutely must, you can append some special magic keyword (something like __allmydocs, to your taste) to all your documents when indexing. Two example non-computable queries just above would then become:
(__allmydocs -aaa)
aaa | (__allmydocs -bbb)
Operator NOT only works at term start.
In order to trigger, it must be preceded with a whitespace, or a bracket, or other clear keyword boundary. For instance, cat-dog is by default actually equivalent to merely cat dog, while cat -dog with a space does apply the operator NOT to dog.
Phrase operator uses the de-facto standard double quotes syntax and basically lets you search for an exact phrase, ie. several keywords in this exact order, without any gaps between them. For example.
"mary had a little lamb"
Yep, boring. But of course there is a bit more even to this simple operator.
Exact form modifier works on the entire operator. Of course, any modifiers must work within a phrase, that’s what modifiers are all about. But with exact form modifiers there’s extra syntax sugar that lets you apply it to the entire phrase at once: ="runs down the hills" form is a bit easier to write than "=runs =down =the =hills".
Standalone star “matches” any keyword. Or rather, they skip that position when matching the phrase. Text queries do not really work with document texts. They work with just the specified keywords, and analyze their in-document and in-query positions. Now, a special star token within a phrase operator will not actually match anything, it will simply adjust the query position when parsing the query. So there will be no impact on search performance at all, but the phrase keyword positions will be shifted. For example.
"mary had * * lamb"
Stopwords “match” any keyword. The very same logic applies to stopwords. Stopwords are not even stored in the index, so we have nothing to match. But even on stopwords, we still need adjust both the in-document positions when indexing, and in-query positions when matching.
This sometimes causes a little counter-intuitive and unexpected (but inevitable!) matching behavior. Consider the following set of documents:
id, content
1, Microsoft Office 2016
2, we are using a lot of software from Microsoft in the office
3, Microsoft opens another office in the UK
Assume that in and the are our only stopwords. What documents would be matched by the following two phrase queries?
"microsoft office""microsoft in the office"Query #1 only matches document #1, no big surprise there. However, as we just discussed, query #2 is in fact equivalent to "microsoft * * office", because of stopwords. And so it matches both documents #2 and #3.
Operator MAYBE is occasionally needed for ranking. It takes two arbitrary expressions, and only requires the first one to match, but uses the (optional) matches of the second expression for ranking.
expr1 MAYBE expr2
For instance, rick MAYBE morty query matches exactly the same documents as just rick, but with that extra MAYBE, documents that mention both rick and morty will get ranked higher.
Arbitrary expressions are supported, so this is also valid:
rick MAYBE morty MAYBE (season (one || two || three) -four')
Term-OR operator (double pipe) essentially lets you specify “properly ranked” per-keyword synonyms at query time.
Matching-wise, it just does regular boolean OR over several keywords, but ranking-wise (and unlike the regular OR operator), it does not increment their in-query positions. That keeps any positional ranking factors intact.
Naturally, it only accepts individual keywords, you can not term-OR a keyword and a phrase or any other expression. Also, term-OR is currently not supported within phrase or proximity operators, though that is an interesting possibility.
It should be easiest to illustrate it with a simple example. Assume we are still searching for that little black dress, as we did in our example on the regular OR operator.
mysql> select id, weight(), title from rt
where match('little black dress');
+------+----------+-----------------------------------------------+
| id | weight() | title |
+------+----------+-----------------------------------------------+
| 1 | 3566 | little black dress |
| 3 | 1566 | huge black/charcoal dress with a little white |
+------+----------+-----------------------------------------------+
2 rows in set (0.00 sec)So far so good. But looks like charcoal is a synonym that we could use here. Let’s try to use it using the regular OR operator.
mysql> select id, weight(), title from rt
where match('little black|charcoal dress');
+------+----------+-----------------------------------------------+
| id | weight() | title |
+------+----------+-----------------------------------------------+
| 3 | 3632 | huge black/charcoal dress with a little white |
| 1 | 2566 | little black dress |
| 2 | 2566 | little charcoal dress |
+------+----------+-----------------------------------------------+
3 rows in set (0.00 sec)Oops, what just happened? We now also match document #2, which is good, but why is the document #3 ranked so high all of a sudden?
That’s because with regular ORs ranking would, basically, look for the entire query as if without any operators, ie. the ideal phrase match would be not just "little black dress", but the entire "little black charcoal dress" query with all special operators removed.
There is no such a “perfect” 4 keyword full phrase match in our small test database. (If there was, it would get top rank.) From the phrase ranking point of view, the next kinda-best thing to it is the "black/charcoal dress" part, where a 3 keyword subphrase matches the query. And that’s why it gets ranked higher that "little black dress", where the longest common subphrase between the document and the query is "little black", only 2 keywords long, not 3.
But that’s not what we wanted in this case at all; we just wanted to introduce a synonym for black, rather than break ranking! And that’s exactly what term-OR operator is for.
mysql> select id, weight(), title from rt
where match('little black||charcoal dress');
+------+----------+-----------------------------------------------+
| id | weight() | title |
+------+----------+-----------------------------------------------+
| 1 | 3566 | little black dress |
| 2 | 3566 | little charcoal dress |
| 3 | 2632 | huge black/charcoal dress with a little white |
+------+----------+-----------------------------------------------+
3 rows in set (0.00 sec)Good, ranking is back to expected. Both the original exact match "little black dress" and synonymical "little charcoal dress" are now at the top again, because of a perfect phrase match (which is favored by the default ranker).
Note that while all the examples above revolved around a single positional factor lcs (which is used in the default ranker), there are more positional factors than just that. See the section on Ranking factors for more details.
Field limit operator limits matching of the subsequent expressions to a given field, or a set of fields. Field names must exist in the index, otherwise the query will fail with an error.
There are several syntax forms available.
@field limits matching to a single given field. This is the simplest form. @(field) is also valid.
@(f1,f2,f3) limits matching to multiple given fields. Note that the match might happen just partially in one of the fields. For example, @(title,body) hello world does not require that both keywords match in the very same field! Document like {"id":123, "title":"hello", "body":"world"} (pardon my JSON) does match this query.
@!(f1,f2,f3) limits matching to all the fields except given ones. This can be useful to avoid matching end-user queries against some internal system fields, for one. @!f1 is also valid syntax in case you want to skip just the one field.
@* syntax resets any previous limits, and re-enables matching all fields.
In addition, all forms except @* can be followed by an optional [N] clause, which limits the matching to N first tokens (keywords) within a field. All of the examples below are valid:
@title[50] test@(title,body)[50] test@!title[50] testTo reiterate, field limits are “contained” by brackets, or more formally, any current limits are stored on an opening bracket, and restored on a closing one.
When in doubt, use SHOW PLAN to figure out what limits are actually used:
mysql> set profiling=1;
select * from rt where match('(@title[50] hello) world') limit 0;
show plan \G
...
*************************** 1. row ***************************
Variable: transformed_tree
Value: AND(
AND(fields=(title), max_field_pos=50, KEYWORD(hello, querypos=1)),
AND(KEYWORD(world, querypos=2)))
1 row in set (0.00 sec)We can see that @title limit was only applied to hello, and reset back to matching all fields (and positions) on a closing bracket, as expected.
Proximity operator matches all the specified keywords, in any order, and allows for a number of gaps between those keywords. The formal syntax is as follows:
"keyword1 keyword2 ... keywordM"~N
Where N has a little weird meaning. It is the allowed number of gaps (other keywords) that can occur between those M specified keywords, but additionally incremented by 1.
For example, consider a document that reads "Mary had a little lamb whose fleece was white as snow", and consider two queries: "lamb fleece mary"~4, and "lamb fleece mary"~5. We have exactly 4 extra words between mary, lamb, and fleece, namely those 4 are had, a, little, and whose. This means that the first query with N = 4 will not match, because with N = 4 the proximity operator actually allows for 3 gaps only, not 4. And thus the second example query will match, as with N = 5 it allows for 4 gaps (plus 1 permutation).
NEAR operator is a generalized version of proximity operator. Its syntax is:
expr1 NEAR/N expr2
Where N has the same meaning as in the proximity operator, the number of allowed gaps plus one. But with NEAR we can use arbitrary expressions, not just individual keywords.
(binary | "red black") NEAR/2 tree
Left and right expressions can still match in any order. For example, a query progress NEAR/2 bar would match both these documents:
progress bara bar called ProgressNEAR is left associative, meaning that arg1 NEAR/X arg2 NEAR/Y arg3 will be evaluated as (arg1 NEAR/X arg2) NEAR/Y arg3. It has the same (lowest) precedence as BEFORE.
Note that while with just 2 keywords proximity and NEAR operators are identical (eg. "one two"~N and one NEAR/N two should behave exactly the same), with more keywords that is not the case.
Because when you stack multiple keywords with NEAR, then upto N - 1 gaps are allowed per each keyword in the stack. Consider this example with two stacked NEAR operators: one NEAR/3 two NEAR/3 three. It allows for upto 2 gaps between one and two, and then for 2 more gaps between two and three. That’s less restrictive than the proximity operator with the same N ("one two three"~3), as the proximity operator will only allow 2 gaps total. So a document with one aaa two bbb ccc three text will match the NEAR query, but not the proximity query.
And vice versa, what if we bump the limit in proximity to match the total limit allowed by all NEARs? We get "one two three"~5 (4 gaps allowed, plus that magic 1), so that anything that matches the NEARs variant would also match the proximity variant. But now a document one two aaa bbb ccc ddd three ceases to match the NEARs, because the gap between two and three is too big. And now the proximity operator becomes less restrictive.
Bottom line is, the proximity operator and a stack of NEARs are not really interchangeable, they match a bit different things.
Quorum matching operator essentially lets you perform fuzzy matching. It’s less strict than matching all the argument keywords. It will match all documents with at least N keywords present out of M total specified. Just like with proximity (or with AND), those N can occur in any order.
"keyword1 keyword2 ... keywordM"/N
"keyword1 keyword2 ... keywordM"/fraction
For a specific example, "the world is a wonderful place"/3 will match all documents that have any 3 of the specified words, or more.
Naturally, N must be less or equal to M. Also, M must be anywhere from 1 to 256 keywords, inclusive. (Even though quorum with just 1 keyword makes little sense, that is allowed.)
Fraction must be from 0.0 to 1.0, more details below.
Quorum with N = 1 is effectively equivalent to a stack of ORs, and can be used as syntax sugar to replace that. For instance, these two queries are equivalent:
red | orange | yellow | green | blue | indigo | violet
"red orange yellow green blue indigo violet"/1
Instead of an absolute number N, you can also specify a fraction, a floating point number between 0.0 and 1.0. In this case Sphinx will automatically compute N based on the number of keywords in the operator. This is useful when you don’t or can’t know the keyword count in advance. The example above can be rewritten as "the world is a wonderful place"/0.5, meaning that we want to match at least 50% of the keywords. As there are 6 words in this query, the autocomputed match threshold would also be 3.
Fractional threshold is rounded up. So with 3 keywords and a fraction of 0.5 we would get a final threshold of 2 keywords, as 3 * 0.5 = 1.5 rounds up as 2. There’s also a lower safety limit of 1 keyword, as matching zero keywords makes zero sense.
When the quorum threshold is too restrictive (ie. when N is greater than M), the operator gets automatically replaced with an AND operator. The same fallback happens when there are more than 256 keywords.
This operator enforces a strict “left to right” order (ie. the query order) on its arguments. The arguments can be arbitrary expressions. The syntax is <<, and there is no all-caps version.
expr1 << expr2
For instance, black << cat query will match a black and white cat document but not a that cat was black document.
Strict order operator has the lowest priority, same as NEAR operator.
It can be applied both to just keywords and more complex expressions, so the following is a valid query:
(bag of words) << "exact phrase" << red|green|blue
These operators match the document when both their arguments are within the same sentence or the same paragraph of text, respectively. The arguments can be either keywords, or phrases, or the instances of the same operator. (That is, you can stack several SENTENCE operators or PARAGRAPH operators. Mixing them is however not supported.) Here are a few examples:
one SENTENCE two
one SENTENCE "two three"
one SENTENCE "two three" SENTENCE four
The order of the arguments within the sentence or paragraph does not matter.
These operators require indexes built with index_sp setting (sentence and paragraph indexing feature) enabled, and revert to a mere AND otherwise. You can refer to documentation on index_sp for additional details on what’s considered a sentence or a paragraph.
Zone limit operator is a bit similar to field limit operator, but restricts matching to a given in-field zone (or a list of zones). The following syntax variants are supported:
ZONE:h1 test
ZONE:(h2,h3) test
ZONESPAN:h1 test
ZONESPAN:(h2,h3) test
Zones are named regions within a field. Essentially they map to HTML (or XML) markup. Everything between <h1> and </h1> is in a zone called h1 and could be matched by that ZONE:h1 test query.
Note that ZONE and ZONESPAN limits will get reset not only on a closing bracket, or on the next zone limit operator, but on a next field limit operator too! So make sure to specify zones explicitly for every field. Also, this makes operator @* a full reset, ie. it should reset both field and zone limits.
Zone limits require indexes built with zones support (see documentation on index_zones for a bit more details).
The difference between ZONE and ZONESPAN limit is that the former allows its arguments to match in multiple disconnected spans of the same zone, and the latter requires that all matching occurs within a single contiguous span.
For instance, (ZONE:th hello world) query will match this example document.
<th>Table 1. Local awareness of Hello Kitty brand.</th>
.. some table data goes here ..
<th>Table 2. World-wide brand awareness.</th>
In this example we have 2 spans of th zone, hello will match in the first one, and world in the second one. So in a sense ZONE works on a concatenation of all the zone spans.
And if you need to further limit matching to any of the individual contiguous spans, you should use the ZONESPAN operator. (ZONESPAN:th hello world) query does not match the document above. (ZONESPAN:th hello kitty) however does!
Efficient geosearches are possible with Sphinx, and the related features are:
GEODIST() function that computes a distance between two geopointsCONTAINS() function that checks if a geopoint is inside a geopolygonAttribute indexes for geosearches.
When you create indexes on your latitude and longitude columns (and you should), query optimizer can utilize those in a few important GEODIST() usecases:
SELECT
GEODIST(lat,lon,$lat1,$lon1) dist1,
GEODIST(lat,lon,$lat2,$lon2) dist2,
GEODIST(lat,lon,$lat3,$lon3) dist3,
...,
(dist1 < $radius1 OR dist2 < $radius2 OR dist3 < $radius3 ...) ok
WHERE ok=1These cases are known to the query optimizer, and once it detects them, it can choose to perform an approximate attribute index read (or reads) first, instead of scanning the entire index. When the quick approximate read is selective enough, which frequently happens with small enough search distances, savings can be huge.
Case #1 handles your typical “give me everything close enough to a certain point” search. When the anchor point and radius are all constant, Sphinx will automatically precompute a bounding box that fully covers a “circle” with a required radius around that anchor point, ie. find some two internal min/max values for latitude and longitude, respectively. It will then quickly check attribute indexes statistics, and if the bounding box condition is selective enough, it will switch to attribute index reads instead of a full scan.
Here’s a working query example:
Case #2 handles multi-anchor search, ie. “give me documents that are either close enough to point number 1, or to point number 2, etc”. The base approach is exactly the same, but multiple boundboxes are generated, multiple index reads are performed, and their results are all merged together.
Here’s another example:
SELECT id,
GEODIST(lat, lon, 55.777, 37.585, {in=deg,out=km}) d1,
GEODIST(lat, lon, 55.569, 37.576, {in=deg,out=km}) d2,
geodist(lat, lon, 56.860, 35.912, {in=deg,out=km}) d3,
(d1<1 OR d2<1 OR d3<1) ok
FROM myindex WHERE ok=1Note that if we reformulate the queries a little, and the optimizer does not recognize the eligible cases any more, the optimization will not trigger. For example:
Obviously, “the boundbox optimization” is actually still feasible in this case, but the optimizer will not recognize that and switch to full scan.
To ensure whether these optimizations are working for you, use EXPLAIN on your query. Also, make sure the radius small enough when doing those checks.
Another interesting bit is that sometimes optimizer can quite properly choose to only use one index instead of two, or avoid using the indexes at all.
Say, what if our radius covers the entire country? All our documents will be within the boundbox anyway, and simple full scan will indeed be faster. That’s why you should use some “small enough” test radius with EXPLAIN.
Or say, what if we have another, super-selective AND id=1234 condition in our query? Doing index reads will be just as extraneous, the optimizer will choose to perform a lookup by id instead.
You can implement vector searches with Sphinx and there are several different features intended for that, namely:
rt_attr_int8_array = vec1[128]{"vec2": int8[1,2,3,4]}DOT() function to compute dot productsFVEC() function to specify vector constantsLet’s see how all these parts connect together.
First, storage. You can store your per-document vectors using any of the following options:
XXX_attr_YYY_array directive[1,2,3,4] values in JSONint8[1,2,3,4] or float[1,2,3,4] syntax extensionsFixed arrays are the fastest to access, but not flexible at all. Also, they require some RAM per every document. For instance, a fixed array with 32 floats (rt_attr_float_array = test1[32]) will consume 128 bytes per every row, whether or not it contains any actual data (and arrays without any explicit data will be filled with zeroes).
JSON arrays are slower to access, and consume a bit more memory per row, but that memory is only consumed per used row. Meaning that when your vectors are defined sparsely (for, say, just 1M documents out of the entire 10M collection), then it might make sense to use JSON anyway to save some RAM.
JSON arrays are also “mixed” by default, that is, can contain values with arbitrary different types. With vector searches however you would normally want to use optimized arrays, with a single type attached to all values. Sphinx can auto-detect integer arrays in JSON, with values that fit into either int32 or int64 range, and store and later process them efficiently. However, to enforce either int8 or float type on a JSON array, you have to explicitly use our JSON syntax extensions.
To store an array of float values in JSON, you have to:
float type in each value with 1.234f syntax (because by default 1.234 gets a double type in JSON), eg: [1.0f, 2.0f, 3.0f]float[...] syntax, eg: float[1,2,3]To store an array of int8 values (ie. from -128 to 127 inclusive) in JSON, the only option is to:
int8[...] syntax, eg: int8[1,2,3]In both these cases, we require an explicit type to differentiate between the two possible options (float vs double, or int8 vs int case), and by default, we choose to use higher precision rather than save space.
Second, calculations. The workhorse here is the DOT() function that computes a dot product between the two vector arguments, ie. a sum of the products of the corresponding vector components.
The most frequent usecase is, of course, computing a DOT() between some per-document array (stored either as an attribute or in JSON) and a constant. The latter should be specifed with FVEC():
SELECT id, DOT(vec1, FVEC(1,2,3,4)) FROM mydocuments
SELECT id, DOT(json.vec2, FVEC(1,2,3,4)) FROM mydocumentsNote that DOT() internally optimizes its execution depending on the actual argument types (ie. float vectors, or integer vectors, etc). That is why the two following queries perform very differently:
mysql> SELECT id, DOT(vec1, FVEC(1,2,3,4,...)) d
FROM mydocuments ORDER BY d DESC LIMIT 3;
...
3 rows in set (0.047 sec)
mysql> SELECT id, DOT(vec1, FVEC(1.0,2,3,4,...)) d
FROM mydocuments ORDER BY d DESC LIMIT 3;
...
3 rows in set (0.073 sec)In this example, vec1 is an integer array, and we DOT() it against either an integer constant vector, or a float constant vector. Obviously, int-by-int vs int-by-float multiplications are a bit different, and hence the performance difference.
Sphinx lets you specify custom ranking formulas for weight() calculations, and tailor text-based relevance ranking for your needs. For instance:
This mechanism is called the expression ranker and its ranking formulas (expressions) can access a few more special variables, called ranking factors, than a regular expression. (Of course, all the per-document attributes and all the math and other functions are still accessible to these formulas, too.)
Ranking factors (aka ranking signals) are, basically, a bunch of different values computed for every document (or even field), based on the current search query. They essentially describe various aspects of the specific document match, and so they are used as input variables in a ranking formula, or a ML model.
There are three types (or levels) of factors, that determine when exactly some given factor can and will be computed:
query_word_count;doc_word_count or bm15;word_count or lcs.Query factors are naturally computed just once at the query start, and from there they stay constant. Those are usually simple things, like a number of unique keywords in the query. You can use them anywhere in the ranking formula.
Document factors additionally depend on the document text, and so they get computed for every matched document. You can use them anywhere in the ranking formula, too. Of these, a few variants of the classic bm25() function are arguably the most important for relevance ranking.
Finally, field factors are even more granular, they get computed for every single field. And thus they then have to be aggregated into a singular value by some factor aggregation function (as of v.3.2, the supported functions are either SUM() or TOP()).
And before we discuss every specific factor in a bit more details, here goes the obligatory factors cheat sheet.
| Name | Level | Type | Summary |
|---|---|---|---|
| has_digit_words | query | int | number of has_digit words that contain [0-9] chars (but may also contain other chars) |
| is_latin_words | query | int | number of is_latin words, ie. words with [a-zA-Z] chars only |
| is_noun_words | query | int | number of is_noun words, ie. tagged as nouns (by the lemmatizer) |
| is_number_words | query | int | number of is_number words, ie. integers with [0-9] chars only |
| max_lcs | query | int | maximum possible LCS value for the current query |
| query_word_count | query | int | number of unique inclusive keywords in a query |
| bm15 | doc | int | quick estimate of BM25(1.2, 0) without query syntax support |
| bm25a(k1, b) | doc | int | precise BM25() value with configurable K1, B constants and syntax support |
| bm25f(k1, b, …) | doc | int | precise BM25F() value with extra configurable field weights |
| doc_word_count | doc | int | number of unique keywords matched in the document |
| field_mask | doc | int | bit mask of the matched fields |
| atc | field | float | Aggregate Term Closeness, log(1+sum(idf1*idf2*pow(distance, -1.75)) over “best” term pairs |
| exact_field_hit | field | bool | whether field is fully covered by the query, in the query term order |
| exact_hit | field | bool | whether query == field |
| exact_order | field | bool | whether all query keywords were a) matched and b) in query order |
| full_field_hit | field | bool | whether field is fully covered by the query, in arbitrary term order |
| has_digit_hits | field | int | number of has_digit keyword hits |
| hit_count | field | int | total number of any-keyword hits |
| is_latin_hits | field | int | number of is_latin keyword hits |
| is_noun_hits | field | int | number of is_noun keyword hits |
| is_number_hits | field | int | number of is_number keyword hits |
| lccs | field | int | Longest Common Contiguous Subsequence between query and document, in words |
| lcs | field | int | Longest Common Subsequence between query and document, in words |
| max_idf | field | float | max(idf) over keywords matched in this field |
| max_window_hits(n) | field | int | max(window_hit_count) computed over all N-word windows in the current field |
| min_best_span_pos | field | int | first maximum LCS span position, in words, 1-based |
| min_gaps | field | int | min number of gaps between the matched keywords over the matching spans |
| min_hit_pos | field | int | first matched occurrence position, in words, 1-based |
| min_idf | field | float | min(idf) over keywords matched in this field |
| sum_idf | field | float | sum(idf) over keywords matched in this field |
| sum_idf_boost | field | float | sum(idf_boost) over keywords matched in this field |
| tf_idf | field | float | sum(tf*idf) over matched keywords, ie. sum(idf) over all occurrences |
| user_weight | field | int | user-specified field weight (via OPTION field_weights) |
| wlccs | field | float | Weighted LCCS, sum(idf) over contiguous keyword spans |
| word_count | field | int | number of unique keyword matched in this field |
Formally, a (field) factor aggregation function is a single argument function that takes an expression with field-level factors, iterates it over all the matched fields, and computes the final result over the individual per-field values.
Currently supported aggregation functions are:
SUM(), sums the argument expression over all matched fields. For instance, sum(1) should return a number of matched fields.TOP(), returns the greatest value of the argument over all matched fields. For instance, top(max_idf) should return a maximum per-keyword IDF over the entire document.Naturally, these are only needed over expressions with field-level factors, query-level and document-level factors can be used in the formulas “as is”.
When searching and ranking, Sphinx classifies every query keyword with regards to a few classes of interest. That is, it flags a keyword with a “noun” class when the keyword is a (known) noun, or flags it with a “number” class when it is an integer, etc.
At the moment we identify 4 keyword classes and assign the respective flags. Those 4 flags in turn generate 8 ranking factors, 4 query-level per-flag keyword counts, and 4 field-level per-class hit counts. The flags are described in a bit more detail just below.
It’s important to understand that all the flags are essentially assigned at query parsing time, without looking into any actual index data (as opposed to tokenization and morphology settings). Also, query processing rules apply. Meaning that the valid keyword modifiers are effectively stripped before assigning the flags.
has_digit flagKeyword is flagged as has_digit when there is at least one digit character, ie. from [0-9] range, in that keyword.
Other characters are allowed, meaning that l33t is a has_digit keyword.
But they are not required, and thus, any is_number keyword is by definition a has_digit keyword.
is_latin flagKeyword is flagged as is_latin when it completely consists of Latin letters, ie. any of the [a-zA-Z] characters. No other characters are allowed.
For instance, hello is flagged as is_latin, but l33t is not, because of the digits.
Also note that wildcards like abc* are not flagged as is_latin, even if all the actual expansions are latin-only. Technically, query keyword flagging only looks at the query itself, and not the index data, and can not know anything about the actual expansions yet. (And even if it did, then inserting a new row with a new expansion could suddenly break the is_latin property.)
At the same time, as query keyword modifiers like ^abc or =abc still get properly processed, these keywords are flagged as is_latin alright.
is_noun flagKeyword is flagged as is_noun when (a) there is at least one lemmatizer enabled for the index, and (b) that lemmatizer classifies that standalone keyword as a noun.
For example, with morphology = lemmatize_en configured in our example index, we get the following:
mysql> CALL KEYWORDS('deadly mortal sin', 'en', 1 AS stats);
+------+-----------+------------+------+------+-----------+------------+----------------+----------+---------+-----------+-----------+
| qpos | tokenized | normalized | docs | hits | plain_idf | global_idf | has_global_idf | is_latin | is_noun | is_number | has_digit |
+------+-----------+------------+------+------+-----------+------------+----------------+----------+---------+-----------+-----------+
| 1 | deadly | deadly | 0 | 0 | 0.000000 | 0.000000 | 0 | 1 | 0 | 0 | 0 |
| 2 | mortal | mortal | 0 | 0 | 0.000000 | 0.000000 | 0 | 1 | 1 | 0 | 0 |
| 3 | sin | sin | 0 | 0 | 0.000000 | 0.000000 | 0 | 1 | 1 | 0 | 0 |
+------+-----------+------------+------+------+-----------+------------+----------------+----------+---------+-----------+-----------+
3 rows in set (0.00 sec)
However, as you can see from this very example, is_noun POS tagging is not completely precise.
For now it works on individual words rather than contexts. So even though in this particular query context we could technically guess that “mortal” is not a noun, in general it sometimes is. Hence the is_noun flags in this example are 0/1/1, though ideally they would be 0/0/1 respectively.
Also, at the moment the tagger prefers to overtag. That is, when “in doubt”, ie. when the lemmatizer reports that a given wordform can either be a noun or not, we do not (yet) analyze the probabilities, and just always set the flag.
Another tricky bit is the handling of non-dictionary forms. As of v.3.2 the lemmatizer reports all such predictions as nouns.
So use with care; this can be a noisy signal.
is_number flagKeyword is flagged as is_number when all its characters are digits from the [0-9] range. Other characters are not allowed.
So, for example, 123 will be flagged is_number, but neither 0.123 nor 0x123 will be flagged.
To nitpick on this particular example a bit more, note that . does not even get parsed as a character by default. So with the default charset_table that query text will not even produce a single keyword. Instead, by default it gets tokenized as two tokens (keywords), 0 and 123, and those tokens in turn are flagged is_number.
These are perhaps the simplest factors. They are entirely independent from the documents being ranked; they only describe the query. So they only get computed once, at the very start of query processing.
Query-level, a number of unique has_digit keywords in the query. Duplicates should only be accounted once.
Query-level, a number of unique is_latin keywords in the query. Duplicates should only be accounted once.
Query-level, a number of unique is_noun keywords in the query. Duplicates should only be accounted once.
Query-level, a number of unique is_number keywords in the query. Duplicates should only be accounted once.
Query-level, maximum possible value that the sum(lcs*user_weight) expression can take. This can be useful for weight boost scaling. For instance, (legacy) MATCHANY ranker formula uses this factor to guarantee that a full phrase match in any individual field ranks higher than any combination of partial matches in all fields.
Query-level, a number of unique and inclusive keywords in a query. “Inclusive” means that it’s additionally adjusted for a number of excluded keywords. For example, both one one one one and (one !two) queries should assign a value of 1 to this factor, because there is just one unique non-excluded keyword.
These are a few factors that “look” at both the query and the (entire) matching document being ranked. The most useful among these are several variants of the classic BM-family factors (as in Okapi BM25).
Document-level, a quick estimate of a classic BM15(1.2) value. It is computed without keyword occurrence filtering (ie. over all the term postings rather than just the matched ones). Also, it ignores the document and fields lengths.
For example, if you search for an exact phrase like "foo bar", and both foo and bar keywords occur 10 times each in the document, but the phrase only occurs once, then this bm15 estimate will still use 10 as TF (Term Frequency) values for both these keywords, ie. account all the term occurrences (postings), instead of “accounting” just 1 actual matching posting.
So bm15 uses pre-computed document TFs, rather that computing actual matched TFs on the fly. By design, that makes zero difference all when running a simple bag-of-words query against the entire document. However, once you start using pretty much any query syntax, the differences become obvious.
To discuss one, what if you limit all your searches to a single field with, and the query is @title foo bar? Should the weights really depend on contents of any other fields, as we clearly intended to limit our searches to titles? They should not. However, with the bm15 approximation they will. But this really is just a performance vs quality tradeoff.
Last but not least, this factor was not-quite-correctly named bm25 for quite a while, until v.3.0.2. (It can be argued that in a way it did compute the BM25 value, for a very specific k1 = 1.2 and b = 0 case. But come on. There is a special name for that b = 0 family of cases, and it is bm15.)
Document-level, parametrized, computes a value of classic BM25(k1,b) function with the two given (required) parameters. For example:
Unlike bm15, this factor only account the matching occurrences (postings) when computing TFs. It also requires index_field_lengths = 1 setting to be on, in order to compute the current and average document lengths (which is in turn required by BM25 function with non-zero b parameters).
It is called bm25a only because bm25 was initially taken (mistakenly) by that BM25(1.2, 0) value estimate that we now (properly) call bm15; no other hidden meaning in that a suffix.
Document-level, parametrized, computes a value of an extended BM25F(k1,b) function with the two given (required) parameters, and an extra set of named per-field weights. For example:
Unlike bm15, this factor only account the matching occurrences (postings) when computing TFs. It also requires index_field_lengths = 1 setting to be on.
BM25F extension lets you assign bigger weights to certain fields. Internally those weights will simply pre-scale the TFs before plugging them into the original BM25 formula. For an original TR, see Zaragoza et al (1994), “Microsoft Cambridge at TREC-13: Web and HARD tracks” paper.
Document-level, a number of unique keywords matched in the entire document.
Document-level, a 32-bit mask of matched fields. Fields with numbers 33 and up are ignored in this mask.
Generally, a field-level factor is just some numeric value computed by the ranking engine for every matched in-document text field, with regards to the current query, describing this or this aspect of the actual match.
As a query can match multiple fields, but the final weight needs to be a single value, these per-field values need to be folded into a single one. Meaning that, unlike query-level and document-level factors, you can’t use them directly in your ranking formulas:
mysql> SELECT id, weight() FROM test1 WHERE MATCH('hello world')
OPTION ranker=expr('lcs');
ERROR 1064 (42000): index 'test1': field factors must only
occur within field aggregates in a ranking expressionThe correct syntax should use one of the aggregation functions. Multiple different aggregations are allowed:
mysql> SELECT id, weight() FROM test1 WHERE MATCH('hello world')
OPTION ranker=expr('sum(lcs) + top(max_idf) * 1000');Now let’s discuss the individual factors in a bit more detail.
Field-level, Aggregate Term Closeness. This is a proximity based measure that grows higher when the document contains more groups of more closely located and more important (rare) query keywords.
WARNING: you should use ATC with OPTION idf='plain,tfidf_unnormalized'; otherwise you could get rather unexpected results.
ATC basically works as follows. For every keyword occurrence in the document, we compute the so called term closeness. For that, we examine all the other closest occurrences of all the query keywords (keyword itself included too), both to the left and to the right of the subject occurrence. We then compute a distance dampening coefficient as k = pow(distance, -1.75) for all those occurrences, and sum the dampened IDFs. Thus for every occurrence of every keyword, we get a “closeness” value that describes the “neighbors” of that occurrence. We then multiply those per-occurrence closenesses by their respective subject keyword IDF, sum them all, and finally, compute a logarithm of that sum.
Or in other words, we process the best (closest) matched keyword pairs in the document, and compute pairwise “closenesses” as the product of their IDFs scaled by the distance coefficient:
pair_tc = idf(pair_word1) * idf(pair_word2) * pow(pair_distance, -1.75)
We then sum such closenesses, and compute the final, log-dampened ATC value:
atc = log(1 + sum(pair_tc))
Note that this final dampening logarithm is exactly the reason you should use OPTION idf=plain, because without it, the expression inside the log() could be negative.
Having closer keyword occurrences actually contributes much more to ATC than having more frequent keywords. Indeed, when the keywords are right next to each other, we get distance = 1 and k = 1; and when there is only one extra word between them, we get distance = 2 and k = 0.297; and with two extra words in-between, we get distance = 3 and k = 0.146, and so on.
At the same time IDF attenuates somewhat slower. For example, in a 1 million document collection, the IDF values for 3 example keywords that are found in 10, 100, and 1000 documents would be 0.833, 0.667, and 0.500, respectively.
So a keyword pair with two rather rare keywords that occur in just 10 documents each but with 2 other words in between would yield pair_tc = 0.101 and thus just barely outweigh a pair with a 100-doc and a 1000-doc keyword with 1 other word between them and pair_tc = 0.099.
Moreover, a pair of two unique, 1-document keywords with ideal IDFs, and with just 3 words between them would fetch a pair_tc = 0.088 and lose to a pair of two 1000-doc keywords located right next to each other, with a pair_tc = 0.25.
So, basically, while ATC does combine both keyword frequency and proximity, it is still heavily favoring the proximity.
Field-level, boolean, whether the current field was (seemingly) fully covered by the query, and in the right (query) term order, too.
This flag should be set when the field is basically either “equal” to the entire query, or equal to a query with a few terms thrown away. Note that term order matters, and it must match, too.
For example, if our query is one two three, then either one two three, or just one three, or two three should all have exact_field_hit = 1, because in these examples all the field keywords are matched by the query, and they are in the right order. However, three one should get exact_field_hit = 0, because of the wrong (non-query) term order. And then if we throw in any extra terms, one four three field should also get exact_field_hit = 0, because four was not matched by the query, ie. this field is not covered fully.
Also, beware that stopwords and other text processing tools might “break” this factor.
For example, when the field is one stop three, where stop is a stopword, we would still get 0 instead of 1, even though intuitively it should be ignored, and the field should be kinda equal to one three, and we get a 1 for that. How come?
This is because stopwords are not really ignored completely. They do still affect positions (and that’s intentional, so that matching operators and other ranking factors would work as expected, just in some other example cases).
Therefore, this field gets indexed as one * three, where star marks a skipped position. So when matching the one two three query, the engine knows that positions number 1 and 3 were matched alright. But there is no (efficient) way for it to tell what exactly was in that missed position 2 in the original field; ie. was there a stopword, or was there any regular word that the query simply did not mention (like in the one four three example). So when computing this factor, we see that there was an unmatched position, therefore we assume that the field was not covered fully (by the query terms), and set the factor to 0.
Field-level, boolean, whether a query was a full and exact match of the entire current field (that is, after normalization, morphology, etc). Used in the SPH04 ranker.
Field-level, boolean, whether all of the query keywords were matched in the current field in the exact query order. (In other words, whether our field “covers” the entire query, and in the right order, too.)
For example, (microsoft office) query would yield exact_order = 1 in a field with the We use Microsoft software in our office. content.
However, the very same query in a field with (Our office is Microsoft free.) text would yield exact_order = 0 because, while the coverage is there (all words are matched), the order is wrong.
Field-level, boolean, whether the current field was (seemingly) fully covered by the query.
This flag should be set when all the field keywords are matched by the query, in whatever order. In other words, this factor requires “full coverage” of the field by the query, and “allows” to reorder the words.
For example, a field three one should get full_field_hit = 1 against a query one two three. Both keywords were “covered” (matched), and the order does not matter.
Note that all documents where exact_field_hit = 1 (which is even more strict) must also get full_field_hit = 1, but not vice versa.
Also, beware that stopwords and other text processing tools might “break” this factor, for exactly the same reasons that we disscussed a little earlier in exact_field_hit.
Field-level, total matched field hits count over just the has_digit keywords.
Field-level, total field hits count over all keywords. In other words, total number of keyword occurrences that were matched in the current field.
Note that a single keyword may occur (and match!) multiple times. For example, if hello occurs 3 times in a field and world occurs 5 times, hit_count will be 8.
Field-level, total matched field hits count over just the is_noun keywords.
Field-level, total matched field hits count over just the is_latin keywords.
Field-level, total matched field hits count over just the is_number keywords.
Field-level, Longest Common Contiguous Subsequence. A length of the longest contiguous subphrase between the query and the document, computed in keywords.
LCCS factor is rather similar to LCS but, in a sense, more restrictive. While LCS could be greater than 1 even though no two query words are matched right next to each other, LCCS would only get greater than 1 if there are exact, contiguous query subphrases in the document.
For example, one two three four five query vs one hundred three hundred five hundred document would yield lcs = 3, but lccs = 1, because even though mutual dispositions of 3 matched keywords (one, three, and five) do match between the query and the document, none of the occurences are actually next to each other.
Note that LCCS still does not differentiate between the frequent and rare keywords; for that, see WLCCS factor.
Field-level, Longest Common Subsequence. This is the length of a maximum “verbatim” match between the document and the query, counted in words.
By construction, it takes a minimum value of 1 when only “stray” keywords were matched in a field, and a maximum value of a query length (in keywords) when the entire query was matched in a field “as is”, in the exact query order.
For example, if the query is hello world and the field contains these two words as a subphrase anywhere in the field, lcs will be 2. Another example, this works on subsets of the query too, ie. with hello world program query the field that only contains hello world subphrase also a gets an lcs value of 2.
Note that any non-contiguous subset of the query keyword works here, not just a subset of adjacent keywords. For example, with hello world program query and hello (test program) field contents, lcs will be 2 just as well, because both hello and program matched in the same respective positions as they were in the query. In other words, both the query and field match a non-contiguous 2-keyword subset hello * program here, hence the value of 2 of lcs.
However, if we keep the hello world program query but our field changes to hello (test computer program), then the longest matching subset is now only 1-keyword long (two subsets match here actually, either hello or program), and lcs is therefore 1.
Finally, if the query is hello world program and the field contains an exact match hello world program, lcs will be 3. (Hopefully that is unsurprising at this point.
Field-level, max(idf) over all keywords that were matched in the field.
Field-level, parametrized, computes max(window_hit_count) over all N-keyword windows (where N is the parameter). For example:
mysql> SELECT *, weight() FROM test1 WHERE MATCH('one two')
-> OPTION ranker=expr('sum(max_window_hits(3))');
+------+-------------------+----------+
| id | title | weight() |
+------+-------------------+----------+
| 1 | one two | 2 |
| 2 | one aa two | 2 |
| 4 | one one aa bb two | 1 |
| 3 | one aa bb two | 1 |
+------+-------------------+----------+
3 rows in set (0.00 sec)So in this example we are looking at rather short 3-keyword windows, and in document number 3 our matched keywords are too far apart, so the factor is 1. However, in document number 4 the one one aa window has 2 occurrences (even though of just one keyword), so the factor is 2 there. Documents number 1 and 2 are straightfoward.
Field-level, the position of the first maximum LCS keyword span.
For example, assume that our query was hello world program, and that the hello world subphrase was matched twice in the current field, in positions 13 and 21. Now assume that hello and world additionally occurred elsewhere in the field (say, in positions 5, 8, and 34), but as those occurrences were not next to each other, they did not count as a subphrase match. In this example, min_best_span_pos will be 13, ie. the position of a first occurence of a longest (maximum) match, LCS-wise.
Note how for the single keyword queries min_best_span_pos must always equal min_hit_pos.
Field-level, the minimum number of positional gaps between (just) the keywords matched in field. Always 0 when less than 2 keywords match; always greater or equal than 0 otherwise.
For example, with the same big wolf query, big bad wolf field would yield min_gaps = 1; big bad hairy wolf field would yield min_gaps = 2; the wolf was scary and big field would yield min_gaps = 3; etc. However, a field like i heard a wolf howl would yield min_gaps = 0, because only one keyword would be matching in that field, and, naturally, there would be no gaps matched keywords.
Therefore, this is a rather low-level, “raw” factor that you would most likely want to adjust before actually using for ranking.
Specific adjustments depend heavily on your data and the resulting formula, but here are a few ideas you can start with:
min_gaps based boosts could be simply ignored when word_count < 2;min_gaps values (ie. when word_count <= 2) could be clamped with a certain “worst case” constant while trivial values (ie. when min_gaps = 0 and word_count < 2) could be replaced by that constant;1 / (1 + min_gaps) could be applied (so that better, smaller min_gaps values would maximize it and worse, bigger min_gaps values would fall off slowly).Field-level, the position of the first matched keyword occurrence, counted in words. Positions begins from 1, so min_hit_pos = 0 must be impossible in an actually matched field.
Field-level, min(idf) over all keywords (not occurrences!) that were matched in the field.
Field-level, sum(idf) over all keywords (not occurrences!) that were matched in the field.
Field-level, sum(idf_boost) over all keywords (not occurrences!) that were matched in the field.
Field-level, a sum of tf*idf over all the keywords matched in the field. (Or, naturally, a sum of idf over all the matched postings.)
For the record, TF is the Term Frequency, aka the number of (matched) keyword occurrences in the current field.
And IDF is the Inverse Document Frequency, a floating point value between 0 and 1 that describes how frequent this keyword is in the index.
Basically, frequent (and therefore not really interesting) words get lower IDFs, hitting the minimum value of 0 when the keyword is present in all of the indexed documents. And vice versa, rare, unique, and therefore interesting words get higher IDFs, maxing out at 1 for unique keywords that occur in just a single document.
Field-level, a user specified per-field weight (for a bit more details on how to set those, refer to OPTION field_weights section). By default all these weights are set to 1.
Field-level, Weighted Longest Common Contiguous Subsequence. A sum of IDFs over the keywords of the longest contiguous subphrase between the current query and the field.
WLCCS is computed very similarly to LCCS, but every “suitable” keyword occurrence increases it by the keyword IDF rather than just by 1 (which is the case with both LCS and LCCS). That lets us rank sequences of more rare and important keywords higher than sequences of frequent keywords, even if the latter are longer. For example, a query Zanzibar bed and breakfast would yield lccs = 1 against a hotels of Zanzibar field, but lccs = 3 against a London bed and breakfast field, even though Zanzibar could be actually somewhat more rare than the entire bed and breakfast phrase. WLCCS factor alleviates (to a certain extent) by accounting the keyword frequencies.
Field-level, the number of unique keywords matched in the field. For example, if both hello and world occur in the current field, word_count will be 2, irregardless of how many times do both keywords occur.
All of the built-in Sphinx rankers can be emulated with the expression based ranker. You just need to pass a proper formula using the OPTION ranker clause.
Such emulation is, of course, going to be slower than using the built-in, pre-compiled rankers. But it still might be of interest if you want to start fine-tuning your ranking formula from an existing built-in baselines ranker. (Also, these formulas kinda define the nitty gritty built-in ranker details in a nicely readable fashion.)
| Ranker | Formula |
|---|---|
| PROXIMITY_BM25 | sum(lcs*user_weight)*1000 + bm25 |
| BM25 | bm25 |
| NONE | 1 |
| WORDCOUNT | sum(hit_count*user_weight) |
| PROXIMITY | sum(lcs*user_weight) |
| MATCHANY | sum((word_count + (lcs - 1)*max_lcs)*user_weight) |
| FIELDMASK | field_mask |
| SPH04 | sum((4*lcs + 2*(min_hit_pos==1) + exact_hit)*user_weight)*1000 + bm25 |
And here goes a complete example query:
SELECT id, weight() FROM test1
WHERE MATCH('hello world')
OPTION ranker=expr('sum(lcs*user_weight)*1000 + bm25')Sphinx supports several different IDF (Inverse Document Frequency) calculation options. Those can affect your relevance ranking (aka scoring) when you are:
By default, term IDFs are (a) per-shard, and (b) computed online. So they might fluctuate significantly when ranking. And several other ranking factors rely on them, so the entire rank might change a lot in a seeimingly random fashion. The reasons are twofold.
First, IDFs usually differ across shards (i.e. individual indexes that make up a bigger combined index). This means that a completely identical document might rank differently depending on a specific shard it ends up in. Not great.
Second, IDFs might change from query to query, as you update the index data. That instability in time might or might not be a desired effect.
To help alleviate these quirks (if they affect your use case), Sphinx offers two features:
local_df option to aggregate sharded IDFs.global_idf feature to enforce prebuilt static IDFs.local_df syntax is SELECT ... OPTION local_df=1 and enabling that option tells the query to compute IDFs (more) precisely, i.e. over the entire index rather than individual shards. The default value is 0 (off) for performance reasons.
global_idf feature is more complicated and includes several components:
indextool --dumpdict --stats switch that generates the source data, i.e. the per-shard dictionary dumps;indextool --buildidf switch that builds a static IDF file from those;global_idf config directive that lets you assign a static IDF file to your shards;OPTION global_idf=1 that forces the query to use that file.Both these features affect the input variables used for IDF calculations. More specifically:
n be the DF, document frequency (for a given term);N be the corpus size, total number of documents;n and N are per-shard;local_df, they both are summed across shards;global_idf, they both are taken from a static IDF file.The static global_idf file actually stores a bunch of n values for every individual term, and the N value for the entire corpus, summed over all the source files that were available during --buildidf stage. For terms that are not present in the static global_idf file, their current (dynamic) DF values will be used. local_df should also still affect those.
To avoid overflows, N is adjusted up for the actual corpus size. Meaning that, for example, if the global_idf file says there were 1000 documents, but your index carries 3000 documents, then N is set to the bigger value, i.e. 3000. Therefore, you should either avoid using too small data slices for dictionary dumps, and/or manually adjust the frequencies, otherwise your static IDFs might be quite off.
To keep the global_idf file reasonably compact, you can use the additional --skip-uniq switch when doing the --buildidf stage. That switch will filter out all terms that only occur once. That usually reduces the .idf file size greatly, while still yielding exact or almost-exact results.
Starting with v.3.3 we removed several legacy IDF calculation methods, and now Sphinx always uses the following formula to compute IDF from n (the document frequency) and N (the corpus size).
idf = log(N/n) / (2*log(N+1)) * term_idf_boostSo we start with de-facto standard raw_idf = log(N/n); then we normalize that for the corpus size; and further compress the idf into [0.0, 0.5) range.
Then we apply per-term user boosts from the query, if any. term_idf_boost defaults to 1.0 but can be arbitrarily changed for individual query terms by using the respective modifier, eg. ... WHERE MATCH('cat^1.2 dog').
For the record, both this IDF normalization itself and the specific [0,0.5) range are now mostly historical. The normalization is considered for removal.
rank_fieldsWhen your indexes and queries contain any special “fake” keywords (usually used to speedup matching), it makes sense to exclude those from ranking. That can be achieved by putting such keywords into special fields, and then using OPTION rank_fields clause in the SELECT statement to pick the fields with actual text for ranking. For example:
SELECT id, weight(), title FROM myindex
WHERE MATCH('hello world @sys _category1234')
OPTION rank_fields='title content'rank_fields is designed to work as follows. Only the keyword occurrences in the ranked fields get processed when computing ranking factors. Any other occurrences are ignored (by ranking, that is).
Note a slight caveat here: for query-level factors, only the query itself can be analyzed, not the index data.
This means that when you do not explicitly specify the fields in the query, the query parser must assume that the keyword can actually occur anywhere in the document. And, for example, MATCH('hello world _category1234') will compute query_word_count=3 for that reason. This query does indeed have 3 keywords, even if _category1234 never actually occurs anywhere except sys field.
Other than that, rank_fields is pretty straightforward. Matching will still work as usual. But for ranking purposes, any occurrences (hits) from the “system” fields can be ignored and hidden.
Sphinx searchd now has a so-called “siege mode” that temporarily imposes server-wide limits on all the incoming SELECT queries, for a given amount of time. This is useful when some client is flooding searchd with heavy requests and, for whatever reason, stopping those requests at other levels is complicated.
Siege mode is controlled via a few global server variables. The example just below will introduce a siege mode for 15 seconds, and impose limits of at most 1000 processed documents and at most 0.3 seconds (wall clock) per query:
Once the timeout reaches zero, the siege mode will be automatically lifted.
There also are intentionally hardcoded limits you can’t change, namely:
siege is 300 seconds, i.e. 5 minutessiege_max_fetched_docs is 1,000,000 documentssiege_max_query_msec is 1 second, i.e. 1000 msecNote that current siege limits are reset when the siege stops. So in the example above, if you start another siege in 20 seconds, then that next siege will be restarted with 1M docs and 1000 msec limits, and not the 1000 docs and 300 msec limits from the previous one.
Siege mode can be turned off at any moment by zeroing out the timeout:
The current siege status is reported in SHOW STATUS:
mysql> show status like 'siege%';
+------------------------+---------+
| Counter | Value |
+------------------------+---------+
| siege_sec_left | 296 |
| siege_max_query_msec | 1000 |
| siege_max_fetched_docs | 1000000 |
+------------------------+---------+
3 rows in set (0.00 sec)Next order of business, the document limit has a couple interesting details that require explanation.
First, the fetched_docs counter is calculated a bit differently for term and non-term searches. For term searches, it counts all the (non-unique!) rows that were fetched by full-text term readers, batch by batch. For non-term searches, it counts all the (unique) alive rows that were matched (either by an attribute index read, or by a full scan).
Second, for multi-index searches, the siege_max_fetched_docs limit will be split across the local indexes (shards), weighted by their document count.
If you’re really curious, let’s discuss those bits in more detail.
The non-term search case is rather easy. All the actually stored rows (whether coming either from a full scan or an attribute index reads) will be first checked for liveness, then accounted in the fetched_docs counter, then either further processed (with extra calculations, filters, etc). Bottom line, a query limited this way will run “hard” calculations, filter checks, etc on at most N rows. So best case scenario (if all WHERE filters pass), the query will return N rows, and never even a single row more.
Now, the term search case is more interesting. The lowest-level term readers will also emit individual rows, but as opposed to the “scan” case, either the terms or the rows might be duplicated. The fetched_docs counter merely counts those emitted rows, as it needs to limit the total amount of work done. So, for example, with a 2-term query like (foo bar) the processing will stop when both terms fetch N documents total from the full-text index… even if not a single document was matched just yet! If a term is duplicated, for example, like in a (foo foo) query, then both the occurrences will contribute to the counter. Thus, for a query with M required terms all AND-ed together, the upper limit on the matched documents should be roughly equal to N/M, because every matched document will be counted as “processed” M times in every term reader. So either (foo bar) or (foo foo) example queries with a limit of 1000 should result in roughly 500 matches tops.
That “roughly” just above means that, occasionally, there might be slightly more matches. As for performance reasons the term readers work in batches, the actual fetched_docs counter might get slightly bigger than the imposed limit, by the batch size at the most. But that must be insignificant as processing just a single small batch is very quick.
And as for splitting the limit between the indexes, it’s simply pro-rata, based on the per-index document count. For example, assume that siege_max_fetched_docs is set to 1000, and that you have 2 local indexes in your query, one with 1400K docs and one with 600K docs respectively. (It does not matter whether those are referenced directly or via a distributed index.) Then the per-index limits will be set to 700 and 300 documents respectively. Easy.
Last but not least, beware that the entire point of the “siege mode” is to intentionally degrade the search results for too complex searches! Use with extreme care; essentially only use it to stomp out cluster fires that can not be quickly alleviated any other way; and at this point we recommend to only ever use it manually.
This section should eventually contain the complete SphinxQL reference. If the statement you’re looking for is not yet documented here, please refer to legacy SphinxQL v.2.x reference document.
Here’s a complete list of SphinxQL statements.
BULK UPDATE ftindex (id, col1 [, col2 [, col3 ...]]) VALUES
(id1, val1_1 [, val1_2 [, val1_3 ...]]),
(id2, val2_1 [, val2_2 [, val2_3 ...]]),
...
(idN, valN_1 [, valN_2 [, valN_3 ...]])BULK UPDATE lets you update multiple rows with a single statement. Compared to running N individual statements, bulk updates provide both cleaner syntax and better performance.
Overall they are quite similar to regular updates. To summarize quickly:
First column in the list must always be the id column. Rows are uniquely identified by document ids.
Other columns to update can either be regular attributes, or individual JSON keys, also just as with regular UPDATE queries. Here are a couple examples:
BULK UPDATE test1 (id, price) VALUES (1, 100.00), (2, 123.45), (3, 299.99)
BULK UPDATE test2 (id, json.price) VALUES (1, 100.00), (2, 123.45), (3, 299.99)All value types (numerics, strings, JSON, MVA) are supported.
Bulk updates of existing values must keep the type. This is a natural restriction for regular attributes, but it also applies to JSON values. For example, if you update an integer JSON value with a float, then that float will get converted (truncated) to the current integer type.
Compatible value type conversions will happen. Truncations are allowed.
Incompatible conversions will fail. For example, strings will not be auto-converted to numeric values.
Attempts to update non-existent JSON keys will fail.
KILL lets you forcibly terminate long-running statements based either on thread ID, or on their current running time.
For the first version, you can obtain the thread IDs using the SHOW THREADS statement.
Note that forcibly killed queries are going to return almost as if they completed OK rather than raise an error. They will return a partial result set accumulated so far, and raise a “query was killed” warning. For example:
mysql> SELECT * FROM rt LIMIT 3;
+------+------+
| id | gid |
+------+------+
| 27 | 123 |
| 28 | 123 |
| 29 | 123 |
+------+------+
3 rows in set, 1 warning (0.54 sec)
mysql> SHOW WARNINGS;
+---------+------+------------------+
| Level | Code | Message |
+---------+------+------------------+
| warning | 1000 | query was killed |
+---------+------+------------------+
1 row in set (0.00 sec)The respective network connections are not going to be forcibly closed.
At the moment, the only statements that can be killed are SELECT, UPDATE, and DELETE. Additional statement types might begin to support KILL in the future.
In both versions, KILL returns the number of threads marked for termination via the affected rows count:
Threads already marked will not be marked again and reported this way.
There are no limits on the <min_msec> parameter for the second version, and therefore, KILL SLOW 0 MSEC is perfectly legal syntax. That specific statement is going to kill all the currently running queries. So please use with a pinch of care.
SELECT <expr> [[AS] <alias>] [, ...]
FROM <ftindex> [, ...]
[{USE | IGNORE | FORCE} INDEX (<attr_index> [, ...]) [...]]
[WHERE
[MATCH('<text_query>') [AND]]
[<where_condition> [AND <where_condition> [...]]]]
[GROUP [<N>] BY <column> [, ...]
[WITHIN GROUP ORDER BY <column> {ASC | DESC} [, ...]]
[HAVING <having_condition>]]
[ORDER BY <column> {ASC | DESC} [, ...]]
[LIMIT [<offset>,] <row_count>]
[OPTION <opt_name> = <opt_value> [, ...]]
[FACET <facet_options> [...]]SELECT is the main querying workhorse, and as such, comes with a rather extensive (and perhaps a little complicated) syntax. There are many different parts (aka clauses) in that syntax. Thankfully, most of them are optional.
Briefly, they are as follows:
SELECT columns list (aka items list, aka expressions list)FROM clause, with the full-text index list<hint> INDEX clauses, with the attribute index usage hintsWHERE condition clause, with the row filtering conditionsGROUP BY clause, with the row grouping conditionsORDER BY clause, with the row sorting conditionsLIMIT clause, with the result set size and offsetOPTION clause, with all the special optionsFACET clauses, with a list of requested additional facetsThe most notable differences from regular SQL are these:
FROM list is NOT an implicit JOIN, but more like a UNIONORDER BY is always present, default is ORDER BY WEIGHT() DESC, id ASCLIMIT is always present, default is LIMIT 0,20GROUP BY always picks a specific “best” row to represent the groupIndex hints can be used to tweak query optimizer behavior and attribute index usage, for either performance or debugging reasons. Note that usually you should not have to use them.
Multiple hints can be used, and multiple attribute indexes can be listed, in any order. For example, the following syntax is legal:
SELECT id FROM test1
USE INDEX (idx_lat)
FORCE INDEX (idx_price)
IGNORE INDEX (idx_time)
USE INDEX (idx_lon) ...All flavors of <hint> INDEX clause take an index list as their argument, for example:
Summarily, hints work this way:
USE INDEX limits the optimizer to only use a subset of given indexes;IGNORE INDEX strictly forbids given indexes from being used;FORCE INDEX strictly forces the given indexes to be used.USE INDEX tells the optimizer that it must only consider the given indexes, rather than all the applicable ones. In other words, in the absence of the USE clause, all indexes are fair game. In its presence, only those that were mentioned in the USE list are. The optimizer still decides whether to actually to use or ignore any specific index. In the example above it still might choose to use idx_lat only, but it must never use idx_time, on the grounds that it was not mentioned explicitly.
IGNORE INDEX completely forbids the optimizer from using the given indexes. Ignores take priority, they override both USE INDEX and FORCE INDEX. Thus, while it is legal to USE INDEX (foo, bar) IGNORE INDEX (bar), it is way too verbose. Simple USE INDEX (foo) achieves exactly the same result.
FORCE INDEX makes the optimizer forcibly use the given indexes (that is, if they are applicable at all) despite the query cost estimates.
For more discussion and details on attributes indexes and hints, refer to “Using attribute indexes”.
SHOW INDEX AGENT STATUS lets you examine a number internal per-agent counters associated with every agent (and then every mirror host of an agent) in a given distributed index.
The agents are numbered in the config order. The mirrors within each agent are also numbered in the config order. All timers must internally have microsecond precision, but should be displayed as floats and in milliseconds, for example:
mysql> SHOW INDEX dist1 AGENT STATUS LIKE '%que%';
+--------------------------------+-------+
| Variable_name | Value |
+--------------------------------+-------+
| agent1_host1_query_timeouts | 0 |
| agent1_host1_succeeded_queries | 1 |
| agent1_host1_total_query_msec | 2.943 |
| agent2_host1_query_timeouts | 0 |
| agent2_host1_succeeded_queries | 1 |
| agent2_host1_total_query_msec | 3.586 |
+--------------------------------+-------+
6 rows in set (0.00 sec)As we can see from the output, there was just 1 query sent to each agent since searchd start, that query went well on both agents, and it took approx 2.9 ms and 3.6 ms respectively. The specific agents are addresses are intentionally not part of this status output to avoid clutter; they can in turn be examined using DESCRIBE statement:
mysql> DESC dist1
+---------------------+----------+
| Agent | Type |
+---------------------+----------+
| 127.0.0.1:7013:loc1 | remote_1 |
| 127.0.0.1:7015:loc2 | remote_2 |
+---------------------+----------+
2 rows in set (0.00 sec)In this case (ie. without mirrors) the mapping is straightforward, we can see that we only have two agents, agent1 on port 7013 and agent2 on port 7015, and we now know what statistics are associated with which agent exactly. Easy.
SHOW VARIABLES statement serves two very different purposes:
searchd server variables.Compatibility mode is required to support connections from certain MySQL clients that automatically run SHOW VARIABLES on connection and fail if that statement raises an error.
Optional GLOBAL or SESSION scope condition is for compatibility only at the moment, and the scope is ignored. All variables, both global and per-session, are always displayed.
WHERE variable_name ... condition is also for compatibility only, and also ignored.
LIKE '<varmask>' condition is supported and functional, for instance:
mysql> show variables like '%comm%';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| autocommit | 1 |
+---------------+-------+
1 row in set (0.00 sec)Most of the variables displayed in SHOW VARIABLES are mutable (ie. possible to change on the fly). Some are read-only. Some are constant.
mysql> show variables;
+--------------------------+-----------------------------+
| Variable_name | Value |
+--------------------------+-----------------------------+
| attrindex_thresh | 1024 |
| autocommit | 1 |
| character_set_client | utf8 |
| character_set_connection | utf8 |
| collation_connection | libc_ci |
| log_debug_filter | |
| log_level | info |
| max_allowed_packet | 8388608 |
| net_spin_msec | 10 |
| qcache_max_bytes | 0 |
| qcache_thresh_msec | 3000 |
| qcache_ttl_sec | 60 |
| query_log_format | sphinxql |
| query_log_min_msec | 0 |
| siege | 0 |
| siege_max_fetched_docs | 1000000 |
| siege_max_query_msec | 1000 |
| sql_fail_filter | |
| sql_log_file | |
| version | 3.3.1-dev (commit fc5fb556) |
+--------------------------+-----------------------------+
20 rows in set (0.00 sec)For instance, log_level is mutable; max_allowed_packet is read-only (you can change in sphinx.conf but with a restart); and version is constant.
Specific per-variable documentation can be found in the “Server variables reference” section.
This section should eventually contain the complete reference on functions that are supported in SELECT and other applicable places. If the function you’re looking for is not yet documented here, please refer to legacy Sphinx v.2.x expressions reference document.
Here’s a complete list of builtin Sphinx functions.
DOT() functionDOT() function computes a dot product over two vector arguments.
Vectors can be taken either from JSON, or from array attributes, or specified as constants using FVEC() function. All combinations should generally work.
The result type is always float for consistency and simplicity. (According to our benchmarks, performance gain from using integer or bigint for the result type, where applicable, is pretty much nonexistent anyway.)
Note that internal calculations are optimized for specific input argument types anyway. For instance, int8 vs int8 vectors should be quite noticeably faster than float by double vectors containing the same data, both because integer multiplication is less expensive, and because int8 would utilize 6x less memory.
So as a rule of thumb, use the narrowest possible type, that yields both better RAM use and better performance.
When one of the arguments is either NULL, or not a numeric vector (that can very well happen with JSON), or when both arguments are vectors of different sizes, DOT() returns 0.
FVEC() functionFVEC() function lets you define a vector of floats. Two current usecases are:
DOT()Constant vector form.
In the first form, the arguments are a list of numeric constants. And note that there can be a difference whether we use integers or floats here!
When both arguments to DOT() are integer vectors, DOT() can use an optimized integer implementation, and to define such a vector using FVEC(), you should only use integers.
The rule of thumb with vectors generally is: just use the narrowest possible type. Because that way, extra optimizations just might kick in. And the other way, they very definitely will not.
For instance, the optimizer is allowed to widen FVEC(1,2,3,4) from integers to floats alright, no surprise there. Now, in this case it is also allowed to narrow the resulting float vector back to integers where applicable, because we can know that all the original values were integers before widening.
And narrowing down from the floating point form like FVEC(1.0, 2.0, 3.0, 4.0) to integers is strictly prohibited. So even though the values actually are the same, in the first case additional integer-only optimization can be engaged, and in the second case they can’t.
UDF argument wrapper form.
In the second form, the only argument must be a JSON key, and the output is only intended for UDF functions (because otherwise this FVEC() wrapper should not be needed and you would just use the key itself). The associated value type gets checked, optimized float vectors get wrapped and passed to UDF, and any other types are replaced with a null vector (zero length and no data pointer) in the UDF call. The respective UDF type is SPH_UDF_TYPE_FLOAT_VEC.
Note that this case is intentionally designed as a fast accessor for UDFs that just passes float vectors to them, and avoids any data copying and conversion.
So if you attempt to wrap and pass anything else, null vector will be passed to the UDF. Could be a generic mixed vector with numeric values of differnt types, could be an optimized int8 vector, could be a double vector - but in all these cases, despite the fact that they are compatible and could technically be converted to some temporary float vector and then passed down, that kind of a conversion just does not happen. Intentionally, for performance reasons.
PP() functionPP() function pretty-prints JSON output (which by default would be compact rather than prettified). When applied to FACTORS() function, it will also automatically force JSON output. For example:
mysql> select id, j from lj limit 1 \G
*************************** 1. row ***************************
id: 1
j: {"gid":1107024,"urlcrc":2557061282}
1 row in set (0.01 sec)
mysql> select id, pp(j) from lj limit 1 \G
*************************** 1. row ***************************
id: 1
pp(j): {
"gid": 1107024,
"urlcrc": 2557061282
}
1 row in set (0.01 sec)
mysql> select id, factors() from lj where match('hello world')
-> limit 1 option ranker=expr('1') \G
*************************** 1. row ***************************
id: 5332
factors(): bm15=735, bm25a=0.898329, field_mask=2, doc_word_count=2,
field1=(lcs=2, hit_count=2, word_count=2, tf_idf=0.517828, ...
1 row in set (0.00 sec)
mysql> select id, pp(factors()) from lj where match('hello world')
-> limit 1 option ranker=expr('1') \G
*************************** 1. row ***************************
id: 5332
pp(factors()): {
"bm15": 735,
"bm25a": 0.898329,
"field_mask": 2,
"doc_word_count": 2,
"fields": [
{
"field": 1,
"lcs": 2,
"hit_count": 2,
"word_count": 2,
...
1 row in set (0.00 sec)SLICEAVG(json.key, min_index, sup_index)
SLICEMAX(json.key, min_index, sup_index)
SLICEMIN(json.key, min_index, sup_index)| Function call example | Info |
|---|---|
SLICEAVG(j.prices, 3, 7) |
Computes average value in a slice |
SLICEMAX(j.prices, 3, 7) |
Computes minimum value in a slice |
SLICEMIN(j.prices, 3, 7) |
Computes maximum value in a slice |
Slice functions (SLICEAVG, SLICEMAX, and SLICEMIN) expect a JSON array as their 1st argument, and two constant integer indexes A and B as their 2nd and 3rd arguments, respectively. Then they compute an aggregate value over the array elements in the respective slice, that is, from index A inclusive to index B exclusive (just like in Python and Golang). For instance, in the example above elements 3, 4, 5, and 6 will be processed, but not element 7. The indexes are, of course, 0-based.
The returned value is float, even when all the input values are actually integer.
Non-arrays and slices with non-numeric items will return a value of 0.0 (subject to change to NULL eventually).
STRPOS() functionSTRPOS() returns the index of the first occurence of its second argument (“needle”) in its first argument (“haystack”), or -1 if there are no occurrences.
The index is counted in bytes (rather that Unicode codepoints).
At the moment, needle must be a constant string. If needle is an empty string, then 0 will be returned.
searchd has a number of server variables that can be changed on the fly using the SET GLOBAL var = value statement. This section provides a reference on all those variables.
attrindex_threshlog_debug_filterlog_levelnet_waitqcache_max_bytesqcache_thresh_msecqcache_ttl_secquery_log_formatquery_log_min_msecsiegesiege_max_fetched_docssql_fail_filtersql_log_fileattrindex_thresh variableMinimum segment size required to enable building the attribute indexes, counted in rows. Default is 1024.
Sphinx will only create attribute indexes for “large enough” segments (be those RAM or disk segments). As a corollary, if the entire FT index is small enough, ie. under this threshold, attribute indexes will not be engaged at all.
At the moment, this setting seem useful for testing and debugging only, and normally you must not need to tweak it in production.
log_debug_filter variableSupresses debug-level log entries that start with a given prefix. Default is empty string, ie. do not suppress any entries.
This makes searchd less chatty at debug and higher log_level levels.
At the moment, this setting seem useful for testing and debugging only, and normally you must not need to tweak it in production.
log_level variableSets the current logging level. Default (and minimum) level is info.
This variable is useful to temporarily enable debug logging in searchd, with this or that verboseness level.
At the moment, this setting seem useful for testing and debugging only, and normally you must not need to tweak it in production.
net_spin_msec variableSets the poller spinning period in the network thread. Default is 10 msec.
The usual thread CPU slice is basically in 5-10 msec range. (For the really curious, a rather good starting point are the lines mentioning “targeted preemption latency” and “minimal preemption granularity” in kernel/sched/fair.c sources.)
Therefore, if a heavily loaded network thread calls epoll_wait() with even a seemingly tiny 1 msec timeout, that thread could occasionally get preempted and waste precious microseconds. According to an ancient internal benchmark that we can neither easily reproduce nor disavow these days (or in other words: under certain circumstances), that can result in quite a significant difference. More specifically, internal notes report ~3000 rps without spinning (ie. with net_spin_msec = 0) vs ~5000 rps with spinning.
Therefore, by default we choose to call epoll_wait() with zero timeouts for the duration of net_spin_msec, so that our “actual” slice for network thread is closer to those 10 msec, just in case we get a lot of incoming queries.
query_log_format variableChanges the search query logging format on the fly. Default is plain, and the other option is sphinxql.
query_log_min_msec variableChanges the minimum elapsed time threshold for the search queries to get logged. Default is 0 msec, ie. log all queries.
sql_fail_filter variableThe “fail filter” is a simple early stage filter imposed on all the incoming SphinxQL queries. Any incoming queries that match a given non-empty substring will immediately fail with an error.
This is useful for emergency maintenance, just as siege mode. The two mechanisms are independent of each other, ie. both fail filter and siege mode can be turned on simultaneously.
As of v.3.2, the matching is simple, case-sensitive and bytewise. This is likely to change in the future.
To remove the filter, set the value to an empty string.
sql_log_file variableSQL log lets you (temporarily) enable logging all the incoming SphinxQL queries, in (almost) raw form. Compared to query_log directive, this logger:
Queries are stored as received. A hardcoded ; /* EOQ */ separator and then a newline are stored after every query, for parsing convenience. It’s useful to capture and later replay a stream of all client SphinxQL queries.
For performance reasons, SQL logging uses a rather big buffer (to the tune of a few megabytes), so don’t be alarmed when tail does not immediately display something after your start this log.
To stop SQL logging (and close and flush the log file), set the value to an empty string.
We do not recommend keeping SQL logging on for prolonged periods on loaded systems, as it might use a lot of disk space.
New features:
PP() pretty-printing function for FACTORS() and JSON valuesKILL <tid> SphinxQL statementSHOW INDEX <idx> AGENT STATUS SphinxQL statement, and moved per-agent counters there from SHOW STATUSMinor new additions:
SHOW VARIABLES, namely
log_debug_filter, net_spin_msec, query_log_min_msec, sql_fail_filter, and sql_log_fileattrindex_thresh, siege_max_fetched_docs, siege_max_query_msec, qcache_max_bytes, qcache_thresh_msec, and qcache_ttl_sec from SHOW STATUSSET GLOBAL server_var in sphinxql_state startup scriptChanges and improvements:
timestamp columns support, use uint type instead (existing indexes are still supported; timestamp should automatically work as uint in those)OPTION idf and unified IDF calculations, see “How Sphinx computes IDF”WEIGHT() from integer to floatglobal_idf behavior; now missing terms get local IDF instead of zeroOPTION cutoff to properly account all processed matchesDOT() over int8 vectors, upto 1.3x fasterUPDATE handling, updates can now execute in parallel (again)SHOW THREADS query limit from 512 to 2048 bytesFixes:
FACTORS() argument, and optimized that case a littlesql_log_file race that caused (rare-ish) crashes under high query loadCALL KEYWORDS did not use normalized term on global_idf lookupINSERT only checked RAM segments for duplicate docidsCOUNT(*) vs empty RTNew features:
(red || green || blue) pixel!indexme => differentlysql_attr_int8_array = myvec[128]DOT() support for all those new array typesint8[] and float[] JSON syntax extensionsFVEC(json.field) support to expressions, and the respective SPH_UDF_TYPE_FLOAT_VEC support to UDFsBULK UPDATE SphinxQL statementSET GLOBAL siegesum_idf_boost, is_noun_hits, is_latin_hits, is_number_hits, has_digit_hits per-field ranking factors](#ranking-factors)is_noun, is_latin, is_number, and has_digit per-term flags; added the respective is_noun_words, is_latin_words, is_number_words, and has_digit_words per-query ranking factors; and added query factors support to UDFs (see sphinxudf.h)SET GLOBAL sql_fail_filterSET GLOBAL sql_log_fileSLICEAVG, SLICEMAX, SLICEMIN functions, and STRPOS(str,conststr) functionMinor new additions:
exceptions files--dummy <arg> switch to searchd (useful to quickly identify specific instances in the process list)CALL KEYWORDS (for JSON output, call it with CALL KEYWORDS(..., 1 AS json)IS NULL and IS NOT NULL checks to ALL() and ANY() JSON iteratorslast_good_id to TSV indexing error reportingram_segments counter to SHOW INDEX STATUS, and renamed two counters (ram_chunk to ram_segments_bytes, disk_chunks to disk_segments)sql_query_kbatch directive, deprecated sql_query_killlist directive<sphinx:kbatch> support to XML sourcenet_spin_msec for example)Changes and improvements:
thread_stack nowstopwords handling, fixed the hash collisions issuestopwords directive, made it multi-valuedglobal_idf handling, made global IDFs totally independent from per-index DFsEXPLAIN, ensured that it always reports real query plan and statsGEODIST() vs JSON, crash in COALESCE() args check, etc)FACET handling, single-search optimization must now always engageindexer --nohup to rename index files to .new on successquery_time metric behaviour for distributed indexes, now it will account wall timesearchd.logMajor optimizations:
ORDER BY clauses, upto 1.1x speedupDOT() for a few cases like int8 vectors, upto 2x+ speedupFixes:
ORDER BY RAND() was breaking WEIGHT() (also, enabled it for grouping queries)ATTACHmax_window_hits() and exact_order factorsSHOW META after index-less SELECTALL() and ANY() vs optimized JSON vectors, and fixed optimized int64 JSON vector accessorSHOW THREADS ... OPTION columns=X limit permanently clipped the thread descriptions/searchd HTTP endpoint error formatcutoff and other query limitsjson_packed_keys issuesINSERTUPDATE and INSERTSNIPPET(field,QUERY()) case to some extent (we now filter out query syntax and treat QUERY() as a bag of words in this case)WHERE conditions from the queryregexp_filter vs ATTACHindexer --dump-rows-tsv switch, and renamed --dump-rows to --dump-rows-sqlCOALESCE() function support for JSONs (beware that it will compute everything in floats!)!=, IN, and NOT IN syntax to expressionsprefix_tokens and suffix_tokens options to blend_mode directiveOPTION rank_fields, lets you specify fields to use for ranking with either expression or ML (UDF) rankersindexerbatch_size variable to SHOW METAcsvpipe_header and tsvpipe_header directivessql_xxx counters to SHOW STATUS, generally cleaned up countersblend_mixed_codes and mixed_codes_fields directivesOPTION inner_limit_per_index to explicitly control reordering in a nested sharded selectmax_matches (must be under 100M)FACET queries with expressions and simple by-attribute (no aliases!) facets; multi-sort optmization now works in that caseid lookups (queries like UPDATE ... WHERE id=123 should now be much faster)PACKEDFACTORS() storage a lot (upto 60x speedup with max_matches=50000)LIMIT anymore)searchd --listen switch, multiple --listen instances are now allowed, and --console is not required anymore@count, @weight, @expr, @geodist syntax supportSetWeights(), SetMatchMode(), SetOverride(), SetGeoAnchor() calls, SPH_MATCH_xxx constants, and SPH_SORT_EXPR sorting mode from APIsspelldump utility.sha index filesMajor fixes:
Other fixes:
min_best_span_pos was sometimes offglobal_idf fileindextool --check vs string attributes, and vs empty JSONsOPTIMIZE vs UPDATE race; UPDATE can now fail with a timeoutindexer --merge --rotate vs kbatchesBITCOUNT() function and bitwise-NOT operator, eg SELECT BITCOUNT(~3)searchd config section completely optionalmin_infix_len behavior, required 2-char minimum is now enforcedjson_packed_keys in RT)uptime counter in SHOW STATUSPACKEDFACTORS()full_field_hit ranking factorbm15 ranking factor name (legacy bm25 name misleading, to be removed)exact_field_hit ranking factor, impact now negligible (approx 2-4%)indexer output, less visual noisesearchd --safetrace option, now skips addr2line to avoid occasional freezesindexer MySQL driver lookup, now also checking for libmariadb.sosearchd crash caused by attribute indexesindexer crash on missing SQL drivers, and improved error reportingsearchd crash on multi-index searches with docstoreBM25F() weights mapALTER failed on field-shadowing attributes vs index_field_lengths casesearchd startup failures (threading related)WIP: the biggest change to rule them all is yet to come. The all new, fully RT index format is still in progress, and not yet available. Do not worry, ETL via
indexerwill not be going anywhere. Moreover, despite being fully and truly RT, the new format is actually already faster at batch indexing.
The biggest changes since Sphinx v.2.x are:
WHERE gid=123 queries can now utilize A-indexesWHERE MATCH('hello') AND gid=123 queries can now efficiently intersect FT-indexes and A-indexesAnother two big changes that are already available but still in pre-alpha are:
./sphinxdata folder)The additional smaller niceties are:
blend_mode=prefix_tokens, and enabled empty blend_modekbatch_source directive, to auto-generate k-batches from source docids (in addition to explicit queries)SHOW OPTIMIZE STATUS statementexact_field_hit ranking factor123.45f value syntax in JSON, optimized support for float32 vectors, and FVEC() and DOT() functionsSNIPPETS() (via hl_fields directive)A bunch of legacy things were removed:
dict, docinfo, infix_fields, prefix_fields directivesattr_flush_period, hit_format, hitless_words, inplace_XXX, max_substring_len, mva_updates_pool, phrase_boundary_XXX, sql_joined_field, subtree_XXX directivesindexer --keep-attrs switchAnd last but not least, the new config directives to play with are:
docstore_type, docstore_block, docstore_comp, docstore_cache_size (per-index) let you generally configure DocStorestored_fields, stored_only_fields, hl_fields (per-index) let you configure what to put in DocStorekbatch, kbatch_source (per-index) update the legacy k-lists-related directivesupdates_pool (per-index) sets vrow file growth stepjson_packed_keys (common section) enables the JSON keys compressionbinlog_flush_mode (searchd section) changes the per-op flushing mode (0=none, 1=fsync, 2=fwrite)Quick update caveats:
sql_query_killlist then you now must explicitly specify kbatch and list all the indexes that the k-batch should be applied to:sql_query_killlist = SELECT deleted_id FROM my_deletes_log
kbatch = main
# or perhaps:
# kbatch = shard1,shard2,shard3,shard4This documentation is copyright (c) 2017-2020, Andrew Aksyonoff. The author hereby grants you the right to redistribute it in a verbatim form, along with the respective copy of Sphinx it came bundled with. All other rights are reserved.