This is a review of Oreilly’s excellent new (April 2011) book
Introduction to Search with Sphinx. This review is also a bit of
a reading guide, if you don’t mind. I am not a computer science
guru, as you’ll notice. I just want a search box on my website that
finds the right stuff. I have actually read the book. Sphinx is
open source. ISBN: 9780596809539
In the Preface, the author gives us this heads up about the audience
he expects: “I assume that readers have a basic familiarity with
tools for system administrators and programmers, including the
command line and simple SQL.” This sounds good to me–it doesn’t
say I need to know anything about search, which is good, because
I don’t. The more experience with linux, mysql, SQL and programming
you bring, the easier this book will be to read. If BM25 and NLP
don’t mean much to you, you’ll have to get a CompSci PhD for a few
of the paragraphs in this book.
Chapter 1 “The World of Text Search” is awesome fun. Who knew that
Sphinx (and Google) searches use AND by default but that Lucene uses
OR by default. Chapter 1 is mostly a refreshing breeze, but we do
get precision about the meaning of many terms including document,
index, match, result set. A beautiful equivalency is presented
that database people will love and should memorize:
Database table ≈ Sphinx index
Database rows ≈ Sphinx documents
Database columns ≈ Sphinx fields and attributes
This equivalency is conceptually accurate and makes Sphinx less
intimidating to those of us not intimidated by an SQL database.
Chapter 1’s rougher stuff, the Linguistics Crash Course, is
inherently rough but well presented, and you’ll get a clue about
things like homonyms, NLP, part-of-speech, word-sense-disambiguation,
stemming, lemmatization, and other stars of morphology processing.
You’ll feel much smarter after reading these few short pages in
only a few minutes. You’ll also wish you knew what a predicate was
like you once did back in fourth grade. There is heavy math, AI and
theory down there where the flesh meets the bone of linguistics and
text search. This very readable and exciting dip into text search
is a masterpiece of brevity. (Oh, I’m sorry , would you prefer
a bookshelf or two of this stuff?). Chapter 1 is generally very
smooth reading but does have one unexplained word-drop, “statistical
text rankings such as BM25″, without explanation (you’ll get your
explanation in Chapter 6). Chapter 1’s introduction to the nature
of full text indexes and the inverted file are easy to understand.
I was surprised to learn that Sphinx indexes are a whopping-large
60-70% of the size of the indexed text. When done with this section,
you’ll know the key features of an index and what the two Sphinx
index types are like. Chapter 1 finishes with a discussion of
Search Workflows, a practical section that discusses how to get
Sphinx to find your stuff. The Search Workflows section renders
Sphinx very approachable. Be sure to grasp and understand the
illustrations and concepts in this section. The reading here
is easy enough to go into skimming mode. Taking your time will
pay off later when other concepts build on the Sphinx basics.
In the Index Approaches section, we learn the difference between
batch-indexing and incremental-indexing, and we start to see that
Sphinx is super-powerful. The author may have gone too far in this
section when he mentions that really big jobs that need multiple
machines and data sharding “isn’t fully automatic with Sphinx,
but its pretty easy to set up.” (“Billy! Honey, pick up your room,
finish your algebra homework and when you’re done, shard that data
and set up a Sphinx search cluster for me please.”) We learn about
the two main types of searches: Full-text queries and full scans.
Full text queries search all the text in a document. Full scans
search attributes attached to the document, like author_id or
creation_date. The last section of Chapter 2, Kinds of Results,
teases us with possibilities involving rewritten queries, query
suggestions, snippets (excerpts) and facets. Unfortunately, we
never return to these topics in this book for in-depth treatment.
Even so, its nice to know that Sphinx is snippets-and-facets ready.
If you’ve ever worked with Solr and its Lucene index and searching
core, you will see that Sphinx is a different take on a similar
Chapter 2 gets to the red meat of running sphinx. You will need
to improve any linux and mysql basics that have gotten rusty.
You should also read the book next to a printout of the example
sphinx.conf and example.sql files (or have them open on a computer
screen) that are discussed in this chapter.
Data sources, index configuration and other Sphinx basics get
good coverage in this chapter. One thing that was not explained
thoroughly enough was the sql_file_field directive. Sphinx is
great at indexing the contents of SQL tables and XML data. But the
sql_file_field directive tells Sphinx to index the file that is
named in that field. For example, if the column contents are
“/storage/123.txt”, then the sql_file_field tells Sphinx to go get
that file and index that file’s contents. Sounds exciting to me.
But this seemingly very important feature gets glossed over too
quickly. I want to know more about how to use sql_file_field
On page 32, we see our first SphinxAPI PHP program. Your favorite
programming language works with Sphinx, but the examples (all
very simple) are in PHP. Anyway, on page 32, the PHP code
doesn’t import any libraries–working code would have been nicer,
especially since the Oreilly page for the book does not have link
to code downloads. This book is best read by going along with it
using a computer. I had to search elsewhere to find out which
PHP imports were required to make the code in the book work.
That said, when I was searching at sphinxsearch.com, I was very
pleased with the online Sphinx documentation and its archived help
from fellow sphinxers. Two good reasons to use Sphinx are: 1)
The Oreilly book exists and 2) The online documentation is good.
Searching for less than one minute at sphinxsearch.com, I learned
that I need to include sphinxapi.php. My package manager put it at
/usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php. After finding
it and copying it to the same directory as the script, then this
include(“sphinxapi.php”); // load API
$cl = new SphinxClient();
The SphinxAPI is a pleasure to use. I was feeling luxurious using
it until he mentioned that “code should, of course, also add error
handling”. Yes, it should, and authors should remind us of that
I suppose. Except for error-handling, code using the SphinxAPI
is easy on the eyes and explains what is happening, “filter by
this, sort by that”. The SphinxAPI is clearly our friend, and
saves us from “building the SQL statement string from pieces”.
But sometimes you need straight SQL to describe what you want,
and so we learn about using SphinxQL. Because Sphinx supports the
MySQL wire protocol, your shipped-with-Linux mysql client can talk
to searchd. That’s nice. Just revert to your old MySQL-from-scratch
habits. We learn about SphinxQL, an SQL-like query language
that searchd understands. The whole searchd-is-almost-mysql and
SphinxQL-is-almost-SQL is intriguing but its best not to think too
hard about it–just move along. The online documentation for the
SphinxAPI is excellent–but it is a little bit strange if you’ve
never read API documentation designed to be used with multiple
programming languages. For example, this documentation might tell
you that a function returns a hash. In PHP, that is fine and good,
but how easily will it translate to another programming language
where a method returns some other data structure? Using the book
and the online API documentation will be easiest for PHP programmers.
Sphinx and its documentation are usable with Perl, PHP, Python, Java,
Ruby, C# and C/C++. To quote from the book on this topic: “When you
send a query to Sphinx using a programming API, the result combines
row data and metadata into a single structure. The specific structure
used varies depending on the language you’re using (an associative
array in PHP, Perl, and Python; a struct in pure C; a class in Java;
etc.), but the structure of member names and their meanings stay
the same across APIs.” Another encouraging quote: “We will use PHP
for the examples, but the API is consistent across the supported
languages, so converting the samples to your language of choice,
such as Python or Java, should be straightforward.” The bottom
line is that the Sphinx API for your language is unmysterious and
its documentation is usable (with a little translation if you’re
not using PHP).
Its nice to see the author and Oreilly devote a few pages to
compiling from source. Package management is great, except when
it isn’t. Its nice to compile from source because you’ll sleep
better knowing that you can survive a bad update. Don’t be a noob,
compile it yourself! While reading or after finishing Chapter 2
is a great time to install and fool around with Sphinx. Build the
test1 index as described and then build another one from your own
source of data. Sphinx is easy. No, seriously, install it and
fool around. I’ll wait for you. Seriously–install Sphinx and do
the test examples as shown in Chapter 2. It takes 5 minutes, do it.
Chapter 3, Basic Indexing, is not very basic, but it covers the
details of important indexing concepts and techniques. It covers
SQL indexes and XML indexes. The SQL section is straightforward
and demonstrates the tools you’ll need when dealing with large
amounts of data. The “Indexing XML Data” reminds me of that quip:
“your XML problems can only be solved with more XML”. I was hoping
to learn that Sphinx would suck in all the XML data on my server
and make it searchable (kind of like how Google’s Picasa finds
all of your photos). Our author gets to the ugly truths quickly:
“Every index has a schema—that is, a list of fields and attributes
that are in it” and “We need to know what data we’re going to
index before we start processing our first document.” Throw in
some character sets and encodings and grab the Pepto Bismol.
Sphinx will index your XML data, but if you want it to be easily
found, then you’ll need to do some work. The good news is that
if your XML data is relatively uniform and you know its structure,
then indexing it with Sphinx won’t be hard. For example, if you’ve
got tons of WordPress blog posts to index, you’ll be able to do it.
The Sphinx-related details of making your XML searchable are simple.
This simple configuration can go in the sphinx.conf configuration
file or in your XML stream. The discussion about stopwords–very
common words like “the” “and” “or”–and how to deal with them is
fascinating. You probably won’t know what to do about stopwords
in your data, but you will learn how to make Sphinx deal with them
the way you want. Chapter 3, Basic Indexing, covers not-so-fun
material but it is presented clearly. Reading this chapter is
necessary and will build confidence in Sphinx and our able author.
Chapter 4, Basic Searching, Sphinx has an interesting development
history that we learn about. This history is required reading
because Sphinx and Sphinx configuration are tied to its legacy.
Along the way we a literary side of our author with peaches
like the “ranking function was nailed onto its matching mode
with nine-inch titanium nails” and “The milkman’s lot isn’t
as sought after as it once was…” Oreilly’s editors are good
to stay out of the way of sparse poetry and especially titanium
milkmen. This chapter is necessarily long. We learn about the
three cornerstones of search–KEYWORDS (e.g. titanium milkmen),
OPERATORS (AND/OR/NOT….e.g. titanium OR milkmen) and MODIFIERS
(^=$, e.g. “^Once upon a time”). We also get the guts of result
sets (results themselves and metadata about the search itself),
searching multiple indexes simultaneously (for performance),
result set processing, filtering and sorting. Fully digesting
chapters 3 and 4 will allow you to understand your searches and
their results. You’ll feel confident trying to (or even be able to)
fine tune your results to your slightest whims. After Chapter 2,
I felt like Sphinx was a ladle I could use with my pot of soup.
After Chapter 4, Sphinx feels more like tweezers I can use to pull
out exactly the needles I want.
Chapter 5, Managing Indexes, gets us back on the learn-by-doing
track. Sphinx is oriented to deal with very large document
repositories. Eventually your search will get too slow and your
indexing strategy will breakdown. You’ll need more machines,
more disk IO, multiple indexes, more CPU. You’ll need an indexing
strategy that can get you an updated index in less that 3000 hours.
To keep it going, and to keep adding new content and deleting
old content, you’ll need another sysadmin. Finding something
in terabytes of text in a few seconds will make you feel like
a mini-google. So get your cluster going and just do it.
This chapter will be most useful to experienced sysadmins and
benchmarkers. This book and especially this chapter, I suspect,
will be very interesting to teams currently supporting a large search
implementation (at a large library or corporation). Teams currently
supporting search implementations that are based on a huge pile
of Java, Lucene, XSLT and gobs of XML with ten different 50-stage
pipelines might blush after reading this book. Perhaps the search
at oreilly.com is due for an assessment. (For the record, searching
http://www.oreillynet.com/search/ for Andrew Aksyonoff gives 73325
results and “Andrew Aksyonoff” gives 1 result–not so great).
After reading the indexing and searching chapters, Chapter 6,
Relevance and Ranking, seems unnecessary. After all, getting
relevant results is all about generating the right indexes from
your data and searching them with the right queries. But then I
(tried to) read Chapter 6. To be honest, it left me knocked out on
the floor. Reading about the black art of relevance assessment made
me wonder if our author is a human being. Yes, he is a linux admin,
a mysql admin, a great programmer and a good writer. Like the best
Oreilly authors, this one has the top shelf computer science chops.
Andrew Aksyonoff is from planet Sebastopol. I will update this
review of Chapter 6 in ten years after I’ve had time to get the
background material required to fully understand it. To be fair,
this chapter does provide very useful tips for rigging your search
results (for example, to make sure your boss’s document always comes
up first–“Your paper must be the most important one, Mr. Johnson”).
You still have to read Chapter 6–take what you can, leave what
you can’t and move along. If nothing else, learn how to rig your
bosses paper so it comes up first in the search results.
Introduction to Search with Sphinx is what you’d expect from
Oreilly–a book you can use to do real work in a very short amount
of reading. Sphinx itself, from what I can tell, is ready for an
Oreilly because Sphinx works as advertised and fills a huge gap
in software for medium-to-large-sized search power that’s easy
to work with. I recommend stopping after Chapter 2 and actually
setting up and playing with Sphinx. The other chapters should
also be read, in order. The whole book can be read by a decent
linux-mysql-sysadmin and programmer in a couple of days.
If you are actually going to deploy, run and depend
on sphinx, then you’ll need to spend time on the
excellent sphinx site at http://sphinxsearch.com/.
You’ll also want to grab the other book from Packt at
(Disclaimer: I don’t have the book and haven’t read it). If you are
new to MySQL, Oreilly’s “Learning MySQL” is by far the best place
to start. After that you’ll want to grab Oreilly’s High Performance
MySQL (whose Appendix C is also written by Andrew Aksyonoff).
This book has no index in the back of the book! This is like a
bad joke–a search book without an index. I’m serious about the
index thing–I don’t think its funny. Another problem (in my view)
is that there is no mention of apache-tika. This seems to be an
omission, given that Sphinx eats text from XML documents. But Joe
and Jane Website want it to search rich documents like PDFs, MSWord,
MSExcel, etc. Apache-Tika parser seems like it would be very close
friends with Sphinx because Tika can spit XML from rich-documents
like PDF. In Chapter 1, our author mentions “DOC, PDF, MP3, AVI”
but says “Sphinx is not able to automatically identify types,
extract text based on type, and index that text.” Instead he
could have introduced us to Sphinx’s friend Apache-Tika, a vampire
that sucks text from PDF and other unsuspecting rich-text snobs.
Another problem (in my view) is the scant information about what
to do with gobzoodles of XML data structured in too many ways.
He says we need a schema. (Funny how Solr and Lucene tell us
the same thing). But even without a silver bullet, some general
advice would be nice. For example, we cannot just put these seven
lines around every XML stream we have, because content inside of
undeclared-XML-tags is ignored:
WARNING: THIS IS NOT GOOD ADVICE:
<?xml version=”1.0″ encoding=”utf-8″?>
Awesome XML stream content, perhaps even stripped of all tags, dude.
I want to know what to do with all my bad XML data. And I want
a different answer than “create fifty schemas for it.” OK, sorry
about the ignorance, whining and ranting.
Introduction to Search with Sphinx and Sphinx itself are great
achievements. If you need to implement your own search or if you’re
currently hoisting a broken behemoth of a search implementation, get
yourself a copy. My message to the author and the Sphinx team is:
Congratulations and thanks!
Oreilly’s page for Introduction to Search with Sphinx is at: