Installing Drupal 7 on Mac OSX 10.6 Snow Leopard

Installing Drupal on a Mac using MacPorts:
First steps:
1. Install Appleʼs “Development Tools” which youʼll find on the “Applications” DVDrom
that came with your Apple machine.
2. Run “Software Update” in “System Preferences” on your Mac to get the updates from
Apple for the Development Tools you just installed.
3. Download and install Macports from macports.org. This is a .dmg file that you
double-click on and go–easy.
To install Drupal7 on your Apple, we are going to use everything Macports. Everything
will be done on the command-line using Macʼs Terminal program (which youʼll find in
“Applications–>Utilities–>Terminal”). Many of the commands we type will require
superuser permissions and so weʼll be prefixing our commands with “sudo” and
sometimes it will make you type a password for a user with admin-level permissions on
the Mac.
Instead of macports, you can also use the default Apache that comes with every Apple,
and install a double-clickable mysql-server .dmg file from mysql.com and manually
install Drupal7 from drupal.org. Or you could instead install Drupal7 using the MAMP
method. Macports is a good option for two important reasons:
A. Macports is great software compiled for your Mac.
B. It is often used by experts and therefore the online help youʼll find is usually good.
4. Update Macports..in Terminal type this and wait a few hours:
sudo port selfupdate
5. Install apache2, php, mysql client, mysql-server, drupal7:
sudo port install apache2
sudo port install mysql5
sudo port install mysql5-server
sudo port install php52
cd /opt/local/apache2/modules
sudo /opt/local/apache2/bin/apxs -a -e -n “php5” libphp5.so
sudo port install drupal7
sudo port installed
sudo port list
sudo port list | grep agick
6. Setup mysql-server:
sudo port load mysql5-server
cd /opt/local/bin
sudo -u _mysql mysql_install_db5
cd /opt/local/; sudo /opt/local/lib/mysql5/bin/mysqld_safe &
/opt/local/lib/mysql5/bin/mysqladmin -u root password ʻs3kr3tʼ
cd /opt/local/bin/
mysql5 -u root -p
mysql> quit
7. Copy the Drupal files to apache2ʼs document directory:
cd /opt/local/www/data/drupal7/
sudo cp -a ./ /opt/local/apache2/htdocs/
sudo cp /opt/local/etc/php5/php.ini-dist /opt/local/etc/php5/php.ini
8. Restart apache2 and make changes to fix apache2:
sudo /opt/local/apache2/bin/apachectl -k restart
links http://localhost/ (gives you “It works” from index.html).
links http://localhost/index.php (gives you php source code)
….at this point two problems still exist:
1. Apache is returning index.html as default document, not index.php
2. Apache is not parsing PHP code, just returning the source code as text.
…to fix the index.html-vs-index.php problem we change DirectoryIndex ordering.
…to fix the PHP not parsing problem, add an “AddType PHP” directive to httpd.conf
Fix both problems like this:
******************************************************
sudo cp /opt/local/apache2/conf//httpd.conf /opt/local/apache2/conf/httpd.conf.ORIG
sudo vi /opt/local/apache2/conf//httpd.conf
sudo /opt/local/apache2/bin/apachectl -k restart
sudo cp /opt/local/apache2/conf//httpd.conf /opt/local/apache2/conf/httpd.conf.070611-
WORKS
links http://localhost/index.php (gives you Drupal page YAY!).
******************************************************
..if you donʼt know vi, try using nano or pico which are easier text editors to use.
..during our vi editing session, we commented-out one line and added two lines:
#DirectoryIndex index.html
DirectoryIndex index.php index.html
AddType application/x-httpd-php .php
…and now we have a working setup and backup httpd.conf. Notice that index.php
comes before index.html…this causes the php file to go out by default if both are
in the directory being requested by the user. The AddType line makes PHP work (that is,
it causes apache2 to execute the php source code instead of returning verbatim code to
the user).
9. One file that came with Drupal and is worth looking over is /.htaccess. Drupal follows
(or should be following) the directives in this file.
10. At this point, Drupal is working but is not setup. There are a few additional pieces of
administration necessary. What still needs to be fixed is not obvious. But because
Drupalʼs php code can execute, letʼs run it and see that it offers helpful
troubleshooting. When it complains about a problem, we will address and fix it until
it doesnʼt complain. At this point, Step #10 is the last “step”. From now on, we will
be clicking on links and filling out forms in our web browser and also running
command-line operations. These are easier to explain by discussion instead of
listing discrete steps.
On our fresh new site, when we visit http://localhost/ we get index.php, but it
in turn sends us to install.php (presumably because Drupal knows it hasn’t
been setup yet). The questions on the install.php page are:
———————–
Choose profile(active)
Choose language
Verify requirements
Set up database
Install profile
Configure site
Finished
———————–
…we should document our deployment choices. The first choice (profile)
asks us if we want “standard” or “minimal” and lets us click a button that
says “Save and Continue”.
…Now is a good time to read up on these choices so that we make good
choices…and then we’ll document them so we know which choices were made.
Continuing along, we are now at this Drupal location in our web browser:
http://localhost/install.php
…and we are making choices:
Choose Profile—–“Standard”
Choose Language—-“English”
Verify Requirements—NOT GOOD! These problems exist:
1. Database Support _NO
2. File System:”The directory sites/default/files does not exist”
3. Settings File:”./sites/default/settings.php file does not exist.”
…this is not too bad. To fix problems 2 and 3, we will make the “files” directory and will
copy in a default “settings.php” file to the proper location. We will also make the whole
directory owned by the apache2 user so that it has the permissions necessary to create
new files (so that users can upload files…and so that Drupal admins can add Drupal
modules). Run these commands in your command-line terminal:
******************************************************
cd /opt/local/apache2/htdocs/
ps -ef | egrep “PID|http”
cat /etc/passwd | grep 70
sudo mkdir sites/default/files
sudo cp sites/default/default.settings.php sites/default/settings.php
sudo chown -R _www ./
sudo /opt/local/apache2/bin/apachectl -k restart
******************************************************
…notice that we check (again) and see that apache is running as userid 70,
which on this MacOSX10.6 box’s /etc/passwd file indicates user “_www”. We
thus change the ownership of the htdocs directory recursively (all files/dirs below)
to new owner “_www”. Changing ownership is the only permissions change that we
should need to make because Drupal ships the files with the proper permissions by
default–for example–the “files” and “modules” and “themes” directories all ship owner-
writable so that users can upload files and admins can install modules and themes.
We also make a backup copy of the settings file and restart apache.
…after performing these steps and reloading our web browser on this page:
http://localhost/install.php
…both problems #2 and #3 went away.
…the first problem, “No Database Support”, was fixed like this:
******************************************************
sudo port uninstall php52
sudo port install php52 +apache2 +mysql5 +pear
cd /opt/local/apache2/modules
sudo /opt/local/apache2/bin/apxs -a -e -n “php5” libphp5.so
sudo /opt/local/apache2/bin/apachectl -k restart
******************************************************
…first we reinstalled php52 but added some extra “extensions”. The
macports php52 comes with several extensions, including:
curl, freetype, jpeg, libpng, libmcrypt, libxml2, libxslt, mhash, and tiff
…but we’re also going to want apache2, mysql5 and pear.
…after running the commands shown and reloading this Drupal setup page:
http://localhost/install.php
…the “Database Support” problem was fixed.
at this point in our Drupal7 setup, the “Verify Requirements” page reports “No
problems”–yay.
…continuing to go through the steps on the Drupal7 setup page at:
http://localhost/install.php
…we are now on the “Database Configuration” page. It wants to know:
1. Database name (and this database must already exist).
2. Database username
3. Database password
…this page won’t be successfully completed until we create the database
and a user/pass that has permissions on it. These steps were run to set that
up:
******************************************************
cd /opt/local/bin/
./mysql5 -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.1.57 Source distribution
Copyright (c) 2000, 2010, Oracle and/or its affiliates. All rights reserved.
This software comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to modify and redistribute it under the GPL v2 license
Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.
mysql> create database drupalseven;
Query OK, 1 row affected (0.00 sec)
mysql> GRANT ALL ON drupalseven.* TO ‘drupaluser’@’localhost’ IDENTIFIED BY
‘QW1ET’;
Query OK, 0 rows affected (0.00 sec)
mysql> quit
Bye
******************************************************
…we may need to flush/reload/??? to get mysql to re-read its permissions.
Or maybe they work out of the box. Restarting mysqld does not appear to be
easy on this macports/osx10.6 box. To restart mysql, these steps were
run:
******************************************************
cd /opt/local/bin/
./mysql5 -u root -p
ps -ef | grep mysql (OUTPUT SHOWS mysqld_safe has PID 2424)
sudo kill -HUP 2424
./mysql5 -u drupaluser -p (WORKS)
******************************************************
…and with that we have a database named “drupalseven” and a user named
“drupaluser” whose password is “QW1ET”. User “drupaluser” was granted
all permissions on every table in the “drupalseven” database. We can now
continue by filling in these items on the “Database Configuration” portion
of the Drupal7 setup at:
http://localhost/install.php
DBNAME: drupalseven
DBUSER: drupaluser
DBPASS: QW1ET
…after entering this information, and clicking to continue, the
next page looked good, there was a status bar, and 28 of 28 things
got done…most of the text that went by said things about installing
modules, like this: “Installing the Toolbar Module”.
…The next page of the Drupal7 setup page at:
http://localhost/install.php
…is called “Configure Site” and asks for these things to be entered:
Site Name:
Site Email:
Site Maintenance Account:
—-user
—-email
—-pass
—-re-enter pass
Server Settings:
Default Country:
…at this point a good question is how the mail is going to work. If I paste
in a gmail address, how will I know if my machine and network will let it all
out. OSX10.6 does not ship with a running mail server because this
command just hangs:
******************************************************
mail -s “Testing RU There” `whoami`@`hostname`
telnet 127.0.0.1 25
Trying 127.0.0.1…
telnet: connect to address 127.0.0.1: Connection refused
telnet: Unable to connect to remote host
******************************************************
…try adding a gmail/yahoo and see if it lets you continue. Otherwise, these steps might
work:
******************************************************
sudo launchctl start org.postfix.master
telnet 127.0.0.1 25
sudo launchctl start org.postfix.master
telnet 127.0.0.1 25
mail localusername
less /var/mail/localusername
hostname
******************************************************
…whatever you get from the hostname command, write it down. Otherwise, you can
disable all of your networking (airport off, ethernet disabled, etc.) and type the
“hostname” command again to get a hostname like “Local-Usernames-Imac.local”.
When entering the Drupal-setup-email information, try one of these possibly-valid local
email addresses:
localuser@hostname.pitt.edu
localuser@Local-Usernames-Imac.local
…with any luck Drupal7 says:
“Drupal installation complete”
YAY!
…it prompts me with a link to “visit your new site”, that goes here:
http://localhost/
The Drupal7 setup will annoyingly not complete unless you can get past the Drupal7-
email-setup page. And the only way to do that is to enter a valid email address. This is
a drag because:
1. Your network might not let Drupal send out email (to yahoo for example).
2. Getting your local Mac to properly run an email server is not so simple especially
because your hostname is apple-unusual. Further, even if you do get it working
during drupal setup, most apple-hostnames change often. Your Mac laptopʼs
hostname this morning might have been “unpaiduser112@dhcp-forbes-
Pgh.starbucks.com” but later today it will be different and your Drupal7 email settings
willl be wrong.
3. Drupal7 disallows real and useable email addresses including these:
localuser@127.0.0.1
localuser@localhost
A good strategy is to just get past this portion of the setup process and worry about
fixing it later (or just look at logging information online instead of reading emails sent to
you by Drupal).
At this point, Drupal7 setup is complete. Very little will be necessary on the command-
line from now on. (Almost) everything Drupal is done via its web interface. Hence, this
is a good time to backup your Drupal site. Backing up Drupal can be done in many
ways including by using Drupalʼs “Backup & Migrate” Drupal module. Instead, we will
take a backup of the databases and the htdocs directory. All changes in Drupal will
manifest themselves (on this machine) at /opt/local/apache2/htdocs and /opt/local/var/
db/mysql5/. Notice that the htdocs directory might contain more web content than just
your Drupal stuff. Also notice that the mysql5 directory contains things other than your
Drupal databases–it contains all of the mysql databases. After taking this backup, we
will have an archive that can be used to restore our Drupal deployment to its exact state
at the time the backup was taken. Run these commands to take a backup:
******************************************************
ps -ef | egrep “PID|mysql”
sudo kill -9 560 (560 is the mysqld_safe PID)
sudo kill -9 561 (561 is the mysqld PID)
cd /opt/local/apache2/
sudo ./bin/apachectl stop
sudo tar -czf ./htdocsBackup070611.tar.gz ./htdocs/
cd /opt/local/var/db/
sudo tar -czf ./mysql5Backup070611.tar.gz ./mysql5/
******************************************************
…those two .tar.gz files contain our entire Drupal7 deployment so far and can be used to
restore the site to its current condition. To restart mysql and apache2, run these
commands:
******************************************************
cd /opt/local/; sudo /opt/local/lib/mysql5/bin/mysqld_safe &
cd /opt/local/apache2/
sudo ./bin/apachectl start
******************************************************
…your Drupal site should be running normally at http://localhost/.
When restoring a backup, typically you want to preserve the existing site. We will do
that. To restore your backups, run these commands:
******************************************************
ps -ef | egrep “PID|mysql”
sudo kill -9 560 (544 is the mysqld_safe PID)
sudo kill -9 561 (578 is the mysqld PID)
cd /opt/local/apache2/
sudo ./bin/apachectl stop
cd /opt/local/apache2/
mv htdocs htdocs_BrokenSite_070711
sudo tar -tzf ./htdocsBackup070611.tar.gz
sudo tar -xzf ./htdocsBackup070611.tar.gz
cd /opt/local/var/db/
mv mysql5 mysql5_BrokenSite_070711
sudo tar -tzf ./mysql5Backup070611.tar.gz
sudo tar -xzf ./mysql5Backup070611.tar.gz
cd /opt/local/; sudo /opt/local/lib/mysql5/bin/mysqld_safe &
cd /opt/local/apache2/
sudo ./bin/apachectl start
******************************************************
…the “mv” command is the rename command. Both times we ran it to preserve the
existing site by renaming the folder something besides htdocs or mysql5. When we run
the tar commands, we will be overwriting anything currently at htdocs or mysql5. tar
was run with -tzf which lists the files in the archives and -xzf which extracts the files from
the archives. Its always a good idea to run tar with the -t option first to make sure the
archive contains what you expect it does. If your site stays small, these backups wonʼt
be big and can be taken often. While under development, you can take a backup, then
do some risky/unknown Drupal development without worry. If things go badly, you can
just untar your trusty backup and start that risky part over again. Because Drupal and
mysql are relatively platform-independent, you should be able to take these two
archives to another machine (even a Linux machine) and untar them there and get your
site running elsewhere without too much trouble. Another reason to take backups is to
track development with documentation. Because most Drupal “administration” takes
place via its web interface, it is very easy to lose track of what administration has been
performed. Documenting Drupal administration will get very tedious because constantly
writing down what you are doing will result in dense, unreadable prose. A better way to
document Drupal administration might be like this:
1. Take a backup of Drupal.
2. Perform Drupal administration…module installation, etc. until you get it right, jot down
notes.
3. Restore backup taken in Step 1.
4. Redo successful steps taken in Step 2 (minus all the false leads/mistakes).
5. Write down the successful operations performed in Step 4.
…in this way, you can keep track of your Drupal site setup without having to remember
anything.
Backups like these are too easy and too useful to take that it never makes sense to not
get in the habit of doing it. If you only want to capture the Drupal mysql database (and
not the whole entire mysql directory), run instead this command sequence:
******************************************************
cd /opt/local/var/db/mysql5/
sudo tar -czf ./mysql5_drupalsevenBackup070611.tar.gz ./drupalseven
******************************************************

Advertisements
Posted in Uncategorized | Leave a comment

Book Review: Virtualization: A Manager’s Guide

Dan Kusnetzky’s new book “Virtualization: A Manager’s Guide” is subtitled “Big picture of the who, what, and where of virtualization”. The book’s ISBN is 9781449306458. Its website is at http://oreilly.com/catalog/9781449306458/ I have read the book. My background is programming, system, network and database administration. I have some virtualization software experience but I am not nearly an expert. I was attracted to the book because its an Oreilly and because I liked my time spent on the author’s blog at: http://www.kusnetzky.net/analysts-take/.

Oreilly books usually contain a warning section, something like “Expected Audience”, that I always read to find out if the book is right for me (is it too easy or too hard?). This book’s “About This Book” section says: “This book is intended to introduce managers…to the concepts behind virtualization technology” and “It is not meant as a how-to guide for IT analysts, developers or administrators”. That warning is about right–this book is not for me and its not for techies. But the book is easy to read in a very short amount of time (a great thing, in my opinion) and so I read it anyway. I read the whole book in less than a day. There are not any good books on virtualization for managers and so this might be the best one available. We still need a good virtualization book somewhere between this one (too vague) and Oreilly’s “VMWare Cookbook” (too specific). This book is an easy-to-read primer for managers who need to know more about virtualization. Cloudy reading (like what you’ll find at the author’s blog at http://www.kusnetzky.net/analysts-take/) will be clearer after reading this book. Technology people will find this book too simplistic and should avoid it.

Chapter 1 discusses all of the layers of virtualization. The biggest problem in virtualization is terminology. It is the Pacific Ocean of buzzwords. Sometimes the book helps clear up the mess and other times it doesn’t. A glossary would have been nice so that PaaS, SaaS, etc could get pinned down. Chapter 1 shows us Fig. 1-1, “The Kusnetzky model of virtualization”, which shows the layers of virtualization:

Access
Application
Processing
Network
Storage
Security
Management

The model may be okay but I don’t like the graphic in Fig 1-1. Its not clear why the layers are stacked, why they are in the order they are in, and why two layers, security and management, are shown as vertical layers. (Wait until Chapter 7 when the author tells us that the vertical layers, security and management, are vertical because they “protect and manage all the other layers”). Perhaps I’m trying to read too much into the graphic or perhaps the author should offer more explanation. Its not made clear why we should think of and break down virtualization into “layers”. What the author refers to as “layers” of virtualization seem (to me) more like types of virtualization. In any case, the strength of this model (and its graphic) is that it fixes a framework for the whole book. Each layer gets its own chapter, starting with “Access Virtualization”. The framework is a good thing. It cleverly compartmentalizes virtualization, which otherwise tends to get minestrone-soup-like in your head.

Chapter 2, “Access Virtualization: Providing Universal Access”, has the same structure as every chapter. Each chapter concentrates on one layer of virtualization. And each chapter has these sections:

What is _______ virtualization?
What does _______ virtualization do?
When is _______ virtualization the right choice?
Players in the _______ virtualization world.
A few examples of _______ virtualization in use.

By breaking each layer of virtualization down into these categories, the author has done us a great service. Because of this rigidity of structure, the book can be used as a reference. Do you want to know who the big players are in storage virtualization? No problem, just find the chapter on Storage Virtualization and find the “Players in” section. This structure also makes for easy reading and easier remembering. Imposing this organization on the book was a great idea. When the author tries to define terms, like “access virtualization”, he usually misses. The first section of Chapter 2 is called “What is Access Virtualization”. But we never get an answer via a definition. The closest we get is “Access virtualization hardware and software are designed to place access to applications and workloads in a virtual environment.” There are too many fuzzy-wuzzy definitions of terms throughout the book. The author does answer the question “What is Access Virtualization” with examples. Most chapters tend to be weak on definitions of terms but strong on examples. Chapter 2 offers a classic example: “X-Windows…is another early example of access virtualization”. Anyone who has used X to display a program locally that is actually running on another PC instantly knows what access virtualization is. I was surprised to see several access virtualization technologies not get any mention in this chapter including: Windows’ Remote Desktop Client, VNC, Apple’s Screen Sharing, even ssh/telnet fit the basic access virtualization idea of type-and-mouse-here/run-elsewhere.

Chapter 3, “Application Virtualization: Application Isolation, Delivery and Performance”, cleanly chops the concept into two varieties:

Client-side
Server-side

We also get a good jump conceptually by building on what we already know from Chapter 2 when the author tells us, “The major difference between access virtualization and application virtualization is that a portion, or perhaps all, of the application actually runs on the local device rather than on a remote server”. We hear only a little bit about thin clients, unfortunately. Many IT managers have bright dreams about the future of thin clients and many also have bitter memories of the many thin client disasters of the past. Of disasters, I can remember Sun Microsystems’ Sun Ray. These little guys were going to be great. They weren’t.

Chapter 4, “Processing Virtualization: Doing System Tricks”, again fails to define what processing virtualization actually is. The description in the section “What is Processing Virtualization” tries to lasso too many things under one umbrella. We would probably be better served if we just stuck with saying that processing virtualization is “Making one machine appear to be many”. It is curious that the author spends time talking about “Making many systems appear to be one” in this chapter. Most deployments using many physical systems to appear as one system are not using virtualization software–they’re using DNS, routing, tiers and light-footprint high-availability/load-distribution/clustering software or hardware. Most readers of “Virtualization, A Manager’s Guide” are probably trying to run many virtual machines on a few physical machines. Machines nowadays are too powerful to give every server its own physical machine. Managers want to replace racks of machines with a couple of PCs hooked up to a speedy and spacious disk array. This segment of the virtualization scene should have received more coverage in this chapter.

Chapter 5, “Network Virtualization: Controlling the View of the Network”, should be skipped entirely. According to our author, Network Virtualization “provides the following functions”:

Network routing
Network address translation
Network isolation

Really? All this time I thought that was done by switches, routers, firewalls and TCP/IP. There may be such a thing as network virtualization, but you won’t learn about it in this chapter.

Chapter 6, “Storage Virtualization: Where are your files and applications”, gives us very basic (good for managers) information about storage hardware technologies. The author only mentions five “players” in storage virtualization, all giants with no mention of Dell. There are dozens of hot storage hardware and software players that managers-without-billions will want to look at including: Tintri, Arkeia, AC&NC, nexenta, dothill, ramsan, storform, Amazon, CDNs, JBODs-with-LVM, Cassandra, Hadoop, MongoDB. The author might have reminded managers that the worst performance/price mistakes are made in the area of storage management.

Chapter 7, “Security for Virtual Environments: Guarding the Treasure”, and Chapter 8, “Management for Virtual Environments”, are mostly aimed at enterprise-level (NASA/Exxon) managers with tons of servers, applications and hardware. In both cases, the list of “players” is again missing some of the smaller vendors.

Chapter 9, “Using Virtualization: The Right Tool for the Job” and Chapter 10, “Summary”, consolidates and mixes in lots of the technologies and concepts from the previous chapters.

If you are a manager of a data center and you need to know more about virtualization, this book is an easy read that will help you communicate with your system administrators. It will also help you understand more of the articles on the author’s excellent blog.

Posted in Book Reviews | Tagged , , , , | Leave a comment

Book Review: Introduction to Search with Sphinx, by Andrew Aksyonoff

This is a review of Oreilly’s excellent new (April 2011) book
Introduction to Search with Sphinx.  This review is also a bit of
a reading guide, if you don’t mind.  I am not a computer science
guru, as you’ll notice.  I just want a search box on my website that
finds the right stuff.  I have actually read the book.  Sphinx is
open source.  ISBN: 9780596809539

In the Preface, the author gives us this heads up about the audience
he expects:  “I assume that readers have a basic familiarity with
tools for system administrators and programmers, including the
command line and simple SQL.”  This sounds good to me–it doesn’t
say I need to know anything about search, which is good, because
I don’t.  The more experience with linux, mysql, SQL and programming
you bring, the easier this book will be to read.  If BM25 and NLP
don’t mean much to you, you’ll have to get a CompSci PhD for a few
of the paragraphs in this book.

Chapter 1 “The World of Text Search” is awesome fun.  Who knew that
Sphinx (and Google) searches use AND by default but that Lucene uses
OR by default.  Chapter 1 is mostly a refreshing breeze, but we do
get precision about the meaning of many terms including document,
index, match, result set.  A beautiful equivalency is presented
that database people will love and should memorize:

Database table ≈ Sphinx index
Database rows ≈ Sphinx documents
Database columns ≈ Sphinx fields and attributes

This equivalency is conceptually accurate and makes Sphinx less
intimidating to those of us not intimidated by an SQL database.
Chapter 1’s rougher stuff, the Linguistics Crash Course, is
inherently rough but well presented, and you’ll get a clue about
things like homonyms, NLP, part-of-speech, word-sense-disambiguation,
stemming, lemmatization, and other stars of morphology processing.
You’ll feel much smarter after reading these few short pages in
only a few minutes.  You’ll also wish you knew what a predicate was
like you once did back in fourth grade.  There is heavy math, AI and
theory down there where the flesh meets the bone of linguistics and
text search.  This very readable and exciting dip into text search
is a masterpiece of brevity.  (Oh, I’m sorry , would you prefer
a bookshelf or two of this stuff?).  Chapter 1 is generally very
smooth reading but does have one unexplained word-drop, “statistical
text rankings such as BM25”, without explanation (you’ll get your
explanation in Chapter 6).  Chapter 1’s introduction to the nature
of full text indexes and the inverted file are easy to understand.
I was surprised to learn that Sphinx indexes are a whopping-large
60-70% of the size of the indexed text.  When done with this section,
you’ll know the key features of an index and what the two Sphinx
index types are like.  Chapter 1 finishes with a discussion of
Search Workflows, a practical section that discusses how to get
Sphinx to find your stuff.  The Search Workflows section renders
Sphinx very approachable.  Be sure to grasp and understand the
illustrations and concepts in this section.  The reading here
is easy enough to go into skimming mode.  Taking your time will
pay off later when other concepts build on the Sphinx basics.
In the Index Approaches section, we learn the difference between
batch-indexing and incremental-indexing, and we start to see that
Sphinx is super-powerful.  The author may have gone too far in this
section when he mentions that really big jobs that need multiple
machines and data sharding “isn’t fully automatic with Sphinx,
but its pretty easy to set up.”  (“Billy! Honey, pick up your room,
finish your algebra homework and when you’re done, shard that data
and set up a Sphinx search cluster for me please.”)  We learn about
the two main types of searches: Full-text queries and full scans.
Full text queries search all the text in a document.  Full scans
search attributes attached to the document, like author_id or
creation_date.  The last section of Chapter 2, Kinds of Results,
teases us with possibilities involving rewritten queries, query
suggestions, snippets (excerpts) and facets.  Unfortunately, we
never return to these topics in this book for in-depth treatment.
Even so, its nice to know that Sphinx is snippets-and-facets ready.
If you’ve ever worked with Solr and its Lucene index and searching
core, you will see that Sphinx is a different take on a similar
problem.

Chapter 2 gets to the red meat of running sphinx.  You will need
to improve any linux and mysql basics that have gotten rusty.
You should also read the book next to a printout of the example
sphinx.conf and example.sql files (or have them open on a computer
screen) that are discussed in this chapter.
Data sources, index configuration and other Sphinx basics get
good coverage in this chapter.  One thing that was not explained
thoroughly enough was the sql_file_field directive.  Sphinx is
great at indexing the contents of SQL tables and XML data.  But the
sql_file_field directive tells Sphinx to index the file that is
named in that field.  For example, if the column contents are
“/storage/123.txt”, then the sql_file_field tells Sphinx to go get
that file and index that file’s contents.  Sounds exciting to me.
But this seemingly very important feature gets glossed over too
quickly.  I want to know more about how to use sql_file_field
intelligently.
On page 32, we see our first SphinxAPI PHP program.  Your favorite
programming language works with Sphinx, but the examples (all
very simple) are in PHP.  Anyway, on page 32, the PHP code
doesn’t import any libraries–working code would have been nicer,
especially since the Oreilly page for the book does not have link
to code downloads.  This book is best read by going along with it
using a computer.  I had to search elsewhere to find out which
PHP imports were required to make the code in the book work.
That said, when I was searching at sphinxsearch.com, I was very
pleased with the online Sphinx documentation and its archived help
from fellow sphinxers.  Two good reasons to use Sphinx are: 1)
The Oreilly book exists and 2) The online documentation is good.
Searching for less than one minute at sphinxsearch.com, I learned
that I need to include sphinxapi.php.  My package manager put it at
/usr/share/doc/libsphinxclient-0.9.9/sphinxapi.php.  After finding
it and copying it to the same directory as the script, then this
include works:

<?php
include(“sphinxapi.php”); // load API
$cl = new SphinxClient();
?>

The SphinxAPI is a pleasure to use.  I was feeling luxurious using
it until he mentioned that “code should, of course, also add error
handling”.  Yes, it should, and authors should remind us of that
I suppose.  Except for error-handling, code using the SphinxAPI
is easy on the eyes and explains what is happening, “filter by
this, sort by that”.  The SphinxAPI is clearly our friend, and
saves us from “building the SQL statement string from pieces”.
But sometimes you need straight SQL to describe what you want,
and so we learn about using SphinxQL.  Because Sphinx supports the
MySQL wire protocol, your shipped-with-Linux mysql client can talk
to searchd.  That’s nice.  Just revert to your old MySQL-from-scratch
habits.  We learn about SphinxQL, an SQL-like query language
that searchd understands.  The whole searchd-is-almost-mysql and
SphinxQL-is-almost-SQL is intriguing but its best not to think too
hard about it–just move along.  The online documentation for the
SphinxAPI is excellent–but it is a little bit strange if you’ve
never read API documentation designed to be used with multiple
programming languages.  For example, this documentation might tell
you that a function returns a hash.  In PHP, that is fine and good,
but how easily will it translate to another programming language
where a method returns some other data structure?  Using the book
and the online API documentation will be easiest for PHP programmers.
Sphinx and its documentation are usable with Perl, PHP, Python, Java,
Ruby, C# and C/C++.  To quote from the book on this topic: “When you
send a query to Sphinx using a programming API, the result combines
row data and metadata into a single structure. The specific structure
used varies depending on the language you’re using (an associative
array in PHP, Perl, and Python; a struct in pure C; a class in Java;
etc.), but the structure of member names and their meanings stay
the same across APIs.”  Another encouraging quote: “We will use PHP
for the examples, but the API is consistent across the supported
languages, so converting the samples to your language of choice,
such as Python or Java, should be straightforward.”  The bottom
line is that the Sphinx API for your language is unmysterious and
its documentation is usable (with a little translation if you’re
not using PHP).
Its nice to see the author and Oreilly devote a few pages to
compiling from source.  Package management is great, except when
it isn’t.  Its nice to compile from source because you’ll sleep
better knowing that you can survive a bad update.  Don’t be a noob,
compile it yourself!  While reading or after finishing Chapter 2
is a great time to install and fool around with Sphinx.  Build the
test1 index as described and then build another one from your own
source of data.  Sphinx is easy.  No, seriously, install it and
fool around.  I’ll wait for you.  Seriously–install Sphinx and do
the test examples as shown in Chapter 2.  It takes 5 minutes, do it.

Chapter 3, Basic Indexing, is not very basic, but it covers the
details of important indexing concepts and techniques.  It covers
SQL indexes and XML indexes.  The SQL section is straightforward
and demonstrates the tools you’ll need when dealing with large
amounts of data.  The “Indexing XML Data” reminds me of that quip:
“your XML problems can only be solved with more XML”.  I was hoping
to learn that Sphinx would suck in all the XML data on my server
and make it searchable (kind of like how Google’s Picasa finds
all of your photos).  Our author gets to the ugly truths quickly:
“Every index has a schema—that is, a list of fields and attributes
that are in it” and “We need to know what data we’re going to
index before we start processing our first document.”  Throw in
some character sets and encodings and grab the Pepto Bismol.
Sphinx will index your XML data, but if you want it to be easily
found, then you’ll need to do some work.  The good news is that
if your XML data is relatively uniform and you know its structure,
then indexing it with Sphinx won’t be hard.  For example, if you’ve
got tons of WordPress blog posts to index, you’ll be able to do it.
The Sphinx-related details of making your XML searchable are simple.
This simple configuration can go in the sphinx.conf configuration
file or in your XML stream.  The discussion about stopwords–very
common words like “the” “and” “or”–and how to deal with them is
fascinating.  You probably won’t know what to do about stopwords
in your data, but you will learn how to make Sphinx deal with them
the way you want.  Chapter 3, Basic Indexing, covers not-so-fun
material but it is presented clearly.  Reading this chapter is
necessary and will build confidence in Sphinx and our able author.

Chapter 4, Basic Searching, Sphinx has an interesting development
history that we learn about.  This history is required reading
because Sphinx and Sphinx configuration are tied to its legacy.
Along the way we a literary side of our author with peaches
like the “ranking function was nailed onto its matching mode
with nine-inch titanium nails” and “The milkman’s lot isn’t
as sought after as it once was…”  Oreilly’s editors are good
to stay out of the way of sparse poetry and especially titanium
milkmen.  This chapter is necessarily long.  We learn about the
three cornerstones of search–KEYWORDS (e.g. titanium milkmen),
OPERATORS (AND/OR/NOT….e.g. titanium OR milkmen) and MODIFIERS
(^=$, e.g. “^Once upon a time”).  We also get the guts of result
sets (results themselves and metadata about the search itself),
searching multiple indexes simultaneously (for performance),
result set processing, filtering and sorting.  Fully digesting
chapters 3 and 4 will allow you to understand your searches and
their results.  You’ll feel confident trying to (or even be able to)
fine tune your results to your slightest whims.  After Chapter 2,
I felt like Sphinx was a ladle I could use with my pot of soup.
After Chapter 4, Sphinx feels more like tweezers I can use to pull
out exactly the needles I want.

Chapter 5, Managing Indexes, gets us back on the learn-by-doing
track.  Sphinx is oriented to deal with very large document
repositories.  Eventually your search will get too slow and your
indexing strategy will breakdown.  You’ll need more machines,
more disk IO, multiple indexes, more CPU.  You’ll need an indexing
strategy that can get you an updated index in less that 3000 hours.
To keep it going, and to keep adding new content and deleting
old content, you’ll need another sysadmin.  Finding something
in terabytes of text in a few seconds will make you feel like
a mini-google.  So get your cluster going and just do it.
This chapter will be most useful to experienced sysadmins and
benchmarkers.  This book and especially this chapter, I suspect,
will be very interesting to teams currently supporting a large search
implementation (at a large library or corporation).  Teams currently
supporting search implementations that are based on a huge pile
of Java, Lucene, XSLT and gobs of XML with ten different 50-stage
pipelines might blush after reading this book.  Perhaps the search
at oreilly.com is due for an assessment.  (For the record, searching
http://www.oreillynet.com/search/ for Andrew Aksyonoff gives 73325
results and “Andrew Aksyonoff” gives 1 result–not so great).

After reading the indexing and searching chapters, Chapter 6,
Relevance and Ranking, seems unnecessary.  After all, getting
relevant results is all about generating the right indexes from
your data and searching them with the right queries.  But then I
(tried to) read Chapter 6.  To be honest, it left me knocked out on
the floor.  Reading about the black art of relevance assessment made
me wonder if our author is a human being.  Yes, he is a linux admin,
a mysql admin, a great programmer and a good writer.  Like the best
Oreilly authors, this one has the top shelf computer science chops.
Andrew Aksyonoff is from planet Sebastopol.  I will update this
review of Chapter 6 in ten years after I’ve had time to get the
background material required to fully understand it.  To be fair,
this chapter does provide very useful tips for rigging your search
results (for example, to make sure your boss’s document always comes
up first–“Your paper must be the most important one, Mr. Johnson”).
You still have to read Chapter 6–take what you can, leave what
you can’t and move along.  If nothing else, learn how to rig your
bosses paper so it comes up first in the search results.

Introduction to Search with Sphinx is what you’d expect from
Oreilly–a book you can use to do real work in a very short amount
of reading.  Sphinx itself, from what I can tell, is ready for an
Oreilly because Sphinx works as advertised and fills a huge gap
in software for medium-to-large-sized search power that’s easy
to work with.  I recommend stopping after Chapter 2 and actually
setting up and playing with Sphinx.  The other chapters should
also be read, in order.  The whole book can be read by a decent
linux-mysql-sysadmin and programmer in a couple of days.

If you are actually going to deploy, run and depend
on sphinx, then you’ll need to spend time on the
excellent sphinx site at http://sphinxsearch.com/.
You’ll also want to grab the other book from Packt at
http://www.packtpub.com/sphinx-search-beginners-guide/book
(Disclaimer: I don’t have the book and haven’t read it).  If you are
new to MySQL, Oreilly’s “Learning MySQL” is by far the best place
to start.  After that you’ll want to grab Oreilly’s High Performance
MySQL (whose Appendix C is also written by Andrew Aksyonoff).

This book has no index in the back of the book!  This is like a
bad joke–a search book without an index.  I’m serious about the
index thing–I don’t think its funny.  Another problem (in my view)
is that there is no mention of apache-tika.  This seems to be an
omission, given that Sphinx eats text from XML documents.  But Joe
and Jane Website want it to search rich documents like PDFs, MSWord,
MSExcel, etc.  Apache-Tika parser seems like it would be very close
friends with Sphinx because Tika can spit XML from rich-documents
like PDF.  In Chapter 1, our author mentions “DOC, PDF, MP3, AVI”
but says “Sphinx is not able to automatically identify types,
extract text based on type, and index that text.”  Instead he
could have introduced us to Sphinx’s friend Apache-Tika, a vampire
that sucks text from PDF and other unsuspecting rich-text snobs.
Another problem (in my view) is the scant information about what
to do with gobzoodles of XML data structured in too many ways.
He says we need a schema.  (Funny how Solr and Lucene tell us
the same thing).  But even without a silver bullet, some general
advice would be nice.  For example, we cannot just put these seven
lines around every XML stream we have, because content inside of
undeclared-XML-tags is ignored:

WARNING: THIS IS NOT GOOD ADVICE:
<?xml version=”1.0″ encoding=”utf-8″?>
<sphinx:docset>
<sphinx:document id=”1234″>
<content>
Awesome XML stream content, perhaps even stripped of all tags, dude.
</content>
</sphinx:document>
</sphinx:docset>

I want to know what to do with all my bad XML data.  And I want
a different answer than “create fifty schemas for it.”  OK, sorry
about the ignorance, whining and ranting.

Introduction to Search with Sphinx and Sphinx itself are great
achievements.  If you need to implement your own search or if you’re
currently hoisting a broken behemoth of a search implementation, get
yourself a copy.  My message to the author and the Sphinx team is:

Congratulations and thanks!

Oreilly’s page for Introduction to Search with Sphinx is at:

http://oreilly.com/catalog/9780596809539/

I review for the O'Reilly Blogger Review Program

Posted in Book Reviews | Tagged , , , , , , , , , , , , , | 1 Comment