Nowadays websites are content management system things where all of the content sits in a database, not in a tree of HTML files. As a website administrator, this all-content-stuck-in-mysql can get frustrating because all of your blogs, products, etc. are in tables. I wanted to look closer at the content inside of a locally running Drupal deployment. Tapping into this content via the mysql command line is mind numbing non-fun. The easy way, I thought, would be to crawl the site with web clients. The last web crawling book I read was Clinton Wong’s excellent Web Client Programming with Perl. Its still a great book after 15 years! Another good book, 10 years on, is Sean Burke’s Perl and LWP. This desire to crawl Drupal content without any effort was in the back of my mind when I saw Webbots, Spiders, and Screen Scrapers, 2nd Edition, A Guide to Developing Internet Agents with PHP/CURL come across the wire. If you know a little PHP, chapters 3, 4, and 6 will take you an hour to read and you’ll be crawling right away. All of the other chapters are short, easy and fun to read. Most chapters are easy to skim or skip. Part 2, “Projects”, is skimmable and skippable. But these cookbook-like projects are easy reads and lots of fun. Easy fun. Easy Fun. Webbots, Spiders, and Screen Scrapers, 2nd Edition is easy fun!
The book uses some libraries written by the author, like LIB_http.php, that are very simple wrappers around the more extravagant and complicated offerings found in the open source cURL project of Swedish developer Daniel Stenberg. We learn a little about PHP’s integrated web client functions like fopen(), file() and fgets(). I liked the LIB_http.php wrapper because it is very small and easy to read and understand. Using it lets the developer concentrate on web crawling instead of code. It took some nerve for our author to provide a web client library but the gamble pays off keeping us focused on the crawling instead of understanding some behemoth API (as are to be found in many other web client library offerings). Anyone who has navigated the DOM of lousy HTML will appreciate the parsing strategy found in this book. Its more like the unix grep and less like tree-navigation. The HTML parsing priesthood might scoff at the crude parsing found here but I enjoyed it very much as it serves much web crawling needs without drowning in the DOM. These libraries are not super-powerful and won’t serve every need. But they are lots of fun and easy to use. Check out Apache HttpComponents if you want to drown. I was surprise to see that LIB_http.php was still being employed in Chapter 25, “Deployment and Scaling”, to operate a web botnet and also Chapter 28, “Writing Fault-Tolerant Webbots”. If you’re going to run a botnet, I would think that you’ll probably want to get a bigger library, but these simple libraries leverage cURL to get alot done in this book.
The section on iMacros was news to me. Unfortunately this chapter relies on code that’s not open source. Sometimes web crawling calls for automating an actual web browser. These macro chapters are wild newfangled fun. If your website target is hard to crawl, pull out a macro on them.
The worst page in the book is page 68, which as 3 problems:
On page 68, during a discussion of web form submission, it says:
Since URL fetch requests are sent in HTTP headers, and since headers are never encrypted, sensitive data should always be transferred with POST methods. POST methods don’t transfer form data in headers, and thus, they may be encrypted. Obviously, this is only important for web pages using encryption.
…but this is incorrect. If the web site is using https, then the whole pipe, including HTTP headers, is encrypted and not available to third-party sniffing. Using GET or POST does not make any difference. This and similar mistakes are repeated in other places including pages 196 and 261.
On page 68, there is a typo on the very last line of Listing 6-6. This line:
$response = http($target=$action, $ref, $method, $data_array, EXCL_HEAD);
…should instead read:
$response = http($action, $ref, $method, $data_array, EXCL_HEAD);
On page 68, last line of Listing 6-6, we use a function called http(). This function is directly callable, but the documentation in LIB_http.php advises:
# A common routine called by all of the previously described
# functions. You should always use one of the other wrappers for this
# routine and not call it directly.
…we probably would be better off using LIB_http.php’s http_post_form().
Both Oreilly and NoStarch do not have an errata page for reporting errors for this book (that I could find). Fortunately, the book is not overloaded with errors, just one bad page. The writing is generally precise but I did get tired of the repeating assumption that “fetch” is synonymous with “GET”. In http, we use “GET” not “fetch”.
All in all, Michael Schrenk has done us a great favor with his excellent new book. I congratulate him. Webbots, Spiders, and Screen Scrapers, 2nd Edition is easy fun!