Full Text Search with Whistlepig

Yesterday I suddenly developed the intense need to add search to this site. Among the problems with this is that the site is kind of a weird hybrid between static and dynamic, and it has no database backend. If posts were stored in Postgres this would be a trivial matter, but they're just markdown files on disk. After flailing around for awhile I came across a library named Whistlepig which purported to do in-memory full text indexing with a full query language.

November 5, 2013: I've removed search because nobody used it and this way the site can be 100% static.

First pass: Regular expressions

To rewind a bit, my first horrible stab at search was to find all of the posts that matched a user-provided regular expression:

@results = @posts.find_all do |post|
  post.body.match /\b#{user_query}\b/

This is of course complete madness. Not only am I allowing the user to put whatever they want in a regex but it only matches whole words. Not to mention that, while I only have around sixty pages right now, that number's never going to go down.

Other Solutions

There's a bunch of different hosted and external options that I briefly considered, none of which were very satisfactory.

  • sqlite3 has built-in FTS but I would have to build a bunch of stuff around it.
  • xapian is a FTS engine but like with sqlite3 I'd have to build stuff.
  • elasticsearch would work but it's an external process that I'd have to run and it's an awful lot of overhead
  • Some kind of hosted elasticsearch or solr provider would work, but again lots of overhead and not free and I'm then dependent on their uptime.

Whistlepig to the Rescue

Whistlepig is a small text search index. Small as in not very many features and not much code, but the features that are there are perfect for my needs:

  • Full query language
  • In-memory, in-process
  • Arbitrary number of indexes for the same document

Here's a full example of how to index and query a document:

require 'rubygems'
require 'whistlepig'

document = "Hi there"

index = Whistlepig::Index.new "index"

entry = Whstilepig::Entry.new
entry.add_string "body", document

docid = index.add_entry entry

query = Query.new("body", "hi")
result = index.search(query)
assert_equal docid, result[0]

The indexing code in bugsplat's app is not much more complicated. Here's the interesting bit:

@pages_by_docid = {}

@pages.each do |page|
  entry = Whistlepig::Entry.new

  entry.add_string "body", page.render(@strip_renderer)
  entry.add_string "name", page.name
  entry.add_string "title", page.title.downcase
  entry.add_string "tags", page.tags.join(" ").downcase
  entry.add_string "page_id", page.page_id
  entry.add_string "blog_post", page.is_blog_post? ? "yes" : "no"
  docid = @index.add_entry(entry)

  @pages_by_docid[docid] = page

In bugsplat a Page encapsulates everything about an entry writen in Markdown. I maintain six indexes on the pages, including body rendered with a Markdown-stripping, downcasing renderer, name which is the canonical name of the post, title, tags, page_id which is a short-code type of thing, and blog_post which is a simple boolean as to whether the post has a date or not.

"Why so many indexes?" you may find yourself asking. Because instead of just implementing search and being done with it, I went and refactored the guts of the blog to use it throughout. See, I had these terrible little things everywhere, all over the place:

@page = @pages.find_all { |p| p.has_tag? params[:tag] }

Doing linear searches across the list of in-memory pages isn't too terrible but man it bugged me to have to repeat that everywhere. Instead of that, I can do nice things like this:

@tagged_pages = @pages.search(params[:tag].downcase, "tags")

Each time I found myself iterating over all of the pages to get a subset I replaced it with a search query. The code is much nicer to read and faster, although almost all of it is cached as static HTML in production.

Try it out!

Go ahead and search for some stuff and let me know what you think! And next time you find yourself with a full text search problem, see if Whistlepig would help you out. It's not for everybody, but it's very good at what it does.

Posted in: Software  

Tagged: Programming Meta