A Practical Exercise in Web Scraping

Yesterday a friend of mine linked me to a fictional web serial that he was reading and enjoying, but could be enjoying more if it was available as a Kindle book. The author, as of yet, hasn't made one available and has asked that fan-made versions not be linked publicly. That said, it's a very long story and would be much easier to read using a dedicated reading app, so I built my own Kindle version to enjoy. This post is the story of how I built it.

Step 1: Source Analysis

The first step of any kind of web scraping is to understand your target. Here's what the first blog post looks like (with different content):

<h1 class="entry-title">The Whale</h1>
<div class="entry-content">
  <p>
    <a title="Next Chapter" href="http://example.com/the/next/chapter">
      Next Chapter
    </a>
  </p>
  <p>"And what tune is it ye pull to, men?"</p>

  <p>"A dead whale or a stove boat!"</p>

  <p>
    More and more strangely and fiercely glad and approving, grew the
     countenance of the old man at every shout; while the mariners
     began to gaze curiously at each other, as if marvelling how it
     was that they themselves became so excited at such seemingly
     purposeless questions.
  </p>

  <p>
    But, they were all eagerness again, as Ahab, now half-revolving in
    his pivot-hole, with one hand reaching high up a shroud, and
    tightly,   almost convulsively grasping it, addressed them
    thus:&mdash;
  </p>
  <p>
    <a title="Next Chapter" href="http://example.com/the/next/chapter">
      Next Chapter
    </a>
  </p>
</div>

After browsing around I found a table of contents, but since all of the posts were linked together with "Next Chapter" pointers it seemed easier to just walk those. The other interesting thing here is that there's a comment section that I didn't really care about.

Step 2: Choose Your Tools

The next stage of web scraping is to choose the appropriate tools. I started with just curl and probably could have gotten pretty far I knew the DOM futzing I wanted to do would require something more powerful later on. At the moment Ruby is where I turn to for most things, so naturally I picked Nokogiri. The first example on the Nokogiri docs page is actually a web scraping example, and that's basically what I cribbed from. Here's the initial version of the scraping function:

def scrape_page(url)
  html = open(url)
  doc = Nokogiri::HTML(html.read)
  doc.encoding = 'utf-8'

  content = doc.css('div.entry-content').first
  title = doc.css('h1.entry-title')

  next_url = ""

  content.search('a[title="Next Chapter"]').each do |node|
    next_url = node['href']
    node.parent.remove
  end

  {
    title: title,
    content: content,
    next_url: next_url
  }
end

Ruby has a built-in capability for opening URLs as readable files with the open-uri standard library module. Because of various problems with Nokogiri's unicode handling I learned about in previous web scraping experiences, the best thing to do is to pass a string to Nokogiri instead of passing it the actual IO handle. Setting the encoding explicitly is also a best practice.

Then it's a simple matter of using Nokogiri's css selector method to pick out the nodes we're interested in and return them to the caller. The idea is that, since each page is linked to it's successor we can just follow the links.

Step 3: The Inevitable Bugfix Iteration

Of course it's never that easy. Turns out these links are generated by hand, and across hundreds of blog posts of course there will be some inconsistencies. At some point the author stopped using the title attribute. Instead of using the super clever CSS selector a[title="Next Chapter"] I had to switch to grabbing all of the anchor tags and selecting based on the text:

content.search('a').each do |node|
  if node.text == "Next Chapter"
    next_url = node['href']
  end
  node.parent.remove
end

This works great, except that in a few cases there's some whitespace in the text of the anchor node, so I had to switch to a regex:

content.search('a').each do |node|
  if node.text =~ /\s*Next Chapter\s*/
    next_url = node['href']
  end
  node.parent.remove
end

Another sticking point was that sometimes (but not always) the author used non-ASCII in their URLs. The trick for dealing with possibly-escaped URLs is to check to see if decoding does anything. If it does, it's already escaped and shouldn't be messed with:

def escape_if_needed(url)
  if URI.unescape(url) == url
    return URI.escape(url)
  end
  url
end

Step 4: Repeat As Necessary

Now that we can reliably scrape one URL, it's time to actually follow the links:

task :scrape do
  next_url = 'http://example.com/the/first/chapter/'

  sh "mkdir -p output"

  counter = 0

  while next_url && next_url =~ /example.com/
    STDERR.puts(next_url)

    res = scrape_page(next_url)
    next_url = res[:next_url]
    title = res[:title].text

    File.open("output/#{sprintf('%04d', counter)}.html", "w+") do |f|
      f.puts res[:title]
      f.puts res[:content]
    end

    counter += 1

    sleep 1
  end
end

This is pretty simple. Set some initial state, make a directory to put the scraped pages, then follow each link in turn and write out the interesting content to sequential files. Note that file names are all four digit numbers so that the sequence is preserved even with lexicographical sorting.

Step 5: Actually Build The Book

At first I wanted to use Docverter, my project that mashes up pandoc and calibre for building rich documents (including ebooks) out of plain text files. I tried the demo installation first, but that runs on Heroku and repeatedly ran out of memory so I tried a local installation. That timed out (did I mention that this web serial is also very long?) so instead I just ran pandoc and ebook-convert directly:

task :build do
  File.open("input.html", "w+") do |f|
    Dir.glob('output/*.html').sort.each do |filename|
      f.write File.read(filename)
    end
  end

  STDERR.puts "Running conversion..."

  sh("pandoc --standalone --output=output.epub --from=html --to=epub --epub-metadata=metadata.xml --epub-stylesheet=epub_stylesheet.css input.html")
  sh("ebook-convert output.epub output.mobi")
end

Pandoc can take multiple input files but it was easier to manage one input file on the command line. The stylesheet and metadata xml files are lifted directly from the mmp-builder project that I use to build Mastering Modern Payments, with appropriate authorship information changes.

In Conclusion, Please Don't Violate Copyright

Making your own ebooks is not hard with the tools that are out there. It's really just a matter of gluing them together with an appropriate amount of duct tape and bailing twine.

That said, distributing content that isn't yours without permission directly affects authors and platform shifting like this is sort of a gray area. The author of this web serial seems to be fine with fan-made ebooks editions as long as they don't get distributed, so that's why I anonymized this post.

Posted in: Software  

Tagged: Programming