Black Bytes
Share this post!

How to build a parser with Ruby

Parsing is the art of making sense of a bunch of strings and converting them into something we can understand. You could just use regular expressions, but they are not always suitable for the job.

For example, it is common knowledge that parsing HTML with regular expressions is probably not a good idea. In Ruby we have nokogiri that can do this work for us, but you can learn a lot by building your own parser. Let’s get started!

Parsing with Ruby

The core of our parser is the StringScanner class. This class holds a copy of a string and a position pointer. The pointer will allow us to traverse the string in search for certain tokens. The methods we will be using are .peek, .scan_until and .getch. Another useful method is .scan (without the until).

Note:
If StringScanner is no available to you try adding require 'strscan'

I wrote two test as documentation so we can understand how this class is supposed to work:

One important thing to notice about this class is that some methods advance the position pointer (getch, scan), while others don’t (peek). At any point your can inspect your scanner (using .inspect or p) to see where it’s at.

The parser class

The parser class is where most of the work happens, we will initialize it with the snippet of text we want to parse and it will create a StringScanner for that and call the parse method:

In the test we define it like this:

We will dive in on how this class does it job in a bit, but first let’s take a look at the last piece of our program.

The Tag Class

This class is very simple, it mainly serves as a container & data class for the parsing results.

Let’s Parse!

To parse something we will need to look at our input text to find patterns. For example, we know HTML code has the following form:

There’s clearly two different components we can identify here, the tag names and the text inside the tags. If we were to define a formal grammar using the BNF notation it would look something like this:

We are going to use StringScanners’s peek to see if the next symbol on our input buffer is an opening tag. If that’s the case then we will call the find_tag and find_content methods on our Parser class:

The find_tag method will:

  • ‘Consume’ the opening tag character
  • Scan until the closing symbol (“>”) is found
  • Create and return a new Tag object with the tag name

Here is the code, notice how we have to chop the last character. This is because scan_until includes the ‘>’ in the results, and we don’t want that.

The next step is finding the content inside the tag, this shouldn’t be too hard since the scan_until method advances the position pointer to the right spot. We are going to use scan_until again to find the closing tag and return the tag contents.

ruby parser

Now all we need to do is to call parse_element on a loop until we can’t find more tags on our input buffer.

You can find the complete code here: https://github.com/matugm/simple-parser. You can also look at the ‘nested_tags’ branch for the extended version that can deal with tags inside another tag.

Conclusion

Writing a parser is an interesting topic and it can also get pretty complicated at times. If you don’t want to make your own parser from scratch you can use one of the so-called ‘parser generators’. In Ruby we have treetop and parslet.

2 comments
Patrick Mulder (@mulpat) says last year

nice overview!

k-ta-yamada says last year

This article was very helpful for me.
Thank you.

Comments are closed