Help more people learn by sharing this post!

How to build a parser with Ruby

last year /
By Jesus Castello /
2 COMMENTS

Parsing is the art of making sense of a bunch of strings and converting them into something we can understand. You could just use regular expressions, but they are not always suitable for the job.

For example, it is common knowledge that parsing HTML with regular expressions is probably not a good idea. In Ruby we have nokogiri that can do this work for us, but you can learn a lot by building your own parser. Let’s get started!

Parsing with Ruby

The core of our parser is the StringScanner class. This class holds a copy of a string and a position pointer. The pointer will allow us to traverse the string in search for certain tokens. The methods we will be using are .peek, .scan_until and .getch. Another useful method is .scan (without the until).

Note:
If StringScanner is no available to you try adding require 'strscan'

I wrote two test as documentation so we can understand how this class is supposed to work:

describe StringScanner do

let (:buff) { StringScanner.new "testing" }

it "can peek one step ahead" do

expect(buff.peek 1).to eq "t"

end

it "can read one char and return it" do

expect(buff.getch).to eq "t"

expect(buff.getch).to eq "e"

end

One important thing to notice about this class is that some methods advance the position pointer (getch, scan), while others don’t (peek). At any point your can inspect your scanner (using .inspect or p) to see where it’s at.

The parser class

The parser class is where most of the work happens, we will initialize it with the snippet of text we want to parse and it will create a StringScanner for that and call the parse method:

def initialize(str)

@buffer = StringScanner.new(str)

@tags = []

parse

end

In the test we define it like this:

1	let(:parser) { Parser.new "<body>testing</body> <title>parsing with ruby</title>" }

We will dive in on how this class does it job in a bit, but first let’s take a look at the last piece of our program.

The Tag Class

This class is very simple, it mainly servers as a container/data class for the parsing results.

class Tag

attr_reader :name

attr_accessor :content

def initialize(name)

@name = name

end

Let’s Parse!

To parse something we will need to look at our input text to find patterns. For example, we know HTML code has the following form:

1	<tag>contents</tag>

There’s clearly two different components we can identify here, the tag names and the text inside the tags. If we were to define a formal grammar using the BNF notation it would look something like this:

tag = <opening_tag> <contents> <closing_tag>

opening_tag = "<" <tag_name> ">"

closing_tag = "</" <tag_name> ">"

We are going to use StringScanners’s peek to see if the next symbol on our input buffer is an opening tag. If that’s the case then we will call the find_tag and find_content methods on our Parser class:

def parse_element

if @buffer.peek(1) == '<'

@tags << find_tag

last_tag.content = find_content

end

The find_tag method will:

‘Consume’ the opening tag character
Scan until the closing symbol (“>”) is found
Create and return a new Tag object with the tag name

Here is the code, notice how we have to chop the last character. This is because scan_until includes the ‘>’ in the results, and we don’t want that.

def find_tag

@buffer.getch

tag = @buffer.scan_until />/

Tag.new(tag.chop)

end

The next step is finding the content inside the tag, this shouldn’t be too hard since the scan_until method advances the position pointer to the right spot. We are going to use scan_until again to find the closing tag and return the tag contents.

def find_content

tag = last_tag.name

content = @buffer.scan_until /<\/#{tag}>/

content.sub("</#{tag}>", "")

end

Now all we need to do is to call parse_element on a loop until we can’t find more tags on our input buffer.

def parse

until @buffer.eos?

skip_spaces

parse_element

end

You can find the complete code here: https://github.com/matugm/simple-parser. You can also look at the ‘nested_tags’ branch for the extended version that can deal with tags inside another tag.

Conclusion

Writing a parser is an interesting topic and it can also get pretty complicated at times. If you don’t want to make your own parser from scratch you can use one of the so-called ‘parser generators’. In Ruby we have treetop and parslet.

2 comments

Patrick Mulder (@mulpat) says last year

nice overview!

k-ta-yamada says 9 months ago

This article was very helpful for me.
Thank you.

How to build a parser with Ruby

Parsing with Ruby

The parser class

The Tag Class

Let’s Parse!

Conclusion

2 comments

Comments are closed

Popular posts

11 Ruby Tricks You Haven’t Seen Before

Building Your Own Linux Tools with Ruby: A Practical Guide

The Ultimate Guide to Blocks, Procs & Lambdas

Mastering Ruby Arrays

How to build a parser with Ruby

Parsing with Ruby

The parser class

The Tag Class

Let’s Parse!

Conclusion

Share this!

2 comments

Comments are closed

Popular posts

11 Ruby Tricks You Haven’t Seen Before

Building Your Own Linux Tools with Ruby: A Practical Guide

The Ultimate Guide to Blocks, Procs & Lambdas

Mastering Ruby Arrays