parsing Archives - Black Bytes

Static Analysis in Ruby

last year /
By Jesus Castello

If you want to know something about your source code, like the name and line number of all your methods, what do you do?

Your first idea might be to write a regexp for it, but what if I told you there is a better way?

ruby static analysis

Static analysis is a technique you can use when you need to extract information from the source code itself. This is done by converting source code into tokens (parsing). Let’s get right into it!

Using the Parser Gem

Ruby has a parser available on the standard library, the name is Ripper. The output is hard to work with so I prefer using the fantastic parser gem. Rubocop uses this gem to do its magic.

This gem also includes a binary you can use to parse some code directly and see the resulting parse tree.

Here is an example:

1	ruby-parse -e '%w(hello world).map { \|c\| c.upcase }'

The output looks like this:

1

2

3

4

5

6

7

8

9

(block

(send

(array

(str "hello")

(str "world")) :map)

(args

(arg :c))

(send

(lvar :c) :upcase))

This can be useful if you are trying to understand how Ruby parses some code. But if you want to create your own analysis tools you will have to read the source file, parse it and then traverse the generated tree.

Example:

1

2

3

4

require 'parser/current'

code = File.read('app.rb')

parsed_code = Parser::CurrentRuby.parse(code)

The parser will return an AST (Abstract Syntax Tree) of your code. Don’t get too intimidated by the name, it’s simpler than it sounds

Traversing the AST

Now that you have parsed your code using the parser gem you need to traverse the resulting AST.

To do that you need to create a class and inherit from AST::Processor:

1 2	class Processor < AST::Processor end

Then you have to instantiate this class and call the .process method:

1 2	ast = Processor.new ast.process(parsed_code)

You need to define some on_ methods. These methods correspond to the node names in the AST.

To discover what methods you need to define you can add the handler_missing method to your Processor class. You also need the on_begin method.

1

2

3

4

5

6

7

8

9

class Processor < AST::Processor

def on_begin(node)

node.children.each { |c| process(c) }

end

def handler_missing(node)

puts "missing #{node.type}"

end

Here is where we are:

You have your AST and a basic processor, when you run this code you will see the node types for your AST.

Now:

You need to to implement all the on_ methods that you want to use. For example, if I want all the instance method names along with their line numbers I can do this:

1

2

3

4

5

6

def on_def(node)

line_num = node.loc.line

method_name = node.children[0]

puts "Found #{method_name} at line #{line_num}"

end

When you run your program now it should print all the method names found.

Conclusion

Building a Ruby static analysis tool is not as difficult as it may look. If you want a more complete example take a look at my class_indexer gem. Now it’s your turn to make your own tools!

Please share this post if you enjoyed it!

How to build a parser with Ruby

last year /
By Jesus Castello /
2 COMMENTS

Parsing is the art of making sense of a bunch of strings and converting them into something we can understand. You could just use regular expressions, but they are not always suitable for the job.

For example, it is common knowledge that parsing HTML with regular expressions is probably not a good idea. In Ruby we have nokogiri that can do this work for us, but you can learn a lot by building your own parser. Let’s get started!

Parsing with Ruby

The core of our parser is the StringScanner class. This class holds a copy of a string and a position pointer. The pointer will allow us to traverse the string in search for certain tokens. The methods we will be using are .peek, .scan_until and .getch. Another useful method is .scan (without the until).

Note:
If StringScanner is no available to you try adding require 'strscan'

I wrote two test as documentation so we can understand how this class is supposed to work:

1

2

3

4

5

6

7

8

9

10

11

12

describe StringScanner do

let (:buff) { StringScanner.new "testing" }

it "can peek one step ahead" do

expect(buff.peek 1).to eq "t"

end

it "can read one char and return it" do

expect(buff.getch).to eq "t"

expect(buff.getch).to eq "e"

end

One important thing to notice about this class is that some methods advance the position pointer (getch, scan), while others don’t (peek). At any point your can inspect your scanner (using .inspect or p) to see where it’s at.

The parser class

The parser class is where most of the work happens, we will initialize it with the snippet of text we want to parse and it will create a StringScanner for that and call the parse method:

1

2

3

4

5

def initialize(str)

@buffer = StringScanner.new(str)

@tags = []

parse

end

In the test we define it like this:

1	let(:parser) { Parser.new "<body>testing</body> <title>parsing with ruby</title>" }

We will dive in on how this class does it job in a bit, but first let’s take a look at the last piece of our program.

The Tag Class

This class is very simple, it mainly servers as a container/data class for the parsing results.

1

2

3

4

5

6

7

8

class Tag

attr_reader :name

attr_accessor :content

def initialize(name)

@name = name

end

Let’s Parse!

To parse something we will need to look at our input text to find patterns. For example, we know HTML code has the following form:

1	<tag>contents</tag>

There’s clearly two different components we can identify here, the tag names and the text inside the tags. If we were to define a formal grammar using the BNF notation it would look something like this:

1

2

3

tag = <opening_tag> <contents> <closing_tag>

opening_tag = "<" <tag_name> ">"

closing_tag = "</" <tag_name> ">"

We are going to use StringScanners’s peek to see if the next symbol on our input buffer is an opening tag. If that’s the case then we will call the find_tag and find_content methods on our Parser class:

1

2

3

4

5

6

def parse_element

if @buffer.peek(1) == '<'

@tags << find_tag

last_tag.content = find_content

end

The find_tag method will:

‘Consume’ the opening tag character
Scan until the closing symbol (“>”) is found
Create and return a new Tag object with the tag name

Here is the code, notice how we have to chop the last character. This is because scan_until includes the ‘>’ in the results, and we don’t want that.

1

2

3

4

5

def find_tag

@buffer.getch

tag = @buffer.scan_until />/

Tag.new(tag.chop)

end

The next step is finding the content inside the tag, this shouldn’t be too hard since the scan_until method advances the position pointer to the right spot. We are going to use scan_until again to find the closing tag and return the tag contents.

1

2

3

4

5

def find_content

tag = last_tag.name

content = @buffer.scan_until /<\/#{tag}>/

content.sub("</#{tag}>", "")

end

Now all we need to do is to call parse_element on a loop until we can’t find more tags on our input buffer.

1

2

3

4

5

6

def parse

until @buffer.eos?

skip_spaces

parse_element

end

You can find the complete code here: https://github.com/matugm/simple-parser. You can also look at the ‘nested_tags’ branch for the extended version that can deal with tags inside another tag.

Conclusion

Writing a parser is an interesting topic and it can also get pretty complicated at times. If you don’t want to make your own parser from scratch you can use one of the so-called ‘parser generators’. In Ruby we have treetop and parslet.

Parsing HTML in Ruby

4 years ago /
By Jesus Castello

If you ever tried to write a scrapping tool you probably had to deal with parsing html. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes html parsing a walk in the park. Let’s see some examples.

First install the nokogiri gem with: gem install nokogiri

Extracting the title

Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri. The output will the page title.

1

2

3

4

5

require 'nokogiri'

html = "<title>test</title><body>actual content here...</body>"

parsed_data = Nokogiri::HTML.parse html

puts parsed_data.title

=> "test"

Extracting anchor links

So that was pretty easy, wasn’t it? Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the xpath method on the object we get back from nokogiri, then we can print the indvidual attributes of the tag or the text inside the tags:

1

2

3

parsed_data = Nokogiri::HTML.parse html

anchor_tags = parsed_data.xpath("//a[@href]")

puts anchor_tags.first[:href] + " " + anchor_tags.first.text

And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link. You can also use CSS selectors, replace the xpath method with the css method.

Nokogiri documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri

Tag Archives for " parsing "

Static Analysis in Ruby

Using the Parser Gem

Traversing the AST

Conclusion

How to build a parser with Ruby

Parsing with Ruby

The parser class

The Tag Class

Let’s Parse!

Conclusion

Parsing HTML in Ruby

Extracting the title

Extracting anchor links