RubyGuides
Share this post!

Parsing HTML in Ruby

If you ever tried to write a scrapping tool you probably had to deal with parsing html. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes html parsing a walk in the park. Let’s see some examples.

First install the nokogiri gem with:  gem install nokogiri

Extracting the title

Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri. The output will the page title.

Extracting anchor links

So that was pretty easy, wasn’t it? Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the xpath method on the object we get back from nokogiri, then we can print the indvidual attributes of the tag or the text inside the tags:

And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link. You can also use CSS selectors, replace the xpath method with the css method.

Example:

Note: The difference between at_css & css is that the first one only returns the first matched element, but the latter returns ALL matched elements.

To find the correct css selector can use your browser’s developer tools.

Nokogiri documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri

You might also like:
Ruby string format