Parsing HTML in Ruby

4 years ago /
By Jesus Castello

If you ever tried to write a scrapping tool you probably had to deal with parsing html. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes html parsing a walk in the park. Let’s see some examples.

First install the nokogiri gem with: gem install nokogiri

Extracting the title

Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri. The output will the page title.

require 'nokogiri'

html = "<title>test</title><body>actual content here...</body>"

parsed_data = Nokogiri::HTML.parse html

puts parsed_data.title

=> "test"

Extracting anchor links

So that was pretty easy, wasn’t it? Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the xpath method on the object we get back from nokogiri, then we can print the indvidual attributes of the tag or the text inside the tags:

parsed_data = Nokogiri::HTML.parse html

anchor_tags = parsed_data.xpath("//a[@href]")

puts anchor_tags.first[:href] + " " + anchor_tags.first.text

And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link. You can also use CSS selectors, replace the xpath method with the css method.

Nokogiri documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri

Parsing HTML in Ruby

Extracting the title

Extracting anchor links

Popular posts

11 Ruby Tricks You Haven’t Seen Before

The Ultimate Guide to Blocks, Procs & Lambdas

Building Your Own Linux Tools with Ruby: A Practical Guide

Mastering Ruby Arrays

Parsing HTML in Ruby

Extracting the title

Extracting anchor links

Share this!

Popular posts

11 Ruby Tricks You Haven’t Seen Before

The Ultimate Guide to Blocks, Procs & Lambdas

Building Your Own Linux Tools with Ruby: A Practical Guide

Mastering Ruby Arrays