If you ever tried to write a scrapping tool you probably had to deal with parsing html. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes html parsing a walk in the park. Let’s see some examples.
First install the nokogiri gem with: gem install nokogiri
Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri. The output will the page title.
1 2 3 4 5 |
require 'nokogiri' html = "<title>test</title><body>actual content here...</body>" parsed_data = Nokogiri::HTML.parse html puts parsed_data.title => "test" |
So that was pretty easy, wasn’t it? Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the xpath method on the object we get back from nokogiri, then we can print the indvidual attributes of the tag or the text inside the tags:
1 2 3 |
parsed_data = Nokogiri::HTML.parse html anchor_tags = parsed_data.xpath("//a[@href]") puts anchor_tags.first[:href] + " " + anchor_tags.first.text |
And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link. You can also use CSS selectors, replace the xpath method with the css method.
Nokogiri documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri
You might also like:
Ruby string format