Black Bytes
Share this post!

All posts by Jesus Castello

Parsing HTML in Ruby

If you ever tried to write a scrapping tool you probably had to deal with parsing html. This task can be a bit difficult if you don’t have the right tools. Ruby has this wonderful library called Nokogiri, which makes html parsing a walk in the park. Let’s see some examples.

First install the nokogiri gem with:  gem install nokogiri

Extracting the title

Then create the following script, which contains a basic HTML snippet that will be parsed by nokogiri. The output will the page title.

Extracting anchor links

So that was pretty easy, wasn’t it? Well, it’s doesn’t get much harder than that. For example, if we want all the links from a page we need to use the xpath method on the object we get back from nokogiri, then we can print the indvidual attributes of the tag or the text inside the tags:

And that’s it, as you may have already guessed the xpath method uses the Xpath query language, for more info on xpath check out this link. You can also use CSS selectors, replace the xpath method with the css method.

Example:

Note: The difference between at_css & css is that the first one only returns the first matched element, but the latter returns ALL matched elements.

To find the correct css selector can use your browser’s developer tools.

Nokogiri documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri

You might also like:
Ruby string format

Ruby String Formatting

Let’s talk about how you can format strings in ruby.

Why would you want to format a string? Well, you may want to do things like have a leading zero even if the number is under 10 (example: 01, 02, 03…), or have some console output nicely formatted in columns.

In other languages you can use the printf function to format strings, and if you have ever used C you are probably familiar with that. To use printf you have to define a list of format specifiers and a list of variables or values.

Getting Started with Ruby String Formatting

While sprintf is also available in Ruby, in this post we will use a more idiomatic way (for some reason the community style guide doesn’t seem to agree on this, but I think that’s ok).

Here is an example:

Output => "Processing of the data has finished in 5 seconds"

In this example, %d is the format specifier (here is a list of available specifiers) and time is the variable we want formatted. A %d format will give us whole numbers only.

If we want to display floating point numbers we need to use %f. We can specify the number of decimal places we want like this: %0.2f.

The 2 here indicates that we want to keep only two decimal places.

Here is an example:

Output => The average is 78.54

Remember that the number will be rounded up. For example, if I used 78.549 in the last example, it would have printed 78.55.

Converting and Padding

You can convert a decimal number and print it as hexadecimal. Using the %x format:

Output => 122 in HEX is 7a

To pad a string:

Use this format for padding a number with as many 0’s as you want: %0<number of zeros>d

Output => The number is 0020

You can also use this ruby string format trick to create aligned columns of text. Replace the 0 with a dash to get this effect:

ruby string format

Alternatively, you can use the .ljust and .rjust methods from the String class to do the same.

Example:

Conclusion

As you have seen ruby & rails string formatting is really easy, it all comes down to understanding the different format specifiers available to you.

I hope you enjoyed this fast trip into the world of output formatting! Don’t forget to subscribe to my newsletter so I can send you more great content 🙂

1 17 18 19