Black Bytes
Share this post!

Category Archives for Programming

ruby hash tables

Hash Tables Explained

One of my favorite data structures is the hash table because it’s simple & powerful. You probably have used it before since it’s an efficient way to store key-value pairs.

There are some interesting computer science concepts behind the implementation of a hash table that are worth studying, and that’s exactly what we are going to do in this article!

Buckets & The Hash Function

The basic idea of a hash table is to allow us to efficiently (in O(1)) retrieve data that is indexed by a key.

As a quick refresher, this is what using a hash table looks like in Ruby:

To implement a hash table we need two components:

  • A place to store the table entries
  • A way to assign key/value pairs to a specific position (index) inside this data store

In other words, we need an array & a hash function.

Implementing a Simple Hash Function

The hash function is an important component of a hash table. This function transforms the key into an index that can be used to lookup or update its associated value.

ruby hash tables

This is the big difference between plain arrays & hash tables.

In an array, you can only access values via their index and the index can only be a number. In a hash table, you access values via their key & the key can be anything (string, symbol, integer…), as long as you can write a hash function for it.

You can write a simple hash function for strings by converting all the letters into their ASCII values then adding them up.

Here is an example:

In this method we use to_s to make sure we are working with a string. This will help us avoid ‘undefined method’ errors. Then a combination of chars (to convert the string into an Array of its characters) & inject for adding up the values.

Inside the block I used the ord method to convert the characters into their ordinal values.

Finally, I used the modulo operator % to make sure the resulting value fits into the array. We call each entry in this array a ‘bucket’.

Bucket Distribution

Ideally we want all our buckets to be filled evenly, this will give us the best performance when we need to retrieve a value.

Let’s look at what happens when we test our hash function using this code:

This produces the following results:

It looks like our keys are evenly distributed…

…but what happens if we ramp up the number of buckets?

In this example I used a bucket size of 128 (it was 32 before):

That doesn’t look like a great distribution anymore!

What’s going on?

The problem is that our hash function is not good enough because all the strings of the same size stay in a certain range. That’s why we have a lot of empty buckets in the middle.

A Better Hash Function

We need a better way to convert our string into an index. Let’s take a look at one possible implementation.

What’s happening here is bit shifting (with the >> & << operators). The values are combined using the “exclusive OR operator” (^).

This bit shifting will mix things up, which will give us a better key distribution. Not perfect, but it’s better than our simple ASCII-based function 🙂

If you want a proper hash function you would be looking at something like MurmurHash, which I believe is what Ruby uses.

Handling Collisions

We don’t have a useful hash table yet.

Why?

Well, you may have noticed that when two keys hash to the same index they will overwrite the old value, and that’s not good!

This is called a hash collision & there are a few strategies to deal with these.

For example:

  • Double Hashing
  • Linear probing
  • Separate chaining

Let’s take a look at separate chaining, where we use a linked-list to store the entries for a particular “bucket”.

So if we assume that :abc & :ccc hash to the same index, our hash table would look something like this:

Then we will need a linear search to find the correct key.

This will have an impact on our performance because our lookup time can slowly degrade towards O(n), instead of the expected O(1).

If you are not familiar with this O(something) notation that’s called “Big-O Notation“.

To avoid the linked list from growing too deep & therefore degrade the performance of the hash table, it’s necessary to recreate the hash table using a bigger number of buckets.

Ruby does this for you automatically, but still good to know.

Conclusion

The purpose of this article is not to have you writing a hash table implementation, but to help you understand how they actually work, so I hope that you found that interesting!

Don’t forget to share this post to keep the blog going 🙂

ruby pack unpack

Packing & Unpacking: A Guide to Reading Binary Data in Ruby

Working with text is a lot easier than working with binary data…

…with text you can use regular expressions & methods like scan, match & gsub.

But if you want to work with binary data there is some extra work to do. That’s where the Array#pack & String#unpack methods come into play.

Let me show you some examples, starting with just a plain string & then moving on to more interesting stuff.

String to ASCII Values

This will convert every character in the string into a decimal value:

Notice the "c*" argument for unpack.

This is a “format string” which tells unpack what to do with the data. In this case, c means take one character & convert it into an integer value (the String#ord method also does this).

The asterisk * just says “repeat this format for all the input data”.

Hex to Integer

This format string takes 4 bytes of data & returns an integer. One thing to notice is that these bytes are in “little-endian” format.

Examples:

I used first here because unpack returns an array.

Binary File Parsing

How do you read a binary file like an EXE, PNG or GZIP?

If you treat these like strings you will just see something that looks like random data…

ruby string pack

…but there is a documented structure for most of these file formats & the unpack method is what you would use to read that data and convert it into something useful.

Here is an example:

In this example, the binary data (represented in hexadecimal, which is way more compact than 1s & 0s) has a two-byte (16 bit) length field that contains the length of the following string. Then there is the string itself.

It is very common for binary files & binary network protocols to have a “length” field.

This tells the parser exactly how many bytes should be read (and yes, I know in this example I read both the length & the data in one step, that’s just to keep things simple).

There is also the bindata gem, which is built specifically to help you parse binary structures.

Here is an example:

Notice the read_length parameter. This will tell bindata to work out the length from the field, so this will save you a lot of work 🙂

So if you want to write a parser for any binary format, these are the steps:

  1. Find the specification for this format (if it’s not public you will have to reverse-engineer it, which is an entire topic on its own)
  2. Write a bindata class for every section of the file (you will usually find a header section first with metadata & then multiple data sections)
  3. Read the data & process it however you want (for example, in a PNG you could change the colors of the image)
  4. Profit!

If you want to see a full example of bindata in action take a look at my PNG parser on github.

Base64 Encoding

There is this type of encoding called “Base64”. You may have seen it before on a URL.

Looks something like this:

The double equals at the end is usually the tell-tale sign that you are dealing with Base64, although some inputs can result in the equals signs not being there (they are used as padding).

So why I’m telling you this, besides being a useful thing to know in itself?

Well it turns out that you can convert a string into Base64 using the pack method.

As you can see here:

In fact, this is the exact method used in the Base64 module from the standard library 🙂

Summary

In this post you learned about the pack & unpack methods, which help you work with binary data. It can be used to parse binary files, convert a string into ASCII values & Base64 encoding.

Don’t forget to share & subscribe so you can enjoy more blog post like this! 🙂

ruby tracepoint

How To Spy on Your Ruby Methods

Ruby has a built-in tracing system which you can access using the TracePoint class. Some of the things you can trace are method calls, new threads & exceptions.

Why would you want to use this?

Well, it could be useful if you want to trace the execution of a certain method. You will be able to see what other methods are being called & what are the return values.

Let’s see a few examples!

Tracing Method Calls

Most of the time you will want TracePoint to trace application code & not built-in methods (like puts, size, etc).

You can do this using the call event.

Example:

This prints the file path, the line number, the event name & the method name.

If you don’t specify any events Ruby will call your block for all of them, resulting in more output. So I would recommend that you focus on specific events to find what you want faster 🙂

Here’s is a table of TracePoint events:

Event name Description
call Application methods
c_call C-level methods (like puts)
return Method return (for tracing return values & call depth)
b_call Block call
b_return Block return
raise Exception raised
thread_begin New thread
thread_end Thread ending

TracePoint + Graphviz

Many methods will make more than just 3 methods calls, especially in framework code, so the output from Tracepoint can be hard to visualize.

So I made a gem that lets you create a visual call graph like this:

This generates a call_graph.png file with the results.

ruby call graph

Keep in mind that this is not static analysis, this will actually call the method!

Showing File Paths

Would you like to know where these methods are defined?

Don’t worry, I got you covered! I added an option you can enable to show the file path for each method call.

Which results in:

visual call graph

If you want to see some massive call graphs you just have to trace some Rails methods 😉

Return Values

In the intro I mentioned that you can also get return values. For this you will need to trace the return event and use the return_value method.

Example:

This will print:

Events First

Someone asked on reddit how it’s possible to avoid having the word “bar” printed when calling the foo method in the following code:

There are many ways to achieve this, like prepending a module, redirecting $stdout or redefining the bar method.

If you are feeling creative, comment on this post with your own idea!

But I found one of the answers particularly interesting because it used the TracePoint class.

Here it is:

This code will call exit when the method bar is called, which prevents the string from being printed by ending the program.

Probably not something you want to use in real code, but it proves one thing about TracePoint: Events are triggered before they happen.

Something to keep in mind if you are going to build some sort of tool around this 🙂

Summary

In this post you learned about the TracePoint class, which allows you to trace a few events like methods calls or new threads. This can be useful as a debugging tool or for code exploration.

Remember to share this post so more people can enjoy it 🙂

ruby 2.4 features

9 New Features in Ruby 2.4

It has become a tradition to release new Ruby versions on Christmas.

And in this post I want to cover some of the most interesting changes in the next version so you can keep up with the news 🙂

Float#round with keyword

If you are using floats in your app I hope you use floor or ceil for rounding, because the Float#round method is changing its default behavior in Ruby 2.4.

Example:

The default behavior is now to “round-to-nearest-even”.

UPDATE: The default behavior for Float#round is back to “rounding up” in the final version of Ruby 2.4. This was a decision taken by Matz after this post was originally published.

In addition, Float#round now takes an argument which you can use to define the type of rounding you want.

The options are:

  • :even
  • :up
  • :down

Example:

Also Float#floor, Float#ceil & Float#truncate now take an optional argument that lets you set a precision.

Chomp flag for IO methods

If you ever used a method like gets or each_line I’m sure you remember having to deal with those pesky new line characters.

The ones that looks like this: \n.

This is something beginners always have trouble with, but this new feature might be able to help!

Here’s an example:

Now in Ruby 2.4 you will be able to set the chomp keyword argument & gets will remove the newline character for you.

Example:

This will work with any value that’s not false or nil (the only “falsy” values in Ruby).

So this also works (but not recommended because it can be confusing):

It’s not a super big deal, but it can save you a method call 🙂

Pathname#empty?

Ruby 2.4 implements Dir#empty? & File#empty?, these methods let you check if a directory or a file is empty (that was pretty obvious, right?).

But Pathname#empty? was also added recently.

In case you are not familiar with Pathname, it’s a class that merges together functionality from both the Dir class & the File class.

In addition, it’s also more “OO” (Object Oriented) in the sense that it returns Pathname objects, instead of strings.

Example:

Commit: https://github.com/ruby/ruby/commit/9373c5efb993dd8cae0526118805449b19af2c22

Set compare_by_identity

Sets are a data structure that is available as part of the standard library. It helps you keep a collection of unique items.

By default the objects are compared based on their value (or to be more accurate, based on their hash value).

But in Ruby 2.4 it’s possible to have a set of unique objects based on their object id.

Example:

If anyone knows of any interesting use for this feature leave a comment 🙂

Commit: https://github.com/ruby/ruby/commit/76977611dd68e384fdce8c546efda5e1931e67a6

Refinements with Kernel#send, BasicObject#send, Symbol#to_proc

I’m not going to cover refinements in detail here, but the basic idea is to be able to add methods to a class like String, while keeping this change “localized” to one file or class.

Since Ruby 2.4, methods defined via refinements will be available when called via methods like Kernel#send & Symbol#to_proc.

Example:

If you try this in 2.3 or below you will get an ‘undefined method’ error.

Commit: https://github.com/ruby/ruby/commit/35a29390197750abf97ef16fa0740e377764daef

Hash#transform_values

Here’s another method extracted from Rails & coming directly to Ruby. I’m talking about Hash#transform_values, which works in a similar way to Array#map.

Example:

There is also Hash#transform_values! if you need in-place mutation.

Kernel#clone now takes an optional keyword argument

As you may know, it’s possible to make a copy of a Ruby object. This is useful because most Ruby objects are mutable & you may want to avoid making changes to the original object.

We have two methods for making an object copy:

  • clone
  • dup

There are a few small differences between clone & dup, but for this post let’s just say that clone keeps the “frozen” status of the original object while dup does not.

New in 2.4 is the ability to call clone with a “freeze” flag.

Example:

I’m not sure to what extend this is useful, but who doesn’t want more options 🙂

Commit: https://github.com/ruby/ruby/commit/320ae01c5fb091eab0926c186f304a9caeda1ace

Thread.report_on_exception

Another feature coming with 2.4 is Thread.report_on_exception. This was proposed because Thread exceptions are silent by default, and this can hide problems with your code.

The default value is false, but if your app is using Threads you should try enabling this when upgrading to Ruby 2.4.

Example:

This will show the exception, but it will not crash your program. The alternative is Thread.abort_on_exception, which has always been available.

Notice that there is also an instance version of this method, so you can set this option per-thread instead of all threads.

Binding#irb

Are you a fan of using binding.pry for debugging? Well now we also have binding.irb which works in a similar way.

But since it’s irb you still don’t get all the nice things that pry gives you, like the syntax highlighting.

Better than nothing if for some reason you don’t have access to pry 🙂

Commit: https://github.com/ruby/ruby/commit/493e48897421d176a8faf0f0820323d79ecdf94a

Conclusion

This new version of Ruby brings in many interesting features, so make sure to check it out when it’s available.

Don’t forget to share this post so more people can learn about these new features in Ruby 2.4!

ruby port scanner

How to Write a Port Scanner in Ruby

Why would you want to write a port scanner?

Writing a port scanner is a great way to learn the basics of the TCP protocol, which is the transport layer used by most internet protocols (including HTTP & SSH).

It’s also a good exercise to learn more about how Ruby network programming works.

Let’s gets started by talking about ports!

What Is a Port?

When we talk about ports what are we really talking about? A port at the O.S. (Operating System) level is just a “file descriptor” associated with a process.

A file descriptor is just a number which is used to reference an open I/O channel, like stdout (standard output), a network socket or a file.

When the OS receives a TCP/IP packet it looks at the destination port & tries to find a process which is listening on this port.

Then if there is a listening process the packet is delivered to it.

Port Ranges

There are 65.535 ports available in total, but in practice not all of them are used regularly.

Ports can be divided into 3 groups:

Range Name Examples
1-1023 Well-known ports 22 (SSH), 80 (HTTP), 443 (HTTPS)
1024-49151 Registered ports 3306 (MySQL), 5432 (PostgreSQL)
49152-65535 Ephemeral ports Chrome, Firefox

Ports in the 2nd range are assigned by the IANA (Internet Assigned Numbers Authority).

And the 3rd range is used for “dynamic” or “ephemeral” ports, these are the ports used by the client side of the connection to receive data from the server.

The Basics of TCP Communication

Now you should be more familiar with ports, but we are still not ready to write our port scanner.

First, we need to discuss what it means for a port to be open. Then we will examine the behavior of both an open port & a closed port at the network level so we can differentiate them.

How can you know if a port is open?

An open port just means that there is an application listening on the other end & that we have access to it (not blocked by a firewall).

Let’s see how a new TCP connection is initiated.

ruby tcp three way handshake

A new TCP connection is started with a SYN packet. This packet symbolizes the start of a new connection.

There are three possible outcomes:

  • The server answers with a SYN/ACK – this means it’s ready to establish a connection
  • The server answers with a RST packet – this means that the connection is rejected
  • The server doesn’t answer at all (some firewalls implement this as the DROP policy)

If the client receives a SYN/ACK packet then it can finish establishing the connection by sending an ACK packet.

This is called the “three-way handshake“.

Note: A great way to watch this happening is by using a networking tool like tcpdump or wireshark. These tools allow you to see every packet that is coming in & out of your system.

Here is an example connection from tshark (wireshark’s command-line interface):

three-way-tshark

Now that you have a basic understanding of how a TCP connection is established we can write a simple port scanner.

Let’s Write a Port Scanner!

The simplest way to write our scanner is to open a new TCP connection using TCPSocket & then rely on the fact that a rejected connection will raise the Errno::ECONNREFUSED exception.

Here is the code:

This code has some limitations:

We can only scan one port at a time. Often you want to scan a range of ports to get a better idea of what services are exposed on a specific host.

Another limitation with this code is that a non-responding port (the third of the outcomes I mentioned earlier) will make you wait for about 20 seconds until the connection times out.

This can be solved with a combination of connect_nonblock, IO.select & Socket (instead of TCPSocket, which initiates a connection as soon as you create the object).

Example:

The IO.select call will wait until the socket is ready to receive data (which means it’s open) or until TIMEOUT time has passed, after which we can assume that the port is either closed or ignoring connection requests.

So, this is all great as a learning exercise, but if you need a proper port scanner you should use something like nmap.

Nmap is open source & has been under active development for over 15 years.

This is what nmap output looks like:

nmap

I know that’s a lot of info, so to help you remember write a comment with 2 new things you learned from this post 🙂

Summary

In this post you learned what a port is, the different port ranges available, how a TCP connection is initiated (the three-way handshake) & how to write a basic port scanner using Ruby.

If you enjoyed this post don’t forget to share it so more people can enjoy it 🙂