Multi-threaded web crawler in Ruby

Hi,I’m Kamil Durski, Senior Ruby Developer at Polcode

If improving Ruby skills is what you’re after, stick around. I’ll show you how to use multiple threads to drastically increase the efficiency of your application.

As I focus on threads, only the relevant code will be displayed in the slideshow. Find the full source here.

https://github.com/kdurski/crawler

The (much) underestimated threads

Ruby programmers have easy access to threads thanks to build-in support.

Threads can be very useful, yet for some reason they don’t receive much love.

Where can you use threads to see their prowess first-hand?

Crawling the web is a perfect example! Threads allow you to save much time you’d spend waiting for data from the remote server.

I’m going to build a simple app so you can really understand the power of threads. It will fetch info on some popular U.S. TV shows (that one with dragons and an ex chemistry teacher too!) from a bunch of websites.

But before we take a look at the code, let’s start with a few slides of good old theory.

What’s the difference betweena thread and a process?

A multi-threaded app is capable of doing a lot of things at the same time.

That’s because the app has the ability to switch between threads, letting each of them use some of the process time.

But it’s still a single processThe same things goes for running many apps on a single-core processor. It’s the operating system that does the switching.

Another big difference

Use threads within a single process and you can share memory and variables between all of them, making development easierUse multiple processes and processor cores and it’s no longer the case – sharing data gets harder.

Check Wikipedia to find out more on threads.

https://en.wikipedia.org/wiki/Thread_(computing)

Now we can go back to the TV shows. Aside of Ruby on Rails’ Active Record library for database access, all I’m going to use are:

Three components from Ruby’s thread library:

1) Thread – the core class that runs multiple parts of code at the same time,

2) Queue – this class will let me schedule jobs to be used by all the threads,

3) Mutex – the role of the Mutex component is to synchronize access to the resources. Thanks to that, the app won’t switch to another thread too early.

The app itself is also divided into three major components:

1) Module I’m going to supply the app with a list of modules to run. The module creates multiple threads and tells the crawler what to do,

2) Crawler I’m going to create crawler classes to fetch data from websites,

3) Model Models will allow me to store and retrieve data from the database.

Crawler module

The Crawler module is responsible for setting the environment and connecting to the database.

The autoload calls refer to major components inside the lib/ directory. The setup_env methodconnects to the database and adds app/ directories to the $LOAD_PATH variable and includes all of the files under app/ directory. A new instance of the mutex method is stored inside of the @mutex variable. We can access it by Crawler.mutex.

Crawler::Threads classcore feature

Now I’m going to create the core feature of the app. I’m initializing a few variables - @size, to know how many threads to spawn, @threads array to keep track of the threads, and @queue to store the jobs to do.

I’m calling the #add method to add each job to the queue. It accepts optional arguments and a block. Please, google block in Ruby if you’re not familiar with the concept.

Next, the #start method initializes threads and calls #join on each of them. It’s essential for the whole app to work – otherwise once the main thread is done with its job, it would instantly kill spawned threads and exit without finishing its job..

To complete the core functionality, I’m calling the #pop method on a block from the queue and then run the block with the arguments from the earlier #add method. The true argument makes sure that it runs in a non-blocking mode. Otherwise, I would run into a deadlock with the thread waiting for a new job to be added even after the queue is already emptied (eventually throwingan application error „No live threads left. Deadlock?”).

I can use the Crawler::Threads class to crawl multiple pages at the same time.

Now I can run some code to see what all of it amounts to:

10 second to visit 10 pages and fetch some basic information. Alright, now I’m going to try 10 threads.

All it took to do the same task is 1.51 s!

The app no longer wastes time doing nothing while waiting for the remote server to deliver data.

Additionally, what’s interesting, the input order is different – for the single thread option it’s the same as the config file. For the multi-threaded it’s random, as some threads do their job faster.

Thread safety

The code I used outputs information using puts. It’s not a thread-safe way of doing this as it causes two particular things: - outputs a given string, - then outputs the new line (NL) character.

This may cause random instances of NL characters appearing out of place as the thread switches in the middle and another assumes control before the NL character is printed See the example below:

I fixed this with mutex by creating a custom #log method to output the information to the console wrapped in it:

Now the console output is always in order as the thread waits for the puts to finish.

And that’s it. Now you know more about how threads work.

I wrote this code as a side project the topic of web crawling being an important part of what I do. The previous version included more features such as the usage of proxies and TOR network support. The latter improves anonymity but also slows down the code a lot.

Thanks for your time and, again, feel free to tackle the entire code at:

https://github.com/kdurski/crawler

Multi-threaded web crawler in Ruby

Software

Transcript of Multi-threaded web crawler in Ruby