This site is now 100% read-only, and retired.

A brief introduction to the beanstalkd queue

Posted by Steve on Wed 24 Apr 2013 at 11:42

There are many times when having access to a queue is useful when you're developing projects and code. These days there are several available queueing daemons available, and here we're going to look at one of them, beanstalkd.

beanstalkd describes itself as a fast and simple work queue, and it is documented at the beanstalkd project page.

Taking a step back we should first look at what a queue is. In brief a queue is something that lets you store "jobs" in it, and later retrieve them. What is a job?, you might ask, and the answer is "almost anything".

Imagine you're writing a web application to display RSS feeds, in that situation you might need to poll 100+ websites, to fetch their feeds. Rather than writing some code to do that en mass you might just have a queue:

  • For each RSS feed you wish to poll.
  • Add the URL to the feed into the queue, to be processed later.
  • When you're idle
    • Fetch one feed URL from the queue.
    • "Handle it"
    • Delete the job from the queue.

This process ensures that you don't lose jobs, and you can pick them off when you're idle.

Similarly if you were writing a spider to crawl websites your algorithm might look like this:

  • Pick an URL from the queue.
  • Fetch it.
  • Parse it.
  • For each new URL you've found in the body then add that as a new URL in the queue.

Queues become even more useful if you're looking to distribute work. Pretend you wish to monitor a bunch of services in a distributed fashion:

  • Push all your checks into a queue
    • For example your check might be "http://example.com must run http", or "192.168.1.1 must ping".
  • Have N worker-hosts which will run the tests.
    • Each host picks a job from the queue, executes it, and alerts on the success/failure.

(This monitoring system exists. It was introduced on the Bytemark blog, and the code is available under a free license.)

In short a queue is a useful thing that lets you "store stuff" and "retrieve stuff". The stuff you store is generally a string, but sometimes might be a hash, a piece of JSON, or some similarly open format. (Storing strings is just as useful as storing objects if you use a serialization library, be it JSON, YAML, or similar.)

The key attributes of a queue are:

  • The ability to easily store and retrieve items.
  • The ability to never lose jobs.
  • The ability to access the queue from several different languages.

There are many choices to choose from, but I've been fond of beanstalkd for some time, partly because it is simple, reliable, and easy to understand, and partly because it is available for Debian's current stable release, Squeeze.

Installing beanstalkd is very straight-forward:

# aptitude install beanstalkd
# echo "START=yes" >> /etc/default/beanstalkd
# service beanstalkd restart

Once you have your queue running you'll want to insert/remove jobs from it. The simplest way is to use the Ruby language-bindings, so you'll need to install those too:

# aptitude install libbeanstalkclient-ruby

Once installed you can then write a simple program to insert a job into the queue. The following example adds two simple strings:

#!/usr/bin/ruby1.8

require 'beanstalk-client'

queue = Beanstalk::Pool.new(["127.0.0.1:11300"] )
queue.put( "This is a job" )
queue.put( "here is another job" )

Now that the jobs are present we can retrieve them:

#!/usr/bin/ruby1.8

require 'beanstalk-client'

beanstalk = Beanstalk::Pool.new(['127.0.0.1:11300'])
loop do
     job = beanstalk.reserve
     puts job.body # prints the literal body of the job
     job.delete
end

Running these in succession does the obvious thing:

precious ~ $ ./queue-add.rb
precious ~ $ ./queue-get.rb
This is a job
here is another job
[hang - waiting for more jobs]

If you prefer you can timeout if there are no jobs:

#!/usr/bin/ruby1.8

require 'beanstalk-client'

beanstalk = Beanstalk::Pool.new(['127.0.0.1:11300'])
while( true )
  begin
    job = beanstalk.reserve(1)
    puts job.body
    job.delete
  rescue Beanstalk::TimedOut => ex
    puts "Timed out"
    exit( 0 )
  end
end

Running this will work as before, but after a second the process will exit:

precious ~ $ ./queue-add.rb
precious ~ $ ./queue-timeout.rb
This is a job
here is another job
Timed out

These three sample scripts cover 99% of all you would wish to do with beanstalkd:

  • Add a job.
  • Retrieve a job.
  • Exit if no jobs are pending.

Beyond that there are extra facilities, for example the queue we've used above is global but there is a notion of named-queues (in beanstalkd these are called "tubes"). In our simple example we merely inserted jobs and pulled them out in the order they were submitted, but there is the notion of priorities so you can retrieve the most important pending-jobs first.

I've used the notion of priority in the past to order things alphabetically which is a cute thing to do. Merely defining the priority of every job as the ASCII value of the first character in the job allows all the A-jobs to be pulled out first, then all the B-jobs, etc.

The actual nature of your jobs is going to be application-specific, but beanstalkd is ideal for having background processing from busy user-interfaces, and distributing load across a number of hosts.

Hopefully despite any further code and specifics this will be a useful introduction.

 

 


Re: A brief introduction to the beanstalkd queue
Posted by Anonymous (216.183.xx.xx) on Wed 29 May 2013 at 23:29
You pointed out 3 key attributes:
-easily store and retrieve items
-never lose jobs
-different languages access

But those all look fairly achievable using any database (sqlite, mysql etc). Can you describe what beanstalkd (or any queuing system) would benefit us over using it over any db?
I wrote my own queueing system a few months back with mysql - but now I sort of wish that I had known about beanstalkd. However I really can't pinpoint the advantages, though im sure they're buried in there some where. Does beanstalkd do a better job of avoiding 1 item being processed at the same time by 2 different threads... type of scenario?

[ Parent ]

Re: A brief introduction to the beanstalkd queue
Posted by Steve (2.120.xx.xx) on Sun 29 Sep 2013 at 15:45
[ View Weblogs ]

I accept that doing things "by hand", via MySQL, or similar, could be more convient and it is hard to justify precisely why to use a dedicated tool.

But that said it is simple, reliable, and fast at accepting jobs. Having queues all in one place is very useful - and the similicy makes it easy to understand and interact with.

As you suggest beanstalkd does go to great lengths to ensure that jobs that are fetched ("reserved") are put back into the queue if they're not releasd after a while (30 seconds by default), and avoid issues with concurrency.

Steve

[ Parent ]