What do you do when you fail at a task?

What do you do when your computer fails at a task?

What about your code?

As programmers we deal with flaky network connections, crashing data stores, and user input day in and day out. It’s important to remember that all of these things will fail, and our code should be prepared to get back on the horse, turn itself off and on again, and try again.

Retrying Bundler

In the Ruby buildpack for Heroku we deal with a staggaring array of uncertainties, so anything we can do to ensure a smooth experience is a good thing. About 2 months JD opened an issue requesting us to retry failed bundle installs, it’s something travis does, and just because a network hiccupped for a split second is no reason to totally cancel an app compile. Besides all a user would do is simply try again. After talking the issue over with Terence Lee we decided to take on the problem. Though instead of solving it locally for the buildpack, we made the concious decision to work on pushing the change upstream into bundler itself. This means that the Ruby buildpack gets retry logic for free as well as every bundler user. Which is what I like to call a win win win scenario. After about a month from initial PR bundler/bundler#2601 was merged.

This patch allows several network operations to be retried, and even better, it defaults those operations to be retried twice (so the operation will run up to 3 times by default). To skip these retry attempts you would need to explicitly tell it to retry 0 times:

$ bundle install --retry 0

While I’m excited for this behavior and general interface for bundler, I’m more excited that it gives the project an extensable & re-usable retry class. So while for now we’re only retrying the fetching of gemspecs and git commands we can easily extend any part of the bundler code to retry behavior cleanly reliably.

Retry Culture

This isn’t the first bit of retry code I’ve worked on, about 8 months ago I wrote rrrretry which monkey patches enumerable to allow retry behavior:

require 'rrrretry'

[0, 1, 2].each.retry { |i| 1/i }
  # => 1

It’s a small library that I’ve used in several projects and find incredibly helpful. Since writing it I’ve found that many of the tools I use every day actually have retry behavior baked in. “Like what?” You might ask. Well have you ever used curl? It has a --retry option (defaulted to zero).

$ curl https://www.schneems.com --retry 3

Then there’s Net HTTP Persistant, and my personal favorite ruby http lib Excon:

Excon.get("https://www.schneems.com", idempotent: true, retry_limit: 6)

Even strangely enough Tail

$ tail ./log/development.log --retry

Then there’s retry logic burried deep in things we use every day like database connections.

Not just for Production

Sure it’s important to keep failure out of your production environments, but I’ve found it to be equally as important in the Ruby buildpack’s testing environment. We use rspec-retry to automatically re-run any failed tests. The buildpack uses full stack integration tests (it deploys real apps on Heroku using a tool called Hatchet which is testing framework agnostic) and any number of network effects can easily add up to create a heisen failure. If the same test fails twice in a row it’s much more likely that it’s a result of the code and not the network.

The deploy process is an especially network sensitive time, so in addition to rspec-retry, the hatchet library can retry deploys:

$ HATCHET_RETRIES=3 bundle exec parallel_rspec -n 7 spec/

While this doesn’t guarantee we won’t see false failures due to network, it drastically minimizes the chances and helps bump the signal to noise ratio of our tests.

Idempotent

Pronounced ˈī-dəm-ˌpō-tənt’ is the idea that when run again, the result of your code should not change. If part of your code succeeds, and part of it fails and the whole thing re-runs, will everything work out? Network connections for most GET requests are easy to retry. While doing more complicated work or manipulating data inside of data stores you should use transactions to avoid getting partially applied code. While the entirety of your code being retried does not need to be idempotent, the individual pieces need to be.

When in doubt only retry the smallest amount of code possible.

Theirs but to do and Retry

If you’re interested in adding some more retry logic to your own code, Jordan Sissel has some patterns: Retry on Failure Ruby software pattern. You can also check out my rrrretry gem, or one of its many many compatriots. You can even try writing your own (it’s a fun kata).

While for some failure is not an option to make your code embrace, extend, and retry can give your programs a second chance, literally.


Richard @schneems works for Heroku and sometimes writes books, teaches at the University of Texas and runs Code Triage. If you didn’t enjoy this article maybe you should come back later and give his writing a (re)try.