By Najaf Ali
Almost all non-trivial web applications integrate with other applications at some point. In this article we're going to talk about some of the problems integrating with APIs lead to and how to mitigate or avoid those issues.
The functionality you get from API providers is the same as the functionality you build for yourself apart from one key difference: you have no control over them. Here are some problems that this leads to:
- Providers can go out of business. Startups come and go all the time. The best service for X functionality this year may be filing for bankruptcy in a few years' time.
- Providers can deprecate APIs. Providers change their API's all the time. While most are good about not shipping breaking changes to old APIs (or at least give ample notice), they can and do deprecate older versions whenever they feel like.
- Providers can have downtime. Even the most apparently stable services will go down from time to time.
- Providers can experience security breaches. Even the most established and apparently secure web applications get breached eventually.
We can't prevent these problems from affecting our applications. We can however take steps to minimize the damage that these issues might lead to.
Don't access third-party APIs in the request-response cycle
Problem this helps with: provider downtime.
Avoid making requests to an API while your application is in the process of serving a web request. If the API is experiencing downtime, at best you'll need to wait until it times out before being able to return a response to your user. At worst, your application won't handle the error and will show the user a confusing error screen.
Instead try to make all access to third-party APIs in jobs run in a separate process to the one serving web requests. That process can do complex, multi-stage interactions with an API without affecting response-time for users. This process can also fail and retry the job with exponential backoff in the event of downtime in the API. This is usually accomplished by queue/worker libraries available for popular web frameworks (e.g. Resque, Sidekiq or Delayed Job for Rails).
If you're lucky, changes you make to your API will be idempotent. In this case that means is that you can make the same request to your API as many times as you like and have the same end result as if you had only done it once. Having idempotence means that you can be a bit less careful about how you write the code to go in your job queue. Jobs in your queue can be retried, and this will have no adverse effect if all access to the API is idempotent.
If you can't have idempotent API access, then try to make the jobs in your queue idempotent as a whole. If you're creating data for example, check that a resource with the same data hasn't been created already. If there is a failure at any stage of the job, you should be able to retry it with no negative effects.
At the very least, have only one thing that can fail in a given job. Here's an example situation where not following this advice turns out badly:
A job in a queue has a list of thirty email addresses to iterate over and send emails to. It needs to send this particular email exactly once to each address.
On the first run it successfully sends emails to addresses 1 to 27. Attempting to send an email to address 28 results in an error, causing the job to terminate and report an error.
The job queueing/monitoring system detects the error and schedules the job to be retried.
The job is retried and emails are sent to addresses 1 to 27 again.
Repeat until customers 1-27 are screaming down the phone at you because of all the spurious emails you're sending them.
To avoid this type of problem, you could record the delivery of a particular email to a particular address in a server side data store and check against it on subsequent attempts to send a given email. A simpler solution would be to have a single job for each email delivery i.e. only one thing that can fail in a given job.
Use one layer of indirection around API access
Problems this helps with: provider deprecation/end of life.
By "layer of indirection", we mean an object, function, closure or whatever tool your programming environment gives you for creating abstractions. In object oriented programming we might use an adapter. This could be an object with provider-agnostic methods that delegates work to a provider API client library.
Using an adapter like this helps to reduce the amount of work required when a provider changes their API or you need to move to a new provider. Rather than littering your code with provider-specific method calls, you only access the API through your adapter. Updating your code to work with any API changes should then be as simple as implementing another adapter.
In practice, the semantics and timing of the interaction with a new API will probably be different after the API changes, so it won't be quite this simple. Writing the adapter will, however, focus you on exactly what your requirements are of the API and should make the transition at least a little bit more straightforward.
Wrap all API access in a feature toggle
Problem this helps with: provider downtime.
A feature toggle is a setting that allows you turn a feature on or off in a running application without deploying any new code. They're sometimes implemented as a database-backed option that can be modified from the admin panel of a web application. They can also be implemented using environment variables.
If accessing an API results in notifications of some sort (e.g. emails or text messages) then it can be extremely useful to have an off switch that turns off all access to the given API. In the case where your application is sending thousands of spurious requests to all of your customers, you want a braindead simple way of turning it off close to hand.
The feature toggle should be consulted at the following points in your integration:
- Before API access within the request-response cycle.
- Before enqueuing any jobs that will result in API access.
- Before actual API access inside a running job.
If you happen to attempt API access within the request-response cycle, having the feature toggle in place means that you'll be able to turn off access during provider downtime.
As jobs may be enqueued before the feature toggled was turned off, you'll want to check against the toggle from within jobs as well.
Rotate keys regularly
Problem this helps with: provider security breaches.
Point-to-point API authentication is usually achieved with a set of API keys intended to be kept secret. In practice there are plenty of opportunities for API keys to inadvertently be made public. Developers pass them around over email, instant message and file uploads. They might accidentally commit API credentials into version control.
While putting procedures in place to discourage these mishaps is probably a good idea, you can minimize the effect of your keys being compromised by sticking to a regular rotation schedule.
This means regularly:
- Generating new API keys
- Using your new keys in your application
- Invalidating old API keys
Providers vary in their support for key rotation. To avoid any gap in service they need to provide a way for you to have at least two sets of valid keys you can use for authentication at a time.
Rotating API keys regularly has the benefit of making key rotation a known, regular process that your team is used to performing. This is important because you really don't want to learn how to do key rotation with a given provider on the same day that they've e.g. experienced a major security breach.
Backup all data stored with your provider
Problems this helps with: provider downtime/end of life/security breaches.
Assume that your provider can go out of business at any moment. You probably don't need to plan every last detail of recovery from catastrophic failure. At the same time, you want to have all the resources at hand to recover your systems should parts of it start to fail. Backups are essential for recovering from that situation. As long as you have backups, you might in the worst case scenario be able to process your users requests by hand. Without backups however, you really do have to close up shop until the situation is resolved.
Make daily backups of all data you store with a third-party API and persist them with at least two cloud-based file stores. Disk space is cheap so it will likely take a long time before you need to worry about pruning backups.
Get set up to receive security announcements
Problems this helps with: provider security breaches.
When providers experience security breaches they're extremely inconsistent in the way they report them.
In the best case they will have a dedicated security mailing list where you'll receive timely notification of a breach, details of the extent to which your data was compromised and specific instructions for mitigating the risks the breach will have exposed you to. In the worst case scenario there might be no notification at all (another reason for you to rotate your keys regularly). There might be a few mentions of it on Twitter or an article a few months later on the company's engineering blog.
Make it easy for your team to know when a security breach has happened. If you use an IM platform like Slack, set up a channel that gets notified when a new post appears on the company blog. Get notifications of any Twitter activity around the company in the same channel. Periodically Google for "security breach [provider name]" to make doubly sure (you could do this at the same time as your API key rotation).
You should use APIs
This article makes it sound like integrating with third-party APIs is more effort than it's worth. In reality the ecosystem of APIs we have available to us allows us to make immensely powerful systems.
Depending on systems outside of your control however does expose you to a level of risk. Applying a bit of forethought and spending regular time on maintenance allows you to mitigate that risk and better respond to issues when they inevitably arise.