Impedance (mis)matching

mismatching.png

In this post we’ll focus on the architecture of our Push Notification subsystem, covering some of the decisions made, talk about our open source Go library to send pushes through the Apple Push Notification System, and wrap it up with an anecdote on how we ended up DDoS’ing ourselves with what we’ve built.

Sunsetting the Rails Monolith

Over the past year at Timehop, we broke our big monolithic Rails app into a service based architecture, written almost entirely in Go.

Breaking a big system down into smaller parts makes it far more modular and, when done right, more available and fault-tolerant.

We ended up with more services but fewer single points of failure. There are now more potential points of failure but none of them can — or should — cause a complete halt.

One of the side effects of dividing to conquer is that communication becomes explicit. Functionality and error handling are now spread across multiple processes, over the network — which also makes them less reliable.

Impedance matching: buffering and throttling

At Timehop, we put a lot of effort into making sure that all communication between our systems is correctly buffered and/or throttled, so as to avoid internally DDoS’ing ourselves.

Whenever a system can offload some work for deferred processing by other systems, we use message queues as buffers. As those queues often grow to hold millions of records in a short amount of time, we keep a close eye on them through an extensive set of alarms.

Whenever a system needs a real-time response from another (e.g. an HTTP call or some other form of RPC), we use aggressive timeouts on the requesting side and throttling on the serving side. It’s all designed to fail fast; the requester won’t wait longer than a few seconds for a response and the server will immediately return an error if too many requests are already being served.

We would rather fail fast and keep the median response times low, even if it comes at a small cost in percentage of successful requests served:

  • From an infrastructure perspective, we’re in control of the internal resource usage. We decide when the system should stop responding vs let it grind itself to a halt.
  • From a UX perspective, we can often silently mask errors. When we cannot, we believe it’s still preferable to quickly show the user something went wrong over keeping her indefinitely waiting for that spinner.
  • From a service perspective, a degraded experience for some is better than no experience for all.

The push notification subsystem

We call it salt-n-pepa. I would have personally gone with static-x.

Whenever we need to send out a push notification to a Timehopper, we load up all her device tokens (one per phone) and then hit Google’s GCM or Apple’s APNS.

If you’ve never dealt with push notification systems, a device token is what Apple and Google use to uniquely identify your device(s) so that we can send you notifications.

With our monolithic system, we kept all these tokens in a PostgreSQLdatabase, which was hidden behind the niceties of Rails’ ActiveRecord. Grabbing the Apple device tokens for a user was as easy as calling a method on a User object — user.valid_apns_tokens.

As the need arose to perform the same tasks from multiple parts of our shiny new (but incredibly lean and minimalist) Go stack, multiple problems became apparent:

  • Duplicate code in different languages: higher effort to maintain two codebases.
  • Tight coupling to the database: some systems required a connection to the database for the sole purpose of loading tokens to send pushes.
  • Harder cache management: if cache invalidation on its own is often very tricky, then distributed cache invalidation is a nightmare.
  • Difficult upgrades: whenever the logic changed, we’d have to upgrade not only the different codebases but all the different systems using that code. The more independent moving parts you have, the harder this procedure is.

To solve those problems, we created a black-box service, salt-n-pepa, that has message queues as entry points. Messages (or tasks) in this queue are JSON documents, whose most notable fields a target user ID, some content and, optionally, a delivery time (so that it supports scheduling for future delivery vs immediate.)

The moving parts

Internally, the push system has multiple components, each with a single, very well defined responsibility.

  • The Demuxer: The entry point into the push system, it reads push notification jobs off of a queue — the message we’ve covered above. This process then loads all the valid device tokens for both APNS and GCM and, for each, it queues them to be immediately sent by the appropriate Pusher. In case the push is scheduled for future delivery it puts them in a timestamp-based set so the Deschedulers can then take care of moving it to the appropriate Pusher queue when the time comes. A single push notification job may end up generating multiple pushes if the user has Timehop installed in more than one device.
  • The APNS & GCM Deschedulers: At the right time, transfers pushes scheduled to be sent in the future to the appropriate pusher queue (APNS or GCM).
  • The APNS Pusher: Converts the contents of a message into APNS format and sends it down Apple’s Push Notification System. This is a fire-and-forget system, with no feedback on message delivery status. This process uses our open source Go APNS library, which we’ll cover ahead.
  • The GCM Pusher: Converts the contents of a message into GCM format and sends it down Google’s Cloud Messaging platform. This system is synchronous in the sense that for every request that hits GCM, we know whether the push was successfully scheduled or whether the token is invalid. When a token is invalid, the GCM Pusher queues an invalidation for the GCM Invalidator

Aside from these, there are also a few other components related to token registration and invalidation.

  • The APNS Invalidator: Periodically connects to APNS to download a list of invalid tokens and update our Apple device token records.
  • The GCM Invalidator: Reads off of the GCM token invalidation queue (populated by the GCM Pusher) and updates our GCM device token records.
  • The Registrar: Reads off of the device token registration queue (populated by other subsystems that want to register new device tokens for users) and updates the device token records for the user.

With this system we send, on average, 25 million push notifications every day.

Timehop’s Go APNS library

One of the hardest parts of this whole system was writing the actual code that talks to APNS to send the pushes.

Whereas with GCM you perform an HTTP request and immediately know the results, Apple took on a less common approach in which you have to open a TLS connection and adopt their binary protocol. You write bytes to a socket instead of HTTP POST’s to a Web server. To gather feedback on which tokens are now invalid, you have to open up a separate connection to receive this information.

As we looked for good libraries, we realized the landscape was grim so we decided to roll our own, which features:

  • Long Lived Clients: Apple’s documentation states that you should hold a persistent connection open as opposed to creating a new one for every payload.
  • Use of v2 Protocol: Apple came out with v2 of their API with support for variable length payloads. This library uses that protocol.
  • Robust Send Guarantees: APNS has asynchronous feedback on whether a push sent. That means that if you send pushes after a bad send, those pushes will be lost forever. Our library records the last N pushes, detects errors, and is able to resend the pushes that could have been lost. You can learn more about this here.

So head on to the GitHub project page and give it a spin!

How we DDoS’ed ourselves with pushes

Every day, the system that prepares your next Timehop day (briefly discussed in this other article) enqueues about 15 million push notifications to be sent shortly before 9am on your local timezone. This scheduling is randomized within a 30 minute window, so that for every timezone, we get an evenly distributed traffic pattern — as opposed to massive influx of traffic when everyone opens the app at the exact same time.

All this is performed far in advance of the actual push being sent so we end up queueing plenty of messages, which the de-schedulers will then move on to the appropriate queues to be sent immediately when the time comes. It’s normal to have a few million entries scheduled for later delivery.

The actual sending on the APNS side is pretty fast. It takes about 2ms to run a full cycle — pop a notification from the queue and send it to Apple’s Push servers. Rinse and repeat.

We run a single process, in a single machine, with 50 workers (each in its own goroutine). It’s so fast that its queue never backs up, no matter what we throw at it.

It’s one of those things that has been so reliable for so long that you kind of forget about it when there are other fires to put out. So reliable and fast we forgot to put alarms in place for the case when its queue starts backing up.

And then it got fun.

What goes around, comes around

We never really put thought into limiting the outbound rate of our pushes — as long as Apple could handle it, we’d hammer them.

What we naively overlooked was the fact that pretty much every push we send causes an indirect hit on our client-facing API, as the users open the app.

The morning push: nobody can resist opening the app after one of these.

The higher the volume of immediate pushes sent, the higher the potential volume of hits on our API.

A week ago, due to a certificate problem with our APNS pusher, each of the 50 workers running on the APNS Pusher slowly started to die. We didn’t really notice anything as, even with just a couple workers left, we were still keeping up with the rate at which pushes were being generated.

Then, the last worker died. No more APNS pushes were sent.

While we did not have an alarm in place, the unusually low morning traffic that our dashboards were showing was not a good sign — that and the fact that we didn’t get our own morning pushes either.

As we investigated and reached the natural conclusion that the APNS Pusher was dead — at that point, the queue had over 6 million pushes and growing — we restarted it.

Within 30 minutes, our client-facing API error rates went up by 60% and our inbound traffic went up nearly 3x. When we looked at the push queue, it was empty. Over 6 million pushes sent under 40 minutes. Most of those were people that actually opened Timehop and hit our servers.

An incredibly simple rate limiter

All it took for this to never happen again were a few lines of code. The algorithm is pretty simple:

  • Each worker, running on its own goroutine, has access to a rate limiter
  • Whenever they’re about to begin a cycle, they query the rate limiter
  • If the rate limiter is over the threshold, the worker sleeps for a bit
  • If the rate limiter is under the threshold, the worker performs a work cycle
  • Every minute, another worker resets the rate limiter

Kinda like pushing the button in LOST.

Here’s what it looks like:

import "sync/atomic"

func NewLimiter(limit int64) *Limiter {
  return &Limiter{limit: limit}
}

type Limiter struct {
  limit   int64
  counter int64
}

// Atomically increments the underlying counter
// and returns whether the new value of counter
// is under the limit, i.e. whether the caller should
// proceed or abort.
func (s *Limiter) Increment() bool {
  return atomic.AddInt64(&s.counter, 1) <= t.limit
}

// Atomically resets the value of the counter to 0.
func (s *Limiter) Clear() {
  atomic.StoreInt64(&s.counter, 0)
}

The limit is then shared across all the workers (goroutines) and whenever they’re about to begin a new cycle, they simply test whether they can proceed:

func (s *apnsWorker) workCycle() bool {
  if !s.limiter.Increment() {
    return false
  }
  // ...
}

Lastly, another goroutine calls Clear() on this shared Limiter every minute, which allows the workers to begin sending pushes again.

A final note

When going distributed you’ll invariably run into throughput impedance mismatches. Make sure you dedicate some time to understand how every part of your system will affect the next and how you can use different techniques, such as the ones we talked about in this article, to help mitigate the effects.

Oh, and always keep an eye out for how outbound traffic can get back at you so you don’t end up nuking yourself like we did! 😬