Developing a Distributed Event Store at BitSight: Why We Are Moving Away From HBase
Ethan Geil | October 30, 2015
by Ethan Geil and Nick Whalen
Every day, BitSight analyzes billions of security events. Not only do we collect billions of new events per day; we also regularly re-examine all of our historical data, to provide security ratings for new companies. Storing and accessing all these events efficiently is crucial to BitSight’s mission.
We are developing a major upgrade to BitSight’s event store. This series of blog posts will describe some key aspects of our approach, and the challenges we encounter along the way. The first post will introduce the problem and outline our basic approach; subsequent posts will fill in some of the details.
To the right is a (highly simplified) diagram of BitSight’s data pipeline. We constantly collect raw events from a wide variety of sources and add them to our event store. Next, the events are associated with individual companies, using an identifying key (e.g. IP address). Finally, we take into account all the events observed for each company, in order to provide daily security ratings and other important information about that company.
We focus here on the second step of the pipeline (highlighted in red): the event store. In order to provide accurate historical ratings, we must retain all events, so we can’t simply stream the events through the pipeline and throw them away. Storing and efficiently accessing this huge volume of data creates unique challenges.
Why not HBase
We have used HBase on Amazon EC2 for this type of data in the past, which turned out to be a less-than-ideal fit for our particular use cases. First, HBase tends to strongly couple data and computation: In the typical case, region servers are co-located with the computational tasks. (It is possible to query an HBase database located on a different cluster, but the limited bandwidth of the region servers makes this problematic.) This makes it difficult to scale and optimize both storage and computation independently.
Another drawback of HBase is that it requires a cluster to always be up and running to provide data access. Yes, EBS provides data persistence when the cluster goes down, but we can't actually query that data until we spin up another EC2 cluster, attach the volumes, and restart HBase. A permanently live HBase cluster can be expensive (and wasteful, if it sits idle for much of the time).
Data locality also makes it difficult to share data between clusters and developers. Copying and loading HBase tables between production, staging, and other environments is error-prone and time-consuming.
Storing data in EBS volumes is expensive, compared to other solutions. Worse, the volumes are of fixed size: to optimize for cost, it’s necessary to keep EBS volumes dangerously close to being full; and scaling up the storage on an EBS-backed cluster is an operational headache.
One final reason why we're moving away from HBase is that we don’t need many of the key features that HBase excels at. HBase is optimized to provide efficient low-latency random PUTs and GETs—neither of which we need (see below). Thus, HBase is simply not the right tool for our jobs.
Designing an S3-based database
In contrast to EBS, Amazon S3 offers high reliability, nearly unlimited scalability, and very low cost per GB. However, it also has several limitations. Perhaps the most challenging, from the point of view of database implementation, is that S3 files are essentially immutable: there is no way to modify or even append to files without copying them in their entirety. Additionally, S3 is “eventually consistent”: the results of operations are not guaranteed to be immediately visible, or to appear in order1.
For these reasons, implementing a general full-featured key-value store (like HBase) on S3 alone would be extremely difficult. Luckily, our use cases allow for several major simplifications:
Our data pipeline needs only bulk inserts—no random PUTs
We rarely use random GETs, so these don’t need to be particularly efficient.
Crucially, events are immutable. Once inserted, they are never modified, so we don’t have to support UPDATEs.
Each of our events has a key (e.g. IP address) and a timestamp (the time when the event occurred or was detected). The event store must efficiently support two basic types of queries. The first is
(1) An incremental query: Get all of the events that happened recently (e.g. within the past day), for all keys belonging to all the companies in our database.
We compute historical security ratings, but it doesn’t make sense to reprocess a company’s entire history on every update—the old events haven’t changed, after all. We need to examine only those events that have arrived since the last update.
On the other hand, we also constantly add new companies to our database. We must examine each new company’s entire history at least once. So the second type of query is
(2) A historical query: Get all events for a relatively small subset of companies, for all time.
These two requirements are in apparent conflict. Sorting events by timestamp would make the first query efficient, but the second would require a full scan (and vice-versa). However, there is a compromise: Partition the events into ranges of keys (“buckets”), and then sort by time within each bucket. Split the timeline for each bucket into a set of non-overlapping time ranges, and write a file for each.
This makes both types of queries relatively efficient. For query 1, we can just grab the “top” few files in each bucket. For query 2, we grab all the files in the bucket for the key in question. Note that both queries now return more data than requested, but this is easy to filter, and still far more efficient than scanning the entire store.
This scheme also solves the S3-immutable-files problem: if we write events out in daily (or larger) batches to each bucket, those files never need to be changed again2. The next day, new, disjoint files will be created.
In the next post, we’ll consider some of the many details needed to make this type of store efficient and reliable.
____________________________________________________________________1 Most endpoints now support read-after-write consistency, however.
2 This is not quite true, but the fraction of files changed is relatively small.