This is the final entry in a three-part series on BitSight’s new Event Store. In the first and second posts, we described some key components of the architecture. Because of the limited number of access patterns we had to support (bulk inserts, mostly in chronological order; full scans, coarsely filtered by key range and time), we were able to implement a simple NoSQL-style database, using flat Parquet files on Amazon’s S3 as the storage layer.
Last month, we shipped the first version of the Event Store, and it’s been running in production since then. So far, we’ve been very happy with the results. In this post, we’ll relate a few of our experiences and lessons learned during the development process.
Many of the difficulties we faced were due to S3’s unique characteristics:
Slow S3 Lists
Each of our Parquet files is stored as a separate S3 object, and there are hundreds of thousands of these. We quickly learned that naively querying S3 to find our objects took a huge amount of time (especially using Hadoop’s built-in FileSystems). In fact, early versions of our code spent more time listing files and getting file metadata than in actually processing the data.
We now keep track of all of the objects and metadata ourselves, in flat manifest files. This has the additional benefit of ensuring data integrity: we notice when files are missing (due to eventual consistency, or otherwise), and we can either wait or repair the damage.
Adjusting our code and configuration for efficient, reliable writes to and reads from S3 required some trial and error (as official tuning guidance is sparse).
For example, an early version of our sort-merge join algorithm (for aggregating and deduplicating records upon insertion into the store) simultaneously kept open a large number of “file” handles to S3, which resulted in sporadic, hard-to-debug connection timeouts and connection pool exhaustion. We reworked the algorithm to open files only when absolutely necessary, and to close them aggressively.
Downstream, we use Hadoop’s s3a filesystem in client mapreduce jobs that ingest the Parquet files from the store. Initially, we saw an alarming fraction of tasks fail on connection timeouts and broken streams. Increasing the maximum number of connection threads (fs.s3a.threads.max) and thread lifetime (fs.s3a.threads.keepalivetime) eliminated these problems.
S3 bandwidth to EC2
One of our main concerns was whether we would see a performance decrease because we were reading data from S3, rather than local HDFS (which violates one of Hadoop’s original core principles), and we spent a lot of time testing S3’s bandwidth and latency. Happily, in the end, we actually observed a slight performance increase. Our workload is largely CPU-bound, and pulling data from S3 meant that the HBase regionservers were no longer competing for CPU time.
At times, these and other difficulties made us question the wisdom of implementing our own database, especially on such an unconventional storage system. Once we worked out most of the major problems, however, we began to realize a number of significant benefits.
With our previous HBase store, it was very difficult for developers and data scientists to access our data. Records were serialized using custom Java code, which meant that any consumers had to use the same classes; getting the data into third-party tools required time-consuming ETL. Additionally, HBase was too delicate to risk hitting the production servers with additional queries, so for anyone else to use the data, it had to be copied to a separate HBase cluster, which was cumbersome and expensive.
Now, everyone on our team can safely read the same data that production uses, from anywhere, even laptops. Many tools can ingest Parquet files; in particular, Spark SQL makes it fast and easy to run ad-hoc queries and analyses (one of our developers even wrote a HipChat plugin to query the Event Store). Since the files are on S3 (with read-only access for everything except the jobs that produce them), we don’t have to worry about destabilizing our production servers.
Now that our Event Store is on S3, no critical data is stored on our EC2 instances. This means that we can use ephemeral (rather than EBS) storage for our clusters, which significantly reduces cost and deployment complexity. It also means that we no longer have to do error-prone, time-consuming copies of data between clusters. We can run our data pipelines with an EMR-style workflow: create a cluster, read the data from S3, do some computation, store the results elsewhere, and then destroy the cluster. Short-lived clusters are easier to maintain: if they break, we throw them away and start fresh.
Moving from a NoSQL store to flat Parquet files on S3 did require substantial engineering effort, and we encountered many unexpected challenges along the way. However, the benefits in operational simplicity, storage costs, and accessibility of the data have been worthwhile. We are continuing to improve the performance and flexibility of the Event Store, and we are investigating using the same technology in other areas of BitSight’s pipeline.
BitSight is moving fast, but we don’t want to sacrifice code quality for speed, which is why tests have always played an important role in our development process. Although we are not doing TDD (Test-driven development), one of the key...
A few months back we added a new feature to the heart of our security ratings portal: the ability for users to not only filter companies in their portfolios, but also to see real-time updated counts of how many "filtered" companies match...