Aggregation pipelines in MongoDB can get very complex very quickly. But they provide a much more flexible query interface than simple “find” queries and often allow you to leverage data locality by moving much of your data processing to MongoDB itself. In this blog post, we will discuss the decision to move a particular data processing pipeline to MongoDB, and how we managed to stay sane in the process.
The story begins last year when we introduced the Endpoint Agent. Endpoint Agents install on employee laptops and desktops and record real end-user performance data from the browser. As you would expect, this generates a huge amount of data and quickly adds up with the number of Endpoint Agents installed and the browser activity of the users. We aggregate and summarize the vast data sets collected through the Endpoint Agent view. Before the changes described here, this aggregation was done entirely in Java. It worked—but was very slow on large datasets. The slowness added up quickly when the user navigated across the timeline or selected multiple filters.
To improve the performance of this request, we reimplemented this functionality using a single MongoDB aggregation pipeline, which improved the average response time significantly. The query got very complex though, as the Java aggregation function it replaced was itself non-trivial. Writing the query in Java only added another layer of complexity, as now it was not just about the query itself, but also about how it was expressed.
Why go through all this trouble, then? The benefit to using aggregations is that they help move computation closer to where the data is, simultaneously running it through MongoDB’s query optimizer. This can significantly reduce the number of round trips to the database server, which is especially important if for some reason the latency to your database is high (for example when using cloud databases), or if you routinely hit the 16MB BSON document size limit. To say that aggregations in MongoDB are arcane, however, is only a slight exaggeration. The JSON syntax resembles freakishly distorted S-expressions, error messages can be non-obvious, and the language itself is quite limited, necessitating ‘clever’ workarounds in certain situations, like using $facet and $match to implement control flow.
The final product is a class of many lines, a creation both bizarre and complex, inspiring feelings of both awe and cosmic unease that permeates its reader’s very being. A true to form Cthulhu’s Query, it performs great leaps over vast amounts of data, while being capable of inspiring madness to some. Not to worry, though, as the accompanying test comprehensively documents the class, allowing for sanity to prevail. The tests are matched nearly one to one with tests for the Java implementation, which is what ensures that the two implementations are consistent with each other.
So, how to write such a query without losing one’s mind? It pays to remember some common sense principles of software engineering:
- Splitting up the query into a tree of functions helps the reader of the class understand how the query is constructed and what each piece is actually doing, as well as keeping the nesting depth in check. All non-trivial stages should have a separate function as a minimum.
- Read the fabulous manual. While the system is complex and not particularly friendly to work with, it is well documented, defining the behavior of every operator and expression in it.
- Tests are the single most important tool you’ll have when porting complex logic from one system to another. Write many, and run them often.
In addition, if you are porting an existing data processing function to MongoDB, it greatly helps to think about it in terms of how data flows through it, as it may be possible to reconstruct that flow with judicious applications of the $facet operator, as well as other tricks (it may even be used to produce key-value maps right inside MongoDB).
Does this mean that you should immediately rewrite all your complex processing as crazy big MongoDB queries? The answer is, most likely no. Recall the disadvantages of using the aggregation framework for this sort of thing, and keep in mind that it is much better to keep logic in a language that makes it easy to express logic. One does not make a pact with the Old Ones to do a grocery run, after all—it is only in times of dire need for performance that such solutions become reasonable. Choosing the right tool for the job is also important. In some cases, MongoDB’s aggregation framework is not as appropriate as, for instance, ElasticSearch, which we use in other parts of the application.
In conclusion, using MongoDB’s aggregations can, in the right circumstances, bring you significant performance improvements for data aggregation and summarization workflows, but at a cost of adding extra complexity and having to work around the system’s limitations to redefine your business logic in terms that MongoDB will understand. It worked for us in this case, as well as for building the pagination for Endpoint Agent path trace visualization, but it is important to choose one’s tools carefully and appropriately.
I’a! I’a! Mongo fhtagn!