Fixing a location lookup by moving the $limit

Some Background:
I’m the technical co-founder of Drivible.

I was working on a new feature which ended up causing a race condition and soft-deleted users files. In fixing that I caused some other minor issues.
We didn’t lose any actual users data, but I generally pride myself on being a good developer and this was something that was hard to unit test and we didn’t manually test enough to find the original bug. Mostly because I had a lot of personal issues happening at the same time, lack of sleep, family stuff, etc..

So I was feeling a bit down and questioning my skills as a developer when an old employer contacted me on a Tuesday wanting some help with a project. They were desperate.
The project was meant to go live the next Monday and it seemed broken. They might have to pay back the equivalent of a years salary if it failed.

Especially bad was an important API call to get a list of locations near a user.
The listing wasn’t simply the closest locations to the user, but there was a weighting system so some entries could be higher up in the ordering despite being a bit further away.

The original developer of the code did it all in code and got the web server to get all 6k location entries, calculate the distance and then apply the weighting formula.
The API call was taking 14s sometimes over 20s and simply having a couple of users on it at the same time would break it.
Another developer had replaced the all code version with a MongoDB aggregation.
This was slightly better at some 2s if you searched for a 10km range and something like 10 seconds per request if you searched within a 100km range.
The assumption here was that the issue was with the first $geoNear aggregation step. However there was a lot of other steps to the database method.

After spending a few hours and getting to know what was going on I didn’t believe the issue was with the $geoNear aggregation step. I was pretty sure that the 2dsphere index and sheer effort MongoDB put into that call wouldn’t be making it slow.
However the database was create by someone used to an SQL database and was rather normalised. There was $lookups done to 2 other important collections and then some work to $unwind the arrays and then $project the needed fields a $sort by distance and finally was a $limit, usually only returning 2 or maybe up to 10 entries.

After putting the aggregation pipeline into MongoDB Compass I noticed something important. The initial $geoNear aggregation would return over 800 results for our test location!
It was then doing $lookups on all of those and that was the equivalent to doing 1,600 requests to the database for the other collections all at the same time.

I simply moved the $limit up, making it the 2nd step and BOOM, suddenly less than 500ms response times!

I had to do some more work and fix things up. For example the ranking priority wasn’t being calculated. So the results were just by the closest entries. I made it so if you wanted under 10 responses we set the limit to a minimum of 10 (yay for clamp). Then we’d do the calculations server side, sort by the updated priority and then applied a limit to the final number of results.
This seemed to be a decent compromise and moved a touch of the CPU processing to the NodeJS servers which meant we could have auto-scaling kick in if needed.

I could hit the endpoint with 100 virtual users in Postman’s new performance tester and had no issues. I had to crank out ab (Apache Benchmark) to simulate 500+ concurrent requests to make it have issues. Simply upping the MongoDB Atlas cluster from an M20 to something with more vCPU meant we couldn’t overload it at all.

There was a bunch of other fixes I helped apply, some changes to cache certain results from a 3rd party, not send as much data to the front end and made some minor changes to the ReactJS + Typescript frontend.

I was pleased with myself. The system went from being pretty much broken to something I couldn’t overload.

It had the added benefit of letting me play with NodeJS again and realising how much I enjoy it. Plus dealing with a different team of people who were working in a different way. They had a deadline which meant we were working at a frantic pace and I did what was normally 3 weeks of work in what ended up being about a week and a half. So doubled my output, but didn’t see much of my family and needed a decent rest afterwards to catch up on sleep and exercise.

I used to be a freelancer but since last year decided to focus on Drivible, so wouldn’t normally take on Freelance work. This was mostly a favour for an old friend who needed it. It helped that it also paid well.

But Drivible is my focus and so back I go.

Still, I’m grateful that I had spent some weeks over the years working on side projects in order to have the skills to jump in fairly quickly.
I don’t really want to touch SQL again so I’m also grateful it was using MongoDB which I really love. Especially the power of aggregations which I’ve not seen replicated by other systems.

An area I need to focus on more is ReactJS and Typescript. A setup we are also using for Drivible but I know is my current main weakness (along with ML but that’s a whole other area).

Having played with some more advanced features of Postman, I now want to setup a test runner using it to double check that we won’t lose users data in Drivible.

It’s something that’s become a core for us. That we don’t lose your Car Dealerships data and we want to help you do things faster and with better quality than the analogue equivalent. Be that payments, checking car rego expiry or getting feedback from Clients.

Cheers!

Fixing a location lookup by moving the $limit

By Michael Kubler

Leave a comment