Advanced Filtering with MongoDB Aggregation Pipelines

This is a repost from https://medium.com/@kublermdk/advanced-filtering-with-mongodb-aggregation-pipelines-5ee7a8798746 although go read it there so I can get the Medium $$, because it reads better.


For a project Chris Were and I have been working on we discovered a great way of doing advanced queries using MongoDB’s Aggregation pipelines with the use of the $merge operator in order to filter down to a specific set of customers based on their demographics and purchasing behaviour.

For those that don’t know MongoDB’s aggregation pipeline is much more sophisticated and at least for me, also more intuitive than filtering using Map-Reduce.

The project is part of the APIs and admin control panel backend that powers a mobile app for customers who are shopping.

We know things like the customers age, gender, and state.

We save some of the customer’s transaction data, including the storeId, an array of products and an array of product categories. Although only the last few months worth.

We have an admin control panel where the managers can create a set of filters to select a specific customer group, they can then then send push notifications to those customers, assign them coupons or send them surveys.

Except for the number of days ago, the entries allow for multiple selections. e.g You can select all age ranges, just one, or a couple of them.

Although it’d take you a really long time in the UI to select almost all of the products or even the categories. Hence we have both have and haven’t purchased / shopped at versions.

Example filters:

Age: 0–20, 21–30, 31–40, 41–50, 51–60, 60+

Gender: Male, Female, Other

State: South Australia, Victoria, Queensland, New South Wales, Western Australia, ACT, NT

Have Purchased Products [X,Y, …] in the last Z days

Haven’t Purchased Products [X,Y, …] in the last Z days

Have Purchased from Product Categories [X,Y, …] in the last Z days

Haven’t Purchased from Product Categories [X,Y, …] in the last Z days

Have Shopped at Store(s) [X,Y, …] in the last Z days

Haven’t Shopped at Store(s) [X,Y, …] in the last Z days

They needed the system to be flexible so there’s also include and exclude versions of the filters.
Importantly the Include filters are AND’d together whilst the Exclude filters are OR’d together. So you can get very specific with the includes whilst applying some broad exclude filters.

An example might be selecting people who have purchased toilet paper and alcohol hand sanitiser in the last 7 days but exclude all people aged 60+ and all people who’ve purchased kitty litter. A notification can then be sent about how there’s a new pandemic preparedness set which is now in stock, or how the stores are being regularly disinfected during the Covid19 pandemic.

Another option could be to target people in South Australia who are 30 yrs old or under and have purchased from the Deli – Vegan meats product category in the last 10 days, but exclude those who’ve purchased the new Vegan burger. Then they can be given a 2 for 1 voucher for the new burger.

With many thousands of products, hundreds of categories and a few dozen stores there’s a reasonable amount to search through. We also don’t know how many customers there will be as the system hasn’t been launched yet.

But the system has to be fairly fast as we sometimes need to queue up customer push notifications whilst processing the HTTP requests from the 3rd party sending us the transaction information.

The important parts of the data structure looks like this:

What we needed after after filtering all the customers and transactions is a list of Customer IDs. We can then feed those into a variety of systems, like the one for sending push notifications, or selecting which people get coupons.

A main limitation was that whilst the database servers were quite powerful, the web servers weren’t. My initial thoughts were to process an bunch of aggregation pipelines, get a list of CustomerID’s and do the merging and processing in PHP, but when there’s potentially 100k+ customers and transactions, Chris pushed to go harder on the database. Thankfully MongoDB is powerful and flexible enough to do what we wanted.

In the latest v4.2 version of MongoDB there’s now a $merge aggregation pipeline which can output documents to a collection and has some advanced controls about what to do when matching, unlike the $out operator.

I worked out that we can do two types of queries. A “Haveselect of all those who should stay and a “Have Not” select of all those who should be removed.

For a “HAVE” Select we output the customerId’s into a merged results collection for the customer group with an extra {selected: true} field and bulk delete those without the selected field, then and bulk-update and removed the selected:true field.

For the Have Not’s we select all the customerId’s of those we don’t want, set {excluded:true} and bulk delete those with the field.

Example of Include and Exclude filters

This is an example of the UI for setting the filters. The approximate customer’s is based upon some randomly created data used for load testing… In this instance 17k customers and 340k transactions.

The UI creates a set of filters with the values of things like the productId’s, but the PHP backend, using the Yii2 framework, Mozzler base and some custom code, does some parsing of things like ages. e.g From the string “0–20” to the current unix “time()” to “time() -20 years”. Similar changes are done on the backend to convert something like 60 days ago into a unix timestamp relative to now.

I was going to do an aggregation pipeline for each filter. However if there’s 10 filters that could be a lot of work. MongoDB seems to be better at having somewhat complicated $match queries (with some good indexes) but not so good at running as many aggregations.

Chris then suggested we merge the aggregations and I realised that doing it the following way actually works out perfectly and we only need a max of 4 aggregations. I’m sure if the Exclude filters weren’t considered a logical OR or the Include’s were consider a logical AND of the filters then things would be different.

Aggregation pipelines

1. Customer Select (Have):
Include filters for a Customer’s Age, Gender and/or State

2. Customer Exclude (Have Not):
Exclude filters for a Customer’s Age, Gender and/or State

3. Transaction Select (Have):
Include filters for Have Purchased Products, Categories and/or at Stores
Exclude filters for Haven’t Purchased Products, Categories and/or at Stores

4. Transaction Exclude (Have Not)
Include
filters for Haven’t Purchased Products, Categories and/or at Stores
Exclude filters for Have Purchased Products, Categories and/or at Stores

From the admin control panel UI we’ve grouped the filters into the Include or Exclude and have an array of them.
Because things like 7 days ago needs to be converted into unix timestamp based on the current time, we need to update the aggregations dynamically, hence using MongoDB Views wasn’t really possible.

On the backend I wrote a Customer Group manager system for grouping the filters into the different categories and merging them together.
Whilst the actual queries we did were a bit more complicated than shown below because we aren’t saving the state as a string but a postcode so have a bunch of ranges for them, what we do is very similar to the example aggregation below, based on the filters in the UI screenshot:

The aggregations in the above Gist should have enough comments to explain the actual steps in detail. But it’s expected you’ve used MongoDB’s aggregations to have some idea of what’s going on.

Some points we discovered:

  • The $merge filter lets us put all the data into a merged collection and we can iterate over the results using a cursor, use the collection count and other things to make it both easy and very scalable. The merged collections are effectively free caching.
  • It was very powerful to always be doing a $match (select) query first. It’s fast with the right indexes and powerful with the excluded or selected: true and doing a bulk delete / update.
  • Merging into the Have / Have Not filtersets on the Customer and Transaction models mean there’s a maximum of 4 pipelines that will be run. Although if the Exclude filters were an AND not OR between them, then this might not be the case.
  • An edge case is that we have to add in all customerId’s as a first pass if there wasn’t a Customer Have pipeline so that customers who don’t have any transactions could still be returned, or so that if there’s no filters then it selects everyone.
  • I also developed some tweaks which let us run a set of filters on just a single customer or small selection of customers to know if they are part of a customer group. This is especially used when a new transaction comes in. We point to a different (temporary) customerGroup collection in that case. Obviously querying against a small set of customerIDs makes things much faster.

The end results are good. I get around 600ms when querying against 20k customers and 180k transactions with the basic Vagrant VM on my laptop. Although that is random data I generated just for load testing.

We are still waiting to see what this will be like in production.

Let me know if something doesn’t make sense or if you want more information.

Via Negativa

Via Negativa translates to “by removal.” Taleb argues that a lot of problems can be solved by removing things, and not by adding more. In decision making, if you have to come up with more than one reason to do something, it’s probably because you’re just trying to convince yourself to do it. Decisions that are robust to errors don’t need more than one good reason. You can observe the beneficial effects of Via Negativa effects in a vast number of fields from Medicine and Diet to Wealth.

Copied from: https://anantja.in/antifragile-things-that-gain-from-disorder/

This mental model wasn’t in the main list of mental models I often look at https://fs.blog/mental-models/ but it’s one I’ve come across before and wanted to point out to people so thought I’d post it here on it’s own.

In this case, I’m thinking about using Via Negativa for removing stupid people from a group.

Covid19 – It’s not intentional, it’s neglect

I’ve been seeing conspiracy theories pop up saying that the Covid19 Corona virus is an intentionally created virus.
But in my research the truth is that it seems to be a string of incidents going back to the 1960’s and culminating in neglect of animals.

Firstly, I know that in Australia at least government funding cuts weren’t just cutting off the ‘fat’ but also the meat from the healthcare services. Core services were being affected. So in times like that having a pandemic response team and spare stock of required PPE supplies is a luxury that can’t be afforded.
If a rather rich country like Australia has these issues, then I’d expect a large number of countries would also have the same issue. Maybe some Nordic countries would have a bit better setup, but I’ve no idea.

If the Covid19 virus RNA did have evidence of something like CRISP-R manipulation. e.g by having many copies of a particular nucleotide sequence. Then given the RNA has been transcribed and looked at by many professionals I’d expect that would be announced and I’d be more likely to consider it.

I do think that Capitalism and the Rich and Powerful are to blame somewhat. But even Communism has it’s part.

The story as I understand it is such:

Back in the 1960’s China followed the USSR’s crazy Communist ideas pitched by Trofim Lysenko around farming practices as part of the ‘Great Leap Forward‘. The ideas like having the plants spaced really close to each other weren’t tested, were very incorrect and when combined with a few other issues caused tens of millions of people to die from hunger when the farms failed.

Still somewhat struggling, in 1978 the Chinese government gave up it’s control over farming allowing companies to take over.
During this time many peasant people started hunting wildlife to survive. From bats to turtles, chimps, snakes, goats and more.
That practice then became industrialised in 1988 when the Chinese government declared the wildlife a natural resource, which allowed the farming to be expanded upon and industrialised. I over time they also started to trade with more and more exotic animals, like bears, lions, pangolins and rhinos.

These animals are sold at wet markets like the one in Wuhan China and their cages are stacked on top of each other. The poop, blood, puss and other liquids fall from the animals at the top through to those at the bottom.

It’s believed that Covid19 was passed from a bat to a pangolin then to a human.

Covid19 Bat to Pangolin to Human

This isn’t the first time such a transmission has happened from animals to humans in a wet market in China, the SARS virus also from China and in a similar incident.

But China isn’t closing the wet markets down permanently because the main customers of the wildlife farming industry and especially the illegal trades are the rich and powerful.
That’s the case because the wildlife industry obviously wants to keep itself going and so has been promoting its animals as tonic products. Good for bodybuilding, disease fighting, sex enhancing and disease fighting. Whilst none of the these claims are true, they are lapped up by the rich and powerful.
That means the industry has an enormous lobbying capability.

So if you want to blame Covid119 on both Capitalism and the rich and powerful you can. But it’s not an intentional virus. It’s the result of a complex chain of events from bad farming practices that caused a side industry which has bad animal farming practices through to the rich and powerful.
I think the Vegans and Vegetarians are right. We wouldn’t be in this mess if we weren’t eating so much meat and were looking after the animals better.

We wouldn’t be creating strains of antibiotic resistant diseases if we weren’t dosing the livestock because we force them into horrible cramped conditions. We wouldn’t a global pandemic and unprecedented full lockdown, possibly causing one of the worst economic downturns in decades, if we were more caring of the animals and environment.

So yeah, someone in Wuhan likely didn’t care about his animals, or was in a situation where he wasn’t able to look after them well and clean them properly so let the fluids of various animals mix at the bottom of the stack and now the world is paying for it. Currently (2pm on Monday the 23rd March 2020) 330k people have been infected and 14,703 have died and if we weren’t taking extreme measures likely millions more would be affected.

Covid19 Virus graph 23rd March 2020

The above won’t stop people, especially those who are rich and have savings from trying to take advantage of the situation and buying things up at bargain basement prices. Others are likely to take other forms of control, political, militarily or otherwise.

However I’ve also seen people actively helping others in a very altruistic way. It’s definitely inspiring. But unfortunately the elderly and poor are going to be the most affected by Covid19. I hope you dear reader isn’t in either group, but even if you aren’t take what precautions you can.

Main sources:

https://www.youtube.com/watch?v=TPpoJGYlW54 Vox – How wildlife trade is linked to coronavirus
https://www.worldometers.info/coronavirus/ – Latest Stats

Evernote lost my work

TLDR: Evernote doesn’t seem to backup your work whilst writing. It doesn’t save until you’ve actually exited the note. So it’s highly vulnerable to your phone / tablet dying.

The Story

This afternoon I got my Android Tablet out and started to write up my weekly review. I haven’t actually done my review for the last month, so there was a lot to write in, like how my Baby Boy is developing, issues I’ve had with sudden sciatica in my back and some of the craziness of dealing with the Corona virus reactions and lockdown.

It’s stuff I can re-write, but won’t. I certainly won’t be using Evernote to do so.

Actions

I started by duplicating an existing template I made recently, then I renamed it, sat down for what felt like 45mins and poured words onto the screen.

My tablet is a little old, it’s a Samsung Galaxy Tab S2 and the battery has a habit of dying. Which it did. A notification just appeared to say the battery was at 15% and then suddenly the screen went blank, then it rebooted. I didn’t think too much of it, figuring that my tablet was connected to the Internet and should be both saving locally and to the cloud as I was writing. It was fine if I lost the last minute of work.

But instead I lost probably thousands of words.

If I was writing my Novel I’d be livid. It’s just not acceptable that it’s not automatically backing up. It’s a mobile app, not a desktop app so I shouldn’t need to [Ctrl] + [s] save every few words as I normally do when working. Especially when I’m in the flow and just actively writing I don’t want to think about saving.

Switching to Pure Writer

I’ve been burned, so from now on I think I’ll be doing my initial writing in Pure Writer before copying over to Evernote. I still like Evernotes syncing across devices and the way it works, but I now loathe it doesn’t have an autosave whilst actually writing.

Killing Windows Night Light

O.M.F.G I finally found out why my colour grading has been so off. Windows 10 “kindly” enabled it’s Night Light mode and made everything more Red in the evening.

Night Light mode is the same as F.Lux or Twilight Mode it puts a Red colour over the top of your screen to reduce the amount of blue light, this is meant to help you go to sleep easier.

However, when doing any colour grading work it completely throws off your attempts. You need to disable it in order to do any Photoshopping, Video editing or anything to do with colour grading.

You can go to [Settings] -> [Display]
Then check that [Night light] is switched to off.

Then go to [Night light settings]

In the Night light settings ensure that the Schedule is set to Off and that the Colour temperature at night is all the way to the right and thus is white.

Unfortunately it took me way too long to realise what was going on and why. When I colour corrected a clip and it looked white to me, the Davinci Resolve colour scopes looked like they were out. I’d notice what looked like a bit of Red on the monitor, but would just tilt it until it looked fine. Which seemed fine as I had an X-Rite i1 Display Pro colour calibrator and was using a 4K monitor.

Note that there’s also a few other apps which can cause a colour tint. On my Asus Republic of Gamers laptop the Armoury Crate app includes a Featured app called GameVisual which also likes to do some colour temperature changes of its own.

Hopefully this helps others with some similar issues. Let me know of any other apps which causes problems.

Things I want to teach my children

Here’s a collection of things I’d love to teach my kids:

One of the most important things is how to be happy and successful. The secret to happiness being more than just having low expectations and being happily surprised.
Having a meaningful life where you are working towards a unified purpose is important.

Another thing is not to take on too many things. I’ve found it hard to say no, because I can see that there’s not enough skilled people in the world trying to help and that I can see the potential in many of the projects I come across.
But by making a list of 25+ things you want to do and focusing on the top 5 you can hopefully keep your focus.

Note that to know your purpose and thus how to prioritise what you want to do in life you should read Stephen R. Covey’s book, The 7 Habits of Highly Effective People which will also teach you to do things like balance production vs production capacity as well as taking different views of your life and do decent planning

vagrant plugin update

Vagrant is a program which makes it easy to start new Virtual Machines.
I’ve got a Windows machine (for the games and video editing software). But usually code websites which are run on Linux servers.

I usually have 1 or 2 VM’s running on my laptop.

After getting messages from Vagrant every time I started up a VM that a new update was available, I decided to install the latest version (v2.2.5)

It then stopped working and when running the usual vagrant up I got the following:

Vagrant failed to initialize at a very early stage:
The plugins failed to initialize correctly. This may be due to manual
modifications made within the Vagrant home directory. Vagrant can
attempt to automatically correct this issue by running:
vagrant plugin repair
If Vagrant was recently updated, this error may be due to incompatible
versions of dependencies. To fix this problem please remove and re-install
all plugins. Vagrant can attempt to do this automatically by running:
vagrant plugin expunge –reinstall
Or you may want to try updating the installed plugins to their latest
versions:
vagrant plugin update
Error message given during initialization: Unable to resolve dependency: user requested ‘vagrant-hostmanager (= 1.8.9)’

Running a vagrant plugin repair showed a new error.

Unable to resolve dependency: user requested ‘vagrant-vbguest (= 0.18.0)

Running the vagrant plugin expunge –reinstall didn’t help.

The vbguest is a reference to the VirtualBox manager I use and likely the guest plugins which allow for better 2 way communication between my host machine and the guest VMs.

There was no reference to the plugin in my Vagrant file, nor in the vagrant folder. There wasn’t any good Google results (hence why I’m writing this post).

After some playing around I found the command which fixed it:

vagrant plugin update

Running a vagrant plugin update caused it to update to v0.19.0 of the plugin and then everything worked happily.

Hopefully if others have the same issue they can quickly try a vagrant plugin update and see if that fixes their issue.

Yii2 Swiftmailer 0 Auth exception

If you get the error Message: Failed to authenticate on SMTP server with username “****” using 0 possible authenticators

Then try to remove the username and password from the configuration file.

Context

When using the Swiftmailer, a common PHP emailer
This example specifically talks about the Yii2 configuration file, but likely applies to other frameworks.

Here’s an example of the offending config

config/web.php (or console.php or a common.php file if you merge the two).

[
'components' => [
  'mailer' => [
    'class' => 'yii\swiftmailer\Mailer',
    'transport' => [
        'class' => 'Swift_SmtpTransport',
        'plugins' => [
            ['class' => 'Openbuildings\Swiftmailer\CssInlinerPlugin']
        ],
        "username" => "smtp-auth-user",
        "password" => "*****",
        "host" => 'exchange.local',
        "port" => 25,
        ]
     ]
  ]
];

The exception seen was

Message: Failed to authenticate on SMTP server with username “….” using 0 possible authenticators

This exception caused major headache.

After investigation it turned out that removing the username and password from the transport caused it to work.

It seems that the server we were on was in a corporate environment and SMTP authentication was disabled but Swiftmailer was trying to authenticate and failed.

Bonus – Enabling SMTP Logging

[
'components' => [
'mailer' => [
'class' => 'yii\swiftmailer\Mailer',
'enableSwiftMailerLogging' => true,
'transport' => [
'class' => 'Swift_SmtpTransport',
"host" => 'localhost',
"port" => 25,
],
],
'log' => [
'traceLevel' => YII_DEBUG ? 3 : 0,
'targets' => [
[
'class' => 'yii\log\FileTarget',
'levels' => ['error', 'warning'],
],
[
// Log the emails
'class' => 'yii\log\FileTarget',
'categories' => ['yii\swiftmailer\Logger::add'],
'logFile' => '@app/runtime/logs/email.log',

],
],
]
];

With the above config you should now see detailed logs in the runtime/logs/email.log file.

Sabby Love

I’m a bit of a night owl but she takes it to a whole new level, sleeping most of the day.
So when she started staying over we got very little sleep. I was exhausted. But we are both in a good rhythm now.
I love the way we’ll seek each other and curl up next to each other. She’ll fall asleep in my arms. Other times she’ll caress my feet. Sometimes she’ll bite and scratch a bit when she’s feeling that way inclined. It’s not my thing, but to each their own.
I met her some time ago but we met again through the same mutual friend who looks after her Mum and we’ve been together for some months now. She’s black, which is new for me, although she has some white hairs that really stand out and I occasionally pluck out.
I love her.
Her eating habits leave a lot to be desired. Like many Vietnamese, they don’t put their litter in the bin but on the floor so I often have to sweep up afterwards.
We’ve already been through a lot. When she first arrived she was a scaredy cat, especially afraid of the rain and thunder. The roof here does make the rain extra loud 🔊  but now she is fine and can sleep through it.
It’s hard to have any privacy with her around. She’ll often come into the bathroom whilst I’m sitting on the toilet. Yet she keeps away when I shower.

We’ve watched movies together and fought off the flies, moths and bugs that often try to attack in night. She loves it when I sweep up. Did didn’t always like me on the computer too long, but now she enjoys it because she’s learnt how to be with me.

I love watching her play and enjoy life. She’s so cute.
Although she’s also invasive. She’ll check everything and go through my stuff given half a chance and sometimes destroys things. But she’s curious, not malicious.
But. I recently learnt that she’s not a she.
It turns out that Sabby, my cat, is a boy. Both Mrs Loan and myself misinterpreted Sabby’s gender. Apparently that’s pretty easy to do when the cat is young.
Sabby is a mostly black Bombay Cat whom I love dearly.
I was going to post this some time ago but wanted to make a nice collage of images. I ran out of time then, but in 15 minutes time Sabby is going away, back to his home.
See, I’ve been living in Vietnam for nearly 7 months now, but it’s time to return to Australia and then, well I’m not sure where I’m going next but I know it’s to be with my girlfriend Jen.

Using jq to update the contents of certain JSON fields

OK, I’ll be brief.

 

I created a set of API docs with Apiary using Markdown format.

We needed to change over to Postman, so I used Apimatic for the conversion. Which was 99% great, except for the item descriptions it only did a single line break, not two line breaks. As Postman is reading the description as Markdown a single line break doesn’t actually create a new line.

 

So, I needed to replace the string \n with \n\n but the key is I only needed to do it on the description field.

Ohh and I needed to add an x-api-key to use the mock server. Even Postman’s own authorisation system didn’t seem to easily support this.

Using the incredibly useful comment by NathanNorman on this GitHub Postman issue I had a glimpse of what I could do.

 

So to add in the x-api-key into the  Postman headers, on my Linux VM I ran the following on the terminal:

jq ‘walk(if (type == “object” and has(“header”)) then .header |= (. + [{“key”:“x-api-key”, “value”:“{{apiKey}}”}] | unique) else . end )’ postman_api.json > postman_api_apiHeader.json

 

I then checked some resources, learnt about the |= update operator and gsub for replacement.

So to replace \n with \n\n in just the description fields I ended up with:

cat postman_api_apiHeader.json | jq ‘walk( if (type == “object” and has (“description”) ) then .description |= gsub( “\\n”; “\n\n”) else . end )’ > postman_api_apiHeader_description.json

 

If you want to see a list of the updated description fields to make sure it worked you can pipe the results to jq again.

cat postman_api_apiHeader.json | jq ‘walk( if (type == “object” and has (“description”) ) then .description |= gsub( “\\n”; “\n\n”) else . end )’ | jq ‘..|.description?’

Hopefully that helps others, or myself in the future.

 

Note that I downloaded the latest version of jq in order to run this. The debian distros are only using version 1.5 but I needed v1.6 for the walk function, but it’s a pretty easy download.

Some resources:

https://stedolan.github.io/jq/ Official jq site

https://stedolan.github.io/jq/manual/#walk(f) – Official docs description of the Walk function in jq.

https://remysharp.com/drafts/jq-recipes – Some jq recipes

https://github.com/postmanlabs/postman-app-support/issues/4044 – The Github issue that got me down this path

https://www.apimatic.io/transformer – The very powerful API blueprint online conversion system. Allowing me to upload a Markdown style Apiary file and download a Postman and also Swagger .json files.