Notes on Airbnb's learning to rank paper
These are my notes on Airbnb’s paper on Applying Deep Learning To Airbnb Search.
I’ve read this paper a few times, since my team is trying out learning to rank, and are going on a similar journey.
I really enjoyed reading this paper. It’s well written and I learnt a lot from it. It’s a great theory-to-practice kind of paper, in that it covers the details, but also discusses how they applied it. I appreciated they noted what didn’t work, too.
We present our perspective not with the intention of pushing the frontier of new modelling techniques. Instead, ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers. Bon voyage!
These notes are mostly for my own consumption, though if you’re working on something similar you might find this useful.
Haldar et al. tell the story of adopting a neural network (NN) for search ranking at Airbnb.
The application to search ranking is one of the biggest machine learning success stories at Airbnb.
The paper is about the listing search feature, that enables people to find a place to rent.
They want people searching for “apartment in Rome” to be shown a list of places to rent, ranked by relevance. In this case, relevance is something like the likelihood that a user will rent a given place.
Airbnb search started out as a “manually crafted scoring function”.
I’d guess that the scoring function would be something like TF-IDF, along with some manual tuning.
Manual tuning would involve applying a weighted boost to documents that (e.g.) are popular or were recently updated.
I would expect most search pages on the web to function like that. Your average WordPress site is probably using a simple text search function (which works great!). When you run something like Elasticsearch you get a bit more control. You get to ‘tune’ relevance, so you can manually boost documents based on some features.
This isn’t machine learning, it’s usually a simple function. For example below is a function to apply a boost to a document in search results using its popularity as a multiplicative factor.
tf_idf_score = 0.0123
total_visits = 500
popularity_weight = 0.0001
score = tf_idf_score * (popularity_weight * total_visits)
All things being equal between documents (i.e. they both had the same tf_idf_score
),
then the more popular document would have the higher score.
You can tune the impact popularity has on search result ranking by changing
that popularity_weight
. Of course, you can have lots of relevancy ‘signals’
like popularity, all of which have weights that you can manually tune.
(again I’m just guessing this is what Airbnb started out with!)
Airbnb then transitioned to using a gradient boosted decision tree (GBDT) model.
I imagine this is because it became hard to improve search relevancy manually. I’ve found manually tuning relevancy boosts to be a bit like whack-a-mole. Some queries get better, while others get worse, and the returns from these manual efforts diminish over time.
What is a gradient boosted decision tree? A decision tree is a tree-shaped flowchart. But what about gradient boosting?
Gradient boosting is an algorithm (machine learning technique) to create a prediction model using an ensemble of decision trees. At each step, the algorithm adds additional decision trees to the model, in order to get the predicted value closer to the ground truth. So the final model, as far as I understand it, ends up as a huge decision tree.
I’d recommend reading the Towards Data Science article linked above. It’s a great explanation of how GBDT works.
Facebook use GBDT models for ranking news feed (or at least, they did in 2017!) because they’re efficient at inference time.
That GBDT model drove most of the initial gains at Airbnb, but those gains plateaued.
They had a long period of “neutral experiments”.
I don’t know exactly what ‘neutral’ means here, but I presume it means experiments that failed to demonstrate a statistically significant improvement in online metrics. We went through something similar!
Since they were seeing diminishing returns from the GBDT model, they write that the moment was ripe for trying big changes to the system.
They then transitioned from using a traditional machine learning model to a NN.
Airbnb have an ‘ecosystem’ of models, which contribute towards deciding which listings to display. For example they have models that contribute extra features.
I can understand the reason for having lots of models involved, aside from the primary ranking model.
For instance, we’re thinking about implementing a click model to turn user clicks into training data. This would replace our current fairly hacky but surprisingly OK method.
We’ve also thought about multiple different model depending on the time/day, because weekend and nighttime traffic is so different to traffic during working hours. That is, if we’re not able to represent this difference to the model.
There are also situations where people run a collection of models, and request predictions from all of them, in order to get a consensus.
The paper discusses the main ranking model - “the most complex piece of the ranking puzzle”.
Like us, they use NDCG as their main offline metric. They care about relative gains in NDCG.
However, unlike us, they just use scores of 0 and 1, rather than 0 to 3. Where 1 = booked, and 0 = not booked. This is domain-specific, since a user doing a search is unlikely to book multiple listings presented in a set of search results. Whereas, in our case a user might want multiple documents in search results if they’re doing some research. Relevance isn’t binary in our case, which is why we assign a relevance score between 0 and 3 to query-document pairs in training data.
They also do AB testing of their online metrics to see if they can get a statistically significant increase in conversions. We’re doing the same thing too.
They switched to a NN by getting a NN model to be “booking neutral” against their existing GBDT model.
I quite liked the Andrej Karpathy advice they quote regarding model architecture: “don’t be a hero” - though they say they followed it with caveats!
It’s something we followed in implementing Learning to rank. However, we were still able to significantly improve on our metrics with learning to rank, since we weren’t running a GBDT model - we went from a manually tuned function to a NN in one leap.
Getting their NN model to “booking neutral” validated the NN was prod ready, and the pipeline worked.
They initially gave the NN the same features, but had to do some extra feature engineering stuff to get it to work well.
They used an L2 regression loss function. I’d like to learn more around that.
They next used Lambdarank to directly optimize the NN for NDCG, rather than using a regression based formulation. They used cross entropy loss.
As a side note, while DNNs have achieved human level performance on certain image applications, it is very hard for us to judge where we stand for a similar comparison. Part of the problem is that it’s unclear how to define human level performance.
Yes we’ve found the same thing. Some time ago we implemented admin software for search, enabling domain experts to create ‘best bets’ for a search term. It was a way to hard code some documents to be top-ranked for a given query. However, one problem with this is that domain experts can have different opinions about what is relevant, and their views can also differ from what the user wants.
I found the section on feature engineering particularly interesting, and I’m still to try some of what they said worked.
Feature engineering: This flavour of feature engineering is different from the traditional one: instead of deriving the math to perform on the features before feeding them into the model, the focus shifts to ensuring the features comply with certain properties so that the NN can do the math effectively by itself
They write that feature normalisation was important. Features must be normalised, otherwise the gains will be saturated by the vanishing gradient problem. E.g. a feature value should be between -1 and 1.
They used mappings of feature distributions to investigate this, which looked interesting. I believe we can get these in to tensor board.
The strength of NNs is in figuring out nonlinear interactions between the features. This is also the weakness when it comes to understanding what role a particular feature is playing as nonlinear interactions make it very difficult to study any feature in isolation but, how then do you figure out which features have the biggest impact so you can prioritise engineering effort?
One way was to remove features one at a time to observe the impact on NDCG. They also built a score decomposition tool for NNs.
The section on system engineering was quite interesting. They’re working at a much larger scale than we are, but it’s cool to read about the efficiency gains they made. They wrote a custom neural network scoring library in Java. I wonder now that TensorFlow Ranking has been released whether they’d use that.
I really liked how they ended the paper:
Most of us have managed to bring down the metrics singlehandedly at some point. But lifting the metrics have always been the work of a collective.
We’ve been able to get learning to rank work by collaborating heavily. Working across disciplines has helped - we’ve involved analysts, data scientists, designers, developers, product managers, and more! Working closely in a diverse team has helped us a lot.
I’m still learning things when I return to this paper. I’d like to read more about learning to rank, so I think looking at the references in the paper is a good starting point.