Recommender Systems, Not Just Recommender Models
by Even Oldridge and Karl Byleen-Higley
One of the biggest challenges facing people new to building recommender systems is the lack of understanding around what these systems look like in the real world. The majority of the online content around recommender systems focuses on models and is often limited to a simple example of collaborative filtering. For new practitioners, there is an enormous gap between examples of simple models and a production system that serves recommendations.
In this blog post we’ll share a pattern that we feel covers the majority of recommender systems deployed today with examples from companies like Meta, Netflix, and Pinterest. This pattern is central to how we think about building end-to-end recsys within the NVIDIA Merlin team and we’re excited to share it with the broader community and help build an understanding and consensus of what recommender systems (not just models) look like in production. If you’re interested in this content in a different format we’ve also got a talk from KDD’s Industrial Recommender Systems workshop on the same topic.
Beyond Recommender Models
The role of a recommender model, whether it’s a simple collaborative filtering example or a deep learning model like DLRM is ranking, or more accurately Scoring, the interest that a user may have with a set of items. But these scores by themselves aren’t enough to serve recommendations to that user in most real world contexts. There are several reasons for this which we’ll dig into below before we explore the solutions and how they shape the system that we end up with.
More items, more problems
The first issue that we run into at scale is the number of items being recommended. Item catalogues can grow to millions, hundreds of millions, even billions in extreme cases. Scoring every item for every user just isn’t feasible in those (or really most) cases. Scoring is computationally expensive. In practice, you start by quickly selecting a relevant subset of those items, say a thousand or ten thousand that you’ll score.
Enter 2-stage recommender systems. Before we score the items we need to select a reasonably relevant set that contains the items that the user will eventually engage with. This stage is commonly referred to as the candidate Retrieval stage, but we’ve also heard it referred to as candidate generation. Retrieval models take many forms including matrix factorization, two-tower, linear models, approximate nearest neighbor, and graph traversal, and are generally (much!) more computationally efficient than the scoring model. YouTube has an excellent paper from 2016 that is one of the first public references of this architecture but the method was almost certainly in use at many other places and is commonly applied in industry. Eugene Yan has a great blog post on this topic and his 2-stage images were the inspiration for our 4-stage recommender diagrams we’ll introduce below. It’s worth noting that it’s also commonplace to use multiple candidate sources in the same recommender system to ensure a diverse set of candidate items are being presented to the user, but we’ll save that topic for another blog.
Anything but that!
While it might seem like we’re able to serve relevant recommendations at scale with these two stages, there are other constraints that a recommender system needs to support. In most contexts, there are items that you don’t want to show to the user. When the item is out of stock, when it’s not age appropriate, when the user has already consumed the content, or when licencing rights don’t allow you to show it in this user’s country.
Rather than relying on the scoring or retrieval models to infer this business logic and to recommend items appropriately, it’s necessary to add a Filtering stage to your recommender system. Filtering is most frequently done following the Retrieval stage, can be integrated with it (one of the more complex problems you run into with filtering is ensuring that there are enough candidates after retrieval) or can even follow scoring in some situations. Filtering allows you to apply business logic rules that would otherwise be impossible (or at least very hard) to enforce by the model. In some cases these are simple exclusion queries, but they can be more complex as is the case for Bloom filters which can be used to remove items that have already been interacted with by the users.
Order up!
The three stages introduced so far: Retrieval, Filtering, and Scoring provide us with a list of relevant recommendations at scale and their corresponding scores. These scores represent the scoring model’s guess at how interested the user will be in each item. Recommendations are generally presented to the user as a list though, and that presents an interesting conundrum; the best list is unlikely to fully align with the individual item scores. Instead we may want to provide a diverse set of items to the user from amongst candidates, or even show them items outside their normal pool of recommended candidates to explore spaces they haven’t seen, helping to prevent filter bubbles.
In some literature and examples our third stage of the recommender system would be referred to as ranking, but the final rank (or position) at which a recommendation is shown to the user is rarely aligned directly with the output of the model. By providing an explicit Ordering stage we’re able to align the output of the model with other needs or constraints of the business.
4-Stage Recommender Systems
These four stages of Retrieval, Filtering, Scoring, and Ordering make up a design pattern which covers nearly every recommender system that we’ve encountered or built. The diagram below shows these stages and presents an example of how each stage could be built. It’s significantly more complex than a basic recommender model, especially as we begin to think about deploying to production but we think it accurately represents how most production recommender systems today are built.
Examples in the Wild
Now that we have a pattern for describing recommender systems let’s see how it stands up. First, exploring examples from common recsys tasks we see that at least at a high level the four stages cover these use cases and demonstrate a consistent pattern.
Taking this a step further, we can look at examples from real-world recommender systems in order to see if we can recognize the four stages from our pattern.
First up, Meta’s Instagram has a great article about the query language they’ve developed to make recommender systems easier to develop: Powered by AI: Instagram’s Explore recommender system (IGQL query language). As we can see from the example they provide, this query language exactly maps to the four stages of our pattern:
Pinterest has a number of papers (Related Pins at Pinterest:The Evolution of a Real-World Recommender System, Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time, and Applying deep learning to Related Pins) the first of which contains this figure which outlines their system architecture’s evolution over time. Again we see this same pattern with the subtle difference that retrieval and filtering have been grouped into a single stage in their diagram.
Finally, Instacart shared in 2016 this architecture for providing recommendations that follows the four stages directly. First candidates are retrieved and the previously bought items are filtered out. The top candidates are then scored, and the final results are reordered in order to improve the diversity of results in the final set presented to the user.
Complex Systems
In our 4-stage diagram we’ve tried to articulate the components required to train, deploy and support inference time querying of all of the stages. This system is much more complex than a single model, and it’s no wonder that those who looked for information about recommender systems online and found only collaborative filtering models are overwhelmed when they actually try to build them.
In our next post we’ll look at some of the issues that this complexity introduces in more detail, and talk about some of the solutions we’re proposing in Merlin, our recommender system framework, but for now we’ll leave you with this challenge: Take a look at recommender systems that you use and see if you recognize the four stages, and if you find any that don’t please let us know! We’re constantly iterating and refining our thoughts and our libraries to ensure that we’re providing the best solution we can for the RecSys space and your input is appreciated.
As a final note for those of you who made it this far… If you’re passionate about building open source software that helps solve these solutions please reach out to us as well. Come chat with us at our online RecSys Summit! We have a number of open roles that make the building and deployment of recommender systems easier. We’d love to hear from you.