Supervised by: Ujjayanta Bhaumik MSC, B.Tech (Hons). Ujjayanta Bhaumik is currently pursuing his PhD in Computer Science and Physics with a focus on Virtual Reality in the Light and Lighting Lab, KU Leuven. He worked as a creative software developer for over 2 years. He has a masters in Computer Graphics, Vision and Imaging from University College London.
Abstract
Video content recommendation systems are revolutionizing how we consume media, with tailored access to the things we consume, such as displaying our favorite movies and genre-related movies first on the home page.
With more movies and TV shows being released every day, the choices for a viewer are overwhelming. It is important that systems are able to recommend the shows that will catch people’s interest from the wide assortment available. This is where AI steps in.
This paper aims to investigate how AI works within the context of video content recommendations and will specifically focus on how streaming services like Netflix, Sky, and Amazon Prime Video decide which content to present to their subscribers to choose from. This paper will also explore some of the limitations of content recommendations and how future developments aim to overcome them.
What are video content recommendation systems?
Video content recommendation systems are software systems used to present users of a video platform with suggestions of content to watch. Recommenders use a mix of techniques to assess the content available at the time in order to assist the user in finding something to watch. The techniques involve various algorithms (or engines), complex data processing, and machine learning in order to continuously improve the accuracy of their recommendations.
Evolution of recommendation
The first time that the general public had enough content being delivered into their homes to have a choice about what to consume started in the 1920s with the adoption of broadcast radio and into the 1930s and beyond with Television. The earliest edition of the Radio Times, published in 1923, may therefore be considered one of the earliest attempts to inform users about which content to consume (1).
Linear Television (TV) was the predominant medium for the next 60 years, and as satellite and cable TV took off from the 1980s onwards, there was much more content available than time to watch it. With so much content available, a better way than searching through TV listings, magazines, or their digital equivalent – Electronic Program Guides (EPGs) was sought.
Editorial recommendations such as “Top Picks” started being used to guide users to certain content, but as the move to downloadable and then streaming content emerged throughout the 2000s, the amount of content and therefore viewing choices increased dramatically.
Editorial recommendations are also used by brands for the double exposure and validation it brings them: if a large brand endorses a TV show, other advertisements of that TV show are more likely to reference that brand. This makes the brand seem trustworthy and important.
One of the early streaming services, Netflix, led the field with a user experience (UX) that was predominantly based on recommendations. The Netflix Prize was a competition they ran between 2006-2009 which invited people to try and improve Netflix’s recommendation algorithm. This brought widespread awareness of the utility and methods of recommenders to the wider industry and the use of Data Science and AI to make relevant recommendations.
Why are content recommendation systems required?
User Goals
The user watches TV for entertainment and generally as part of their leisure activities. If finding something to watch becomes an obstacle to their entertainment, their user experience (UX) will suffer, and there is an elevated chance of them switching to competitors.
A typical user of a video platform will have access to hundreds of linear TV channels, many video streaming apps, and other apps with video content. Netflix is just one of those sources and in 2021 had 6074 titles available in the US (6). At any one time, a user may have tens of thousands of titles available to choose from. This amounts to an overwhelming amount of content to choose from and creates the “Paradox of choice” which the author Barry Schwarz summarises as “choose less and feel better.” When a user is presented with too many options, they are more likely to choose none (8).
Editorially selected recommendations, usually chosen by an experienced editor, are too blunt an instrument to be useful over such a large set of content. They are helpful but apply a “one size fits all” approach and don’t deliver any level of personalization or advanced techniques because the human curator is limited to their knowledge and biases. Put simply; they don’t scale well.
Business Goals
The video platforms themselves have a lot to gain from recommenders. Each video platform may have a slightly different business model, but they are generally:
- Advertisement funded – content is free to access, and adverts are inserted. The video platform gets paid for the number of people that see the advert—examples: YouTube and Linear TV broadcasters such as ITV.
- Pay Per View – the user pays for a short-term rental or to “buy to own” the content. Example: Sky Store.
- Subscription – the user pays a monthly fee to access all available content during their subscription period. Examples: Netflix and Amazon Prime Video.
Common to all business models is the requirement for users to find and consume more content, either to generate one-time payments, see more adverts, or continue to subscribe to the platform. If accurate recommendations are not given, then the user may leave the video platform and go to a competitor.
Technical Aspects
Recommenders typically employ more than one “engine” consisting of multiple layered algorithms to populate carousels (also known as rails or stripes) of content that make up the user interface. Each carousel may use one or a mix of engines to populate the content. Let’s consider which engines relate to which carousels in a typical video platform and where AI applies.
Pictured below is an example of a carousel taken from Netflix. The account was previously used to watch kids’ content, so Netflix places a ‘‘Children and Family TV’ carousel near the top of the user’s homepage.
Figure 1. An example of a personalized carousel, taken from Netflix.
More Like This
The “More Like This” carousel has content similar to the current content selected or just watched. The content in this rail is populated by building a content profile for each title. For series, each episode is usually the same profile as the series itself, but this is not universally true. The profile is built using important characteristics of the title like genre, year of production, cast, crew, duration, and format. Then a table of scores from 0 to 1 is calculated using Cosine similarity, with 0 being the least similar and 1 being the most similar. This allows the recommender to suggest the most similar content to the one currently being browsed or watched.
Further refinement of this method is possible by analyzing the descriptive metadata about each piece of content, e.g., the synopsis. This can also extend to subtitles (captions), audio descriptions, and transcripts where they are available. The words within the descriptive metadata are processed to remove commonly occurring words which means that unusual (low-frequency words) remain, and these are more likely to give insight into the nature of the piece of content. This process uses a method called “Term Frequency – Inverse Document Frequency” (TF-IDF), which measures how frequently the words occur in the general language and maps it onto the relative importance of words within the descriptive metadata. Lemmatization can also be applied to recognize groups of words as being similar so that a mention of “run” and “running” will result in a frequency of 2 for the word “run”.
Figure 2. Calculating Cosine Similarity.
Once these significant words are identified within the descriptive metadata, they can be used as inputs into the similarity score and as ways to match search results when a user performs a search.
You Might Like
‘‘You Might Like’ carousels use Collaborative Filtering to select content. Collaborative filtering uses the user profile of the individual user and matches them to other similar users based on their viewing history and any indication of preferences, such as likes and ratings. The user can then recommend content that other, similar users like.
There are multiple methods, such as K-Nearest-Neighbour or Item-Based, to determine similar users. Where those users haven’t rated the content, a prediction of the likely rating can be made by taking the weighted average of all ratings.
More advanced forms of user clustering can be used to understand not only viewing history and ratings but more complex user behaviors such as frequency and patterns of watching, which then allow more accurate targeting of similar users.
Continue Watching
Continue Watching is an important set of content as the user has already discovered and started to watch the content. This indicates some level of preference for the content, especially where several episodes of a series may have already been watched. If the next episode in the series is suggested, then it is likely the user will find this an accurate recommendation. Coupled with the time of day and day of the week from the user’s viewing history, the recommendations can be further optimized for accuracy.
Social Recommendations
Social recommendations typically arise through social media platforms where a friend or connection of the user can suggest something to watch to the user. Alternatively, the user can share their viewing history with their connections, which allows other connections to be influenced to watch new content.
Trending
Trending content may also be identified through social media; for example, if there is a spike in Tweets around a live sporting event, then the hashtags for that event can identify the content and influence video platform users to watch it.
Trending carousels are also used within the confines of the video platform itself. In the absence of a more compelling recommendation, users can be influenced by social proof to watch something – because lots of other people are watching it too.
Algorithms that calculate trending content need to balance their scoring depending on the type of content. For example, news content may have a shorter trending score before being replaced by fresher news content, whereas movies that have trended over the last few weeks may still make accurate recommendations for users interested in watching a movie.
Promoted Content
Promoted content is clearly labeled on eCommerce websites but is more subtle within video platforms. Promoted content is generally content that the video platform will make the most revenue from, and therefore they want to place it in a more prominent position within the user interface. Where video platforms create their content, for example, Netflix, Amazon, and Sky originals, that content will also be shown prominently as a reminder of the exclusive content only available on that video platform.
Promoted content carousels may be labeled as “New in” or “Last chance to watch” but generally, the most profitable content is blended into other carousels in a way that doesn’t disrupt them. A promoted piece of content may be placed by the algorithm only once every ten pieces of content in a carousel, or the first piece of content in each carousel may be a promoted piece, but the rest of the content follows the underlying algorithm.
Problems in making accurate recommendations
Collaborative Filtering vs Content-Based Filtering
When determining the preferences of a user, two types of filtering are commonly used: Collaborative filtering and content-based filtering.
Collaborative filtering uses the data of users with similar preferences to the user to select what shows to recommend. It attempts to track the interactions of each user, then groups similar users together so that it can recommend one user’s activities to the other users in the same group. This holds some advantages over content-based filtering, with the most obvious being that no knowledge of the data is required: the algorithm doesn’t need to know anything about the content itself, only that similar users are engaging with it. This can save companies a lot of time and resources that would otherwise need to be used to collect and store data on every piece of content.
Content-based filtering, on the other hand, uses metadata (data that describes other data) to find content similar to that the user has already interacted with. In the context of TV shows, this could include anything from broad categories, such as the genre or episode length, to much more niche categories, such as particular actors or directors. When comparing the similarity of two programs, the engine will compare all data of each one using an algorithm such as cosine similarity (7). An advantage of content-based over collaborative filtering is that new content that is introduced to the platform is more likely to perform well; collaborative filtering runs the risk of creating a closed loop of similar users who are recommended the same content repeatedly, while newer content with very few views and ratings is less likely to be recommended at all. Clever implementation of collaborative filterings, such as artificially boosting the popularity of new content, can help to reduce this issue, but content-based has the advantage of circumnavigating it in the first place by recommending newer shows based on their content rather than their popularity. Another advantage is that collaborative filtering does not suffer as much from the ‘‘cold start’ problem, discussed later in this paper.
Additionally, many modern algorithms use a ‘‘hybrid’ system, layering both types of filtering to get a broader sense of what should be recommended to the user next.
While researching this paper, we interviewed Sophia Goldberg and Jian Li from Sky, who told us the following:
“There isn’t a single better recommendation engine, but rather it depends on the problem that you’re trying to solve: if you know that people have watched something, liked it, and they want to find [content] that’s similar to that, content-based is better, but if you’re in a situation where there isn’t one specific article, but you want to [predict] the whole user’s personality, such as the way they watch the content or the times they watch it, … then collaborative filtering is your friend. So I think it’s just about ‘‘what is the business problem that you’re facing”.
Sophia mentions some potential aims of content filtering: if the user has enjoyed a show and wants to find more like it, content-based filtering would serve that purpose better. On the other hand, a user who turns the TV on with no particular show in mind would benefit more from collaborative filtering. These scenarios mean that the user journey may require multiple algorithms at different points.
Cold Start and prioritizing content within carousels
Now that we have explored the main recommender engines, we must consider how recommenders prioritize which content to place in which order in each of the carousels. Even with a similarity score being calculated, there may be hundreds of similar titles with the same score (or very similar scores). Which of those should be shown in position 1 in the carousel? This problem is solved by using a popularity score for each piece of content. Assuming that the content being recommended has filtered out previously watched content, the popularity score will be for unwatched content for this user and therefore come from other users’ ratings of the content. Where there is a sparse set of ratings, ratings from external data sources like IMDb, TMDb, and local TV and magazine websites can be used as a supplement.
In our interview with Sophia Goldberg and Jian Li, we discovered that Sky’s recommendation engine only requires two instances of a user consuming content to get a basic understanding of their preferences.
Measuring the accuracy of recommendations
When recommendations have been made, it is important to know whether those recommendations were deemed accurate by the user. This is a subjective measure but the user’s interactions with the video platform – the clickstream data – give implicit feedback, which can be used to calculate whether the user likes or dislikes the content. Factors that indicate an accurate recommendation are:
- The user hovered over the recommendation.
- The user browsed the information about the program.
- The user played the trailer for the program.
- The user started watching the program.
- How much of the program the user watched? This is the Consumption metric which can be measured in minutes or quartiles (how many quarters) the user watched (3).
- A high rating score.
Factors that indicate an inaccurate recommendation are:
- No user interaction with the program information.
- An early abandonment once they’ve started watching the content.
- A low rating score.
Content metadata sparsity and consistency
Most of the AI used to make recommendations relies on accurate and comprehensive metadata, that is, the information about the programs like title, synopsis, cast, year, and genre. Metadata may be much more accurate and comprehensive for big-budget movies and series, but it is not uniformly available across all titles. While methods like Spectral Regularisation can be used to predict user ratings, these do not work for trying to predict missing cast member data (5).
Inappropriate content
Targeting the right recommendation to the right user has complexities. Some video platforms, such as YouTube, do not require the user to set up a profile, which may be seen as a barrier to the user adopting the video platform in the first place. Without a user profile, different users within the same household may all be understood to be the same person. Sophisticated viewing pattern detection can be employed to determine individual users.
The corollary of not knowing who the user is in knowing the user but not understanding that someone else may be using that profile or group viewing may be taking place. This means recommendation accuracy is compromised.
Children make up a considerable amount of the overall hours viewed as they generally have more free time (6). They must ideally only be shown age-appropriate content and protected from harmful content. Age rating and content classification systems go some ways to address this but ultimately, only dedicated child-friendly video apps like BabyTV, or the kid’s versions of Netflix and YouTube can offer a level of certainty that inappropriate content will not be shown.
Repetitive recommendations
As the popularity and explicit, positive user feedback help AI algorithms make more accurate recommendations, there is a danger that the self-reinforcing nature of this positive feedback loop causes very similar content to be recommended continuously. While these recommendations are strictly accurate, the experience for the user is that very similar content is recommended frequently, leading to either all the relevant content being watched too quickly or a sense of fatigue.
Another problem with repetitive recommendations is that the Effective Catalogue Size, that is, the proportion of the whole content catalog that a user is aware of, stays small. A Vimeo study of OTT providers showed that once a video platform has over 200 videos in their service, an increase of 10 titles to a user’s effective catalog size correlates to the user watching 2 hours more video (2).
The future of video content recommendation systems
Video content recommendation systems have been around for more than 20 years, and the so-called ‘‘streaming wars’ have accelerated innovation in order for video platforms to obtain and keep subscribers.
Deep learning goes beyond the basic content metadata to use advanced natural language processing, such as latent dimension analysis, to determine topics and similar phrases within associated content to uncover less obvious relationships between content (4).
Another way to generate a deeper understanding of the content so that similar content can be more accurately predicted is to use computer vision to process the video of the content. There are commercially available services like Amazon Rekognition that can identify people, places, and objects within video. All of these can be used to find similar content, for example, content where the location of the filming is not mentioned but the Eiffel Tower is detected in the video, which means that an association can be created between those pieces of content and a weighting applied to indicate how strong a similarity has been detected.
This kind of data can be stored in a graph database where entities like the Eiffel Tower are stored as nodes connected by edges. Each edge has a weight to indicate the strength of the association.
The context that recommendations are being made is one of the largest areas for improvement. Whether a user of the video platform is having breakfast, on the train, or with friends can make a huge difference to the type, genre, and length of the content they will be most likely to watch if it is recommended. Overcoming privacy issues and being able to access location, nearby device information, and other indicators could make recommendations much more accurate.
Conclusion
In summary, this paper set out to explain how AI is employed in video content recommendation systems in order to help viewers make choices amongst the overwhelming selection of content available. The main carousel types and the AI algorithms employed have been explained, as well as how business goals are balanced with the viewer’s enjoyment.
While there are many limitations of video content recommendation systems, the adoption of them on all major video streaming services indicates that they’re here to stay. There is sufficient evidence of future development to believe that the next generation of recommenders will offer a more accurate and valuable set of recommendations.
Figure 3. A diagram demonstrating the differences between content-based and collaborative filtering.
References
[1] BBC (No date) ‘‘The Radio Times’. Available at: https://www.bbc.com/historyofthebbc/research/radio-times (Accessed: 22 August 2022).
[2] Deep discovery (no date) Tivo.com. Available at: https://business.tivo.com/products-solutions/metadata/deep-discovery (Accessed: 22 August 2022).
[3] Key performance indicators video service providers must track to drive views and engagement (no date b) Tivo.com. Available at: https://business.tivo.com/content/dam/tivo/resources/whitepapers/tivo_kpi_white_paper.pdf (Accessed: 22 August 2022).
[4] Lineberry, A. (2018) Creating a hybrid content-collaborative movie recommender using deep learning, Towards Data Science. Available at: https://towardsdatascience.com/creating-a-hybrid-content-collaborative-movie-recommender-using-deep-learning-cc8b431618af (Accessed: 22 August 2022).
[5] Mazumder, R., Hastie, T., and Tibshirani, R. (2010) ‘‘Spectral regularization algorithms for learning large incomplete matrices’, Journal of machine learning research: JMLR, 11, pp. 2287–2322. Available at: https://web.stanford.edu/~hastie/Papers/mazumder10a.pdf (Accessed: 22 August 2022).
[6] ‘‘Netflix statistics in 2022: The status of the internet streaming giant’ (2020) Internetadvisor.com. InternetAdvisor, 30 March. Available at: https://www.internetadvisor.com/netflix-facts-statistics (Accessed: 22 August 2022).
[7] Prabhakaran, S. (2018) Cosine Similarity – Understanding the math and how it works? (with python), Machine Learning Plus. Available at: https://www.machinelearningplus.com/nlp/cosine-similarity/ (Accessed: 22 August 2022).
[8] Schwartz, B. (2016) The paradox of choice: Why more is less, revised edition. New York, NY, USA: ECCO Press.