The Universal Recommender (UR) is a new type of collaborative filtering recommender based on an algorithm that can use data from a wide variety of user taste indicators—it is called the Correlated Cross-Occurrence algorithm. Unlike the matrix factorization embodied in things like MLlib’s ALS, The UR’s CCO algorithm is able to ingest any number of user actions, events, profile data, and contextual information. It then serves results in a fast and scalable way. It also supports item properties for filtering and boosting recommendations and can therefor be considered a hybrid collaborative filtering and content-based recommender.
The use of multiple types of data fundamentally changes the way a recommender is used and, when employed correctly, will provide a significant increase in quality of recommendations vs. using only one user event. Most recommenders, for instance, can only use “purchase” events. Using all we know about a user and their context allows us to much better predict their preferences.
User | Action | Item |
---|---|---|
u1 | view | t1 |
u1 | view | t2 |
u1 | view | t3 |
u1 | view | t5 |
u2 | view | t1 |
u2 | view | t3 |
u2 | view | t4 |
u2 | view | t5 |
u3 | view | t2 |
u3 | view | t3 |
u3 | view | t5 |
整理后得到以下关系:
u1=> [ t1, t2, t3, t5 ]
u2=> [ t1, t3, t4, t5 ]
u3=> [ t2, t3,t5 ]
$r=(P^{T}P)h_{p}$
$h_{p}$= 某一用户的历史动作(比如购买动作)
针对某个item的动作在史来情况下是有可能重复的,如果表达???
$h_{u1}=\begin{bmatrix}1 & 2 & 1 & 0 & 1\end{bmatrix}$ 2代表了购买item2两次
如果这么表示,那么问题来了,近期的动作和久远的动作,意义是不同的。偶尔受伤买个了拐,是不能根据这个动作就推荐拐的,LLR是不是可以消减这类情况呢?
$P$ = 历史所有用户的主动作(主事件)构成的矩阵
$P=\begin{bmatrix}1 & 1 & 1& 0 & 1\\ 1 & 0 & 1 & 1 &1 \\ 0& 1& 1 & 0 & 1\end{bmatrix}$
$(P^{T}P)$ = compares column to column using log-likelihood based correlation test
Let’s call ($P^{T}P$) an indicator matrix for some primary action like purchase
Log-likelihood Ratio(LLR对数似然比) finds important/correlating cooccurrences and filters out the rest —a major improvement in quality over simple cooccurrences or other similarity metrics.
根据两个事件的共现关系计算LLR值,用于衡量两个事件的关联度:
$P^{T}\cdot P=\begin{bmatrix}- & 1 & 2 & 1 & 2\\ 1& – & 1 &1 &1\\2& 1& – & 1 &2 \\1& 1 & 1 & – &1\\2&1&2&1 & -\end{bmatrix}\overset{LLR}{\rightarrow}\begin{bmatrix}-& 1.05 & 3.82 & 1.05 &3.82 \\ 1.05 & – &1.05 &1.05 &1.05 \\ 3.82& 1.05 & – & 1.05&3.82 \\1.05&1.05 &1.05 & – &1.05 \\3.82& 1.05 & 3.82 & 1.05&-\end{bmatrix}$
注意:我们发现每个用户都有点击广告a4,但a4的LLR值却是0,也就是a4跟任何帖子都没有关联,这看上去很奇怪。但其实这是LLR的特点,LLR对于热门事件有很大的惩罚,简单来说它认为浏览t1和点击广告a4这两个事件共同发生的原因不是因为浏览t1和点击a4有关联,而仅仅只是因为点击a4本身是一个高频发生的事件。
$r=(P^{t}P)h_{p}$
$h_{p}$ =P动作历史行为
$r=(P^{t}P)h_{p}$
$r=(P^{t}P)h_{p}$
基于 CCO 的协同过滤推荐可以表示为:
$r=(P^{T}P)h_{p}+(P^{T}V)h_{v}+(P^{T}C)h_{c}+…$
Given strong data about user preferences on a general population we can also use
Collaborative Topic Filtering
Entity Preferences:
Indicators can also be based on content similarity
$r=(TT^{t})h_{t}+I\cdot L$
$(TT^{t})$ is a calculation that compares every 2 documents to each other and finds the most similar—based upon content alone
Cooccurences
Cross-occurrence
SimilarityAnalysis.cooccurences
Content or metadata
SimilarityAnalysis.rowSimilarity
Intrinsic
“Universal” means one query on all indicators at once
$r=(P^{T}P)h_{p}+(P^{T}V)h_{v}+(P^{T}C)h_{c}+…(TT^{T}h_{t})+I\cdot L$
Unified query:
Once indicators are indexed as search fields this entire equation is a single query
Fast!
Solution to the “cold-start” problem—items with too short a lifespan or new users with no history
how to solve ??
v0.3.0—most current release
Randomize some returned recs, if they are acted upon they become part of the new training data and are more likely to be recommended in the future
Visibility control:
参考