仅供个人笔记使用
A pattern recognition problem
there are large “labeled” data online e.g. tweets using hash #
can we use these unlabel data to improve our classifier
- labeled data
- unlabeled data
-some applications
- image classification (easy to obtain images e.g, from flicker)
- protein function prediction
- document classification
- part of speech tagging
-semi-supervised classification
- similar but with continuous out come measure
- using some labels to improve a clustering solution
- measure how well the unlabeled data could help to improve
Content
self-learning
One of the earliest studies on SSL (Hartley & Rao 1968):
• Maximum likelihood trying all possible labelings (!)
(the problem of treating unlabeled data is dealing with explosive parameter)
More feasible suggestion (McLachlan 1975):
• Start with supervised solution
• Label unlabeled objects using this classifier
• Retrain classifier treating labels as true labels
Also known as self-training, self-labeling or pseudo-labeling
self-learning ≈\approx≈ EXPECTATION MAXIMIZATION
- Linear Discriminant Analysis (LDA)
p(X,y;θ)=∏i=1L[π0N(xi,μ0,Σ)]1−yi[π1N(xi,μ1,Σ)]y1p(X,y;\theta)=\prod_{i=1}^L[\pi_0N(x_i,\mu_0,\Sigma)]^{1-y_i}[\pi_1N(x_i,\mu_1,\Sigma)]^{y_1} p(X,y;θ)=i=1∏L[π0N(xi,μ0,Σ)]1−yi[π1N(xi,μ1,Σ)]y1
share the covariance Σ\SigmaΣ
N(xi,μ0,Σ)N(x_i,\mu_0,\Sigma)N(xi,μ0,Σ) gaussians for each class - LDA + unlabeled data
p(X,y,Xu,h;θ)=∏i=1L[π0N(xi,μ0,Σ)]1−yi[π1N(xi,μ1,Σ)]y1×∏i=1u[π0N(xi,μ0,Σ)]1−hi[π1N(xi,μ1,Σ)]h1p(X,y,X_u,h;\theta)=\prod_{i=1}^L[\pi_0N(x_i,\mu_0,\Sigma)]^{1-y_i}[\pi_1N(x_i,\mu_1,\Sigma)]^{y_1}\\ \times\prod_{i=1}^u[\pi_0N(x_i,\mu_0,\Sigma)]^{1-h_i}[\pi_1N(x_i,\mu_1,\Sigma)]^{h_1}p(X,y,Xu,h;θ)=i=1∏L[π0N(xi,μ0,Σ)]1−yi[π1N(xi,μ1,Σ)]y1×i=1∏u[π0N(xi,μ0,Σ)]1−hi[π1N(xi,μ1,Σ)]h1
But we do not know h… Integrate it out!
p(X,y,Xu;θ)=∫hp(X,y,Xu,h;θ)dhp(X, y, X_u; \theta) = \int_hp(X, y, X_u, h; \theta)dhp(X,y,Xu;θ)=∫hp(X,y,Xu,h;θ)dh
LDA +
unlabeled data
∏i=1L[π0N(xi,μ0,Σ)]1−yi[π1N(xi,μ1,Σ)]y1×∏i=1u∑c=01πcN(xi,μc,Σ)\prod_{i=1}^L[\pi_0N(x_i,\mu_0,\Sigma)]^{1-y_i}[\pi_1N(x_i,\mu_1,\Sigma)]^{y_1}\\ \times\prod_{i=1}^u\sum^1_{c=0}\pi_cN(x_i,\mu_c,\Sigma)i=1∏L[π0N(xi,μ0,Σ)]1−yi[π1N(xi,μ1,Σ)]y1×i=1∏uc=0∑1πcN(xi,μc,Σ)
Like LDA + a gaussian mixture with the same parameters
EM algorithm
• Log sum makes optimization difficult
• Change goal: find a local maximum of this function
EM algorithm: finding a lower bound
what we want is construct a lower bound and touch exactyl the objective function ,and get the best lower bound which you can get
Jensen’s inequality
If f(x)f(x)f(x) concave then f(E[X])≥E[f(X)]f(E[X]) \geq E[f(X)]f(E[X])≥E[f(X)]
Does unlabeled data help?
θx→X\theta_x \rightarrow Xθx→X
X→YX \rightarrow YX→Y
θY∣X→Y\theta_{Y|X} \rightarrow YθY∣X→Y
Self-learning and EM conclusions
• For generative models:
• Integrate out the missing variables
• Difficult optimization problem can often be “solved” efficiently using
expectation maximization
• Only guaranteed to improve performance asymptotically, if the model is
correct
• Self-learning is a closely related technique that is applicable to any classifier
• Related: co-training (multi-view learning)
• Use labels predicted by other view(s) as newly labeled objects
Low-density assumption
Low-density assumption conclusion
• “Natural” extension for the SVM
• Local minima may be a problem
• Lots of work on optimization
• My experience: quite sensitive to parameter settings
• Other low-density approaches:
• Entropy Regularization (Bengio & Grandvalet 2005)
manifold assumption
- manifold regularization
-consistency regularication
∥f(x;w)−g(x′;wt)∥2\Vert f(x;w)-g(x';w^t) \Vert^2∥f(x;w)−g(x′;wt)∥2
Semi-Supervised Conclusion
• Unlabeled data is often available
• Semi-supervised learning attempts to use it to improve classifier
• Often worthwhile, but it does not come for free
• Modeling time
• Computational cost
• Remember: an unlabeled object is less valuable than a labeled one
• Labeling a few more objects can be more effective
• Remember the goal: transductive or inductive?