Machine Learning: A Probabilistic Perspective
Kevin P. Murphy
Today's Webenabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and selfcontained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudocode for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled modelbased approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software packagePMTK (probabilistic modeling toolkit)that is freely available online. The book is suitable for upperlevel undergraduates with an introductorylevel college math background and beginning graduate students.
Series:
Adaptive Computation and Machine Learning series
The file will be sent to your email address. It may take up to 15 minutes before you receive it.
The file will be sent to your Kindle account. It may takes up to 15 minutes before you received it.
Please note you need to add our email km0@bookmail.org to approved email addresses.
Read more.
Most frequently terms
machine learning
Machine Learning
A Probabilistic Perspective
Kevin P. Murphy
Today’s Webenabled deluge of electronic data calls for
automated methods of data analysis. Machine learning
provides these, developing methods that can automatically
detect patterns in data and use the uncovered patterns to
predict future data. This textbook offers a comprehensive
and selfcontained introduction to the field of machine
learning, a unified, probabilistic approach.
The coverage combines breadth and depth, offering
necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of
recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book
is written in an informal, accessible style, complete with
pseudocode for the most important algorithms. All topics
are copiously illustrated with color images and worked
examples drawn from such application domains as biology,
text processing, computer vision, and robotics. Rather than
providing a cookbook of different heuristic methods, the
book stresses a principled modelbased approach, often
using the language of graphical models to specify models in a concise and intuitive way. Almost all the models
described have been implemented in a MATLAB software
package—PMTK (probabilistic modeling toolkit)—that is
freely available online. The book is suitable for upperlevel
undergraduates with an introductorylevel college math
background and beginning graduate students.
Kevin P. Murphy is a Research Scientist at Google. Previously, he was Associate Professor of Computer Science
and Statistics at the University of British Columbia.
“An astonishing machine learning book: intuitive, full
of examples, fun to read but still comprehensive,
strong, and deep! A great starting point for any university student—and a musthave for anybody in the field.”
Jan Peters, Darmstadt University of Technology;
MaxPlanck Institute for Intelligent Systems
“Kevin Murphy excels at unraveling t; he complexities
of machine learning methods while motivating the
reader with a stream of illustrated examples and
realworld case studies. The accompanying software
package includes source code for many of the figures,
making it both easy and very tempting to dive in and
explore these methods for yourself. A mustbuy for
anyone interested in machine learning or curious
about how to extract useful knowledge from big data.”
John Winn, Microsoft Research
“This is a wonderful book that starts with basic topics
in statistical modeling, culminating in the most advanced topics. It provides both the theoretical foundations of probabilistic machine learning as well as
practical tools, in the form of MATLAB code. The book
should be on the shelf of any student interested in the
topic, and any practitioner working in the field.”
Yoram Singer, Google Research
“This book will be an essential reference for practitioners of modern machine learning. It covers the basic
concepts needed to understand the field as a whole,
and the powerful modern methods that build on those
concepts. In Machine Learning, the language of probability and statistics reveals important connections between seemingly disparate algorithms and strategies.
Thus, its readers will become articulate in a holistic
view of the stateoftheart and poised to build the next
generation of machine learning algorithms.”
David Blei, Princeton University
9780262018029
The MIT Press
Massachusetts Institute of Technology
Cambridge, Massachusetts 02142
http://mitpress.mit.edu
The cover image is based on sequential Bayesian updating
Adaptive Computation and Machine Learning series
of a 2D Gaussian distribution. See Figure 7.11 for details.
Machine Learning
A Probabilistic Perspective
Kevin P. Murphy
Machine Learning: A Probabilistic Perspective
Machine Learning
A Probabilistic Perspective
Kevin P. Murphy
The MIT Press
Cambridge, Massachusetts
London, England
© 2012 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical
means (including photocopying, recording, or information storage and retrieval) without permission in
writing from the publisher.
For information about special quantity discounts, please email special_sales@mitpress.mit.edu
This book was set in the LATEX programming language by the author. Printed and bound in the United
States of America.
Library of Congress CataloginginPublication Information
Murphy, Kevin P.
Machine learning : a probabilistic perspective / Kevin P. Murphy.
p. cm. — (Adaptive computation and machine learning series)
Includes bibliographical references and index.
ISBN 9780262018029 (hardcover : alk. paper)
1. Machine learning. 2. Probabilities. I. Title.
Q325.5.M87 2012
006.3’1—dc23
2012004558
10
9
8
7
6
5
4
3
2
1
This book is dedicated to Alessandro, Michael and Stefano,
and to the memory of Gerard Joseph Murphy.
Contents
Preface
1
xxvii
Introduction
1
1.1
Machine learning: what and why?
1
1.1.1
Types of machine learning
2
1.2
Supervised learning
3
1.2.1
Classiﬁcation
3
1.2.2
Regression
8
1.3
Unsupervised learning
9
1.3.1
Discovering clusters
10
1.3.2
Discovering latent factors
11
1.3.3
Discovering graph structure
13
1.3.4
Matrix completion
14
1.4
Some basic concepts in machine learning
16
1.4.1
Parametric vs nonparametric models
16
1.4.2
A simple nonparametric classiﬁer: Knearest neighbors
1.4.3
The curse of dimensionality
18
1.4.4
Parametric models for classiﬁcation and regression
19
1.4.5
Linear regression
19
1.4.6
Logistic regression
21
1.4.7
Overﬁtting
22
1.4.8
Model selection
22
1.4.9
No free lunch theorem
24
2 Probability
27
2.1
Introduction
27
2.2
A brief review of probability theory
28
2.2.1
Discrete random variables
28
2.2.2
Fundamental rules
28
2.2.3
Bayes rule
29
2.2.4
Independence and conditional independence
2.2.5
Continuous random variables
32
30
16
CONTENTS
viii
2.3
2.4
2.5
2.6
2.7
2.8
3
2.2.6
Quantiles
33
2.2.7
Mean and variance
33
Some common discrete distributions
34
2.3.1
The binomial and Bernoulli distributions
34
2.3.2
The multinomial and multinoulli distributions
35
2.3.3
The Poisson distribution
37
2.3.4
The empirical distribution
37
Some common continuous distributions
38
2.4.1
Gaussian (normal) distribution
38
2.4.2
Degenerate pdf
39
2.4.3
The Laplace distribution
41
2.4.4
The gamma distribution
41
2.4.5
The beta distribution
42
2.4.6
Pareto distribution
43
Joint probability distributions
44
2.5.1
Covariance and correlation
44
2.5.2
The multivariate Gaussian
46
2.5.3
Multivariate Student t distribution
46
2.5.4
Dirichlet distribution
47
Transformations of random variables
49
2.6.1
Linear transformations
49
2.6.2
General transformations
50
2.6.3
Central limit theorem
51
Monte Carlo approximation
52
2.7.1
Example: change of variables, the MC way
53
2.7.2
Example: estimating π by Monte Carlo integration
54
2.7.3
Accuracy of Monte Carlo approximation
54
Information theory
56
2.8.1
Entropy
56
2.8.2
KL divergence
57
2.8.3
Mutual information
59
Generative models for discrete data
65
3.1
Introduction
65
3.2
Bayesian concept learning
65
3.2.1
Likelihood
67
3.2.2
Prior
67
3.2.3
Posterior
68
3.2.4
Posterior predictive distribution
3.2.5
A more complex prior
72
3.3
The betabinomial model
72
3.3.1
Likelihood
73
3.3.2
Prior
74
3.3.3
Posterior
75
3.3.4
Posterior predictive distribution
71
77
CONTENTS
3.4
3.5
4
The Dirichletmultinomial model
78
3.4.1
Likelihood
79
3.4.2
Prior
79
3.4.3
Posterior
79
3.4.4
Posterior predictive
81
Naive Bayes classiﬁers
82
3.5.1
Model ﬁtting
83
3.5.2
Using the model for prediction
85
3.5.3
The logsumexp trick
86
3.5.4
Feature selection using mutual information
3.5.5
Classifying documents using bag of words
ix
86
87
Gaussian models
97
4.1
Introduction
97
4.1.1
Notation
97
4.1.2
Basics
97
4.1.3
MLE for an MVN
99
4.1.4
Maximum entropy derivation of the Gaussian *
4.2
Gaussian discriminant analysis
101
4.2.1
Quadratic discriminant analysis (QDA)
102
4.2.2
Linear discriminant analysis (LDA)
103
4.2.3
Twoclass LDA
104
4.2.4
MLE for discriminant analysis
106
4.2.5
Strategies for preventing overﬁtting
106
4.2.6
Regularized LDA *
107
4.2.7
Diagonal LDA
108
4.2.8
Nearest shrunken centroids classiﬁer *
109
4.3
Inference in jointly Gaussian distributions
110
4.3.1
Statement of the result
111
4.3.2
Examples
111
4.3.3
Information form
115
4.3.4
Proof of the result *
116
4.4
Linear Gaussian systems
119
4.4.1
Statement of the result
119
4.4.2
Examples
120
4.4.3
Proof of the result *
124
4.5
Digression: The Wishart distribution *
125
4.5.1
Inverse Wishart distribution
126
4.5.2
Visualizing the Wishart distribution *
127
4.6
Inferring the parameters of an MVN
127
4.6.1
Posterior distribution of μ
128
4.6.2
Posterior distribution of Σ *
128
4.6.3
Posterior distribution of μ and Σ *
132
4.6.4
Sensor fusion with unknown precisions *
138
101
x
CONTENTS
5 Bayesian statistics
149
5.1
Introduction
149
5.2
Summarizing posterior distributions
149
5.2.1
MAP estimation
149
5.2.2
Credible intervals
152
5.2.3
Inference for a difference in proportions
154
5.3
Bayesian model selection
155
5.3.1
Bayesian Occam’s razor
156
5.3.2
Computing the marginal likelihood (evidence)
158
5.3.3
Bayes factors
163
5.3.4
JeffreysLindley paradox *
164
5.4
Priors
165
5.4.1
Uninformative priors
165
5.4.2
Jeffreys priors *
166
5.4.3
Robust priors
168
5.4.4
Mixtures of conjugate priors
168
5.5
Hierarchical Bayes
171
5.5.1
Example: modeling related cancer rates
171
5.6
Empirical Bayes
172
5.6.1
Example: betabinomial model
173
5.6.2
Example: GaussianGaussian model
173
5.7
Bayesian decision theory
176
5.7.1
Bayes estimators for common loss functions
177
5.7.2
The false positive vs false negative tradeoff
180
5.7.3
Other topics *
184
6 Frequentist statistics
191
6.1
Introduction
191
6.2
Sampling distribution of an estimator
191
6.2.1
Bootstrap
192
6.2.2
Large sample theory for the MLE *
193
6.3
Frequentist decision theory
194
6.3.1
Bayes risk
195
6.3.2
Minimax risk
196
6.3.3
Admissible estimators
197
6.4
Desirable properties of estimators
200
6.4.1
Consistent estimators
200
6.4.2
Unbiased estimators
200
6.4.3
Minimum variance estimators
201
6.4.4
The biasvariance tradeoff
202
6.5
Empirical risk minimization
204
6.5.1
Regularized risk minimization
205
6.5.2
Structural risk minimization
206
6.5.3
Estimating the risk using cross validation
206
6.5.4
Upper bounding the risk using statistical learning theory *
209
CONTENTS
6.6
7
6.5.5
Surrogate loss functions
210
Pathologies of frequentist statistics *
211
6.6.1
Counterintuitive behavior of conﬁdence intervals
6.6.2
pvalues considered harmful
213
6.6.3
The likelihood principle
214
6.6.4
Why isn’t everyone a Bayesian?
215
xi
212
Linear regression
217
7.1
Introduction
217
7.2
Model speciﬁcation
217
7.3
Maximum likelihood estimation (least squares)
217
7.3.1
Derivation of the MLE
219
7.3.2
Geometric interpretation
220
7.3.3
Convexity
221
7.4
Robust linear regression *
223
7.5
Ridge regression
225
7.5.1
Basic idea
225
7.5.2
Numerically stable computation *
227
7.5.3
Connection with PCA *
228
7.5.4
Regularization effects of big data
230
7.6
Bayesian linear regression
231
7.6.1
Computing the posterior
232
7.6.2
Computing the posterior predictive
233
7.6.3
Bayesian inference when σ 2 is unknown *
234
7.6.4
EB for linear regression (evidence procedure)
238
8 Logistic regression
245
8.1
Introduction
245
8.2
Model speciﬁcation
245
8.3
Model ﬁtting
245
8.3.1
MLE
246
8.3.2
Steepest descent
247
8.3.3
Newton’s method
249
8.3.4
Iteratively reweighted least squares (IRLS)
250
8.3.5
QuasiNewton (variable metric) methods
251
8.3.6
2 regularization
252
8.3.7
Multiclass logistic regression
252
8.4
Bayesian logistic regression
254
8.4.1
Laplace approximation
255
8.4.2
Derivation of the BIC
255
8.4.3
Gaussian approximation for logistic regression
256
8.4.4
Approximating the posterior predictive
256
8.4.5
Residual analysis (outlier detection) *
260
8.5
Online learning and stochastic optimization
261
8.5.1
Online learning and regret minimization
262
CONTENTS
xii
8.6
8.5.2
Stochastic optimization and risk minimization
8.5.3
The LMS algorithm
264
8.5.4
The perceptron algorithm
265
8.5.5
A Bayesian view
266
Generative vs discriminative classiﬁers
267
8.6.1
Pros and cons of each approach
268
8.6.2
Dealing with missing data
269
8.6.3
Fisher’s linear discriminant analysis (FLDA) *
262
271
9 Generalized linear models and the exponential family
281
9.1
Introduction
281
9.2
The exponential family
281
9.2.1
Deﬁnition
282
9.2.2
Examples
282
9.2.3
Log partition function
284
9.2.4
MLE for the exponential family
286
9.2.5
Bayes for the exponential family *
287
9.2.6
Maximum entropy derivation of the exponential family *
289
9.3
Generalized linear models (GLMs)
290
9.3.1
Basics
290
9.3.2
ML and MAP estimation
292
9.3.3
Bayesian inference
293
9.4
Probit regression
293
9.4.1
ML/MAP estimation using gradientbased optimization
294
9.4.2
Latent variable interpretation
294
9.4.3
Ordinal probit regression *
295
9.4.4
Multinomial probit models *
295
9.5
Multitask learning
296
9.5.1
Hierarchical Bayes for multitask learning
296
9.5.2
Application to personalized email spam ﬁltering
296
9.5.3
Application to domain adaptation
297
9.5.4
Other kinds of prior
297
9.6
Generalized linear mixed models *
298
9.6.1
Example: semiparametric GLMMs for medical data
298
9.6.2
Computational issues
300
9.7
Learning to rank *
300
9.7.1
The pointwise approach
301
9.7.2
The pairwise approach
301
9.7.3
The listwise approach
302
9.7.4
Loss functions for ranking
303
10 Directed graphical models (Bayes nets)
10.1 Introduction
307
10.1.1
Chain rule
307
10.1.2
Conditional independence
307
308
CONTENTS
10.2
10.3
10.4
10.5
10.6
10.1.3
Graphical models
308
10.1.4
Graph terminology
309
10.1.5
Directed graphical models
310
Examples
311
10.2.1
Naive Bayes classiﬁers
311
10.2.2
Markov and hidden Markov models
312
10.2.3
Medical diagnosis
313
10.2.4
Genetic linkage analysis *
315
10.2.5
Directed Gaussian graphical models *
318
Inference
319
Learning
320
10.4.1
Plate notation
320
10.4.2
Learning from complete data
322
10.4.3
Learning with missing and/or latent variables
323
Conditional independence properties of DGMs
324
10.5.1
dseparation and the Bayes Ball algorithm (global Markov
properties)
324
10.5.2
Other Markov properties of DGMs
327
10.5.3
Markov blanket and full conditionals
327
Inﬂuence (decision) diagrams *
328
11 Mixture models and the EM algorithm
337
11.1
Latent variable models
337
11.2 Mixture models
337
11.2.1
Mixtures of Gaussians
339
11.2.2
Mixture of multinoullis
340
11.2.3
Using mixture models for clustering
340
11.2.4
Mixtures of experts
342
11.3 Parameter estimation for mixture models
345
11.3.1
Unidentiﬁability
346
11.3.2
Computing a MAP estimate is nonconvex
347
11.4 The EM algorithm
348
11.4.1
Basic idea
349
11.4.2
EM for GMMs
350
11.4.3
EM for mixture of experts
357
11.4.4
EM for DGMs with hidden variables
358
11.4.5
EM for the Student distribution *
359
11.4.6
EM for probit regression *
362
11.4.7
Theoretical basis for EM *
363
11.4.8
Online EM
365
11.4.9
Other EM variants *
367
11.5 Model selection for latent variable models
370
11.5.1
Model selection for probabilistic models
370
11.5.2
Model selection for nonprobabilistic methods
370
11.6 Fitting models with missing data
372
xiii
CONTENTS
xiv
11.6.1
EM for the MLE of an MVN with missing data
373
12 Latent linear models
381
12.1 Factor analysis
381
12.1.1
FA is a low rank parameterization of an MVN
381
12.1.2
Inference of the latent factors
382
12.1.3
Unidentiﬁability
383
12.1.4
Mixtures of factor analysers
385
12.1.5
EM for factor analysis models
386
12.1.6
Fitting FA models with missing data
387
12.2 Principal components analysis (PCA)
387
12.2.1
Classical PCA: statement of the theorem
387
12.2.2
Proof *
389
12.2.3
Singular value decomposition (SVD)
392
12.2.4
Probabilistic PCA
395
12.2.5
EM algorithm for PCA
396
12.3 Choosing the number of latent dimensions
398
12.3.1
Model selection for FA/PPCA
398
12.3.2
Model selection for PCA
399
12.4 PCA for categorical data
402
12.5 PCA for paired and multiview data
404
12.5.1
Supervised PCA (latent factor regression)
405
12.5.2
Partial least squares
406
12.5.3
Canonical correlation analysis
407
12.6 Independent Component Analysis (ICA)
407
12.6.1
Maximum likelihood estimation
410
12.6.2
The FastICA algorithm
411
12.6.3
Using EM
414
12.6.4
Other estimation principles *
415
13 Sparse linear models
421
13.1 Introduction
421
13.2 Bayesian variable selection
422
13.2.1
The spike and slab model
424
13.2.2
From the BernoulliGaussian model to 0 regularization
425
13.2.3
Algorithms
426
429
13.3 1 regularization: basics
430
13.3.1
Why does 1 regularization yield sparse solutions?
13.3.2
Optimality conditions for lasso
431
13.3.3
Comparison of least squares, lasso, ridge and subset selection
435
13.3.4
Regularization path
436
13.3.5
Model selection
439
13.3.6
Bayesian inference for linear models with Laplace priors
440
13.4 1 regularization: algorithms
441
13.4.1
Coordinate descent
441
CONTENTS
13.5
13.6
13.7
13.8
13.4.2
LARS and other homotopy methods
441
13.4.3
Proximal and gradient projection methods
442
13.4.4
EM for lasso
447
1 regularization: extensions
449
13.5.1
Group Lasso
449
13.5.2
Fused lasso
454
13.5.3
Elastic net (ridge and lasso combined)
455
Nonconvex regularizers
457
13.6.1
Bridge regression
458
13.6.2
Hierarchical adaptive lasso
458
13.6.3
Other hierarchical priors
462
Automatic relevance determination (ARD)/sparse Bayesian learning (SBL)
13.7.1
ARD for linear regression
463
13.7.2
Whence sparsity?
465
13.7.3
Connection to MAP estimation
465
13.7.4
Algorithms for ARD *
466
13.7.5
ARD for logistic regression
468
Sparse coding *
468
13.8.1
Learning a sparse coding dictionary
469
13.8.2
Results of dictionary learning from image patches
470
13.8.3
Compressed sensing
472
13.8.4
Image inpainting and denoising
472
14 Kernels
479
14.1 Introduction
479
14.2 Kernel functions
479
14.2.1
RBF kernels
480
14.2.2
Kernels for comparing documents
480
14.2.3
Mercer (positive deﬁnite) kernels
481
14.2.4
Linear kernels
482
14.2.5
Matern kernels
482
14.2.6
String kernels
483
14.2.7
Pyramid match kernels
484
14.2.8
Kernels derived from probabilistic generative models
485
14.3 Using kernels inside GLMs
486
14.3.1
Kernel machines
486
14.3.2
L1VMs, RVMs, and other sparse vector machines
487
14.4 The kernel trick
488
14.4.1
Kernelized nearest neighbor classiﬁcation
489
14.4.2
Kernelized Kmedoids clustering
489
14.4.3
Kernelized ridge regression
492
14.4.4
Kernel PCA
493
14.5 Support vector machines (SVMs)
496
14.5.1
SVMs for regression
497
14.5.2
SVMs for classiﬁcation
498
xv
463
CONTENTS
xvi
14.6
14.7
14.5.3
Choosing C
504
14.5.4
Summary of key points
504
14.5.5
A probabilistic interpretation of SVMs
505
Comparison of discriminative kernel methods
505
Kernels for building generative models
507
14.7.1
Smoothing kernels
507
14.7.2
Kernel density estimation (KDE)
508
14.7.3
From KDE to KNN
509
14.7.4
Kernel regression
510
14.7.5
Locally weighted regression
512
15 Gaussian processes
515
15.1 Introduction
515
15.2 GPs for regression
516
15.2.1
Predictions using noisefree observations
517
15.2.2
Predictions using noisy observations
518
15.2.3
Effect of the kernel parameters
519
15.2.4
Estimating the kernel parameters
521
15.2.5
Computational and numerical issues *
524
15.2.6
Semiparametric GPs *
524
15.3 GPs meet GLMs
525
15.3.1
Binary classiﬁcation
525
15.3.2
Multiclass classiﬁcation
528
15.3.3
GPs for Poisson regression
531
15.4 Connection with other methods
532
15.4.1
Linear models compared to GPs
532
15.4.2
Linear smoothers compared to GPs
533
15.4.3
SVMs compared to GPs
534
15.4.4
L1VM and RVMs compared to GPs
534
15.4.5
Neural networks compared to GPs
535
15.4.6
Smoothing splines compared to GPs *
536
15.4.7
RKHS methods compared to GPs *
538
15.5 GP latent variable model
540
15.6 Approximation methods for large datasets
542
16 Adaptive basis function models
543
16.1 Introduction
543
16.2 Classiﬁcation and regression trees (CART)
544
16.2.1
Basics
544
16.2.2
Growing a tree
545
16.2.3
Pruning a tree
549
16.2.4
Pros and cons of trees
550
16.2.5
Random forests
550
16.2.6
CART compared to hierarchical mixture of experts *
16.3 Generalized additive models
552
551
CONTENTS
16.4
16.5
16.6
16.7
16.8
16.3.1
Backﬁtting
552
16.3.2
Computational efficiency
553
16.3.3
Multivariate adaptive regression splines (MARS)
553
Boosting
554
16.4.1
Forward stagewise additive modeling
555
16.4.2
L2boosting
557
16.4.3
AdaBoost
558
16.4.4
LogitBoost
559
16.4.5
Boosting as functional gradient descent
560
16.4.6
Sparse boosting
561
16.4.7
Multivariate adaptive regression trees (MART)
562
16.4.8
Why does boosting work so well?
562
16.4.9
A Bayesian view
563
Feedforward neural networks (multilayer perceptrons)
563
16.5.1
Convolutional neural networks
564
16.5.2
Other kinds of neural networks
568
16.5.3
A brief history of the ﬁeld
568
16.5.4
The backpropagation algorithm
569
16.5.5
Identiﬁability
572
16.5.6
Regularization
572
16.5.7
Bayesian inference *
576
Ensemble learning
580
16.6.1
Stacking
580
16.6.2
Errorcorrecting output codes
581
16.6.3
Ensemble learning is not equivalent to Bayes model averaging
Experimental comparison
582
16.7.1
Lowdimensional features
582
16.7.2
Highdimensional features
583
Interpreting blackbox models
585
17 Markov and hidden Markov models
589
17.1 Introduction
589
17.2 Markov models
589
17.2.1
Transition matrix
589
17.2.2
Application: Language modeling
591
17.2.3
Stationary distribution of a Markov chain *
596
17.2.4
Application: Google’s PageRank algorithm for web page ranking *
17.3 Hidden Markov models
603
17.3.1
Applications of HMMs
604
17.4 Inference in HMMs
606
17.4.1
Types of inference problems for temporal models
606
17.4.2
The forwards algorithm
609
17.4.3
The forwardsbackwards algorithm
610
17.4.4
The Viterbi algorithm
612
17.4.5
Forwards ﬁltering, backwards sampling
616
xvii
581
600
CONTENTS
xviii
17.5
17.6
Learning for HMMs
617
17.5.1
Training with fully observed data
617
17.5.2
EM for HMMs (the BaumWelch algorithm)
618
17.5.3
Bayesian methods for “ﬁtting” HMMs *
620
17.5.4
Discriminative training
620
17.5.5
Model selection
621
Generalizations of HMMs
621
17.6.1
Variable duration (semiMarkov) HMMs
622
17.6.2
Hierarchical HMMs
624
17.6.3
Inputoutput HMMs
625
17.6.4
Autoregressive and buried HMMs
626
17.6.5
Factorial HMM
627
17.6.6
Coupled HMM and the inﬂuence model
628
17.6.7
Dynamic Bayesian networks (DBNs)
628
18 State space models
631
18.1 Introduction
631
18.2 Applications of SSMs
632
18.2.1
SSMs for object tracking
632
18.2.2
Robotic SLAM
633
18.2.3
Online parameter learning using recursive least squares
18.2.4
SSM for time series forecasting *
637
18.3 Inference in LGSSM
640
18.3.1
The Kalman ﬁltering algorithm
640
18.3.2
The Kalman smoothing algorithm
643
18.4 Learning for LGSSM
646
18.4.1
Identiﬁability and numerical stability
646
18.4.2
Training with fully observed data
647
18.4.3
EM for LGSSM
647
18.4.4
Subspace methods
647
18.4.5
Bayesian methods for “ﬁtting” LGSSMs
647
18.5 Approximate online inference for nonlinear, nonGaussian SSMs
18.5.1
Extended Kalman ﬁlter (EKF)
648
18.5.2
Unscented Kalman ﬁlter (UKF)
650
18.5.3
Assumed density ﬁltering (ADF)
652
18.6 Hybrid discrete/continuous SSMs
655
18.6.1
Inference
656
18.6.2
Application: data association and multitarget tracking
18.6.3
Application: fault diagnosis
659
18.6.4
Application: econometric forecasting
660
19 Undirected graphical models (Markov random ﬁelds)
19.1 Introduction
661
19.2 Conditional independence properties of UGMs
19.2.1
Key properties
661
661
661
636
647
658
CONTENTS
19.3
19.4
19.5
19.6
19.7
19.2.2
An undirected alternative to dseparation
663
19.2.3
Comparing directed and undirected graphical models
664
Parameterization of MRFs
665
19.3.1
The HammersleyClifford theorem
665
19.3.2
Representing potential functions
667
Examples of MRFs
668
19.4.1
Ising model
668
19.4.2
Hopﬁeld networks
669
19.4.3
Potts model
671
19.4.4
Gaussian MRFs
672
19.4.5
Markov logic networks *
674
Learning
676
19.5.1
Training maxent models using gradient methods
676
19.5.2
Training partially observed maxent models
677
19.5.3
Approximate methods for computing the MLEs of MRFs
678
19.5.4
Pseudo likelihood
678
19.5.5
Stochastic maximum likelihood
679
19.5.6
Feature induction for maxent models *
680
19.5.7
Iterative proportional ﬁtting (IPF) *
681
Conditional random ﬁelds (CRFs)
684
19.6.1
Chainstructured CRFs, MEMMs and the labelbias problem
684
19.6.2
Applications of CRFs
686
19.6.3
CRF training
692
Structural SVMs
693
19.7.1
SSVMs: a probabilistic view
693
19.7.2
SSVMs: a nonprobabilistic view
695
19.7.3
Cutting plane methods for ﬁtting SSVMs
698
19.7.4
Online algorithms for ﬁtting SSVMs
700
19.7.5
Latent structural SVMs
701
20 Exact inference for graphical models
707
20.1 Introduction
707
20.2 Belief propagation for trees
707
20.2.1
Serial protocol
707
20.2.2
Parallel protocol
709
20.2.3
Gaussian BP *
710
20.2.4
Other BP variants *
712
20.3 The variable elimination algorithm
714
20.3.1
The generalized distributive law *
717
20.3.2
Computational complexity of VE
717
20.3.3
A weakness of VE
720
20.4 The junction tree algorithm *
720
20.4.1
Creating a junction tree
720
20.4.2
Message passing on a junction tree
722
20.4.3
Computational complexity of JTA
725
xix
CONTENTS
xx
20.5
20.4.4
JTA generalizations *
726
Computational intractability of exact inference in the worst case
20.5.1
Approximate inference
727
726
21 Variational inference
731
21.1 Introduction
731
21.2 Variational inference
732
21.2.1
Alternative interpretations of the variational objective
733
21.2.2
Forward or reverse KL? *
733
21.3 The mean ﬁeld method
735
21.3.1
Derivation of the mean ﬁeld update equations
736
21.3.2
Example: mean ﬁeld for the Ising model
737
21.4 Structured mean ﬁeld *
739
21.4.1
Example: factorial HMM
740
21.5 Variational Bayes
742
21.5.1
Example: VB for a univariate Gaussian
742
21.5.2
Example: VB for linear regression
746
21.6 Variational Bayes EM
749
21.6.1
Example: VBEM for mixtures of Gaussians *
750
21.7 Variational message passing and VIBES
756
21.8 Local variational bounds *
756
21.8.1
Motivating applications
756
21.8.2
Bohning’s quadratic bound to the logsumexp function
758
21.8.3
Bounds for the sigmoid function
760
21.8.4
Other bounds and approximations to the logsumexp function *
21.8.5
Variational inference based on upper bounds
763
22 More variational inference
767
22.1 Introduction
767
22.2 Loopy belief propagation: algorithmic issues
767
22.2.1
A brief history
767
22.2.2
LBP on pairwise models
768
22.2.3
LBP on a factor graph
769
22.2.4
Convergence
771
22.2.5
Accuracy of LBP
774
22.2.6
Other speedup tricks for LBP *
775
22.3 Loopy belief propagation: theoretical issues *
776
22.3.1
UGMs represented in exponential family form
776
22.3.2
The marginal polytope
777
22.3.3
Exact inference as a variational optimization problem
778
22.3.4
Mean ﬁeld as a variational optimization problem
779
22.3.5
LBP as a variational optimization problem
779
22.3.6
Loopy BP vs mean ﬁeld
783
22.4 Extensions of belief propagation *
783
22.4.1
Generalized belief propagation
783
762
CONTENTS
22.5
22.6
xxi
22.4.2
Convex belief propagation
785
Expectation propagation
787
22.5.1
EP as a variational inference problem
788
22.5.2
Optimizing the EP objective using moment matching
22.5.3
EP for the clutter problem
791
22.5.4
LBP is a special case of EP
792
22.5.5
Ranking players using TrueSkill
793
22.5.6
Other applications of EP
799
MAP state estimation
799
22.6.1
Linear programming relaxation
799
22.6.2
Maxproduct belief propagation
800
22.6.3
Graphcuts
801
22.6.4
Experimental comparison of graphcuts and BP
804
22.6.5
Dual decomposition
806
23 Monte Carlo inference
815
23.1 Introduction
815
23.2 Sampling from standard distributions
815
23.2.1
Using the cdf
815
23.2.2
Sampling from a Gaussian (BoxMuller method)
817
23.3 Rejection sampling
817
23.3.1
Basic idea
817
23.3.2
Example
818
23.3.3
Application to Bayesian statistics
819
23.3.4
Adaptive rejection sampling
819
23.3.5
Rejection sampling in high dimensions
820
23.4 Importance sampling
820
23.4.1
Basic idea
820
23.4.2
Handling unnormalized distributions
821
23.4.3
Importance sampling for a DGM: likelihood weighting
23.4.4
Sampling importance resampling (SIR)
822
23.5 Particle ﬁltering
823
23.5.1
Sequential importance sampling
824
23.5.2
The degeneracy problem
825
23.5.3
The resampling step
825
23.5.4
The proposal distribution
827
23.5.5
Application: robot localization
828
23.5.6
Application: visual object tracking
828
23.5.7
Application: time series forecasting
831
23.6 RaoBlackwellised particle ﬁltering (RBPF)
831
23.6.1
RBPF for switching LGSSMs
831
23.6.2
Application: tracking a maneuvering target
832
23.6.3
Application: Fast SLAM
834
24 Markov chain Monte Carlo (MCMC) inference
837
789
822
CONTENTS
xxii
24.1
24.2
24.3
24.4
24.5
24.6
24.7
Introduction
837
Gibbs sampling
838
24.2.1
Basic idea
838
24.2.2
Example: Gibbs sampling for the Ising model
838
24.2.3
Example: Gibbs sampling for inferring the parameters of a GMM
24.2.4
Collapsed Gibbs sampling *
841
24.2.5
Gibbs sampling for hierarchical GLMs
844
24.2.6
BUGS and JAGS
846
24.2.7
The Imputation Posterior (IP) algorithm
847
24.2.8
Blocking Gibbs sampling
847
Metropolis Hastings algorithm
848
24.3.1
Basic idea
848
24.3.2
Gibbs sampling is a special case of MH
849
24.3.3
Proposal distributions
850
24.3.4
Adaptive MCMC
853
24.3.5
Initialization and mode hopping
854
24.3.6
Why MH works *
854
24.3.7
Reversible jump (transdimensional) MCMC *
855
Speed and accuracy of MCMC
856
24.4.1
The burnin phase
856
24.4.2
Mixing rates of Markov chains *
857
24.4.3
Practical convergence diagnostics
858
24.4.4
Accuracy of MCMC
860
24.4.5
How many chains?
862
Auxiliary variable MCMC *
863
24.5.1
Auxiliary variable sampling for logistic regression
863
24.5.2
Slice sampling
864
24.5.3
Swendsen Wang
866
24.5.4
Hybrid/Hamiltonian MCMC *
868
Annealing methods
868
24.6.1
Simulated annealing
869
24.6.2
Annealed importance sampling
871
24.6.3
Parallel tempering
871
Approximating the marginal likelihood
872
24.7.1
The candidate method
872
24.7.2
Harmonic mean estimate
872
24.7.3
Annealed importance sampling
873
25 Clustering
875
25.1 Introduction
875
25.1.1
Measuring (dis)similarity
875
25.1.2
Evaluating the output of clustering methods *
25.2 Dirichlet process mixture models
879
25.2.1
From ﬁnite to inﬁnite mixture models
879
25.2.2
The Dirichlet process
882
876
840
CONTENTS
25.3
25.4
25.5
25.6
25.2.3
Applying Dirichlet processes to mixture modeling
25.2.4
Fitting a DP mixture model
886
Affinity propagation
887
Spectral clustering
890
25.4.1
Graph Laplacian
891
25.4.2
Normalized graph Laplacian
892
25.4.3
Example
893
Hierarchical clustering
893
25.5.1
Agglomerative clustering
895
25.5.2
Divisive clustering
898
25.5.3
Choosing the number of clusters
899
25.5.4
Bayesian hierarchical clustering
899
Clustering datapoints and features
901
25.6.1
Biclustering
903
25.6.2
Multiview clustering
903
xxiii
885
26 Graphical model structure learning
907
26.1 Introduction
907
26.2 Structure learning for knowledge discovery
908
26.2.1
Relevance networks
908
26.2.2
Dependency networks
909
26.3 Learning tree structures
910
26.3.1
Directed or undirected tree?
911
26.3.2
ChowLiu algorithm for ﬁnding the ML tree structure
912
26.3.3
Finding the MAP forest
912
26.3.4
Mixtures of trees
914
26.4 Learning DAG structures
914
26.4.1
Markov equivalence
914
26.4.2
Exact structural inference
916
26.4.3
Scaling up to larger graphs
920
26.5 Learning DAG structure with latent variables
922
26.5.1
Approximating the marginal likelihood when we have missing data
26.5.2
Structural EM
925
26.5.3
Discovering hidden variables
926
26.5.4
Case study: Google’s Rephil
928
26.5.5
Structural equation models *
929
26.6 Learning causal DAGs
931
26.6.1
Causal interpretation of DAGs
931
26.6.2
Using causal DAGs to resolve Simpson’s paradox
933
26.6.3
Learning causal DAG structures
935
26.7 Learning undirected Gaussian graphical models
938
26.7.1
MLE for a GGM
938
26.7.2
Graphical lasso
939
26.7.3
Bayesian inference for GGM structure *
941
26.7.4
Handling nonGaussian data using copulas *
942
922
CONTENTS
xxiv
26.8
Learning undirected discrete graphical models
26.8.1
Graphical lasso for MRFs/CRFs
942
26.8.2
Thin junction trees
944
942
27 Latent variable models for discrete data
945
27.1 Introduction
945
27.2 Distributed state LVMs for discrete data
946
27.2.1
Mixture models
946
27.2.2
Exponential family PCA
947
27.2.3
LDA and mPCA
948
27.2.4
GaP model and nonnegative matrix factorization
949
27.3 Latent Dirichlet allocation (LDA)
950
27.3.1
Basics
950
27.3.2
Unsupervised discovery of topics
953
27.3.3
Quantitatively evaluating LDA as a language model
953
27.3.4
Fitting using (collapsed) Gibbs sampling
955
27.3.5
Example
956
27.3.6
Fitting using batch variational inference
957
27.3.7
Fitting using online variational inference
959
27.3.8
Determining the number of topics
960
27.4 Extensions of LDA
961
27.4.1
Correlated topic model
961
27.4.2
Dynamic topic model
962
27.4.3
LDAHMM
963
27.4.4
Supervised LDA
967
27.5 LVMs for graphstructured data
970
27.5.1
Stochastic block model
971
27.5.2
Mixed membership stochastic block model
973
27.5.3
Relational topic model
974
27.6 LVMs for relational data
975
27.6.1
Inﬁnite relational model
976
27.6.2
Probabilistic matrix factorization for collaborative ﬁltering
27.7 Restricted Boltzmann machines (RBMs)
983
27.7.1
Varieties of RBMs
985
27.7.2
Learning RBMs
987
27.7.3
Applications of RBMs
991
28 Deep learning
995
28.1 Introduction
995
28.2 Deep generative models
995
28.2.1
Deep directed networks
996
28.2.2
Deep Boltzmann machines
996
28.2.3
Deep belief networks
997
28.2.4
Greedy layerwise learning of DBNs
28.3 Deep neural networks
999
998
979
CONTENTS
28.4
28.5
xxv
28.3.1
Deep multilayer perceptrons
999
28.3.2
Deep autoencoders
1000
28.3.3
Stacked denoising autoencoders
1001
Applications of deep networks
1001
28.4.1
Handwritten digit classiﬁcation using DBNs
1001
28.4.2
Data visualization and feature discovery using deep autoencoders
28.4.3
Information retrieval using deep autoencoders (semantic hashing)
28.4.4
Learning audio features using 1d convolutional DBNs
1004
28.4.5
Learning image features using 2d convolutional DBNs
1005
Discussion
1005
Notation
Bibliography
1009
1015
Indexes
1047
Index to code
1047
Index to keywords
1050
1002
1003
Preface
Introduction
With the ever increasing amounts of data in electronic form, the need for automated methods
for data analysis continues to grow. The goal of machine learning is to develop methods that
can automatically detect patterns in data, and then to use the uncovered patterns to predict
future data or other outcomes of interest. Machine learning is thus closely related to the ﬁelds
of statistics and data mining, but differs slightly in terms of its emphasis and terminology. This
book provides a detailed introduction to the ﬁeld, and includes worked examples drawn from
application domains such as molecular biology, text processing, computer vision, and robotics.
Target audience
This book is suitable for upperlevel undergraduate students and beginning graduate students in
computer science, statistics, electrical engineering, econometrics, or any one else who has the
appropriate mathematical background. Speciﬁcally, the reader is assumed to already be familiar
with basic multivariate calculus, probability, linear algebra, and computer programming. Prior
exposure to statistics is helpful but not necessary.
A probabilistic approach
This books adopts the view that the best way to make machines that can learn from data is to
use the tools of probability theory, which has been the mainstay of statistics and engineering for
centuries. Probability theory can be applied to any problem involving uncertainty. In machine
learning, uncertainty comes in many forms: what is the best prediction (or decision) given some
data? what is the best model given some data? what measurement should I perform next? etc.
The systematic application of probabilistic reasoning to all inferential problems, including
inferring parameters of statistical models, is sometimes called a Bayesian approach. However,
this term tends to elicit very strong reactions (either positive or negative, depending on who
you ask), so we prefer the more neutral term “probabilistic approach”. Besides, we will often
use techniques such as maximum likelihood estimation, which are not Bayesian methods, but
certainly fall within the probabilistic paradigm.
Rather than describing a cookbook of different heuristic methods, this book stresses a principled modelbased approach to machine learning. For any given model, a variety of algorithms
xxviii
Preface
can often be applied. Conversely, any given algorithm can often be applied to a variety of
models. This kind of modularity, where we distinguish model from algorithm, is good pedagogy
and good engineering.
We will often use the language of graphical models to specify our models in a concise and
intuitive way. In addition to aiding comprehension, the graph structure aids in developing
efficient algorithms, as we will see. However, this book is not primarily about graphical models;
it is about probabilistic modeling in general.
A practical approach
Nearly all of the methods described in this book have been implemented in a MATLAB software
package called PMTK, which stands for probabilistic modeling toolkit. This is freely available
from pmtk3.googlecode.com (the digit 3 refers to the third edition of the toolkit, which is the
one used in this version of the book). There are also a variety of supporting ﬁles, written by other
people, available at pmtksupport.googlecode.com. These will be downloaded automatically,
if you follow the setup instructions described on the PMTK website.
MATLAB is a highlevel, interactive scripting language ideally suited to numerical computation
and data visualization, and can be purchased from www.mathworks.com. Some of the code
requires the Statistics toolbox, which needs to be purchased separately. There is also a free
version of Matlab called Octave, available at http://www.gnu.org/software/octave/, which
supports most of the functionality of MATLAB. Some (but not all) of the code in this book also
works in Octave. See the PMTK website for details.
PMTK was used to generate many of the ﬁgures in this book; the source code for these ﬁgures
is included on the PMTK website, allowing the reader to easily see the effects of changing the
data or algorithm or parameter settings. The book refers to ﬁles by name, e.g., naiveBayesFit.
In order to ﬁnd the corresponding ﬁle, you can use two methods: within Matlab you can type
which naiveBayesFit and it will return the full path to the ﬁle; or, if you do not have Matlab
but want to read the source code anyway, you can use your favorite search engine, which should
return the corresponding ﬁle from the pmtk3.googlecode.com website.
Details on how to use PMTK can be found on the website, which will be udpated over time.
Details on the underlying theory behind these methods can be found in this book.
Acknowledgments
A book this large is obviously a team effort. I would especially like to thank the following people:
my wife Margaret, for keeping the home ﬁres burning as I toiled away in my office for the last six
years; Matt Dunham, who created many of the ﬁgures in this book, and who wrote much of the
code in PMTK; Baback Moghaddam, who gave extremely detailed feedback on every page of an
earlier draft of the book; Chris Williams, who also gave very detailed feedback; Cody Severinski
and WeiLwun Lu, who assisted with ﬁgures; generations of UBC students, who gave helpful
comments on earlier drafts; Daphne Koller, Nir Friedman, and Chris Manning, for letting me use
their latex style ﬁles; Stanford University, Google Research and Skyline College for hosting me
during part of my sabbatical; and various Canadian funding agencies (NSERC, CRC and CIFAR)
who have supported me ﬁnancially over the years.
In addition, I would like to thank the following people for giving me helpful feedback on
Preface
xxix
parts of the book, and/or for sharing ﬁgures, code, exercises or even (in some cases) text: David
Blei, Hannes Bretschneider, Greg Corrado, Arnaud Doucet, Mario Figueiredo, Nando de Freitas,
Mark Girolami, Gabriel Goh, Tom Griffiths, Katherine Heller, Geoff Hinton, Aapo Hyvarinen,
Tommi Jaakkola, Mike Jordan, Charles Kemp, Emtiyaz Khan, Bonnie Kirkpatrick, Daphne Koller,
Zico Kolter, Honglak Lee, Julien Mairal, Andrew McPherson, Tom Minka, Ian Nabney, Arthur
Pope, Carl Rassmussen, Ryan Rifkin, Ruslan Salakhutdinov, Mark Schmidt, Daniel Selsam, David
Sontag, Erik Sudderth, Josh Tenenbaum, Kai Yu, Martin Wainwright, Yair Weiss.
Kevin Patrick Murphy
Palo Alto, California
June 2012
1
1.1
Introduction
Machine learning: what and why?
We are drowning in information and starving for knowledge. — John Naisbitt.
We are entering the era of big data. For example, there are about 1 trillion web pages1 ; one
hour of video is uploaded to YouTube every second, amounting to 10 years of content every
day2 ; the genomes of 1000s of people, each of which has a length of 3.8 × 109 base pairs, have
been sequenced by various labs; Walmart handles more than 1M transactions per hour and has
databases containing more than 2.5 petabytes (2.5 × 1015 ) of information (Cukier 2010); and so
on.
This deluge of data calls for automated methods of data analysis, which is what machine
learning provides. In particular, we deﬁne machine learning as a set of methods that can
automatically detect patterns in data, and then use the uncovered patterns to predict future
data, or to perform other kinds of decision making under uncertainty (such as planning how to
collect more data!).
This books adopts the view that the best way to solve such problems is to use the tools
of probability theory. Probability theory can be applied to any problem involving uncertainty.
In machine learning, uncertainty comes in many forms: what is the best prediction about the
future given some past data? what is the best model to explain some data? what measurement
should I perform next? etc. The probabilistic approach to machine learning is closely related to
the ﬁeld of statistics, but differs slightly in terms of its emphasis and terminology3 .
We will describe a wide variety of probabilistic models, suitable for a wide variety of data and
tasks. We will also describe a wide variety of algorithms for learning and using such models.
The goal is not to develop a cook book of ad hoc techiques, but instead to present a uniﬁed
view of the ﬁeld through the lens of probabilistic modeling and inference. Although we will pay
attention to computational efficiency, details on how to scale these methods to truly massive
datasets are better described in other books, such as (Rajaraman and Ullman 2011; Bekkerman
et al. 2011).
1. http://googleblog.blogspot.com/2008/07/weknewwebwasbig.html
2. Source: http://www.youtube.com/t/press_statistics.
3. Rob Tibshirani, a statistician at Stanford university, has created an amusing comparison between machine learning
and statistics, available at http://wwwstat.stanford.edu/~tibs/stat315a/glossary.pdf.
2
Chapter 1. Introduction
It should be noted, however, that even when one has an apparently massive data set, the
effective number of data points for certain cases of interest might be quite small. In fact, data
across a variety of domains exhibits a property known as the long tail, which means that a
few things (e.g., words) are very common, but most things are quite rare (see Section 2.4.6 for
details). For example, 20% of Google searches each day have never been seen before4 . This
means that the core statistical issues that we discuss in this book, concerning generalizing from
relatively small samples sizes, are still very relevant even in the big data era.
1.1.1
Types of machine learning
Machine learning is usually divided into two main types. In the predictive or supervised
learning approach, the goal is to learn a mapping from inputs x to outputs y, given a labeled
set of inputoutput pairs D = {(xi , yi )}N
i=1 . Here D is called the training set, and N is the
number of training examples.
In the simplest setting, each training input xi is a Ddimensional vector of numbers, representing, say, the height and weight of a person. These are called features, attributes or
covariates. In general, however, xi could be a complex structured object, such as an image, a
sentence, an email message, a time series, a molecular shape, a graph, etc.
Similarly the form of the output or response variable can in principle be anything, but
most methods assume that yi is a categorical or nominal variable from some ﬁnite set,
yi ∈ {1, . . . , C} (such as male or female), or that yi is a realvalued scalar (such as income
level). When yi is categorical, the problem is known as classiﬁcation or pattern recognition,
and when yi is realvalued, the problem is known as regression. Another variant, known as
ordinal regression, occurs where label space Y has some natural ordering, such as grades A–F.
The second main type of machine learning is the descriptive or unsupervised learning
approach. Here we are only given inputs, D = {xi }N
i=1 , and the goal is to ﬁnd “interesting
patterns” in the data. This is sometimes called knowledge discovery. This is a much less
welldeﬁned problem, since we are not told what kinds of patterns to look for, and there is no
obvious error metric to use (unlike supervised learning, where we can compare our prediction
of y for a given x to the observed value).
There is a third type of machine learning, known as reinforcement learning, which is
somewhat less commonly used. This is useful for learning how to act or behave when given
occasional reward or punishment signals. (For example, consider how a baby learns to walk.)
Unfortunately, RL is beyond the scope of this book, although we do discuss decision theory
in Section 5.7, which is the basis of RL. See e.g., (Kaelbling et al. 1996; Sutton and Barto 1998;
Russell and Norvig 2010; Szepesvari 2010; Wiering and van Otterlo 2012) for more information
on RL.
4.
http://certifiedknowledge.org/blog/aresearchqueriesbecomingevenmoreuniquestatistic
sfromgoogle.
1.2. Supervised learning
(a)
3
(b)
Figure 1.1 Left: Some labeled training examples of colored shapes, along with 3 unlabeled test cases.
Right: Representing the training data as an N × D design matrix. Row i represents the feature vector xi .
The last column is the label, yi ∈ {0, 1}. Based on a ﬁgure by Leslie Kaelbling.
1.2
Supervised learning
We begin our investigation of machine learning by discussing supervised learning, which is the
form of ML most widely used in practice.
1.2.1
Classiﬁcation
In this section, we discuss classiﬁcation. Here the goal is to learn a mapping from inputs x
to outputs y, where y ∈ {1, . . . , C}, with C being the number of classes. If C = 2, this is
called binary classiﬁcation (in which case we often assume y ∈ {0, 1}); if C > 2, this is called
multiclass classiﬁcation. If the class labels are not mutually exclusive (e.g., somebody may be
classiﬁed as tall and strong), we call it multilabel classiﬁcation, but this is best viewed as
predicting multiple related binary class labels (a socalled multiple output model). When we
use the term “classiﬁcation”, we will mean multiclass classiﬁcation with a single output, unless
we state otherwise.
One way to formalize the problem is as function approximation. We assume y = f (x) for
some unknown function f , and the goal of learning is to estimate the function f given a labeled
training set, and then to make predictions using ŷ = fˆ(x). (We use the hat symbol to denote
an estimate.) Our main goal is to make predictions on novel inputs, meaning ones that we have
not seen before (this is called generalization), since predicting the response on the training set
is easy (we can just look up the answer).
1.2.1.1
Example
As a simple toy example of classiﬁcation, consider the problem illustrated in Figure 1.1(a). We
have two classes of object which correspond to labels 0 and 1. The inputs are colored shapes.
These have been described by a set of D features or attributes, which are stored in an N × D
design matrix X, shown in Figure 1.1(b). The input features x can be discrete, continuous or a
combination of the two. In addition to the inputs, we have a vector of training labels y.
In Figure 1.1, the test cases are a blue crescent, a yellow circle and a blue arrow. None of
these have been seen before. Thus we are required to generalize beyond the training set. A
Chapter 1. Introduction
4
reasonable guess is that blue crescent should be y = 1, since all blue shapes are labeled 1 in the
training set. The yellow circle is harder to classify, since some yellow things are labeled y = 1
and some are labeled y = 0, and some circles are labeled y = 1 and some y = 0. Consequently
it is not clear what the right label should be in the case of the yellow circle. Similarly, the correct
label for the blue arrow is unclear.
1.2.1.2
The need for probabilistic predictions
To handle ambiguous cases, such as the yellow circle above, it is desirable to return a probability.
The reader is assumed to already have some familiarity with basic concepts in probability. If
not, please consult Chapter 2 for a refresher, if necessary.
We will denote the probability distribution over possible labels, given the input vector x and
training set D by p(yx, D). In general, this represents a vector of length C. (If there are just two
classes, it is sufficient to return the single number p(y = 1x, D), since p(y = 1x, D) + p(y =
0x, D) = 1.) In our notation, we make explicit that the probability is conditional on the test
input x, as well as the training set D, by putting these terms on the right hand side of the
conditioning bar . We are also implicitly conditioning on the form of model that we use to make
predictions. When choosing between different models, we will make this assumption explicit by
writing p(yx, D, M ), where M denotes the model. However, if the model is clear from context,
we will drop M from our notation for brevity.
Given a probabilistic output, we can always compute our “best guess” as to the “true label”
using
C
ŷ = fˆ(x) = argmax p(y = cx, D)
(1.1)
c=1
This corresponds to the most probable class label, and is called the mode of the distribution
p(yx, D); it is also known as a MAP estimate (MAP stands for maximum a posteriori). Using
the most probable label makes intuitive sense, but we will give a more formal justiﬁcation for
this procedure in Section 5.7.
Now consider a case such as the yellow circle, where p(ŷx, D) is far from 1.0. In such a
case we are not very conﬁdent of our answer, so it might be better to say “I don’t know” instead
of returning an answer that we don’t really trust. This is particularly important in domains
such as medicine and ﬁnance where we may be risk averse, as we explain in Section 5.7.
Another application where it is important to assess risk is when playing TV game shows, such
as Jeopardy. In this game, contestants have to solve various word puzzles and answer a variety
of trivia questions, but if they answer incorrectly, they lose money. In 2011, IBM unveiled a
computer system called Watson which beat the top human Jeopardy champion. Watson uses a
variety of interesting techniques (Ferrucci et al. 2010), but the most pertinent one for our present
purposes is that it contains a module that estimates how conﬁdent it is of its answer. The system
only chooses to “buzz in” its answer if sufficiently conﬁdent it is correct. Similarly, Google has a
system known as SmartASS (ad selection system) that predicts the probability you will click on
an ad based on your search history and other user and adspeciﬁc features (Metz 2010). This
probability is known as the clickthrough rate or CTR, and can be used to maximize expected
proﬁt. We will discuss some of the basic principles behind systems such as SmartASS later in
this book.
1.2. Supervised learning
5
100
200
documents
300
400
500
600
700
800
900
1000
10
20
30
40
50
words
60
70
80
90
100
Figure 1.2 Subset of size 16242 x 100 of the 20newsgroups data. We only show 1000 rows, for clarity.
Each row is a document (represented as a bagofwords bit vector), each column is a word. The red
lines separate the 4 classes, which are (in descending order) comp, rec, sci, talk (these are the titles of
USENET groups). We can see that there are subsets of words whose presence or absence is indicative
of the class. The data is available from http://cs.nyu.edu/~roweis/data.html. Figure generated by
newsgroupsVisualize.
1.2.1.3
Realworld applications
Classiﬁcation is probably the most widely used form of machine learning, and has been used
to solve many interesting and often difficult realworld problems. We have already mentioned
some important applciations. We give a few more examples below.
Document classiﬁcation and email spam ﬁltering
In document classiﬁcation, the goal is to classify a document, such as a web page or email
message, into one of C classes, that is, to compute p(y = cx, D), where x is some representation of the text. A special case of this is email spam ﬁltering, where the classes are spam
y = 1 or ham y = 0.
Most classiﬁers assume that the input vector x has a ﬁxed size. A common way to represent
variablelength documents in featurevector format is to use a bag of words representation.
This is explained in detail in Section 3.4.4.1, but the basic idea is to deﬁne xij = 1 iff word j
occurs in document i. If we apply this transformation to every document in our data set, we get
a binary document × word cooccurrence matrix: see Figure 1.2 for an example. Essentially the
document classiﬁcation problem has been reduced to one that looks for subtle changes in the
pattern of bits. For example, we may notice that most spam messages have a high probability of
containing the words “buy”, “cheap”, “viagra”, etc. In Exercise 8.1 and Exercise 8.2, you will get
handson experience applying various classiﬁcation techniques to the spam ﬁltering problem.
Chapter 1. Introduction
6
(a)
(b)
(c)
Figure 1.3 Three types of iris ﬂowers: setosa, versicolor and virginica. Source: http://www.statlab.u
niheidelberg.de/data/iris/ . Used with kind permission of Dennis Kramb and SIGNA.
sepal width
petal length
petal width
petal width
petal length
sepal width
sepal length
sepal length
Figure 1.4 Visualization of the Iris data as a pairwise scatter plot. The diagonal plots the marginal
histograms of the 4 features. The off diagonals contain scatterplots of all possible pairs of features. Red
circle = setosa, green diamond = versicolor, blue star = virginica. Figure generated by fisheririsDemo.
Classifying ﬂowers
Figure 1.3 gives another example of classiﬁcation, due to the statistician Ronald Fisher. The goal
is to learn to distinguish three different kinds of iris ﬂower, called setosa, versicolor and virginica.
Fortunately, rather than working directly with images, a botanist has already extracted 4 useful
features or characteristics: sepal length and width, and petal length and width. (Such feature
extraction is an important, but difficult, task. Most machine learning methods use features
chosen by some human. Later we will discuss some methods that can learn good features from
the data.) If we make a scatter plot of the iris data, as in Figure 1.4, we see that it is easy to
distinguish setosas (red circles) from the other two classes by just checking if their petal length
1.2. Supervised learning
7
true class = 7
true class = 2
true class = 1
true class = 7
true class = 2
true class = 1
true class = 0
true class = 4
true class = 1
true class = 0
true class = 4
true class = 1
true class = 4
true class = 9
true class = 5
true class = 4
true class = 9
true class = 5
(a)
(b)
Figure 1.5 (a) First 9 test MNIST grayscale images. (b) Same as (a), but with the features permuted
randomly. Classiﬁcation performance is identical on both versions of the data (assuming the training data
is permuted in an identical way). Figure generated by shuffledDigitsDemo.
or width is below some threshold. However, distinguishing versicolor from virginica is slightly
harder; any decision will need to be based on at least two features. (It is always a good idea
to perform exploratory data analysis, such as plotting the data, before applying a machine
learning method.)
Image classiﬁcation and handwriting recognition
Now consider the harder problem of classifying images directly, where a human has not preprocessed the data. We might want to classify the image as a whole, e.g., is it an indoors or
outdoors scene? is it a horizontal or vertical photo? does it contain a dog or not? This is called
image classiﬁcation.
In the special case that the images consist of isolated handwritten letters and digits, for
example, in a postal or ZIP code on a letter, we can use classiﬁcation to perform handwriting
recognition. A standard dataset used in this area is known as MNIST, which stands for “Modiﬁed
National Institute of Standards”5 . (The term “modiﬁed” is used because the images have been
preprocessed to ensure the digits are mostly in the center of the image.) This dataset contains
60,000 training images and 10,000 test images of the digits 0 to 9, as written by various people.
The images are size 28 × 28 and have grayscale values in the range 0 : 255. See Figure 1.5(a) for
some example images.
Many generic classiﬁcation methods ignore any structure in the input features, such as spatial
layout. Consequently, they can also just as easily handle data that looks like Figure 1.5(b), which
is the same data except we have randomly permuted the order of all the features. (You will
verify this in Exercise 1.1.) This ﬂexibility is both a blessing (since the methods are general
purpose) and a curse (since the methods ignore an obviously useful source of information). We
will discuss methods for exploiting structure in the input features later in the book.
5. Available from http://yann.lecun.com/exdb/mnist/.
Chapter 1. Introduction
8
(a)
(b)
Figure 1.6 Example of face detection. (a) Input image (Murphy family, photo taken 5 August 2010). Used
with kind permission of Bernard Diedrich of Sherwood Studios. (b) Output of classiﬁer, which detected 5
faces at different poses. This was produced using the online demo at http://demo.pittpatt.com/. The
classiﬁer was trained on 1000s of manually labeled images of faces and nonfaces, and then was applied
to a dense set of overlapping patches in the test image. Only the patches whose probability of containing
a face was sufficiently high were returned. Used with kind permission of Pittpatt.com
Face detection and recognition
A harder problem is to ﬁnd objects within an image; this is called object detection or object
localization. An important special case of this is face detection. One approach to this problem
is to divide the image into many small overlapping patches at different locations, scales and
orientations, and to classify each such patch based on whether it contains facelike texture or
not. This is called a sliding window detector. The system then returns those locations where
the probability of face is sufficiently high. See Figure 1.6 for an example. Such face detection
systems are builtin to most modern digital cameras; the locations of the detected faces are
used to determine the center of the autofocus. Another application is automatically blurring
out faces in Google’s StreetView system.
Having found the faces, one can then proceed to perform face recognition, which means
estimating the identity of the person (see Figure 1.10(a)). In this case, the number of class labels
might be very large. Also, the features one should use are likely to be different than in the face
detection problem: for recognition, subtle differences between faces such as hairstyle may be
important for determining identity, but for detection, it is important to be invariant to such
details, and to just focus on the differences between faces and nonfaces. For more information
about visual object detection, see e.g., (Szeliski 2010).
1.2.2
Regression
Regression is just like classiﬁcation except the response variable is continuous. Figure 1.7 shows
a simple example: we have a single realvalued input xi ∈ R, and a single realvalued response
yi ∈ R. We consider ﬁtting two models to the data: a straight line and a quadratic function.
(We explain how to ﬁt such models below.) Various extensions of this basic problem can arise,
such as having highdimensional inputs, outliers, nonsmooth responses, etc. We will discuss
ways to handle such problems later in the book.
1.3. Unsupervised learning
9
degree 1
degree 2
15
15
10
10
5
5
0
0
−5
−5
−10
−10
0
5
10
15
20
0
5
(a)
10
15
20
(b)
Figure 1.7 (a) Linear regression on some 1d data. (b) Same data with polynomial regression (degree 2).
Figure generated by linregPolyVsDegree.
Here are some examples of realworld regression problems.
• Predict tomorrow’s stock market price given current market conditions and other possible
side information.
• Predict the age of a viewer watching a given video on YouTube.
• Predict the location in 3d space of a robot arm end effector, given control signals (torques)
sent to its various motors.
• Predict the amount of prostate speciﬁc antigen (PSA) in the body as a function of a number
of different clinical measurements.
• Predict the temperature at any location inside a building using weather data, time, door
sensors, etc.
1.3
Unsupervised learning
We now consider unsupervised learning, where we are just given output data, without any
inputs. The goal is to discover “interesting structure” in the data; this is sometimes called
knowledge discovery. Unlike supervised learning, we are not told what the desired output is
for each input. Instead, we will formalize our task as one of density estimation, that is, we
want to build models of the form p(xi θ). There are two differences from the supervised case.
First, we have written p(xi θ) instead of p(yi xi , θ); that is, supervised learning is conditional
density estimation, whereas unsupervised learning is unconditional density estimation. Second,
xi is a vector of features, so we need to create multivariate probability models. By contrast,
in supervised learning, yi is usually just a single variable that we are trying to predict. This
means that for most supervised learning problems, we can use univariate probability models
(with inputdependent parameters), which signiﬁcantly simpliﬁes the problem. (We will discuss
multioutput classiﬁcation in Chapter 19, where we will see that it also involves multivariate
probability models.)
Unsupervised learning is arguably more typical of human and animal learning. It is also
more widely applicable than supervised learning, since it does not require a human expert to
Chapter 1. Introduction
10
280
260
260
240
240
220
220
200
200
weight
weight
K=2
280
180
180
160
160
140
140
120
120
100
80
55
100
60
65
70
height
(a)
75
80
80
55
60
65
70
75
80
height
(b)
Figure 1.8 (a) The height and weight of some people. (b) A possible clustering using K = 2 clusters.
Figure generated by kmeansHeightWeight.
manually label the data. Labeled data is not only expensive to acquire6 , but it also contains
relatively little information, certainly not enough to reliably estimate the parameters of complex
models. Geoff Hinton, who is a famous professor of ML at the University of Toronto, has said:
When we’re learning to see, nobody’s telling us what the right answers are — we just
look. Every so often, your mother says “that’s a dog”, but that’s very little information.
You’d be lucky if you got a few bits of information — even one bit per second — that
way. The brain’s visual system has 1014 neural connections. And you only live for 109
seconds. So it’s no use learning one bit per second. You need more like 105 bits per
second. And there’s only one place you can get that much information: from the input
itself. — Geoffrey Hinton, 1996 (quoted in (Gorder 2006)).
Below we describe some canonical examples of unsupervised learning.
1.3.1
Discovering clusters
As a canonical example of unsupervised learning, consider the problem of clustering data into
groups. For example, Figure 1.8(a) plots some 2d data, representing the height and weight of
a group of 210 people. It seems that there might be various clusters, or subgroups, although
it is not clear how many. Let K denote the number of clusters. Our ﬁrst goal is to estimate
the distribution over the number of clusters, p(KD); this tells us if there are subpopulations
within the data. For simplicity, we often approximate the distribution p(KD) by its mode,
K ∗ = arg maxK p(KD). In the supervised case, we were told that there are two classes (male
and female), but in the unsupervised case, we are free to choose as many or few clusters as we
like. Picking a model of the “right” complexity is called model selection, and will be discussed
in detail below.
Our second goal is to estimate which cluster each point belongs to. Let zi ∈ {1, . . . , K}
represent the cluster to which data point i is assigned. (zi is an example of a hidden or
6. The advent of crowd sourcing web sites such as Mechanical Turk, (https://www.mturk.com/mturk/welcome),
which outsource data processing tasks to humans all over the world, has reduced the cost of labeling data. Nevertheless,
the amount of unlabeled data is still orders of magnitude larger than the amount of labeled data.
1.3. Unsupervised learning
11
2
0
4
2
−2
0
4
2
0
−2
−4
−8
−6
−4
−2
0
2
4
(a)
6
8
5
−2
0
−4
−5
(b)
Figure 1.9 (a) A set of points that live on a 2d linear subspace embedded in 3d. The solid red line is the
ﬁrst principal component direction. The dotted black line is the second PC direction. (b) 2D representation
of the data. Figure generated by pcaDemo3d.
latent variable, since it is never observed in the training set.) We can infer which cluster each
data point belongs to by computing zi∗ = argmaxk p(zi = kxi , D). This is illustrated in
Figure 1.8(b), where we use different colors to indicate the assignments, assuming K = 2.
In this book, we focus on model based clustering, which means we ﬁt a probabilistic model
to the data, rather than running some ad hoc algorithm. The advantages of the modelbased
approach are that one can compare different kinds of models in an objective way (in terms of
the likelihood they assign to the data), we can combine them together into larger systems, etc.
Here are some real world applications of clustering.
• In astronomy, the autoclass system (Cheeseman et al. 1988) discovered a new type of star,
based on clustering astrophysical measurements.
• In ecommerce, it is common to cluster users into groups, based on their purchasing or
websurﬁng behavior, and then to send customized targeted advertising to each group (see
e.g., (Berkhin 2006)).
• In biology, it is common to cluster ﬂowcytometry data into groups, to discover different
subpopulations of cells (see e.g., (Lo et al. 2009)).
1.3.2
Discovering latent factors
When dealing with high dimensional data, it is often useful to reduce the dimensionality by
projecting the data to a lower dimensional subspace which captures the “essence” of the data.
This is called dimensionality reduction. A simple example is shown in Figure 1.9, where we
project some 3d data down to a 2d plane. The 2d approximation is quite good, since most points
lie close to this subspace. Reducing to 1d would involve projecting points onto the red line in
Figure 1.9(a); this would be a rather poor approximation. (We will make this notion precise in
Chapter 12.)
The motivation behind this technique is that although the data may appear high dimensional,
there may only be a small number of degrees of variability, corresponding to latent factors. For
example, when modeling the appearance of face images, there may only be a few underlying
latent factors which describe most of the variability, such as lighting, pose, identity, etc, as
illustrated in Figure 1.10.
Chapter 1. Introduction
12
(a)
(b)
Figure 1.10 a) 25 randomly chosen 64 × 64 pixel images from the Olivetti face database. (b) The mean
and the ﬁrst three principal component basis vectors (eigenfaces). Figure generated by pcaImageDemo.
When used as input to other statistical models, such low dimensional representations often
result in better predictive accuracy, because they focus on the “essence” of the object, ﬁltering
out inessential features. Also, low dimensional representations are useful for enabling fast
nearest neighbor searches and two dimensional projections are very useful for visualizing high
dimensional data.
The most common approach to dimensionality reduction is called principal components
analysis or PCA. This can be thought of as an unsupervised version of (multioutput) linear
regression, where we observe the highdimensional response y, but not the lowdimensional
“cause” z. Thus the model has the form z → y; we have to “invert the arrow”, and infer the
latent lowdimensional z from the observed highdimensional y. See Section 12.1 for details.
Dimensionality reduction, and PCA in particular, has been applied in many different areas.
Some examples include the following:
• In biology, it is common to use PCA to interpret gene microarray data, to account for the
fact that each measurement is usually the result of many genes which are correlated in their
behavior by the fact that they belong to different biological pathways.
• In natural language processing, it is common to use a variant of PCA called latent semantic
analysis for document retrieval (see Section 27.2.2).
• In signal processing (e.g., of acoustic or neural signals), it is common to use ICA (which is a
variant of PCA) to separate signals into their different sources (see Section 12.6).
• In computer graphics, it is common to project motion capture data to a low dimensional
space, and use it to create animations. See Section 15.5 for one way to tackle such problems.
1.3. Unsupervised learning
13
Figure 1.11 A sparse undirected Gaussian graphical model learned using graphical lasso (Section 26.7.2)
applied to some ﬂow cytometry data (from (Sachs et al. 2005)), which measures the phosphorylation status
of 11 proteins. Figure generated by ggmLassoDemo.
1.3.3
Discovering graph structure
Sometimes we measure a set of correlated variables, and we would like to discover which ones
are most correlated with which others. This can be represented by a graph G, in which nodes
represent variables, and edges represent direct dependence between variables (we will make
this precise in Chapter 10, when we discuss graphical models). We can then learn this graph
structure from data, i.e., we compute Ĝ = argmax p(GD).
As with unsupervised learning in general, there are two main applications for learning sparse
graphs: to discover new knowledge, and to get better joint probability density estimators. We
now give somes example of each.
• Much of the motivation for learning sparse graphical models comes from the systems biology
community. For example, suppose we measure the phosphorylation status of some proteins
in a cell (Sachs et al. 2005). Figure 1.11 gives an example of a graph structure that was learned
from this data (using methods discussed in Section 26.7.2). As another example, Smith et al.
(2006) showed that one can recover the neural “wiring diagram” of a certain kind of bird
from timeseries EEG data. The recovered structure closely matched the known functional
connectivity of this part of the bird brain.
• In some cases, we are not interested in interpreting the graph structure, we just want to
use it to model correlations and to make predictions. One example of this is in ﬁnancial
portfolio management, where accurate models of the covariance between large numbers of
different stocks is important. Carvalho and West (2007) show that by learning a sparse graph,
and then using this as the basis of a trading strategy, it is possible to outperform (i.e., make
more money than) methods that do not exploit sparse graphs. Another example is predicting
traffic jams on the freeway. Horvitz et al. (2005) describe a deployed system called JamBayes
for predicting traffic ﬂow in the Seattle area; predictions are made using a graphical model
whose structure was learned from data.
Chapter 1. Introduction
14
(a)
(b)
Figure 1.12 (a) A noisy image with an occluder. (b) An estimate of the underlying pixel intensities, based
on a pairwise MRF model. Source: Figure 8 of (Felzenszwalb and Huttenlocher 2006). Used with kind
permission of Pedro Felzenszwalb.
1.3.4
Matrix completion
Sometimes we have missing data, that is, variables whose values are unknown. For example, we
might have conducted a survey, and some people might not have answered certain questions.
Or we might have various sensors, some of which fail. The corresponding design matrix will
then have “holes” in it; these missing entries are often represented by NaN, which stands for
“not a number”. The goal of imputation is to infer plausible values for the missing entries. This
is sometimes called matrix completion. Below we give some example applications.
1.3.4.1
Image inpainting
An interesting example of an imputationlike task is known as image inpainting. The goal is
to “ﬁll in” holes (e.g., due to scratches or occlusions) in an image with realistic texture. This is
illustrated in Figure 1.12, where we denoise the image, as well as impute the pixels hidden behind
the occlusion. This can be tackled by building a joint probability model of the pixels, given a
set of clean images, and then inferring the unknown variables (pixels) given the known variables
(pixels). This is somewhat like masket basket analysis, except the data is realvalued and spatially
structured, so the kinds of probability models we use are quite different. See Sections 19.6.2.7
and 13.8.4 for some possible choices.
1.3.4.2
Collaborative ﬁltering
Another interesting example of an imputationlike task is known as collaborative ﬁltering. A
common example of this concerns predicting which movies people will want to watch based
on how they, and other people, have rated movies which they have already seen. The key idea
is that the prediction is not based on features of the movie or user (although it could be), but
merely on a ratings matrix. More precisely, we have a matrix X where X(m, u) is the rating
1.3. Unsupervised learning
15
XVHUV
PRYLHV
"
"
"
"
Figure 1.13 Example of movierating data. Training data is in red, test data is denoted by ?, empty cells
are unknown.
(say an integer between 1 and 5, where 1 is dislike and 5 is like) by user u of movie m. Note
that most of the entries in X will be missing or unknown, since most users will not have rated
most movies. Hence we only observe a tiny subset of the X matrix, and we want to predict
a different subset. In particular, for any given user u, we might want to predict which of the
unrated movies he/she is most likely to want to watch.
In order to encourage research in this area, the DVD rental company Netﬂix created a competition, launched in 2006, with a $1M USD prize (see http://netflixprize.com/). In
particular, they provided a large matrix of ratings, on a scale of 1 to 5, for ∼ 18k movies
created by ∼ 500k users. The full matrix would have ∼ 9 × 109 entries, but only about 1%
of the entries are observed, so the matrix is extremely sparse. A subset of these are used for
training, and the rest for testing, as shown in Figure 1.13. The goal of the competition was to
predict more accurately than Netﬂix’s existing system. On 21 September 2009, the prize was
awarded to a team of researchers known as “BellKor’s Pragmatic Chaos”. Section 27.6.2 discusses
some of their methodology. Further details on the teams and their methods can be found at
http://www.netflixprize.com/community/viewtopic.php?id=1537.
1.3.4.3
Market basket analysis
In commercial data mining, there is much interest in a task called market basket analysis. The
data consists of a (typically very large but sparse) binary matrix, where each column represents
an item or product, and each row represents a transaction. We set xij = 1 if item j was
purchased on the i’th transaction. Many items are purchased together (e.g., bread and butter),
so there will be correlations amongst the bits. Given a new partially observed bit vector,
representing a subset of items that the consumer has bought, the goal is to predict which other
bits are likely to turn on, representing other items the consumer might be likely to buy. (Unlike
collaborative ﬁltering, we often assume there is no missing data in the training data, since we
know the past shopping behavior of each customer.)
This task arises in other domains besides modeling purchasing patterns. For example, similar
techniques can be used to model dependencies between ﬁles in complex software systems. In
this case, the task is to predict, given a subset of ﬁles that have been changed, which other ones
need to be updated to ensure consistency (see e.g., (Hu et al. 2010)).
It is common to solve such tasks using frequent itemset mining, which create association
rules (see e.g., (Hastie et al. 2009, sec 14.2) for details). Alternatively, we can adopt a probabilistic
approach, and ﬁt a joint density model p(x1 , . . . , xD ) to the bit vectors, see e.g., (Hu et al.
Chapter 1. Introduction
16
(a)
(b)
Figure 1.14 (a) Illustration of a Knearest neighbors classiﬁer in 2d for K = 3. The 3 nearest neighbors
of test point x1 have labels 1, 1 and 0, so we predict p(y = 1x1 , D, K = 3) = 2/3. The 3 nearest
neighbors of test point x2 have labels 0, 0, and 0, so we predict p(y = 1x2 , D, K = 3) = 0/3. (b)
Illustration of the Voronoi tesselation induced by 1NN. Based on Figure 4.13 of (Duda et al. 2001). Figure
generated by knnVoronoi.
2010). Such models often have better predictive acccuracy than association rules, although they
may be less interpretible. This is typical of the difference between data mining and machine
learning: in data mining, there is more emphasis on interpretable models, whereas in machine
learning, there is more emphasis on accurate models.
1.4
Some basic concepts in machine learning
In this Section, we provide an introduction to some key ideas in machine learning. We will
expand on these concepts later in the book, but we introduce them brieﬂy here, to give a ﬂavor
of things to come.
1.4.1
Parametric vs nonparametric models
In this book, we will be focussing on probabilistic models of the form p(yx) or p(x), depending
on whether we are interested in supervised or unsupervised learning respectively. There are
many ways to deﬁne such models, but the most important distinction is this: does the model
have a ﬁxed number of parameters, or does the number of parameters grow with the amount
of training data? The former is called a parametric model, and the latter is called a nonparametric model. Parametric models have the advantage of often being faster to use, but the
disadvantage of making stronger assumptions about the nature of the data distributions. Nonparametric models are more ﬂexible, but often computationally intractable for large datasets.
We will give examples of both kinds of models in the sections below. We focus on supervised
learning for simplicity, although much of our discussion also applies to unsupervised learning.
1.4.2
A simple nonparametric classiﬁer: Knearest neighbors
A simple example of a nonparametric classiﬁer is the K nearest neighbor (KNN) classiﬁer.
This simply “looks at” the K points in the training set that are nearest to the test input x,
1.4. Some basic concepts in machine learning
17
train
p(y=1data,K=10)
1
120
5
0.9
4
100
0.8
0.7
3
80
0.6
2
0.5
60
1
0.4
40
0.3
0
0.2
20
−1
0.1
−2
−3
−2
−1
0
1
2
20
3
40
60
(a)
80
0
100
(b)
predicted label, K=10
p(y=2data,K=10)
1
120
5
0.9
100
0.8
0.7
4
3
80
0.6
2
0.5
60
0.4
40
0.3
1
0
0.2
20
−1
0.1
20
40
60
80
100
0
−2
−3
c1
c2
c3
−2
−1
0
1
2
3
(d)
(c)
Figure 1.15 (a) Some synthetic 3class training data in 2d. (b) Probability of class 1 for KNN with K = 10.
(c) Probability of class 2. (d) MAP estimate of class label. Figure generated by knnClassifyDemo.
counts how many members of each class are in this set, and returns that empirical fraction as
the estimate, as illustrated in Figure 1.14. More formally,
1
p(y = cx, D, K) =
I(yi = c)
(1.2)
K
i∈NK (x,D)
where NK (x, D)
function deﬁned
1
I(e) =
0
are the (indices of the) K nearest points to x in D and I(e) is the indicator
as follows:
if e is true
if e is false
(1.3)
This method is an example of memorybased learning or instancebased learning. It can
be derived from a probabilistic framework as explained in Section 14.7.3. The most common
Chapter 1. Introduction
18
1
0.9
d=10
d=7
d=5
0.8
1
Edge length of cube
0.7
d=3
0.6
0.5
0.4
d=1
0.3
0.2
0
0.1
s
1
(a)
0
0
0.2
0.4
0.6
Fraction of data in neighborhood
0.8
1
(b)
Figure 1.16 Illustration of the curse of dimensionality. (a) We embed a small cube of side s inside a larger
unit cube. (b) We plot the edge length of a cube needed to cover a given volume of the unit cube as a
function of the number of dimensions. Based on Figure 2.6 from (Hastie et al. 2009). Figure generated by
curseDimensionality.
distance metric to use is Euclidean distance (which limits the applicability of the technique to
data which is realvalued), although other metrics can be used.
Figure 1.15 gives an example of the method in action, where the input is two dimensional, we
have three classes, and K = 10. (We discuss the effect of K below.) Panel (a) plots the training
data. Panel (b) plots p(y = 1x, D) where x is evaluated on a grid of points. Panel (c) plots
p(y = 2x, D). We do not need to plot p(y = 3x, D), since probabilities sum to one. Panel (d)
plots the MAP estimate ŷ(x) = argmaxc (y = cx, D).
A KNN classiﬁer with K = 1 induces a Voronoi tessellation of the points (see Figure 1.14(b)).
This is a partition of space which associates a region V (xi ) with each point xi in such a way
that all points in V (xi ) are closer to xi than to any other point. Within each cell, the predicted
label is the label of the corresponding training point.
1.4.3
The curse of dimensionality
The KNN classiﬁer is simple and can work quite well, provided it is given a good distance metric
and has enough labeled training data. In fact, it can be shown that the KNN classiﬁer can come
within a factor of 2 of the best possible performance if N → ∞ (Cover and Hart 1967).
However, the main problem with KNN classiﬁers is that they do not work well with high
dimensional inputs. The poor performance in high dimensional settings is due to the curse of
dimensionality.
To explain the curse, we give some examples from (Hastie et al. 2009, p22). Consider applying
a KNN classiﬁer to data where the inputs are uniformly distributed in the Ddimensional unit
cube. Suppose we estimate the density of class labels around a test point x by “growing” a
hypercube around x until it contains a desired fraction f of the data points. The expected edge
length of this cube will be eD (f ) = f 1/D . If D = 10, and we want to base our estimate on 10%
1.4. Some basic concepts in machine learning
19
PDF
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
−3
−2
−1
0
1
2
3
(a)
(b)
Figure 1.17 (a) A Gaussian pdf with mean 0 and variance 1. Figure generated by gaussPlotDemo. (b)
Visualization of the conditional density model p(yx, θ) = N (yw0 + w1 x, σ 2 ). The density falls off
exponentially fast as we move away from the regression line. Figure generated by linregWedgeDemo2.
of the data, we have e10 (0.1) = 0.8, so we need to extend the cube 80% along each dimension
around x. Even if we only use 1% of the data, we ﬁnd e10 (0.01) = 0.63: see Figure 1.16. Since
the entire range of the data is only 1 along each dimension, we see that the method is no longer
very local, despite the name “nearest neighbor”. The trouble with looking at neighbors that are
so far away is that they may not be good predictors about the behavior of the inputoutput
function at a given point.
1.4.4
Parametric models for classiﬁcation and regression
The main way to combat the curse of dimensionality is to make some assumptions about
the nature of the data distribution (either p(yx) for a supervised problem or p(x) for an
unsupervised problem). These assumptions, known as inductive bias, are often embodied in
the form of a parametric model, which is a statistical model with a ﬁxed number of parameters.
Below we brieﬂy describe two widely used examples; we will revisit these and other models in
much greater depth later in the book.
1.4.5
Linear regression
One of the most widely used models for regression is known as linear regression. This asserts
that the response is a linear function of the inputs. This can be written as follows:
y(x) = wT x + =
D
w j xj +
(1.4)
j=1
where wT x represents the inner or scalar product between the input vector x and the model’s
weight vector w7 , and is the residual error between our linear predictions and the true
response.
7. In statistics, it is more common to denote the regression weights by β.
Chapter 1. Introduction
20
degree 14
degree 20
15
15
10
10
5
5
0
0
−5
−5
−10
−10
0
5
10
15
20
0
5
(a)
10
15
20
(b)
Figure 1.18 Polynomial of degrees 14 and 20 ﬁt by least squares to 21 data points. Figure generated by
linregPolyVsDegree.
We often assume that has a Gaussian8 or normal distribution. We denote this by ∼
N (μ, σ 2 ), where μ is the mean and σ 2 is the variance (see Chapter 2 for details). When we plot
this distribution, we get the wellknown bell curve shown in Figure 1.17(a).
To make the connection between linear regression and Gaussians more explicit, we can rewrite
the model in the following form:
p(yx, θ) = N (yμ(x), σ 2 (x))
(1.5)
This makes it clear that the model is a conditional probability density. In the simplest case, we
assume μ is a linear function of x, so μ = wT x, and that the noise is ﬁxed, σ 2 (x) = σ 2 . In
this case, θ = (w, σ 2 ) are the parameters of the model.
For example, suppose the input is 1 dimensional. We can represent the expected response as
follows:
μ(x) = w0 + w1 x = wT x
(1.6)
where w0 is the intercept or bias term, w1 is the slope, and where we have deﬁned the vector
x = (1, x). (Prepending a constant 1 term to an input vector is a common notational trick which
allows us to combine the intercept term with the other terms in the model.) If w1 is positive,
it means we expect the output to increase as the input increases. This is illustrated in 1d in
Figure 1.17(b); a more conventional plot, of the mean response vs x, is shown in Figure 1.7(a).
Linear regression can be made to model nonlinear relationships by replacing x with some
nonlinear function of the inputs, φ(x). That is, we use
p(yx, θ) = N (ywT φ(x), σ 2 )
(1.7)
This is known as basis function expansion. For example, Figure 1.18 illustrates the case where
φ(x) = [1, x, x2 , . . . , xd ], for d = 14 and d = 20; this is known as polynomial regression.
We will consider other kinds of basis functions later in the book. In fact, many popular
machine learning methods — such as support vector machines, neural networks, classiﬁcation
and regression trees, etc. — can be seen as just different ways of estimating basis functions
from data, as we discuss in Chapters 14 and 16.
8. Carl Friedrich Gauss (1777–1855) was a German mathematician and physicist.
1.4. Some basic concepts in machine learning
21
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
−10
−5
0
5
10
460
480
500
520
(a)
540
560
580
600
620
640
(b)
Figure 1.19 (a) The sigmoid or logistic function. We have sigm(−∞) = 0, sigm(0) = 0.5, and
sigm(∞) = 1. Figure generated by sigmoidPlot. (b) Logistic regression for SAT scores. Solid black dots
are the data. The open red circles are the predicted probabilities. The green crosses denote two students
with the same SAT score of 525 (and hence same input representation x) but with different training labels
(one student passed, y = 1, the other failed, y = 0). Hence this data is not perfectly separable using just
the SAT feature. Figure generated by logregSATdemo.
1.4.6
Logistic regression
We can generalize linear regression to the (binary) classiﬁcation setting by making two changes.
First we replace the Gaussian distribution for y with a Bernoulli distribution9 ,which is more
appropriate for the case when the response is binary, y ∈ {0, 1}. That is, we use
p(yx, w) = Ber(yμ(x))
(1.8)
where μ(x) = E [yx] = p(y = 1x). Second, we compute a linear combination of the inputs,
as before, but then we pass this through a function that ensures 0 ≤ μ(x) ≤ 1 by deﬁning
μ(x) = sigm(wT x)
(1.9)
where sigm(η) refers to the sigmoid function, also known as the logistic or logit function.
This is deﬁned as
sigm(η)
1
eη
= η
1 + exp(−η)
e +1
(1.10)
The term “sigmoid” means Sshaped: see Figure 1.19(a) for a plot. It is also known as a squashing
function, since it maps the whole real line to [0, 1], which is necessary for the output to be
interpreted as a probability.
Putting these two steps together we get
p(yx, w) = Ber(ysigm(wT x))
(1.11)
This is called logistic regression due to its similarity to linear regression (although it is a form
of classiﬁcation, not regression!).
9. Daniel Bernoulli (1700–1782) was a DutchSwiss mathematician and physicist.
Chapter 1. Introduction
22
A simple example of logistic regression is shown in Figure 1.19(b), where we plot
p(yi = 1xi , w) = sigm(w0 + w1 xi )
(1.12)
where xi is the SAT10 score of student i and yi is whether they passed or failed a class. The
solid black dots show the training data, and the red circles plot p(y = 1xi , ŵ), where ŵ are
the parameters estimated from the training data (we discuss how to compute these estimates in
Section 8.3.4).
If we threshold the output probability at 0.5, we can induce a decision rule of the form
ŷ(x) = 1 ⇐⇒ p(y = 1x) > 0.5
(1.13)
By looking at Figure 1.19(b), we see that sigm(w0 + w1 x) = 0.5 for x ≈ 545 = x∗ . We can
imagine drawing a vertical line at x = x∗ ; this is known as a decision boundary. Everything to
the left of this line is classiﬁed as a 0, and everything to the right of the line is classiﬁed as a 1.
We notice that this decision rule has a nonzero error rate even on the training set. This
is because the data is not linearly separable, i.e., there is no straight line we can draw to
separate the 0s from the 1s. We can create models with nonlinear decision boundaries using
basis function expansion, just as we did with nonlinear regression. We will see many examples
of this later in the book.
1.4.7
Overﬁtting
When we ﬁt highly ﬂexible models, we need to be careful that we do not overﬁt the data, that
is, we should avoid trying to model every minor variation in the input, since this is more likely
to be noise than true signal. This is illustrated in Figure 1.18(b), where we see that using a high
degree polynomial results in a curve that is very “wiggly”. It is unlikely that the true function
has such extreme oscillations. Thus using such a model might result in accurate predictions of
future outputs.
As another example, consider the KNN classiﬁer. The value of K can have a large effect on
the behavior of this model. When K = 1, the method makes no errors on the training set (since
we just return the labels of the original training points), but the resulting prediction surface is
very “wiggly” (see Figure 1.20(a)). Therefore the method may not work well at predicting future
data. In Figure 1.20(b), we see that using K = 5 results in a smoother prediction surface,
because we are averaging over a larger neighborhood. As K increases, the predictions becomes
smoother until, in the limit of K = N , we end up predicting the majority label of the whole
data set. Below we discuss how to pick the “right” value of K.
1.4.8
Model selection
When we have a variety of models of different complexity (e.g., linear or logistic regression
models with different degree polynomials, or KNN classiﬁers with different values of K), how
should we pick the right one? A natural approach is to compute the misclassiﬁcation rate on
10. SAT stands for “Scholastic Aptitude Test”. This is a standardized test for college admissions used in the United States
(the data in this example is from (Johnson and Albert 1999, p87)).
1.4. Some basic concepts in machine learning
23
predicted label, K=1
predicted label, K=5
5
5
4
4
3
3
2
2
1
1
0
−1
−2
−3
0
−1
c1
c2
c3
−2
−1
0
1
2
3
−2
−3
c1
c2
c3
−2
(a)
−1
0
1
2
3
(b)
Figure 1.20 Prediction surface for KNN on the data in Figure 1.15(a). (a) K=1. (b) K=5. Figure generated by
knnClassifyDemo.
the training set for each method. This is deﬁned as follows:
err(f, D) =
N
1
I(f (xi ) = yi )
N i=1
(1.14)
where f (x) is our classiﬁer. In Figure 1.21(a), we plot this error rate vs K for a KNN classiﬁer
(dotted blue line). We see that increasing K increases our error rate on the training set, because
we are oversmoothing. As we said above, we can get minimal error on the training set by using
K = 1, since this model is just memorizing the data.
However, what we care about is generalization error, which is the expected value of the
misclassiﬁcation rate when averaged over future data (see Section 6.3 for details). This can be
approximated by computing the misclassiﬁcation rate on a large independent test set, not used
during model training. We plot the test error vs K in Figure 1.21(a) in solid red (upper curve).
Now we see a Ushaped curve: for complex models (small K), the method overﬁts, and for
simple models (big K), the method underﬁts. Therefore, an obvious way to pick K is to pick
the value with the minimum error on the test set (in this example, any value between 10 and
100 should be ﬁne).
Unfortunately, when training the model, we don’t have access to the test set (by assumption),
so we cannot use the test set to pick the model of the right complexity.11 However, we can create
a test set by partitioning the training set into two: the part used for training the model, and a
second part, called the validation set, used for selecting the model complexity. We then ﬁt all
the models on the training set, and evaluate their performance on the validation set, and pick
the best. Once we have picked the best, we can reﬁt it to all the available data. If we have a
separate test set, we can evaluate performance on this, in order to estimate the accuracy of our
method. (We discuss this in more detail in Section 6.5.3.)
Often we use about 80% of the data for the training set, and 20% for the validation set. But
if the number of training cases is small, this technique runs into problems, because the model
11. In academic settings, we usually do have access to the test set, but we should not use it for model ﬁtting or model
selection, otherwise we will get an unrealistically optimistic estimate of performance of our method. This is one of the
“golden rules” of machine learning research.
Chapter 1. Introduction
24
0.35
train
test
misclassification rate
0.3
0.25
0.2
0.15
0.1
0.05
0
0
20
40
60
K
80
100
120
(a)
(b)
Figure 1.21 (a) Misclassiﬁcation rate vs K in a Knearest neighbor classiﬁer. On the left, where K is
small, the model is complex and hence we overﬁt. On the right, where K is large, the model is simple
and we underﬁt. Dotted blue line: training set (size 200). Solid red line: test set (size 500). (b) Schematic
of 5fold cross validation. Figure generated by knnClassifyDemo.
won’t have enough data to train on, and we won’t have enough data to make a reliable estimate
of the future performance.
A simple but popular solution to this is to use cross validation (CV). The idea is simple: we
split the training data into K folds; then, for each fold k ∈ {1, . . . , K}, we train on all the
folds but the k’th, and test on the k’th, in a roundrobin fashion, as sketched in Figure 1.21(b).
We then compute the error averaged over all the folds, and use this as a proxy for the test error.
(Note that each point gets predicted only once, although it will be used for training K −1 times.)
It is common to use K = 5; this is called 5fold CV. If we set K = N , then we get a method
called leaveone out cross validation, or LOOCV, since in fold i, we train on all the data cases
except for i, and then test on i. Exercise 1.3 asks you to compute the 5fold CV estimate of the
test error vs K, and to compare it to the empirical test error in Figure 1.21(a).
Choosing K for a KNN classiﬁer is a special case of a more general problem known as model
selection, where we have to choose between models with different degrees of ﬂexibility. Crossvalidation is widely used for solving such problems, although we will discuss other approaches
later in the book.
1.4.9
No free lunch theorem
All models are wrong, but some models are useful. — George Box (Box and Draper 1987,
p424).12
Much of machine learning is concerned with devising different models, and different algorithms
to ﬁt them. We can use methods such as cross validation to empirically choose the best method
for our particular problem. However, there is no universally best model — this is sometimes
called the no free lunch theorem (Wolpert 1996). The reason for this is that a set of assumptions
that works well in one domain may work poorly in another.
12. George Box is a retired statistics professor at the University of Wisconsin.
1.4. Some basic concepts in machine learning
25
As a consequence of the no free lunch theorem, we need to develop many different types of
models, to cover the wide variety of data that occurs in the real world. And for each model,
there may be many different algorithms we can use to train the model, which make different
speedaccuracycomplexity tradeoffs. It is this combination of data, models and algorithms that
we will be studying in the subsequent chapters.
Exercises
Exercise 1.1 KNN classiﬁer on shuffled MNIST data
Run mnist1NNdemo and verify that the misclassiﬁcation rate (on the ﬁrst 1000 test cases) of MNIST of a
1NN classiﬁer is 3.8%. (If you run it all on all 10,000 test cases, the error rate is 3.09%.) Modify the code
so that you ﬁrst randomly permute the features (columns of the training and test design matrices), as in
shuffledDigitsDemo, and then apply the classiﬁer. Verify that the error rate is not changed.
Exercise 1.2 Approximate KNN classiﬁers
Use the Matlab/C++ code at http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN to perform approximate nearest neighbor search, and combine it with mnist1NNdemo to classify the MNIST data
set. How much speedup do you get, and what is the drop (if any) in accuracy?
Exercise 1.3 CV for KNN
Use knnClassifyDemo to plot the CV estimate of the misclassiﬁcation rate on the test set. Compare this
to Figure 1.21(a). Discuss the similarities and differences to the test error rate.
2
2.1
Probability
Introduction
Probability theory is nothing but common sense reduced to calculation. — Pierre Laplace,
1812
In the previous chapter, we saw how probability can play a useful role in machine learning. In
this chapter, we discuss probability theory in more detail. We do not have to space to go into
great detail — for that, you are better off consulting some of the excellent textbooks available
on this topic, such as (Jaynes 2003; Bertsekas and Tsitsiklis 2008; Wasserman 2004). But we will
brieﬂy review many of the key ideas you will need in later chapters.
Before we start with the more technical material, let us pause and ask: what is probability?
We are all familiar with the phrase “the probability that a coin will land heads is 0.5”. But what
does this mean? There are actually at least two different interpretations of probability. One is
called the frequentist interpretation. In this view, probabilities represent long run frequencies
of events. For example, the above statement means that, if we ﬂip the coin many times, we
expect it to land heads about half the time.1
The other interpretation is called the Bayesian interpretation of probability. In this view,
probability is used to quantify our uncertainty about something; hence it is fundamentally
related to information rather than repeated trials (Jaynes 2003). In the Bayesian view, the above
statement means we believe the coin is equally likely to land heads or tails on the next toss.
One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about events that do not have long term frequencies. For example, we might want to
compute the probability that the polar ice cap will melt by 2020 CE. This event will happen zero
or one times, but cannot happen repeatedly. Nevertheless, we ought to be able to quantify our
uncertainty about this event; based on how probable we think this event is, we will (hopefully!)
take appropriate actions (see Section 5.7 for a discussion of optimal decision making under
uncertainty). To give some more machine learning oriented examples, we might have received
a speciﬁc email message, and want to compute the probability it is spam. Or we might have
observed a “blip” on our radar screen, and w
