A probabilistic view on machine learning

Author

Alexandre Dauphin and Borja Requena

Open in Colab

Note

This section has been adapted from the sections 2.3 and 2.4.1 of our book (Dawid et al. 2022).

1 Introduction

The need for a probabilistic approach to machine learning becomes apparent when we consider that this field has to tackle three sources of uncertainty (following the deep learning book (Goodfellow, Bengio, and Courville 2016)).

  • Firstly, there may be an inherent stochasticity of the system generating the data we have access to.
  • Secondly, we need to account for a possible incomplete observability, i.e., an unavoidable lack of information regarding all possible variables influencing the system. In other words, we have only partial access (by means of the available data) to all relevant parts of the mechanism or distribution underlying the system.
  • Finally, the models we use are rarely complete and need to discard some available information. An example of incomplete modeling may be a robot whose movement space we discretize. Such a discretization immediately makes the robot uncertain about ‘’omitted’’ parts of the space.

To mathematically account for the uncertainty of a model, we can follow the so-called Bayesian approach to probability which interprets the probability as an expectation or quantification of a belief.

In this section, we provide a concise reminder of basic concepts from the probability theory which appear in the rest of these lectures:

  • Random variables are variables taking random values. If they are independent and identically distributed (i.e., drawn independently from the same probability distribution), they are called independent and identically distributed (i.i.d.) random variables.
  • A probability distribution is a measure of how likely a random variable \(X\) is to take on each of its possible states \(x\) 1, e.g., \(p(X=x)\). A probability distribution over discrete (continuous) variables is called a probability mass function (probability density function). A joint probability distribution is a probability distribution over many variables at the same time and is denoted, e.g., as \(p(X=x, Y=y)\). When the notation is clear, we typically also drop the random variable and just write \(p(X=x) \equiv p(x)\), instead.
  • Two random variables \(X\) and \(Y\) are independent if their joint probability distribution can be expressed as a product of two factors, one involving only \(X\) and one involving only \(Y\):

\[ \forall x, y:\quad p(X=x, Y=y)=p(X=x) p(Y=y)\,. \]

You can denote this independence by \(X \perp Y\).

  • A vector whose elements consists of random variables is called a random vector and we denote it simply with \(\mathbf{x}\).

  • A conditional probability is a probability of one event given that some other event has happened. We denote the conditional probability with \(p(Y=y \mid X=x)\), meaning the probability of \(Y=y\) given the observation that \(X=x\). It can be calculated as

\[ p(Y=y \mid X=x)=\frac{p(Y=y, X=x)}{p(X=x)}\,. \]

  • Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable each, which is called the chain rule or product rule of probability:

\[ p\left(x^{(1)}, \ldots, x^{(N)}\right)=p\left(x^{(1)}\right) \prod_{i=2}^{N} p\left(x^{(i)} \mid x^{(1)}, \ldots, x^{(i-1)}\right)\,. \]

  • Finally, let us discuss a situation where we know the conditional probability \(p(y\mid x)\) and need to know the opposite one, \(p(x\mid y)\). Fortunately, if we also know \(p(x)\), we can compute the desired quantity using Bayes’ rule:

\[ p(x \mid y)=\frac{p(y \mid x) p(x)}{p(y)}\,. \] Bayes’ rule is a direct consequence of the definition of conditional probability. If we do not know \(p(y)\), we can compute it via \(p(y)=\sum_{x} p(y \mid x) p(x)\), the sum rule of probabilities. Coming back to the notion of belief in Bayesian statistics, we can give the other terms of the Bayes rule equation a clear interpretation. In this theory, \(p(x)\) encodes our prior knowledge about a proposition \(x\), i.e., modelled without any evidence \(y\) collected. Hence, \(p(x)\) is called the prior. Consequently, Bayes’ rule gives us the recipe how to update our beliefs using the likelihood \(p(y \mid x)\) to arrive at the posterior \(p(x \mid y)\). The posterior now models our updated knowledge about the proposition \(x\) that takes the evidence \(y\) collected into account.

We are now armed with enough tools to look at machine learning models in a probabilistic way. In particular, we can reformulate the definition of supervised and unsupervised learning. Unsupervised learning consists of observing some examples of a random variable \(X\), e.g., \(x_1, x_2, \ldots, x_N\), and then learning the probability distribution \(p(X)\) or some of its properties. Supervised learning is about observing some examples of a random vector \(X\) and an associated vector \(Y\), e.g., \(\{x_1,y_1\}, \{x_2,y_2\}, \ldots, \{x_N,y_N\}\), and learning to predict \(y\) from \(x\), usually by estimating \(p(Y=y \mid X=x)\). An ideal machine learning model perfectly learns the probability distribution that generates the data.2

2 Likelihood function

The likelihood function is the joint probability of the observed data as a function of the parameters of the chosen model, \(p(\mathcal{D} \mid \theta)\), estimating the data-generating probability distribution.3 Intuitively, the probability is a property of a sample coming from some distribution. The likelihood, on the other hand, is a property of a parametrized model. In particular, if you plot \(p(\mathcal{D} \mid \theta)\) as a function of possible \(\{\theta\}\), it does not have to integrate to one.

For the sake of concreteness, let us consider the Gaussian probability distribution

\[ p(x;\mu,\sigma) = \frac{1}{2\pi\sigma^2}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

and let us draw some samples from this distribution for a given set \(\mu\) and \(\sigma\).

mu, sigma = 2, 3
x = np.random.normal(loc=mu, scale=sigma, size=500)

We can then inspect the distribution of the random variable with the help of an histogram, as shown in Figure 1.

Code
fig = px.histogram(x=x)
fig.show()
Figure 1: Histogram of the random variable

We can now compute the likelihood of of a given pair \(\mu'\) and \(\sigma'\) given our observations \(\{x\}\)

\[p(\{x\}\mid \mu, \sigma)=\prod_{i=1}^{N}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}}.\]

Maximum Likelihood Estimator

Given a dataset and a target probability distribution, one could try to find an estimator of the paramters of the distribution that best fit the observations. This can be done through the maximum likelihood estimation method, which consists in maximizing the value of the likelihood with respect to the parameters

\[ \partial_{\theta_i} p(\mathcal{D}\mid \theta) = 0\]

In practice, it is often prefered to minimize the negative log likelihood \(-\log (p(\mathcal{D}\mid \theta))\).

In the case of our gaussian distribution, we find the following expresion for the negative log likelihood

\[NLL = \sum_{i=1}^{N} \frac{1}{2}\log{2\pi \sigma^2}+\frac{(x_i-\mu)^2}{2\sigma^2}\]

By computing the derivatives of this expression with respect to \(\mu\) and \(\sigma\) and setting them equal to zero, one finds the estimators

\[ \begin{aligned} &\hat{\mu}=\frac{1}{N}\sum_{i=1}^{N}x_i\\ & \hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i-\hat{\mu})^2. \end{aligned} \]

mu_estimator, sigma_estimator = np.mean(x), np.std(x)
print(f'Estimator of mu: {mu_estimator:.2f}\n Estimator of sigma: {sigma_estimator:.2f}')
Estimator of mu: 2.03
 Estimator of sigma: 2.91

3 Kullback-Leibler divergence

Finally, sometimes it is useful to compare two probability distributions over the same random variable \(X\), e.g., \(p(x)\) and \(q(x)\). Such a measure is provided by a relative entropy, called the Kullback-Leibler divergence, \(D_{\mathrm{KL}}(p||q)\). To be precise, it is a measure of how the probability distribution \(q\) is different from a reference probability distribution \(p\). As we typically employ it in classification tasks where \(p\) and \(q\) are both discrete variable distributions, it is defined as

\[\begin{equation} D_{\mathrm{KL}}(p||q) = \langle\log\frac{p(\mathrm{x})}{q(\mathrm{x})}\rangle_p = \frac{1}{N} \sum_{i}^{N} p(x_i) \log\frac{p (x_i)}{q (x_i)}\,. \end{equation}\]

For continuous distributions, the sum has to be replaced by an integral. \(D_{\mathrm{KL}}(p||q)\) has some properties of distance, i.e., is non-zero and is zero if and only if \(p\) and \(q\) are equal.[4] But it is not a proper distance measure as it is not symmetric, \(D_{\mathrm{KL}}(p||q) \ne D_{\mathrm{KL}}(q||p)\).4

Using the properties of the logarithm, \(D_{\mathrm{KL}}(p||q)\) can be expressed as \[\begin{align} D_{\mathrm{KL}}(p||q) &= \frac{1}{N} \sum_{i}^{N} p(x_i) \log p (x_i) - \frac{1}{N} \sum_{i}^{N} p(x_i) \log q (x_i) \\ &=-{\cal S}(p) + {L_\mathrm{CE}}(p,q), \end{align}\] where \({\cal S}(p)\) is the Shannon entropy of the reference probability distribution \(p\), and as the second term we obtain the cross entropy, which we have already introduced in logistic regression section! We rediscover it by noting that minimizing \(D_{\mathrm{KL}}(p||q)\), i.e., the difference of \(p\) with respect to \(q\), is equivalent to minimizing the cross-entropy, because \(q\) does not appear in \({\cal S}(p)\).

Deriving the binary cross entropy

While the utility of comparing probability distributions is clear in the case of estimating a true probability distribution with a parametrized one, it may not be immediately obvious for arbitrary machine learning models. Let us discuss the case of supervised learning with a model \(f\).

Consider the labeled training data set consisting of \(N\) tuples \(\{x_i,y_i\}\), where \(x_i\) is a given sample with label \(y_i\). Each label belongs to one out of \(2\) classes. Next, we can think of each one-hot-encoded label \(y_i = k\) as a very specific probability distribution \(y_i = q(x_i) = \delta_{k,j}\), where \(k,j \in \{1,2\}\) (one-hot encoding).

Next, the training data \(x_i\) are fed to the model, \(f\), and as an output we obtain the probability distribution \(p(x_i) = f(x_i)\), which gives us the probabilities of a given sample \(x_i\) belonging to each class. In the last step, we have to compare two probability distributions \(p\) and \(q\). Therefore, we rediscover the categorical cross-entropy from the logistic regression.

4 Linear regression: a probabilistic viewpoint

We consider a labeled data set \(\mathcal{D} = \{ \mathbf{x}_i,y_i)\}_{i=1}^N \equiv \{(X,y)\}\) of observations that are derived from an underlying function \(f\), possibly subject to some (stochastic) noise \(\epsilon\). The latter is often assumed to be sampled from an unknown noise distribution \(\mathcal{E}\), i.e., \(\epsilon \sim \mathcal{E}\): \[\begin{equation} y_i = f(\mathbf{x}_i) + \epsilon_i \quad \forall\ (\mathbf{x}_i,y_i) \in \mathcal{D}\,. \end{equation}\] The function \(f\) is generally unknown and our goal is to infer it. To this end, we build a regression fit of the data \(\hat{f}\) such that \(\hat{f}(\mathbf{x}) \approx y\).5

Arguably, the simplest parametrized fitting method one can produce is a linear model, where we seek to find parameters \(\theta\in R^d\) that linearly connect the input variable \(\mathbf{x}\) with the prediction \(\hat{y}\), i.e., \[\begin{equation} \hat{y} = \sum_{i=1}^{d - 1} \theta_i x_i + b \equiv \sum_{i=0}^{d - 1} \theta_i x_i = \mathbf{x}^T\theta\,. \end{equation}\] To shorten the notation, we have absorbed the constant \(\theta_0 = b\), the so-called bias, in the definition of the input \(\mathbf{x}\) via setting \(x_0 = 1\). Up to now, the linear model aims to find a hyperplane6 through the data points. We can extend the model by a nonlinear transformation \(\phi\) of the input, i.e., \(\mathbf{x}\mapsto\phi(\mathbf{x})\). This is still linear regression as we maintain linearity in the parameters \(\theta\) that we seek to optimize. As an example of a nonlinear transformation, the map \(\phi_p: x \mapsto (1,x,x^2,x^3,\dots,x^p)\) promotes our model to polynomial regression up to the \(p\)-th degree. To simplify the notation in the rest of the section, we consider the case where no feature maps are applied. The inclusion of a feature map is central element of gaussian processes and discussed in the chapter 4 of our book (Dawid et al. 2022) to a far greater extent.

Once acertain hyperplane is defined, by means of its parameters \(\theta\), we need to define a quality measure that compares our predictions to their corresponding ground-truth values. That is, we have to choose a suitable loss function \(L\). The most conventional choice for the loss is the mean square error over the data set \(\mathcal{D}\) as \[\begin{equation} L_{\mathrm{MSE}}(\theta\mid\mathcal{D}) = \frac{1}{n} \sum_{i=1}^N \left( y_i - \mathbf{x}_i^T\theta \right)^2 = \parallel \mathbf{y}-X^T\theta \parallel^2\,. \end{equation}\] To attain the right-most equation, we stack all inputs \(\mathbf{x}_i\) vertically next to each other, to form the matrix \(X\in R^{(d-1)\times N}\). The same procedure is applied to \(y_i\), now to be promoted to \(\mathbf{y}\). The last step allows to find the set of parameters \(\theta\) that minimize the mean square error. This yields the least-squares estimator (LSE) \[\begin{equation}\label{eq:intro_least_squares_estimator} \theta_\mathrm{LSE} = \left(X^T X\right)^{-1}X^T\mathbf{y} = X^+ \mathbf{y}\,, \end{equation}\] where the notation \(X^+\) denotes the Moore-Penrose inverse (Rao 1972).

The mean square error as the choice of our loss function appears to be self-evident. In fact, we can derive it via maximizing the likelihood of the labeled data given the model’s parameters \(p(\mathbf{y} \mid X,\theta)\). To this end, we assume that our targets \(y\) are actually sampled from a Gaussian with a mean given by our linear model, i.e., \(\mathbf{x}^T\theta\) with some variance \(\sigma\) which models the noise on the data. We can then write the likelihood of observing the targets \(\mathbf{y}\) given the locations \(X\) and model parameters \(\theta\) as \[\begin{align}\label{eq:intro_data_likelihood} p(\mathbf{y} \mid X,\theta) &= \mathcal{N}(\mathbf{y} \mid X^T\theta,\sigma^2\mathbb{1}) \\ &= \prod_{i=1}^N \mathcal{N}(y_i \mid \mathbf{x}_i^T\theta,\sigma^2)\,. \end{align}\] In the last step, we furthermore assumed a data set \(\{x\}\) of i.i.d. random variables to factorize the multivariate Gaussian. A common assumption is to regard the observed data set \(\{x\}\) as the most probable one of the underlying linear model. We therefore seek to maximize the likelihood in order to find the set of parameters \(\theta\) that have led to the most probable data. This is the idea of MLE. Its estimator is defined as the argument of the maximum likelihood of. We can modify this estimator by including a logarithm and obtain \[\begin{align} \theta_\mathrm{MLE} &= \underset{\theta}{\text{arg max}} p(\mathbf{y} \mid X,\theta) \\ &= \underset{\theta}{\text{arg max}} \log p(\mathbf{y} \mid X,\theta) \\ &= \underset{\theta}{\text{arg max}} \left( -\frac{1}{2\sigma^2} \sum_{i=1}^N \left(y_i - \mathbf{x}_i^T\theta\right)^2 + \text{const.}\right)\\ &= \underset{\theta}{\text{arg min}}\ \left( \sum_{i=1}^N \left(y_i - \mathbf{x}_i^T\theta\right)^2 \right) \\ &\equiv \underset{\theta}{\text{arg min}} \left( L\mathrm{MSE} \right)\,. \end{align}\] The constants appearing in second last line can be ignored as they are independent of \(\theta\). From the previous results, we hence see that the i.i.d.-assumption, together with the concept of MLE leads to the MSE as the preferred loss function and we conclude that the MLE \(\theta_\mathrm{MLE}\) coincides with the least squares estimator \(\theta_\mathrm{LSE}\).

However, the estimator fully ignores the data noise modelled by \(\sigma^2\), as it was also drop out in the maximization procedure of the \(p(\mathbf{y} \mid X,\theta)\). Thus, even if we correctly chose the model, the minimization procedure of the MSE typically performs well on the provided data set \(\mathcal{D}\) but not on previously unencountered data points. The reason is overfitting which we already introduced as a concept in the polynomial regression. This phenomenon occurs as we incorporate the noise on the targets in our model parameters \(\theta_\mathrm{MLE}\). As a way out to this issue, we have introduced the notion of regularization. In our linear model, we can introduce regularization by means of Bayesian inference.

This means, instead of maximizing merely the data likelihood, we encode any prior knowledge of the model into the prior distribution \(p(\theta)\). By virtue of the Bayes’ theorem, we can calculate the posterior distribution7 \(p(\theta \mid X,\mathbf{y})\) over the parameters given the data set and maximize this quantity, instead. This yields the maximum a posteriori estimator defined as \[\begin{align} \theta_\mathrm{MAP} &= \underset{\theta}{\text{arg max}}\ p(\theta \mid X,\mathbf{y}) \\ &= \underset{\theta}{\text{arg max}}\ \frac{ p(\theta)p(\mathbf{y} \mid X,\theta)}{p(X,\mathbf{y})}\,, \end{align}\] where we have used the Bayes theorem in the second step. The denominator does not depend on \(\theta\) and can therefore be ignored. As the prior, we now draw the parameter values from a Gaussian distribution centered around \(\mathbf{0}\) with some variance \(\tau^2\), i.e., \[\begin{equation}\label{eq:intro_prior_dist} p(\theta) = \mathcal{N}(\theta \mid \mathbf{0},\tau^2\mathbb{1})\,. \end{equation}\]

The product of two Gaussian distribution is Gaussian itself, hence allowing us to apply the same trick with the logarithm as before. We arrive at \[\begin{equation}\label{eq:intro_MAP_estimator} \theta_\mathrm{MAP} = \underset{\theta}{\text{arg min}}\ \left( L_\mathrm{MSE}(\theta \mid \mathcal{D}) + \frac{\sigma^2}{\tau^2} \parallel \theta \parallel^2 \right) = \left( X^T X + \frac{\sigma^2}{\tau^2}\mathbb{1}\right)^{-1} X^T \mathbf{y}\,. \end{equation}\] We can picture the parameter \(\lambda = \sigma^2/\tau^2\) as a signal-to-noise ratio which effectively penalizes large magnitudes of the parameter values by the additional term in the loss function. Hence, \(\lambda\) is referred to as the regularization strength. This particular choice of the loss term is called Tikhonov regularization. Its corresponding MAP is also called the linear ridge regression estimator.

Let us compare the two estimators with and without the noise. The additional term \(\sigma^2/\tau^2\mathbb{1}\) in the estimator stems from the fact that we take both the data noise as well as a parameter constraint into account. Both are discarded in the limit of \(\tau\to\infty\)8 , where we have \(\theta_\mathrm{MAP}\to\theta_\mathrm{MLE}\).

Note

In order to consider the underlying noise in the training data, we have to constrain the linear model. Dealing with overfitting in such a way is usually referred to as regularization.

Finally, the choice of the prior is by no means unique. In fact, there is a plethora of regularization ideas and corresponding penalty terms (Goodfellow, Bengio, and Courville 2016). An easy variation could, for example, be to replace the \(l_2\)-norm with an \(l_1\)-norm. This is achieved by choosing a Laplace distribution for the parameters as the prior. The corresponding estimator is the result of least absolute schrinkage and selection operator regression (Tibshirani 1996). Because the \(l_1\)-norm punishes already small parameter values severely, it favors sparse solutions for the parameters \(\theta\), instead. This can, for example, be desired to detect the significant features out of a pool of possible candidates in certain tasks (Zhou, Li, and Zare 2017).

References

Dawid, Anna, Julian Arnold, Borja Requena, Alexander Gresch, Marcin Płodzień, Kaelan Donatella, Kim A. Nicoli, et al. 2022. “Modern Applications of Machine Learning in Quantum Sciences.” https://doi.org/10.48550/ARXIV.2204.04198.
Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press. https://doi.org/10.5555/3086952.
Rao, C. Radhakrishna. 1972. “Generalized Inverse of a Matrix and Its Applications.” In Theory of Statistics, Vol. 1, 601–20. University of California Press. https://doi.org/10.1525/9780520325883-032.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” J. R. Stat. Soc. Ser. B Methodol. 58 (1): 267–88. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x.
Zhou, Zhenpeng, Xiaocheng Li, and Richard N. Zare. 2017. “Optimizing Chemical Reactions with Deep Reinforcement Learning.” ACS Cent. Sci 3 (12): 1337–44. https://doi.org/10.1021/acscentsci.7b00492.

Footnotes

  1. This notation is easily generalized to vector-valued random variables.↩︎

  2. If there is any kind of stochasticity in the underlying distribution, then even an ideal model can be wrong. Such an~error is called the Bayes optimal error that poses a fundamental limit on statistical models.↩︎

  3. Do not confuse likelihood and probability!↩︎

  4. We recommend an illustrative discussion of the asymmetry of \(D_{\mathrm{KL}}(p||q)\) in Fig. 3.6 of Goodfellow, Bengio, and Courville (2016).↩︎

  5. For the sake of simplicity, we consider one-dimensional output. The following derivations, however, can easily be extended to multi-dimensional output as well.↩︎

  6. For one-dimensional input and output, the hyperplane simply is a line.↩︎

  7. Remember that we call it posterior because it is computed after the observation of the data set \(\mathcal{D}\).↩︎

  8. This corresponds to a uniform prior of the parameters \(\theta\).↩︎