Generative models

Introduction to unsupervised learning and generative models
Author

Borja Requena

Open in Colab

1 Unsupervised learning

Unsupervised learning consists on capturing rich patterns in the data in a label-free approach. This is opposed to the supervised learning scheme, in which we have a data set comprised of labeled samples \(\left\{\mathbf{x}, y\right\}\) and we try to approximate the function \(f(x)\approx y(x)\).

In unsupervised learning, even though we follow label-free approaches, what we would consider labels can some times be part of the data corpus.

We can split deep unsupervised learning in two main categories: generative and self-superivsed learning, although the line is often blurred. In generative learning, we try to recreate the data distribution. This allows us to generate new data points that are likely to belong to the original data set, and often even know the probability to observe them. In self-supervised learning, we instead focus on finding different representations of the data. These are often useful to accomplish other tasks, compress the information, etc.

Indeed, in some cases, the resulting models can accomplish downstream tasks without having been trained to perform them explicitly. For example, the generative model GPT-3 Brown et al. (2020) is a language model, as we saw earlier in the course, that can perform question answering tasks (among others) without any further specific training for it. Or the self-supervised vision model DINO Caron et al. (2021) can extract segmentation masks from images (see Figure 1), such as the ones we saw in the computer vision example tasks.

Figure 1: Self-supervised segmentation masks from DINO

Unsupervised methods have gathered a lot of attention in scientific applications, as they can help us extract physically relevant information from experimental data Iten et al. (2020). Actually, in science, some times we do not even know what to look for in the data! For example, supose that we want to characterize a complex quantum system. To do so, we need to consider all the possible phases the system can be and devise appropiate order parameters to test whether they exist and find the phase transitions. With self-supervised methods, we can find different data representation schemes for specific regions of the phase diagram. This way, we can explore the phase diagram autonomously to find where the phase transitions may be in our system Kottmann et al. (2020).

Most of the recent advances in the machine learning (ML) field have been mainly due to massive scaling, both in terms of the model size and the amount of data. This has relied heavily in the vast amount of unlabeled data that exists in the internet. Think about it, for every cat image in every appropiately labeled data set we can find, how many unlabeled cat images and videos are in the internet? The current state-of-the-art practice in many ML applications consists on training an unsupervised model with huge amounts of unlabeled data and, then, leveraging its knowledge to accomplish the desired task. We saw this procedure when we adapted our language model trained on wikipedia to write movie reviews, and then, we used it to classify them.

This process is akin to the way humans learn. Our brain processes a continuous stream of unlabeled data containing rich information about our environment. Furthermore, we never process the exact same information twice, as there are no two instances of our life that are exactly the same. This allows us to generalize extremely well and make the most out of the relatively scarse labeled data we have access to. For example, given a single stegosaurus image, we can immediately recognize this dinosaur species anywhere else, with any camera angle, any art-style, and even with partial information (e.g. just a part of the dinosaur).

Thus, unsupervised learning is essential for the entire ML field and it is specially promising in scientific applications.

2 Generative modeling

Here, we focus in generative learning. As we have briefly mentioned before, it consists on learning the data distribution to generate new samples. This is extremely powerful both on its own, since high-quality new samples can be very valuable, and in combination with other tools to tackle downstream tasks, as in the movie review example.

There are many data generation approaches that we can consider. The most straightforward one is to simply generate samples that are similar to the traning ones, such as face images or digits. We can also have conditioned synthesis, such as generating an audio signal from a text prompt that can be conditioned to a specific speaker voice (e.g. WaveNet). This involves all sorts of translation tasks, where we write text from a sample fragment, generate a new image from a reference one (see the emblematic horse-to-zebra example), or even create a video from a text fragment!

Note

This is a very broad field and here we just show a hand full of representative examples.

2.1 Learning the data probability distribution

The task is to learn the underlying distribution \(p_{\text{data}}(x_i)\) given a data set \(\left\{x_i\right\}\). We can do this either finding a model to approximate the probability distribution, \(p_\theta(x_i)\approx p_{\text{data}}\), and then sample from it, or by training a model to generate new samples and then estimate \(p_{\text{data}}\).

Note

Just because we can compute the probability it does not mean that we can sample easily and vice versa. In general, there is a trade-off between sampling and computing the probability.

To illustrate the main concepts, we will consider a toy model with samples \(\left\{x_i\right\}\) drawn from a mixture of two Gaussian distributions: \(\mathcal{N}_0(\mu_0,\sigma_0)\) and \(\mathcal{N}_1(\mu_1,\sigma_1)\). A common way to model mixture models is through a multinoulli distribution that assigns the probability to sample from each of the possible modes. Since in this case we only have two modes, we can use a Bernoulli distribution instead with \(p(x)=\phi^x(1-\phi)^{1-x}\), meaning that \(p(x=1) = \phi\) and \(p(x=0) = 1-\phi\).

Exercise

Define a function to sample from the Gaussian mixture described above. As input, it should take the desired number of samples GaussianMixture.sample(self, n_samples).

Code
class GaussianMixture:
    def __init__(self, phi, mu_0, std_0, mu_1, std_1):
        """Initialize a Gaussian mixture with two modes. `phi` denotes
        the probability to sample from distribution 1."""
        self.phi = phi
        self.mu_0, self.std_0 = mu_0, std_0
        self.mu_1, self.std_1 = mu_1, std_1

    def sample(self, n_samples):
        "Draw samples from a Gaussian mixture model."
        which = np.random.uniform(size=n_samples) < self.phi
        samples_0 = np.random.normal(self.mu_0, self.std_0, n_samples)
        samples_1 = np.random.normal(self.mu_1, self.std_1, n_samples)
        return np.where(which, samples_1, samples_0)

    def pdf(self, x):
        "Evaluate the Gaussian mixture pdf over x."
        pdf_0 = self.gaussian_pdf(x, self.mu_0, self.std_0)
        pdf_1 = self.gaussian_pdf(x, self.mu_1, self.std_1)
        return (1-self.phi)*pdf_0 + self.phi*pdf_1
    
    @staticmethod
    def gaussian_pdf(x, mu, std):
        return np.exp(-(x-mu)**2/(2*std**2))/(std*np.sqrt(2*np.pi))

2.1.1 Empirical distribution and histograms

We can see our data set \(\left\{x_i\right\}\) as a collection of samples drawn from the probability distribution \(p_{\text{data}}(x_i)\). The empirical distribution, or Dirac delta distribution, specifies the probability distribution \(\hat{p}_{\text{data}}(x_i)\) from which we sample as we draw examples \(x_i\) from the data set. This way, it maximizes the likelihood of our training samples by construction.

For continuous variables, we define the empirical distribution as \[\hat{p}_{\text{data}}(x)=\frac{1}{m}\sum_{i=1}^m \delta(x - x_i)\,\] which puts the same probability mass \(1/m\) to every data point in a collection of \(m\) samples.

For discrete variables, however, we define the empirical probability to be the empirical frequency with which the value appears in the training set. This is what we typically visualize in normalized histograms. Indeed, histograms are one of the simplest generative models we can have!

Let’s see an example with our Gaussian mixture model. First of all, we need to create some data from which we wish to learn the underlying distribution.

phi, mu_0, std_0, mu_1, std_1 = 0.7, 5, 2, 20, 3
size_train, size_test = 500, 500

np.random.seed(0)
mixture = GaussianMixture(phi, mu_0, std_0, mu_1, std_1)
x_train = np.round(mixture.sample(size_train)).astype(int)
x_test = np.round(mixture.sample(size_test)).astype(int)
Note

Notice tha we have rounded the outputs and converted them to integers. This is because the histograms represent the empirical distribution for discrete random variables.

Now we can build the histogram of the training data by computing the empirical frequency of each value.

values_train, counts_train = np.unique(x_train, return_counts=True)
probs_train = counts_train/counts_train.sum()
Code
discrete_pdf = lambda x: np.trapz(mixture.pdf(x), x) # Integrate pdf over value range
hist_pdf = [discrete_pdf(np.linspace(val-0.5, val+0.5, 10)) for val in values_train]
fig = go.Figure()
fig.add_bar(x=values_train, y=probs_train, name="Histogram")
fig.add_trace(go.Scatter(name="pdf", x=values_train, y=hist_pdf, mode='markers+lines'))
fig.update_layout(xaxis_title='x', yaxis_title='probability', title='Training set')

We can use this histogram as generative model to draw samples according to the empirical distribution of the training data.

def sample_histogram(n_samples, values, probs):
    """Draw samples from the probability distribution defined by the
    histogram assigning normalized `probs` to `values`."""
    cumprobs = probs.cumsum()
    samples = [values[cumprobs >= np.random.uniform()][0]
               for _ in range(n_samples)]
    return np.array(samples)
sample_histogram(10, values_train, probs_train)
array([20,  5, 21, 23,  7, 25, 21, 15, 20, 16])

We can even make a histogram of the samples drawn form the histogram!

Code
samples_hist = sample_histogram(2000, values_train, probs_train)
values_hist, counts_hist = np.unique(samples_hist, return_counts=True)
probs_hist = counts_hist/counts_hist.sum()
fig = go.Figure()
fig.add_bar(x=values_hist, y=probs_hist, name="Histogram")
fig.add_trace(go.Scatter(name="pdf", x=values_train, y=hist_pdf, mode='markers+lines'))
fig.update_layout(xaxis_title='x', yaxis_title='probability',
                  title='Histogram of the training histogram')

The main issue with this approach is that we maximize the likelihood of the training data at the expense of heavily overfitting it. This generally results in terrible generalization to the test set. As we can see below, the histogram for the training set and the test set have some strong differences despite coming from the same underlying distribution. Thus, it is desirable to train a smoother model that can generalize better to unseen data.

Code
values_test, counts_test = np.unique(x_test, return_counts=True)
probs_test = counts_test/counts_test.sum()

fig = go.Figure()
fig.add_bar(x=values_test, y=probs_test, name="Histogram")
fig.add_trace(go.Scatter(name="pdf", x=values_train, y=hist_pdf, mode='markers+lines'))
fig.update_layout(xaxis_title='x', yaxis_title='probability', title='Test set')

Even though this can be mitigated by increasing the amount of data, this solution becomes unfeasable when we go to high-dimensional data and face the curse of dimensionality. For example, if we try to learn the probability distribution of the MNIST data set that we have used in previous lectures, even in its binarized form (black or white pixels), the data is \(28\times28\)-dimensional meaning that there are \(2^{784}\sim10^{236}\) possible configurations. Thus, even if every atom in the observable universe was a training sample (\(\sim10^{80}\)), the resulting histogram would still be extremely sparse assigning null probability almost everywhere.

2.1.2 Maximum likelihood estimation

As we have seen, it is desirable to find better solutions than the simple empirical distribution to model the underlying probability distribution of our data \(p_{\text{data}}\). We can derive a parametrized estimator \(p_{\mathbf{\theta}}\approx p_{\text{data}}\) directly from the data following the maximum likelihood principle, which minimizes the distance between our model and the empirical distribution of the data: \[\mathbf{\theta}^* = \text{arg}\,\text{min}_{\mathbf{\theta}} D_{KL}(\hat{p}_{\text{data}}||p_{\mathbf{\theta}}) = \text{arg}\,\text{min}_{\mathbf{\theta}} -\mathbb{E}_{x\sim\hat{p}_{\text{data}}}\left[\log p_{\mathbf{\theta}}(x)\right]\,.\] We can recognize here the negative log-likelihood loss function that we have previously seen in the course, which is the cross entropy between the empirical distribution and the one defined by the model, as we introduced in the logistic regression section.

This is known as the maximum likelihood estimator (MLE) and it is the most statistically efficient estimator. This means that no other estimator achieves a lower mean squared error (MSE) than the MLE for a fixed number of samples. Furthermore, it is consistent, which guarantees that it converges to the true value as we increase the number of data points, under two conditions:

  • \(p_{\text{data}}\) lies within the hypothesis space of \(p_{\mathbf{\theta}}\).
  • \(p_{\text{data}}\) corresponds to a unique \(\mathbf{\theta}\).
Note

Intuitively, the MLE tries to maximize the probability of observing the samples in the training set. We would obtain the same estimator by taking \[\mathbf{\theta}^* = \text{arg}\,\text{max}_{\mathbf{\theta}} \prod_{i=1}^m p_{\mathbf{\theta}}(x_i) = \sum_{i=1}^m \log p_{\mathbf{\theta}}(x_i)\,.\] This principle is applicable to conditional probability distributions with labeled data \[\mathbf{\theta}^* = \text{arg}\,\text{max}_{\mathbf{\theta}} \sum_{i=1}^m p_{\mathbf{\theta}}(y_i|x_i)\,,\] from which we can derive the MSE loss for supervised learning tasks. Thus, the MSE provides the MLE!

Let’s find the MLE for our toy example. We will cheat and already assume that our distribution follows a Gaussian mixture with two modes. First, we define the loss function for the training data. Since we deal with a fairly low amount of data, we can compute the loss for the whole training set at once.

def mle_train_loss(params):
    phi, mu_0, std_0, mu_1, std_1 = params
    pdf_0 = mixture.gaussian_pdf(x_train, mu_0, std_0)
    pdf_1 = mixture.gaussian_pdf(x_train, mu_1, std_1)
    log_likelihood = np.log((1-phi)*pdf_0 + phi*pdf_1)
    return -np.mean(log_likelihood)
Note

To be completely rigurous here, we should consider the discrete probability mass function, instead of a probability density function.

Now we can simply use a scipy optimizer to find the minimum.

initial_parameters = np.array([0.5, 5., 3., 20., 3.])
result = minimize(mle_train_loss, x0=initial_parameters, bounds=[(0, 1), (0, 25), (0, 5), (0, 25), (0, 5)])
result.x
array([ 0.72000346,  4.90814727,  1.91073172, 20.26624794,  2.84841709])
Code
print("The parameters are:")
print(f"\tGround truth: phi={mixture.phi:.2f}, mu_0={mixture.mu_0:.2f},"+
      f" std_0={mixture.std_0:.2f}, mu_1={mixture.mu_1:.2f}, std_1={mixture.std_1:.2f}")
print(f"\tEstimation:   phi={result.x[0]:.2f}, mu_0={result.x[1]:.2f},"+
      f" std_0={result.x[2]:.2f}, mu_1={result.x[3]:.2f}, std_1={result.x[4]:.2f}")
The parameters are:
    Ground truth: phi=0.70, mu_0=5.00, std_0=2.00, mu_1=20.00, std_1=3.00
    Estimation:   phi=0.72, mu_0=4.91, std_0=1.91, mu_1=20.27, std_1=2.85

Not bad! We have obtained a good estimation of the underlying parameters of our data distribution. We see that the estimation of the second mode is a bit rougher than the first mode. However, looking at the data distribution, we can understand why the second distribution appears wider than it is.

We can compare the negative log likelihood loss for the MLE and the histogram in the train and test data sets.

Code
p_hist = dict(zip(values_train, probs_train))
nll_hist_train = -np.mean(np.log([p_hist.get(x, 1e-9) for x in x_train]))
nll_hist_test = -np.mean(np.log([p_hist.get(x, 1e-9) for x in x_test]))
# To evaluate the MLE over the discrete points we need to integrate around each value
def p_mle(x, params):
    phi, mu_0, std_0, mu_1, std_1 = params
    def _p(x):
        pdf_0 = mixture.gaussian_pdf(x, mu_0, std_0)
        pdf_1 = mixture.gaussian_pdf(x, mu_1, std_1)
        return (1-phi)*pdf_0 + phi*pdf_1
    xx = np.linspace(x-0.5, x+0.5, 20)
    return np.trapz(_p(xx), xx)
nll_mle_train = -np.mean(np.log([p_mle(x, result.x) for x in x_train]))
nll_mle_test = -np.mean(np.log([p_mle(x, result.x) for x in x_test]))
print("Negative log-likelihood loss")
print("      Train Test")
print(f"Hist: {nll_hist_train:.2f}  {nll_hist_test:.2f}")
print(f"MLE:  {nll_mle_train:.2f}  {nll_mle_test:.2f}")
Negative log-likelihood loss
      Train Test
Hist: 2.91  3.12
MLE:  2.95  2.97

We clearly see how the histogram outperforms the MLE in the training data, but it does not generalize well to the test data. In contrast, while the MLE has a lower performance during training, it generalizes well to the test data, keeping a small gap between train and test losses. This way, the MLE is the preferred choice.

Looking at the resulting probability distributions below, we clearly see how the histogram overfit the training data with large spikes at \(x=5\) and \(x=21\).

Code
values_test, counts_test = np.unique(x_test, return_counts=True)
probs_test = counts_test/counts_test.sum()
probs_mle = [p_mle(x, result.x) for x in values_test]
probs_hist = [p_hist.get(x, 1e-9) for x in values_test]

fig = go.Figure()
fig.add_bar(x=values_test, y=probs_test, name="Histogram")
fig.add_trace(go.Scatter(name="pdf", x=values_train, y=hist_pdf, mode='markers+lines'))
fig.add_trace(go.Scatter(name="MLE", x=values_test, y=probs_mle, mode='markers+lines'))
fig.add_trace(go.Scatter(name="hist", x=values_test, y=probs_hist, mode='markers+lines'))
fig.update_layout(xaxis_title='x', yaxis_title='probability', title='Test set')
Note

Gaussian mixture models are universal approximators of probability distributions given enough modes, i.e., enough \((\mu_i, \sigma_i)\). Hence, they are common ansatze in this kind of applications. We haven’t cheated that much :D

References

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, 33:1877–1901. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Caron, Mathilde, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. “Emerging Properties in Self-Supervised Vision Transformers.” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 9650–60. https://doi.org/10.1109/ICCV48922.2021.00951.
Iten, Raban, Tony Metger, Henrik Wilming, Lı́dia Del Rio, and Renato Renner. 2020. “Discovering Physical Concepts with Neural Networks.” Phys. Rev. Lett. 124 (1): 010508. https://doi.org/10.1103/PhysRevLett.124.010508.
Kottmann, Korbinian, Patrick Huembeli, Maciej Lewenstein, and Antonio Acı́n. 2020. “Unsupervised Phase Discovery with Deep Anomaly Detection.” Phys. Rev. Lett. 125 (October): 170603. https://doi.org/10.1103/PhysRevLett.125.170603.