Logistic regression

Author

Alexandre Dauphin

Open in Colab

1 Binary classification

In this section, we consider the classification task. In particular, we focus here on the binary classification, whcih means that a datapoint can either be in class \(0\) or in class \(1\).

Let us consider a concrete dataset, the toxity dataset. In this dataset, scientists were studying the toxity of a substance to a population of insects. They therefore administrated a serie of toxic doses and measured the number of deaths in the population of \(250\) individuals. The results of the study are shown in Figure 1.

n_ind = 250
data = dict(doses=range(1,7),deaths=[28,53,93,126,172,197])
df = pd.DataFrame.from_dict(data)
df['p_dying'] = df['deaths']/n_ind
Code
px.bar(data_frame=df, x='doses', y='deaths')
Figure 1: Number of deaths with respect to the number of doses.

Here, the goal of the classification would be to predict, given a number of dosis whether an individual would be death (class \(0\)) or alive (class \(1\)).

2 Logistic regression

We can here define a probability distribution of dying for each value of the doses by dividing by the total number of individuals (here \(250\)). We can therefore see the classification task as fitting this probability distribution. A simple trial function is the sigmoid function

\[ \sigma(x) =\frac{1}{1+e^{-(ax+b)}}\] and the goal is to find the values of \(a\) and \(b\) that fit the best the data.

Binomial distribution

When we divide the number of deaths by the number of individuals, we are effectively evaluating the estimator of the parameter \(p\) of the Binomial distribution. As the reminder, the Binomial distribution charatcerizes the probability of having \(k\) successes over \(n\) independent samples of a Bernouilli distribution with succes probability \(p\).

Notice how we transformed our classification problem into a regression problem. We will come back to this probabilisitc view of machine lerning in the next lecture.

def sigmoid(x,a,b):
    return 1/(1 + np.exp(-(a*x+b)))

Figure 2 shows a trial of the fit for the values of \(a=0.5\) and \(b=-2\). You can play around with the values of \(a\) and \(b\) to find a more decent fit in the next figure (only in the notebook).

Code
a, b = 0.5, -2

vecd = np.arange(0,7,0.1)

Fig = go.Figure()


Fig.add_bar(x=df['doses'], y=df['p_dying'], name='toxity dataset')

Fig.add_trace(go.Scatter(name=f'sigmoid: a={a}, b={b}',x=vecd,y=sigmoid(vecd,a,b)))
Fig.update_layout(xaxis_title='doses',yaxis_title='p')
Figure 2: Probability of deaths with respect to the number of doses.
Exercise

Show that, for the logistic regression,

\[\log(\frac{p}{1-p})=ax+b\]

3 The dataset

Until now, we have considered the viewpoint of the whole population. But generally, the dataset is built from the data of each individual to which is associated a tuple \((x,y)\), where \(x\) is the number of dosis and \(y\) is the class (\(0\): dead, \(1\): alive). To this end, we construct a dataset with \(n_\text{ind}\times n_\text{doses}\) individuals.

for i, d in zip(df['doses'], df['deaths']):
    vec1 = np.zeros((n_ind,2))
    vec1[:,0], vec1[:d,1] = i, 1
    vec = vec1 if i ==1 else np.concatenate((vec,vec1))
    
np.random.shuffle(vec)
x, y = vec[:,0], vec[:,1]
print(x[:20])
print(y[:20])
[5. 2. 5. 2. 2. 3. 1. 1. 1. 6. 6. 3. 3. 6. 6. 6. 2. 6. 3. 4.]
[1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1.]

4 Loss function: Binary Cross Entropy

We now want to automatically find the values of \(a\) and \(b\). To this end, we have to define a loss function. We will here use the binary cross entropy

\[ BCE = -\frac{1}{N}\sum_{i=1}^N y_i \log [\sigma(x_i,a,b)] +(1-y_i)\log[1-\sigma(x_i,a,b)].\]

Let us have a look of the loss landscape of the binary cross entropy with respect to the paramters \(a\) and \(b\), as depicted in Figure 3. We actually observe that the landscape is convex for this choice of the Loss function. In this particular case, the latter can proven.

Code to generate the data for the plot
vec_a = np.arange(-5,5,0.1)
vec_b = np.arange(-5,5,0.1)
matz = np.zeros((vec_a.size,vec_b.size))

for i, a1 in enumerate(vec_a):
    for j, b1 in enumerate(vec_b):
        p = dict(a=a1, b=b1)
        matz[i,j] = BCE(x, y, sigmoid, params=p)
Code
fig = go.Figure()

fig.add_contour(z=matz,x=vec_b, y=vec_a,hovertemplate=
                    'a:%{y:.2f}'
                    +'<br>b:%{x:.2f}</br>'
                    +'f:%{z:.2f}<extra></extra>')


d = dict(width=600,
         height=600,
         xaxis={'title':'b'},
         yaxis={'title':'a'}
       )

fig.update_layout(d)
fig.show()
Figure 3: Binary Cross Entropy
Exercise

What is the value loss for your guess? Is it close to the minimum?

5 Gradient of the Loss function

Let us compute the gradient with respect to \(a\) and \(b\).

Exercise

What is the gradient of the sigmoid function \(\sigma(x)\)?

\(\sigma'(x) = \sigma(x) (1-\sigma(x))\)

We find

\[\begin{aligned} \partial_a BCE &= \frac{1}{N}\sum_{i=1}^N (\sigma_i -y_i)x_i\\ \partial_b BCE &= \frac{1}{N}\sum_{i=1}^N (\sigma_i -y_i),\\ \end{aligned}\]

where we have defined \(\sigma_i\equiv \sigma(ax_i+b)\).

Although the problem is convex, there is no easy way to derive an analytical closed form. This comes from the non-linearity introduced by the sigmoid. Libraries like scikit-learnpropose to use different algorithm such as e.g. conjugate gradient. In the next section, we will consider the stochastic gradient descent approach.

6 Stochastic gradient descent

Exercise

Implement the function grad_BCE(x, y, p) where p=dict(a=a, b=b) is a dictionary containing the the parameters \(a\) and \(b\). The function should return an array of the form np.array([grad_a, grad_b]).

Having computed the gradients, we can now apply the stochastic gradient descent algorithm.

a0, b0 = 2, 1 

pini = dict(a=a0, b=b0)
ll = dict(loss=BCE, grads=grad_BCE, fun=sigmoid)

trackers = sgd(x, y, pini, ll, niter=int(1.5E2))

Figure 4 shows the trajectory of the stochastic gradient descent algorithm.

Code
amin, amax = np.min(trackers['a']), np.max(trackers['a'])
bmin, bmax = np.min(trackers['b']), np.max(trackers['b'])

n = 100
stepa, stepb =(amax-amin)/n, (bmax-bmin)/n

vec_a = np.arange(amin-19*stepa, amax+20*stepa, stepa)
vec_b = np.arange(bmin-19*stepb, bmax+20*stepb,stepb)
matz = np.zeros((vec_a.size,vec_b.size))

for i, a1 in enumerate(vec_a):
    for j, b1 in enumerate(vec_b):
        p = dict(a=a1, b=b1)
        matz[i,j] = BCE(x, y, sigmoid, params=p)
        
fig = go.Figure()
fig.add_contour(z=matz,x=vec_b, y=vec_a,
                hovertemplate=
                    'a:%{y:.2f}'
                    +'<br>b:%{x:.2f}</br>'
                    +'f:%{z:.2f}<extra></extra>')
mask = np.arange(0,len(trackers['a']),100)
veca, vecb, vecl = np.array(trackers['a'])[mask],np.array(trackers['b'])[mask], np.array(trackers['loss'])[mask]
fig.add_scatter(x=vecb, y=veca, name=f'SGD',text=vecl, mode='lines+markers',
                hovertemplate=
                'a:%{y:.2f}'
                +'<br>b:%{x:.2f}</br>'
                +'f:%{text:.2f}<extra></extra>')


fig.show()
Figure 4: Trajectory in the parameter space of the stochastic gradient descent algorithm.

7 Accuracy and other metrics

So far, we did not discuss about the choice of the metric. The first metric that comes into mind is the accuracy, i.e.

\[ \text{acc}= \frac{\# \text{correct predictions}}{\#\text{dataset}},\]

which considers the ratio between the number of correct predictions and the number of elements in our training set.

let us compute the accuracuy in terms of the training.

veca, vecb, vecl = np.array(trackers['a']), np.array(trackers['b']), np.array(trackers['loss'])
mask = np.arange(0,len(veca),100)
veca, vecb, vecl = veca [mask], vecb[mask], vecl[mask]
Exercise

Write the function accuracy(x,y,a,b), which returns the accuracy for a given dataset and for the parameters \(a\) and \(b\). Choose the label with the following rule: \(0\) if \(\sigma_i<0.5\) else \(1\).

Code
def accuracy(x,y,a,b):
    yp = sigmoid(x, a, b)
    yp[yp>0.5] = 1
    yp[yp<=0.5] = 0
    return np.sum(yp==y)/y.size

We then compute the accuracy during the training.

vec_acc = np.zeros_like(veca)

for i,(a,b) in enumerate(zip(veca,vecb)):
    vec_acc[i] = accuracy(x,y,a,b)

Figure 5 shows the value of the Loss function and the accuracy during the training. It is interesting to note that while the Loss function is a smooth function, the accuracy is not smooth and that the variations of Loss function do not directly correspond to the variations of the accuracy. It is therefore important to check not only the Loss function but also the figure of merit of the problem, that can be for example the accuracy.

Code
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

# Add traces
fig.add_trace(
    go.Scatter(y=vecl, name="Loss"),
    secondary_y=False,
)

fig.add_trace(
    go.Scatter(y=vec_acc, name="Accuracy"),
    secondary_y=True,
)


# Set x-axis title
fig.update_xaxes(title_text="iterations")

# Set y-axes titles
fig.update_yaxes(title_text="Loss", secondary_y=False)
fig.update_yaxes(title_text="Accuracy", secondary_y=True)

fig.show()
Figure 5: Loss function vs Accuracy

Depending on the applications, it might be advantageous to define different figures of merits. The later can be done with the help of the confusion matrix

Positive (Prediction) Negative (Prediction)
Positive (Ground Truth) True Positive False negative
Negative (Ground Truth) False positive True Negative

Let us now construct the confusion matrix for our model.

yp = sigmoid(x, a, b)
yp[yp>0.5] = 1
yp[yp<=0.5] = 0
cm = confusion_matrix(y,yp)

Figure 6 shows the confustion matrix for the logistic regression.

Code
X, Y = ["Alive(P)", "Dead(P)"], ["Alive(GT)", "Dead(GT)"]
fig = px.imshow(cm, x=X, y=Y, text_auto=True,color_continuous_scale='Blues')


fig.show()
Figure 6: Confusion matrix for the logistic regression

We readily observe that the accuracy correponds to the trace of this matrix normalized by the number of individuals.

print(f'{np.trace(cm)/np.sum(cm):.3f}')
0.714

There are other metrics that one can use for the predictions

Metric Description Formula
Precision Correct positive predictions over the total of positive predictions \(\frac{TP}{TP+FP}\)
Recall Correct positive predictions over the total of actual positives \(\frac{TP}{TP+FN}\)
F1 score Harmonic mean of precision and recall \(\frac{2\text{Recall}\,\text{Precision}}{\text{Recall}+\text{Precision}}\)
Exercise

Compute the different metrics for the trained logistic regression model.

precision = cm[0,0]/(cm[0,0]+cm[1,0])

recall = cm[0,0]/(cm[0,0]+cm[0,1])

F1_score = 2*recall*precision/(precision+recall)

print(f'Precision: {precision:.3f}')

print(f'Recall: {recall:.3f}')

print(f'F1 score: {F1_score:.3f}')

Let us perform the same analysis on another dataset, the brest cancer dataset of scikit learn.

x1, y1 =load_breast_cancer(return_X_y=True)
x1 = x1[:,3]

print(x1[:10])
print(y1[:10])
[1001.  1326.  1203.   386.1 1297.   477.1 1040.   577.9  519.8  475.9]
[0 0 0 0 0 0 0 0 0 0]
x1 = x1.reshape([np.size(x1),1])

clf = LogisticRegression().fit(x1,y1)
yp1 = clf.predict(x1)

cm = confusion_matrix(y1,yp1)
Code
X1, Y1 = ["Malignous(P)", "Benign(P)"], ["Malignous(GT)", "Benign(GT)"]
fig = px.imshow(cm, x=X1, y=Y1, text_auto=True,color_continuous_scale='Blues')


fig.show()
Figure 7: Confusion matrix for the logistic regression of the brest cancer
Exercise

Compute the different metrics for the trained logistic regression model.