1 - Regression

The goal of this homework is to revisit the main concepts about linear/polynomial and logistic regression.

In this homework we will have two main tasks: a regression and a classification. However, your first goal will be to actually design each of these by choosing an appropiate problem and dataset.

You have complete freedom to choose any dataset that you may wish to work with, from some personal project to any random dataset in the internet, as long as it allows you to perform the tasks below (read them first!). We encourage you to choose a problem that you may find interesting or that you have any curiosity about the topic. We recommend kaggle, although you can also download data from scikit-learn or, simply, do a google search.

Submit a written report in .pdf format (no code) to the corresponding deliverable entry in the aula virtual. The report should contain a description of the whole procedure with some representative figures.

1 - Linear regression

We will start with a regression task. Choose an appropiate dataset that will allow us to perform a linear regression between two variables at least.

1.1 - Become familiar with the data

Describe the dataset that you have chosen, describe and look at the different features in the samples and try to identify a couple of potential linear (or polynomial) relationships between any pairs of variables.

Tip

For example, if we look at the weather, we may find a nearly linear relationship between the apparent temperature and the ambient temperature, or the apparent temperature and the relative humidity.

1.2 - Perform the regression

Take any pair of related features and perform a regression. Plot the data and the resulting fit and report an appropiate metric to evaluate the performance of the resulting method. Is the fit good?

Does the data follow the relationship that you were expecting? Would a polynomial regression of a different degree result in a better fit? Try it and explain why it does or why it doesn’t.

1.3 - Deepening the analysis based on the results

After the first preliminary analysis, we can proceed based on the previous results. We propose two possible directions but one is enough. Also, if you find anything interesting that you would like to explore instead of the following options, feel free to do it and report it.

1.3 (a) - Second relationship

If you could identify more linear or polynomial relationships between your features, repeat 1.2 for these variables. Which one of them results in a better model? Can you explain why?

Tip

In the weather example, we could predict the apparent temperature from the ambient temperature or from the relative humidity, which one would result in a better regressor?

1.3 (b) - Improving our fit with more features

In 1.2, we only looked at the dependency between two variables. However, we can consider further features in our samples to obtain better predictions. Perform the same regression task as before considering additional features, e.g., one and two more features, and report the results. Explain these features and discuss why they improve the performance or not.

Tip

In the weather example, would we obtain a better regressor for the apparent temperature combining the ambient temperature and the relative humidity?

2 - Logistic regression

We now consider a binary classification task. Choose an appropiate dataset that will allow us to perform a classification task between, at least, two classes. It does not matter if the dataset has more classes, we can always take a subset of those.

Tip

For example, in a dataset of flowers, we can have potentially many kinds of flowers, but we can choose to simply distinguish between tulips and orchids. However, we need to ensure that we have enough samples of each class to train our model.

2.1 - Become familiar with the data

Describe the dataset that you have chosen, describe and look at the different features in the samples and try to identify a possible binary classification task from a few relevant features.

2.2 - Train the classifier

Train a logistic regression on one of the relevant features. Plot the results and report an appropiate metric to evaluate the method’s performance. Is the classifier good? Is this one feature enough to perform the task?

Tip

In the flowers example, we can distinguish between some species, such as daisies and poppies, just by looking at the flower radius. We can visualize this by drawing the one dimensional radius line and placing markers where the examples lie colored by class.

2.3 - Considering further features

So far, we have only used a single feature to perform the classification. Consider now a second one and retrain the logistic regression classifier with both of them. Do the results improve? Why? Visualize the results.

2.4 - Understanding the model

Finally, train the regressor with as many relevant features as you consider. Does the performance improve? Visualize the weights of the resulting model and discuss which features have the highest and lowest impact in the prediction.