Houjun Liu

AIBridgeLab D2Aft

Welcome to the Day-2 Afternoon Lab! We are super excited to work through tasks in linear regression and logistic regression, as well as familiarize you with the Iris dataset.

Iris Dataset

Let’s load the Iris dataset! Begin by importing the load_iris tool from sklearn. This is an easy loader scheme for the iris dataset.

from sklearn.datasets import load_iris

Then, we simply execute the following to load the data.

x,y = load_iris(return_X_y=True)

We use the return_X_y argument here so that, instead of dumping a large CSV, we get the neat-cleaned input and output values.

Let’s inspect this data a little.

x[0]
5.13.51.40.2

We can see that each sample of the data is a vector in \(\mathbb{R}^4\). They correspond to four attributes:

  • septal length
  • septal width
  • pedal length
  • pedal width

What’s the output?

y[0]
0

We can actually see all the possible values of the output by putting it into a set.

set(y)
012

There are three different classes of outputs.

  • Iris Setosa
  • Iris Versicolour
  • Iris Virginica

Excellent. So we can see that we have a dataset of four possible inputs and one possible output. Let’s see what we can do with it.

Logistic Regression

The simplest thing we can do is a logistic regression. We have a there categories for output and a lot of data for input. Let’s figure out if we can predict the output from the input!

Let’s import logistic regression tool first, and instantiate it.

from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()

We will “fit” the data to the model: adjusting the model to best represent the data. Our data has 150 samples, so let’s fit the data on 140 of them.

testing_samples_x = x[-5:]
testing_samples_y = y[-5:]
x = x[:-5]
y = y[:-5]

Wonderful. Let’s fit the data onto the model.

reg = reg.fit(x,y)

Let’s go ahead and run the model on our 10 testing samples!

predicted_y = reg.predict(testing_samples_x)
predicted_y
22222

And, let’s figure out what our actual results say:

testing_samples_y
22222

Woah! That’s excellent.

Linear Regression

Instead of predicting the output classes, we can predict some values from the output. How about if we used septal length, width, and pedal length to predict petal width? The output now is a number, not some classes, which calls for linear regression!

Let’s import linear regression tool first, and instantiate it.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()

We will “fit” the data to the model again. As we have cleaned out the testing_samples, we simply need to split out the fourth column for the new x and y:

new_x = x[:,:3]
new_y = x[:,3]

new_testing_samples_y = testing_samples_x[:,3]
new_testing_samples_x = testing_samples_x[:,:3]

Taking now our newly parsed data, let’s fit it to a linear model.

reg = reg.fit(new_x,new_y)

Let’s go ahead and run the model on our 10 testing samples!

new_predicted_y = reg.predict(new_testing_samples_x)
new_predicted_y
1.75007341.619270611.792187672.048243641.86638164

And, let’s figure out what our actual results say:

new_testing_samples_y
2.31.922.31.8

Close on some samples, not quite there on others. How good does our model actually do? We can use .score() to figure out the \(r^2\) value of our line on some data.

reg.score(new_x, new_y)
0.9405617534915884

Evidently, it seems like about \(94\%\) of the variation in our output data can be explained by the input features. This means that the relationship between septals are not exactly a linear pattern!

Now you try

  • Download the wine quality dataset
  • Predict the quality of wine given its chemical metrics
  • Predict if its red or white wine given its chemical metrics
  • Vary the amount of data used to .fit the model, how does that influence the results?
  • Vary the amount in each “class” (red wine, white wine) to fit the model, how much does that influence the results.