Houjun Liu

AIBridgeLab D4Aft

Let’s run some clustering algorithms! We are still going to use the Iris data, because we are super familiar with it already. Loading it works the exactly in the same way; I will not repeat the notes but just copy the code and description from before here for your reference

Iris Dataset

Let’s load the Iris dataset! Begin by importing the load_iris tool from sklearn. This is an easy loader scheme for the iris dataset.

from sklearn.datasets import load_iris

Then, we simply execute the following to load the data.

x,y = load_iris(return_X_y=True)

We use the return_X_y argument here so that, instead of dumping a large CSV, we get the neat-cleaned input and output values.

k-means clustering

The basics of k-means clustering works exactly the same as before, except this time we have to specify and get a few more parameters. Let’s begin by importing k-means and getting some clusters together!

from sklearn.cluster import KMeans

Let’s instantiate the KMeans cluster with 3 clusters, which is the number of classes there is.

kmeans = KMeans(n_clusters=3)
kmeans = kmeans.fit(x)

Great! Let’s take a look at how it sorted all of our samples

kmeans.labels_
111111111111111111111111111111111111111111111111110020000000000000000000000002000000000000000000000020222202222220022220202022002222202222022202220220

Let’s plot our results.

import matplotlib.pyplot as plt

We then need to define some colours.

colors=["red", "green", "blue"]

Recall from yesterday that we realized that inner Septal/Pedal differences are not as variable as intra Septal/Pedal differences. So, we will plot the first and third columns next to each other, and use labels_ for coloring.

# for each element
for indx, element in enumerate(x):
    # add a scatter point
    plt.scatter(element[0], element[1], color=colors[kmeans.labels_[indx]])
# save our figure
plt.savefig("scatter.png")

Nice. These look like the main groups are captured!

Let’s compare that to intended classes

y
000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111122222222222222222222222222222222222222222222222222

There are obviously some clustering mistakes. Woah! Without prompting with answers, our model was able to figure out much of the general clusters at which our data exists. Nice.

We can also see the “average”/“center” for each of the clusters:

kmeans.cluster_centers_
5.90161292.74838714.393548391.43387097
5.0063.4281.4620.246
6.853.073684215.742105262.07105263

Nice! These are what our model thinks are the centers of each group.

Principle Component Analysis

Let’s try reducing the dimentionality of our data by one, so that we only have three dimensions. We do this, by, again, begin importing PCA.

from sklearn.decomposition import PCA

When we are instantiating, we need to create a PCA instance with a keyword n_components, which is the number of dimensions (“component vectors”) we want to keep.

pca = PCA(n_components=3)

Great, let’s fit our data to this PCA.

pca.fit(x)

Wonderful. singular_values_ is how we can get out of the PCA’d change of basis results:

cob = pca.components_
cob
0.36138659-0.084522510.856670610.3582892
0.656588770.73016143-0.17337266-0.07548102
-0.582029850.597910830.076236080.54583143

So, we can then take a change of basis matrix and apply it to some samples!

cob@(x[0])
2.818239515.64634982-0.65976754

What’s @? Well… Unfortunately, Python has different operator for matrix-operations (“dot”); otherwise, it will perform element-wise operations.

We can actually also see the \(R^2\) values on each of the axis: the variance explained by each of the dimensions.

pca.explained_variance_
4.228241710.242670750.0782095

Nice! As you can see, much of the variance is contained in our first dimension here.