A simple Machine Learning process example (supervised) – just to start with

This is the fifth article of the Machine Learning topic. I have introduced you to the Machine Learning, described the process and the tools. Last time I told you about my favourite web pages where you can find a lot of good data sources to start your learning adventure. Now it is time to start with some example.

I have choosen a simple example just to give you an impression how all the parts that build a Machine Learning (and AI) process work together. No sweat. This will be easy (but only in this post).

I bet you know that when you start a new programming language journey it always has a basic project with a code like “Hello world”. The same is true for the first Machine Learning project I would like to do here. The difference is that “Hello world” in Python or R does not require to run any Machine Learning procedures so I have found another example. I think the classic “Hello world” in the Machine Learning and AI area is the Iris data set. You can download it from Kaggle or UCI or hundreads other sites.

What should we do next? You know that already:

  • What would you like to achieve
  • Prepare the data
  • Choose an algorithm
  • Build and train the model
  • Test and evaluate the model
Important assumption – you have downloaded the tools I have described in the previous article.

What would you like to achieve

If you define a goal this way: “I would like to be able to classify irises correctly. ” than it really means nothing because you have not defined the word “correctly”. I would rephrase it in the following way “I would like to ba able to classify irises 95 times out of 100”. Which means the model cannot do more than 5% of mistakes.

Remember you need alwas define the goal in a way it can be measured or verified.

Prepare the data

The iris data set is good as a starting one because of it’s properties:

  • there are types of flowers (Iris Setosa, Iris Virginica and Iris Versicolor)
  • each flower is described by four attributes (Sepal Length, Sepal Width, Petal Length, Petal Width)
  • there is no need to do data preparation because:
    • all attributes are numeric
    • all attributes are in the same units and scale
    • there are no missing values
  • there are 150 observations, 50 observations for each type of the iris

Wait – I will show you in the next article that you can still reduce the number of dimensions from four to just two by using the PCA (Principal Component Analysis).

Let’s load the data along with the important libraries:

import pandas as pd

I have imported the pandas library to be able to create so called data frame and to load data to it.

Here you have two variables that are useful for data load and add names to the columns.

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
columns = ['sepal length','sepal width','petal length','petal width','class']
iris_df = pd.read_csv(url, names = columns)

You see how easy it was. Pandas offers you methods to load data in various format and csv is one of these. In the same way you could load the data from your local disk. All you need to do is to replace the url with local path but do not forget to add double slashes so the path should look like this: C:\\Temp\\iris.data

A data frame is this example like a matrix that stores all data. It has 150 rows and five colums. The first four columns are called features – this is the imput to our model. The last column is the output which means based on the four features the model needs to predict the output.

By the way – you can load the iris dataset directly from the sklearn library. It is just part of it. Take a look at the code below.

from sklearn.datasets import load_iris
iris_df = load_iris()
iris_df

I just wanted to give you two ways of how you can get the data into your script.

Now let’s do some exploratory analysis of our dataset. Let’s run head() method to display the top 5 rows from the data frame. You can control how many rows should be displayed like this: head(3) displays top 3 rows.

iris_df.head()

You can also use the tail() method to display last rows from the data frame.

Another way to look at the data set is to run the info() method. It gives you information about all columns, their data types, the number of observations (rows) and memory allocation for the data frame

iris_df.info()

The last method I would like to show is the describe() method. This one gives you a statistical look into your data. You will see information about the minimum, maximum and mean values and of course about the data distribution

iris_df.describe()

About the data distribution – you can also use this code to see how many observations are for every distinct class of flowers

iris_df.groupby('class').size()

But wait, what if I would like to see my data not as numbers but as plots? Let’s import two useful libraries and explore the data visually

import matplotlib.pyplot as plt
import seaborn as sns
sns.pairplot(iris_df, hue = "class");
plt.show()

Well, this is it! Now we see there are three types of irises and should be even able to classify them.Each color represents different class of irises. You see that we have 4 x 4 plot which means there are 4 dimensions. I have mentioned that I will show you a method to reduce the number of dimensions to only 2 so I will be able to visualize the entire data set on 2-dimensional chart. What you see now are the 2 dimensional projections of 4 dimensional space we have to work with. Yes, that sounds ugly but how could you visualize 4 dimensional space? No way to do it!

Choose an algorithm

Let’s load the libraries that are needed to create a test and train data frames, and to import neigbors (I will use k-nearest neigbors algorithm) and classes for scoring the algorithm

from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

I have indicated in the imports that I will use only one algorith. This is not usual practice and I have done it to make this example simple. In real life you would use many algorithms along with so called hyperparameters tuning to find the best model. We will get into that so be patient!

Build and train the model

Now we need to split the data frame into two objects. The first object ( I will call it X ) will contain only the input data – the features. It will still be a data frame. The second object ( I will call it y) will contain the output column. This will be a vector as we have one output column.

y = iris_df["class"]
X = iris_df.drop(["class"], axis=1)

Comment – the drop() method has created a data frame without the class column. Ahe parameter axis=1 says that I have removed a column named “class”.

The most important part now is to create a good training and testing data sets. What do I mean by saying it has to be “good”? Let’s discuss this on the example of the iris data set. As you know it has 150 observations for 3 different flower type. Each flower type has 50 observations in the data set. Generally speaking we should split the data set into two parts – for a training part which should have 70-80% of data and testing part which has the rest.

Important! The 70%-30% split is just a general rule and I will show you examples where we do exactle the oposite in one of the further articles.

Now the question is how to take the 70% of data to build the training data set? I could do it that way – give me roughly 100 rows for training and 50 rows for testing. Do you think this would work? Of course it will not! It could be that you will train the model on two types of irises and test on th ethird one. You see this will end up with 0% accuracy.

You have to create the test and training data sets propely. That means both sets needs to maintain your data distribution. How to do this? Take a look into my code:

X_train, X_test, y_train, y_test = train_test_split(X, y, 

                                                    train_size=0.7,

                                                    random_state=42)

What has just happend? I have now 4 objects

  • data frame – X_train (70% data from X)
  • data frame – X_test (30% data from X)
  • vector – y_train (70% data from y)
  • vector – y_test (30% data from y)

The train_size parameter has information about the data split. The random_state parameter is used to shuffle the data and it can be any number. For some reasons the 42 is used but you could you any number you like.

Let’s build the model now! Look how simple it is

knn = KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')
knn.fit(X_train, y_train)

Yes, the model is done, trained! Two lines of code and this is it? Yes, it is! I have trained a model using k-Nearest Neighbors algorithm which allows classification of objects. We have specified some important parameters like:

  • n_neighbors = 5
  • metrics = minkowski
  • p = 2

Do not worry if this is odd for you now. But the meaining is as follow. The output value (the type of the flower) will be calculate based on the five nearest input points (remeber, I mean a point in 4-dimensional space which is still a point). The metrics and p parameters are used to calculate the “nearest”. This is because a distance between two points can be calculated in many ways and I will show you this later.

Test and evaluate the model

The model is done. Is it good? Is it bad? How to find this out? Let’s check the model accurancy on the train data:

knn_predict_train = knn.predict(X_train)
print("Accuracy: {0:.4f}".format(accuracy_score(y_train, knn_predict_train)))

The result is 0.9524 which is ok, as the goal is to have the classification at least on the 95% accuracy.

Now, let’s check how the model behaves when we apply the test data. Remember, the model have not seen the data yet so it is a true test for it!

knn_predict_test = knn.predict(X_test)
print("Accuracy: {0:.4f}".format(accuracy_score(y_test, knn_predict_test)))

I am surprised because the accuracy is 1! This means the model has classified correctly all test cases! I am in doubts and you? Let’s use some other statistical tests like confussion matrix:

print(confusion_matrix(y_test, knn_predict_test))

The results on the diagonal represent the correct classification. The data is not on the diagional represents bad classification which would be an error. Let’s see the last report for today – it is called classification report. It has all important information and we will be using it very often:

print(classification_report(y_test, knn_predict_test))

You see, the model works perfect having this data. Please try out on the training data to see where the errors were because the accuracy was not 100%.

Now important question – “Can we do any better”? The answer is “why?” We have reached the goal, the case is closed!

But of course the life is not always (never) that easy and we need to dig deeper in the algorithms and their parameters. I will explain it later.

Summary

You can find the code of this example here.

This idea behind this post was to give you the overview of the entire Machine Learning process. Now you see all the steps and see the code behind them.

Next time I will show you (finally) how to start with data preparation. I mean how to clean data, remove duplicates, perform feature engineering (could be tough), do PCA analysis and so on. And this is not all as we have a lot to do with the modeling as well!

Stay tuned!

Damian

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.