Introduction
K-Nearest Neighbor(KNN) is a supervised algorithm in machine learning that is used for classification and regression analysis. This algorithm assigns the new data based on how close or how similar the data is to the points in training data.
Here, ‘K’ represents the number of neighbors that are considered to classify the new data point. KNN is called a lazy learning algorithm because it uses all the data during training for the classification of a new point.
In other words, it doesn’t learn from training data rather it stores data and when new data is introduced it classifies that new point in the course of training.
KNN algorithm
- Choose the suitable value of k (number of neighbors)
- Calculate the distance between the new point and the k number of the closest point
- Count the number of neighbors in each category
- Assign the new data to that category that has the maximum number of neighbors
- The model is ready
How does KNN work?
- Suppose that we have two categories in the input dataset. The diagram shown below shows the input data having two categories, one with red color and another with green color. We will classify the data in white color using KNN.
- The next step is to choose the number of neighbors i.e the value of k. Let’s take the value of k to be 5
- Now, the third step is to calculate the distance between a new point and other points. Here are some of the methods that are used for the calculation of distance
- Euclidean distance is a straight line distance between two points. Let (x1, y1) and (x2, y2) be the two points. The above formula will calculate the distance between these two points.
- Manhattan distance is the distance between two points along axes at a right angle. Let (x1, y1) and (x2, y2) be the two points. The above formula will calculate the distance between these two points.
- After calculating the distance between the new point and other points, we’ve got the nearest neighbors i.e 5 nearest points with reference to the new point.
- The next step is to count the number of neighbors in each category. As we can see that there are three(3) points in the red category and two(2) points in the green category.
So, the new data point belongs to the red category.
Python code for implementation of KNN
For demonstration, we’ll be using Jupyter Notebook and we’ll be using the Iris flower classification dataset for implementation of KNN.
#importing required models import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report, accuracy_score #read dataframe df = pd.read_csv("IRIS.csv") #target and input variable selection X, y = df.iloc[:,:-1].values, df.iloc[:, -1].values #do train test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) #Standardizing the data sc = StandardScaler() X_train_scaled = sc.fit_transform(X_train) X_test_scaled = sc.transform(X_test) #creating model with random value of K knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) model= knn.fit(X_train_scaled, y_train) #making prediction pred = model.predict(X_test_scaled) #checking accuracy of model print(classification_report(y_test, pred))
Output
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 9 Iris-versicolor 0.90 0.82 0.86 11 Iris-virginica 0.82 0.90 0.86 10 accuracy 0.90 30 macro avg 0.91 0.91 0.90 30 weighted avg 0.90 0.90 0.90 30
The KNN model performance is pretty good and the accuracy is 90%.
How to select the best value for K
We can go for the error value generated by using different values of k to see at which particular value the error is minimum to select the best value of K.
import matplotlib.pyplot as plt error = [] for k in range(1,50): knn = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2) model= knn.fit(X_train_scaled, y_train) pred = model.predict(X_test_scaled) error.append(np.mean(pred != y_test)) #plotting graph of error vs value of k plt.plot(range(1, 50), error) plt.xlabel("K") plt.ylabel("Error") plt.title("Best estimation of k ") plt.show()
Output
As we can see that when k = 25 the error is minimum. So, the best value of k is 25. Using the value of k = 25, we can rebuild the model as:
knn = KNeighborsClassifier(n_neighbors=25) model= knn.fit(X_train_scaled, y_train) #making prediction pred = model.predict(X_test_scaled) #checking accuracy of model print(classification_report(y_test, pred))
Output
precision recall f1-score support Iris-setosa 1.00 1.00 1.00 12 Iris-versicolor 1.00 1.00 1.00 9 Iris-virginica 1.00 1.00 1.00 9 accuracy 1.00 30 macro avg 1.00 1.00 1.00 30 weighted avg 1.00 1.00 1.00 30
After substituting the best value of k(25) model accuracy has increased to 100%. So, KNN is useful for classification problems in machine learning.
Conclusion
K-Nearest Neighbors (KNN) algorithm is a supervised machine learning algorithm. It is applicable for both classification and regression purposes. This algorithm assigns the new data by analyzing how much the data is similar to the specific category.
To determine the best k for KNN, it is important to calculate the error associated with different values of k. After calculating the value of error associated with different k, we will choose that value with low error. Data must be standardized before training the model as KNN is a distance-based algorithm for classification.
For more information follow this link
Happy Learning 🙂