K-Nearest Neighbors is a machine learning technique and algorithm that can be used for both regression and classification tasks. K-Nearest Neighbors examines the labels of a chosen number of data points surrounding a target data point in order to make a prediction as to which class the data point falls into. K-Nearest Neighbors (KNN) is a conceptually simple but very powerful algorithm and, for these reasons, is one of the most popular machine learning algorithms. Let's take a deep dive into the KNN algorithm and see exactly how it works. Having a good understanding of how KNN works will allow you to appreciate the best and worst use cases for KNN.
K-Nearest Neighbors (KNN) overview
Let's visualize a dataset on a 2D plane. Imagine a group of data points on a graph, distributed across the graph in small groups. KNN examines the distribution of data points and, depending on the arguments provided to the model, separates the data points into groups. These groups are then assigned a label. The main assumption that a KNN model makes is that the data points / instances that exist in close proximity to each other are very similar, whereas if one data point is far from another group it is dissimilar to those data points.
A KNN model calculates similarity using the distance between two points on a graph. The greater the distance between the points, the less similar they are. There are several ways to calculate the distance between points, but the most common distance metric is just the Euclidean distance (the distance between two points in a straight line).
KNN is a supervised learning algorithm, which means that the examples in the dataset must be labeled / their classes must be known. There are two other important things to know about KNN. First, KNN is an algorithm not parametric. This means that no assumptions are made about the dataset when the model is used. Rather, the model is built entirely from the data provided. Secondly, there is no splitting of the dataset into training set and test set when using KNN. KNN does not generalize between a training set and a test set, so all training data is also used when the model is asked to make predictions.
How a KNN algorithm works
A KNN algorithm goes through three main stages during its execution:
Set K to the chosen number of neighbors.
Calculation of the distance between a supplied / test example and the dataset examples.
Sorting of calculated distances.
Obtain the labels of the first K items.
Returning a prediction on the test example.
In the first step, K is chosen by the user and tells the algorithm how many neighbors (how many surrounding data points) should be considered when making a judgment about the group to which the target example belongs. In the second step, notice that the model checks the distance between the target example and each example in the dataset. The distances are then added to a list and sorted. Next, the sorted list is checked and the labels for the first K items are returned. In other words, if K is set to 5, the model checks the labels of the first 5 data points closest to the target data point. When rendering a forecast on the target data point, it is important whether the activity is a regression or a classification task. For a regression task, the average of the first K labels is used, while the mode of the first K labels is used in the case of classification.
The exact mathematical operations used to perform KNN differ depending on the distance metric chosen. If you'd like to learn more about how metrics are calculated, you can read some of the more common distance metrics, such as Euclidean, Manhattan, and Minkowski.
Because the value of K. is important
The main limitation when using KNN is that an improper value of K (the wrong number of neighbors to consider) could be chosen. In that case, the returned predictions can be substantially disabled. It is very important that when using a KNN algorithm, the appropriate value for K is chosen. You want to choose a value for K that maximizes the model's ability to make predictions on invisible data by reducing the number of errors it makes.
Lower values of K mean that the predictions made by the KNN are less stable and reliable. To get an idea of why this is the case, consider a case where we have 7 neighbors around a target data point. Suppose the KNN model is running with a K value of 2 (we ask it to look at the two closest neighbors to make a prediction). If the vast majority of the neighbors (five out of seven) belong to the Blue class, but the two closest neighbors are simply Red, the model predicts that the query example is Red. Despite the model hypothesis, blue would be a better guess in such a scenario.
If this is the case, why not just choose the highest possible K value? This is because telling the model to consider too many neighbors will also reduce accuracy. As the radius that the KNN model considers increases, it will eventually start considering data points closer to other groups than the target data point and misclassification will begin to occur. For example, even if the initially chosen point was in one of the red regions above, if K were set too high, the model would reach the other regions to consider the points. When using a KNN model, different values of K are tried to see which value gives the model the best performance.
Pros and cons of KNN
Let's look at some of the pros and cons of the KNN model.
KNN can be used for both regression and classification tasks, unlike other supervised learning algorithms.
KNN is extremely accurate and simple to use. It is easy to interpret, understand and implement.
KNN makes no assumptions about the data, which means it can be used for a wide variety of problems.
KNN stores most or all of the data, which means that the model requires a lot of memory and is expensive in terms of computation. Large datasets can also make forecasts take a long time.
KNN proves to be very sensitive to the scale of the dataset and can be discarded by irrelevant characteristics quite easily compared to other models.
K-Nearest Neighbors (KNN) summary
K-Nearest Neighbors is one of the simplest machine learning algorithms. Despite how simple KNN is, in theory, it is also a powerful algorithm that provides fairly high accuracy on most problems. When using KNN, be sure to experiment with various values of K to find the number that provides the most accuracy.
House of Codes
Technical Advisor and Business Partner