What is K-Nearest Neighbors (KNN)?

K-Nearest Neighbors (KNN) algorithm is a supervised learning algorithm used for classification and regression tasks in machine learning. In KNN, the idea is that similar data points are close to each other in a high-dimensional space. It works by finding the K closest data points (neighbors) to a new data point, based on a distance metric, and then predicting the label or value of the new data point based on the labels or values of its nearest neighbors. The distance metric can be Euclidean, Manhattan, or other types of distance measures.

How does it works?

Suppose we have a dataset with N data points, where each data point is represented by a d-dimensional feature vector x and belongs to one of K classes. We also have a new data point x0 that we want to predict its class label.

The steps of the KNN algorithm are: -


Choose a distance metric: - The distance metric is used to measure the distance between two data points. The most common distance metric used in KNN is Euclidean distance, which is defined as follows:

d(x, y) = sqrt(sum((xi - yi)^2)), i=1 to d

where x and y are two d-dimensional feature vectors, xi and yi are their i-th components, and sqrt denotes the square root.


Compute the distances between x0 and all N data points: - For each data point xi in the dataset, we compute its distance to x0 using the chosen distance metric.

dist_i = d(xi, x0), i=1 to N

Find the K nearest neighbors: - We sort the distances in ascending order and select the K nearest data points to x0 based on their distances.


idx = sort(dist_1, dist_2, ..., dist_N)

k_nearest = {x_i | i in idx[1:K]}

where idx is the index array of sorted distances and k_nearest is the set of K nearest data points.


Predict the class label: For classification tasks, we predict the class label of x0 based on the labels of its K nearest neighbors. The predicted class label is the mode (most common) label among the K nearest neighbors.


y0 = argmax(sum(1{y_i = j}), j=1 to K)

where y_i is the label of the i-th nearest neighbor, 1{y_i = j} is an indicator function that returns 1 if y_i = j and 0 otherwise, and argmax denotes the argument that maximizes the function.


For regression tasks, we predict the value of x0 based on the values of its K nearest neighbors. The predicted value is usually the mean or median value among the K nearest neighbors.


y0 = mean({y_i | i in idx[1:K]})

where y_i is the value of the i-th nearest neighbor.


Return the predicted class label or value: - The KNN algorithm returns the predicted class label or value of x0.


That's the KNN algorithm in a nutshell. The value of K is a hyperparameter that needs to be tuned based on the data and the problem at hand. A larger value of K means more neighbors are considered, which may result in a smoother decision boundary, but may also lead to overfitting. A smaller value of K means fewer neighbors are considered, which may result in a more complex decision boundary, but may also lead to underfitting.

Several variations K-Nearest Neighbors

There are several variations and extensions of the K-Nearest Neighbors (KNN) algorithm that have been proposed over the years. Here are some of the most common methods: -
Weighted KNN: - In this method, instead of just taking a majority vote or average of the K nearest neighbors, we assign weights to each neighbor based on their distances to the new data point. The closer neighbors are given higher weights, while the farther neighbors are given lower weights. This helps to give more importance to the neighbors that are more similar to the new data point.

Distance-weighted KNN: - This is a variation of weighted KNN where the weights are not only based on the distances, but also on the inverse of the distances raised to a power p. This method assigns higher weights to the closer neighbors and lower weights to the farther neighbors, but the weights decrease more rapidly as the distance increases. This can help to reduce the influence of outliers and noise in the data.

Ball Tree: - This is a data structure that is used to speed up the search for the K nearest neighbors in high-dimensional spaces. It works by recursively partitioning the data space into smaller and smaller balls around each data point. This allows for faster distance calculations and reduces the number of distance calculations needed to find the K nearest neighbors.

KD Tree: - This is another data structure that is used to speed up the search for the K nearest neighbors in high-dimensional spaces. It works by recursively partitioning the data space into smaller and smaller hyperplanes that are perpendicular to each other. This also allows for faster distance calculations and reduces the number of distance calculations needed to find the K nearest neighbors.

Radius-based KNN: - In this method, instead of finding the K nearest neighbors, we find all the neighbors within a certain radius r of the new data point. The predicted class label or value is then based on the labels or values of these neighbors. This method can be useful when the density of the data points is not uniform, and we want to find all the neighbors within a certain neighborhood.

In Our Project


Our project, which has been uploaded on GitHub and implemented in Python, aims to create a K-Nearest Neighbors (KNN) model for classification and identify the optimal number of clusters (K) for the given data. To achieve this, we have utilized several Python packages including Pandas, Seaborn, Sklearn, StandardScaler, and KNeighborsClassifier.

Pandas is a powerful data manipulation library that allows us to load, process, and manipulate the dataset effectively. Seaborn is a visualization library that enables us to create appealing and informative visualizations to better understand the data. Sklearn is a widely-used machine learning library that provides a comprehensive set of tools and algorithms for data analysis and modeling.
We have utilized the StandardScaler module from Sklearn to scale our data to ensure that all features have equal importance during model training. Additionally, we have used the KNeighborsClassifier module from Sklearn to create our KNN model for classification.

In our project, we have created several visualizations to better understand the data and identify any patterns or trends. This includes scatter plots, histograms, and box plots, which help us to visualize the distribution and relationships between the features.

The main focus of our project is to find the optimal number of clusters (K) for the KNN model. To do this, we have utilized the elbow method, which involves running the KNN model with different values of K and calculating the sum of squared distances between the data points and their assigned clusters. We then plot the results and identify the "elbow" point, which indicates the optimal number of clusters for the given data.

Objective

The main focus of our project is to find the optimal number of clusters (K) for the KNN model. To do this, we have utilized the elbow method, which involves running the KNN model with different values of K and calculating the sum of squared distances between the data points and their assigned clusters. We then plot the results and identify the "elbow" point, which indicates the optimal number of clusters for the given data./p>

Card
                                                                    image cap

Jay Mudgal

Jay.mudgal@baruchmail.cuny.edu