Weighted k-NN Clustering Using C# — Visual Studio Magazine

The Data Science Lab

Weighted k-NN classification using C#

Microsoft Research’s Dr. James McCaffrey uses a comprehensive program and step-by-step instructions to explain the machine learning technique, which can be used to predict a person’s happiness score from their income and education. for example.

The goal of a weighted k-NN (“k-nearest neighbours”) classification problem is to predict a discrete value called the class label. The idea is best explained with a concrete example. Suppose you have a set of 30 data items where each item consists of a person’s annual income, education (years), and personality happiness score (0, 1, or 2). You want to predict a person’s happiness score based on their income and education. This is an example of multiclass classification because the variable to be predicted, happiness, has three or more possible values. If the variable to be predicted had only two possible values, the problem would be called binary classification. The k-NN technique can handle both multi-class and binary classification problems.

Figure 1: Weighted k-NN classification using C#
[Click on image for larger view.] Figure 1: Weighted k-NN classification using C#

A good way to see where this article is going is to take a look at the screenshot of a demo program in Figure 1. The demo configures 30 data items. Each has a data ID (from 0 to 29), annual income (normalized by dividing by $100,000), education (normalized by dividing by 20), and happiness score (0, 1, 2). Data elements look like:

ID  Income Education Happy
--------------------------
00   0.32   0.43       0
01   0.26   0.54       0
. . . 
12   0.39   0.43       1
. . .
29   0.71   0.22       2

The graph in Figure 2 displays the data. The goal is to predict the happiness score of a person who has (income, education) of (0.62, 0.35). Note that all predictor values ​​are numeric. This is a key feature of the k-NN classification technique presented in this article. If you have non-numeric predictive data, such as a person’s age in years, you should use a different classification technique such as a neural network or decision tree.

The demo program specifies k = 6, then calculates the six data points closest to the item (0.62, 0.35) to classify. Of the six closest points, three are class 0, one is class 1, and two are class 2. A simple majority rule voting scheme would conclude that (0.62, 0.35) is class 0. The demo uses a weighted voting scheme where closer data points are given more weight. The weighted votes are (0.5711, 0.1782, 0.2508). The largest value is at the index [0] and so the prediction is happiness = 0.

Figure 2: The k-NN demo data
[Click on image for larger view.] Figure 2: The k-NN demo data

This article assumes that you have intermediate or better programming skills with the C family language such as Python or Java, but does not assume that you know anything about weighted k-NN classification. The complete demo code and associated data are shown in this article. Source code and data are also available in the accompanying download. All normal error checking has been removed to keep the main ideas as clear as possible.

Understanding weighted k-NN classification
The first step in weighted k-NN classification is to calculate the distances between the data item to be predicted and each of the data points. There are many ways to calculate the distance between two points, but the Euclidean distance is simple and effective in most cases.

For example, the Euclidean distance between the target element (0.62, 0.35) and the data point [09] at (0.61, 0.42) is:

dist = sqrt( (0.62 - 0.61)^2 + (0.35 - 0.42)^2 )
     = sqrt( 0.0001 + 0.0049 )
     = sqrt(0.0050)
     = 0.0707

It is important that the values ​​of the numerical predictors are normalized so that large amplitude variables, such as income, do not overwhelm smaller amplitude variables.

The next step is to find the k closest data points to the target element. This is done by sorting all the distances and extracting the first k elements and distances. For the demo data, the k = 6 elements closest to (0.62, 0.35) are:

ID  Data       Class Dist   Inv Dist
------------------------------------
09  (0.61 0.42)  0   0.0707  14.1421
07  (0.55 0.32)  0   0.0762  13.1306
18  (0.55 0.41)  1   0.0922  10.8465
28  (0.64 0.24)  2   0.1118   8.9443
05  (0.49 0.32)  0   0.1334   7.4953
29  (0.71 0.22)  2   0.1581   6.3246

To weight closer data items more heavily, the inverse of the distance is used. At this point, you can simply add the inverse distances for each class as follows:

class 0: 14.1421 + 13.1306 + 7.4953 = 34.7680
class 1:                              10.8465
class 2:  8.9443 + 6.3246           = 15.2689
                                      -------
                                      60.8834

The prediction is the class with the highest weighted votes, class 0. The demo program normalizes the weighted votes by dividing each by their sum, which is 60.8834. This makes the sum of the weighted votes at 1.0 and a bit easier to interpret:

class 0: 34.7680 / 60.8834 = 0.5711
class 1: 10.8465 / 60.8834 = 0.1782
class 2: 15.2689 / 60.8834 = 0.2508
                             ------
                             1.0000

The biggest weakness of k-NN classification is that there is no right way to select the value of k. For a given problem, different values ​​of k often give different predictions. One technique to solve this problem is to use several different values ​​of k and then use the most common predicted class.

The demo program
The full demo program, with some minor modifications to save space, is shown in List 1. To create the program, I launched Visual Studio and created a new C# .NET Core console application named KNN. I used Visual Studio 2019 (Free Community Edition), but the demo has no major dependencies, so any version of Visual Studio will work fine. You can also use the Visual Studio Code program.

Once the template code loaded, in the editor window, I removed all unnecessary namespace references, leaving only the reference to the top-level System namespace. In the Solution Explorer window, I right-clicked the Program.cs file, renamed it to the more descriptive KNNProgram.cs, and allowed Visual Studio to automatically rename the Program class to KNNProgram.

List 1:
Complete k-NN demo code

using System;
namespace KNN
{
  class KNNProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin weighted k-NN classification demo ");
      Console.WriteLine("Normalized income, education data: ");
      Console.WriteLine("[id =  0, 0.32, 0.43, class = 0]");
      Console.WriteLine(" . . . ");
      Console.WriteLine("[id = 29, 0.71, 0.22, class = 2]");

      double[][] data = GetData();
      double[] item = new double[] { 0.62, 0.35 };
      Console.WriteLine("nNearest (k=6) to (0.35, 0.38):");
      Analyze(item, data, 6, 3);  // 3 classes
      Console.WriteLine("nEnd weighted k-NN demo ");
      Console.ReadLine();
    }

    static void Analyze(double[] item, double[][] data, int k, int c)
    {
      // 1. Compute all distances
      int N = data.Length;
      double[] distances = new double[N];
      for (int i = 0; i < N; ++i)
        distances[i] = DistFunc(item, data[i]);

      // 2. Get ordering
      int[] ordering = new int[N];
      for (int i = 0; i < N; ++i)
        ordering[i] = i;
      double[] distancesCopy = new double[N];
      Array.Copy(distances, distancesCopy, distances.Length);
      Array.Sort(distancesCopy, ordering);

      // 3. Show info for k-nearest
      double[] kNearestDists = new double[k];
      for (int i = 0; i < k; ++i)
      {
        int idx = ordering[i];
        ShowVector(data[idx]);
        Console.Write("  dist = " +  
          distances[idx].ToString("F4"));
        Console.WriteLine("  inv dist " +
          (1.0 / distances[idx]).ToString("F4"));
        kNearestDists[i] = distances[idx];
      }

      // 4. Vote
      double[] votes = new double[c];  // one per class
      double[] wts = MakeWeights(k, kNearestDists);
      Console.WriteLine("nWeights (inverse technique): ");
      for (int i = 0; i < wts.Length; ++i)
        Console.Write(wts[i].ToString("F4") + "  ");
      Console.WriteLine("nnPredicted class: ");
      for (int i = 0; i < k; ++i)
      {
        int idx = ordering[i];
        int predClass = (int)data[idx][3];
        votes[predClass] += wts[i] * 1.0;
      }
      for (int i = 0; i < c; ++i)
        Console.WriteLine("[" + i + "]  " +
        votes[i].ToString("F4"));
    } // Analyze

    static double[] MakeWeights(int k, double[] distances)
    {
      // Inverse technique
      double[] result = new double[k];  // one per neighbor
      double sum = 0.0;
      for (int i = 0; i < k; ++i) {
        result[i] = 1.0 / distances[i];
        sum += result[i];
      }
      for (int i = 0; i < k; ++i)
        result[i] /= sum;
      return result;
    }

    static double DistFunc(double[] item, double[] dataPoint)
    {
      double sum = 0.0;
      for (int i = 0; i < 2; ++i)  {
        double diff = item[i] - dataPoint[i + 1];
        sum += diff * diff;
      }
      return Math.Sqrt(sum);
    }

    static double[][] GetData()
    {
      double[][] data = new double[30][];
      data[0] = new double[] { 0, 0.32, 0.43, 0 };
      data[1] = new double[] { 1, 0.26, 0.54, 0 };
      data[2] = new double[] { 2, 0.27, 0.6, 0 };
      data[3] = new double[] { 3, 0.37, 0.36, 0 };
      data[4] = new double[] { 4, 0.37, 0.68, 0 };
      data[5] = new double[] { 5, 0.49, 0.32, 0 };
      data[6] = new double[] { 6, 0.46, 0.7, 0 };
      data[7] = new double[] { 7, 0.55, 0.32, 0 };
      data[8] = new double[] { 8, 0.57, 0.71, 0 };
      data[9] = new double[] { 9, 0.61, 0.42, 0 };
      data[10] = new double[] { 10, 0.63, 0.51, 0 };
      data[11] = new double[] { 11, 0.62, 0.63, 0 };
      data[12] = new double[] { 12, 0.39, 0.43, 1 };
      data[13] = new double[] { 13, 0.35, 0.51, 1 };
      data[14] = new double[] { 14, 0.39, 0.63, 1 };
      data[15] = new double[] { 15, 0.47, 0.4, 1 };
      data[16] = new double[] { 16, 0.48, 0.5, 1 };
      data[17] = new double[] { 17, 0.45, 0.61, 1 };
      data[18] = new double[] { 18, 0.55, 0.41, 1 };
      data[19] = new double[] { 19, 0.57, 0.53, 1 };
      data[20] = new double[] { 20, 0.56, 0.62, 1 };
      data[21] = new double[] { 21, 0.28, 0.12, 1 };
      data[22] = new double[] { 22, 0.31, 0.24, 1 };
      data[23] = new double[] { 23, 0.22, 0.3, 1 };
      data[24] = new double[] { 24, 0.38, 0.14, 1 };
      data[25] = new double[] { 25, 0.58, 0.13, 2 };
      data[26] = new double[] { 26, 0.57, 0.19, 2 };
      data[27] = new double[] { 27, 0.66, 0.14, 2 };
      data[28] = new double[] { 28, 0.64, 0.24, 2 };
      data[29] = new double[] { 29, 0.71, 0.22, 2 };
      return data;
    }

    static void ShowVector(double[] v)
    {
      Console.Write("idx = " + v[0].ToString().PadLeft(3) +
        "  (" + v[1].ToString("F2") + " " +
        v[2].ToString("F2") + ") class = " + v[3]);
    }

  } // Program
} // ns

The demo program hard-codes the data into a program-defined GetData() function. Data is returned as an array of arrays style double[ ][ ] matrix. In a non-demo scenario, you might want to store data in a text file and write a helper function to read the data into an array of arrays.

The first value of each data item is an arbitrary ID. I used consecutive whole numbers from 0 to 29, but in many cases you would have meaningful identifiers such as a person’s employee ID number. The k-NN algorithm doesn’t need data IDs, so the ID column could have been omitted. The second and third values ​​are the normalized predictor values. The fourth value is the class ID. Class IDs for k-NN need not be zero-based integers, but using this scheme is convenient.

Calculation of distances
The first step of the k-NN algorithm is to calculate the distance between each labeled data item and the item to be classified. In my experience and some research results, the choice of which distance function to apply is not as critical as one might expect, and the simple Euclidean distance usually works well.

The demo implementation of the Euclidean distance is:

static double DistFunc(double[] item, double[] dataPoint)
{
  double sum = 0.0;
  for (int i = 0; i < 2; ++i) {
    double diff = item[i] - dataPoint[i+1];
    sum += diff * diff;
  }
  return Math.Sqrt(sum);
}

Note that the for loop indexing is hardcoded with 2 for the number of predictor values ​​and the difference between the element[i] and data point[i+1] is shifted to account for the value of the data ID at position [0]. The point here is that there is usually a trade-off between coding a machine learning algorithm for a specific problem and coding the algorithm for general purpose use. Because k-NN is so simple and easy to implement, I generally prefer the specific code-for-problem approach.

Sort or classify distances
Once all the distances have been calculated, the k-NN algorithm must find the nearest (smallest) k distances. The approach used by the demo is to use a neat overload of the .NET Array.Sort method to sort only the distances and associated data indexes in parallel. The key code is:

int[] ordering = new int[N];
for (int i = 0; i < N; ++i)
  ordering[i] = i;
double[] distancesCopy = new double[N];
Array.Copy(distances, distancesCopy, distances.Length);
Array.Sort(distancesCopy, ordering);

The array named sort initially contains 0, 1, 2 . . 29. A copy of the 29 distances to the item to be sorted is created because you don’t want to lose the distances in their original order. The Array,Sort(distancesCopy, ordering) statement sorts the values ​​of the distancesCopy array from smallest to largest using the Quicksort algorithm and simultaneously reorders the index values ​​in the ordering array. The result is that the first value in order of the arrays is the index into the data of the element with the smallest distance, the second value in order contains the index of the second closest distance, and and so on. A very interesting feature of the C# language.

Wrap
The two main advantages of weighted k-NN classification are 1.) it is simple to understand and implement, and 2.) the results are easily interpretable. The two main disadvantages of weighted k-NN classification are 1.) there is no principle method for choosing the value of k, and 2.) the technique only works well with strictly numerical predictor values because it is usually not possible to define a good distance function for non-numeric data points.

For an example using Python, see the article “Weighted k-NN classification using Python”.

About the Author


Dr. James McCaffrey works for Microsoft Research in Redmond, Washington. He has worked on several Microsoft products including Azure and Bing. Jacques can be reached at [email protected].



Comments are closed.