Chi-Square Test Using C # – Visual Studio Magazine

The data science lab

Chi-square test using C #

A chi-square test (also called a chi-square) is a classic statistical technique that can be used to determine whether the observed count data matches the expected count data.

A chi-square test (also called a chi-square) is a classic statistical technique that can be used to determine whether the observed count data matches the expected count data. For example, suppose you have three web servers designed to handle 50%, 25%, and 25% of your traffic, respectively. If you observe 1000 HTTP requests, you would expect to see approximately 500 requests processed by the first server, 250 requests processed by the second server, and 250 requests processed by the third server.

But suppose your observed actual counts are (529, 241, 230). Do you conclude that the differences between observed and expected numbers are just due to chance, or do you conclude that there is statistical evidence that your web servers are not handling traffic as expected? This is an example of a chi-square fit test.

A good way to see where this article is going is to take a look at a screenshot of a demo program in Figure 1. The demo configures the observed counts of (529, 241, 230) and the expected counts of (500.0, 250.0, 250.0). In a chi-square test, the observed counts will be of the integer type, but the expected / theoretical counts are often of the double type.

The three main output statements from the demo program are:

The chi-square statistic = 3.6060
The corresponding p-val  = 0.1648
Insufficient evidence observed is off

The chi-square statistic is a single value that measures the difference between observed numbers and expected numbers. A statistical chi-square value of 0.0 means that the observed and expected numbers match exactly. The higher the statistic value, the greater the difference between the observed and expected numbers.

Figure 1: Demonstration of the chi-square test
[Click on image for larger view.] Figure 1: Chi-Square Test Demo

The p-val (“probability value”) is roughly the probability that the observed and expected counts match. Thus, a small p-value, such as 0.03 (3%), indicates a lag between observed and expected. In this case, the p-value of 0.1648 is small but not small enough to conclude that something is wrong. In other words, the difference between observed and expected counts is somewhat suspect, but the difference could be due to fortuitous fluctuations in the data.

This article assumes you have intermediate or above programming skills, but does not assume that you know anything about the quality of chi-square goodness-of-fit tests. The demo program is coded in C # but you should have no problem refactoring the code in another language such as JavaScript or Python. All of the demo code is featured in this article and is also available in the accompanying download.

Understanding the Chi-Square Test
There are two steps to a chi-square test. First, the observed and expected counts are used to calculate a chi-square statistic, which is a measure of the difference between the counts. Second, the chi-square statistic is used to calculate a p-value, which is a measure of the likelihood that the counts match.

The chi-square statistic is defined as the sum of the squared differences between the observed and the expected divided by the expected:

chi-square = sum( (obs[i] - exp[i])^2 / exp[i] )

The idea is best explained by an example. Suppose, as in the demo, that the observed counts are (529, 241, 230) and that the expected counts are (500, 250, 250). The calculated chi-square statistic is:

chi-square = (529 - 500)^2 / 500 +
             (241 - 250)^2 / 250 +
             (230 - 250)^2 / 250

           = (841 / 500) + (81 / 250) + (400 / 250)
           = 1.6820 + 0.3240 + 1.6000
           = 3.6060

The demo implements this function as:

public static double ChiSqStat(int[] observed,
  double[] expected)
{
  double sum = 0.0;
  for (int i = 0; i < observed.Length; ++i) {
    sum += ((observed[i] - expected[i]) *
      (observed[i] - expected[i])) / expected[i];
  }
  return sum;
}

There is no error checking for simplicity, but in a production system you would want to make sure that the observed and expected arrays are the same length, and so on.

Calculation of p-value
Calculating a chi-square statistic is easy, but calculating the associated p-value is very difficult. The ideas are illustrated in the graphic in Figure 2.

Figure 2: The chi-square distribution for the demonstration data
[Click on image for larger view.] Figure 2: Chi-square distribution for demonstration data

There is not just one distribution of chi-square; there is a different chi-square distribution for each value of degrees of freedom (df). For a chi-square test between observed and expected numbers, the df is the number of categories minus one. Therefore, for the demonstration data df = 3 – 1 = 2.

The x-axis ranges from 0.0 (when the observed and expected numbers are equal) to infinity (there is no limit to the difference between the observed and expected numbers). The line that defines the chi-square distribution is called the probability density function (pdf). The p-value is the area under the pdf curve of the chi-square statistic at infinity.

To recap, when the observed and expected numbers are similar, the calculated chi-square statistic will be small (close to 0) and the p-value will be large (close to 1). When the observed and expected counts are very different, the calculated chi-square statistic will be high and the p-value will be low (perhaps less than 0.05).

There are several algorithms that estimate the area under a chi-square distribution. The demo program uses something called the ACM 299 algorithm. This algorithm calls another algorithm – the ACM 209 algorithm. Here is a small snippet of the demo code function that calculates the area under a chi-square distribution :

. . .
if (a > 40.0) {  // ACM remark (5)
  if (even == true) ee = 0.0;
  else ee = 0.5723649429247000870717135; // log(sqrt(pi))
  c = Math.Log(a); // log base e
  while (z <= x) {
    ee = Math.Log(z) + ee;
    s = s + Exp(c * z - a - ee); // ACM update (6)
    z = z + 1.0;
 }
. . .

The point is, estimating the area under a chi-square distribution is conceptually complex, but these are just numerical calculations.

The demonstration program
The complete code for the demo program, with some modifications to save space, is shown in List 1. To create the demo, I launched Visual Studio and created a new console app and named it ChiSquareUsingCSharp. I specified a .NET Core app, but the code has no dependencies and a .NET Framework app would work great as well.

List 1:
Complete demonstration program

using System;
namespace ChiSquareCSharp
{
  class ChiSquareProgram
  {
    static void Main(string[] args)
    {
      Console.WriteLine("Begin chi-square test ");

      int[] observed = new int[] { 529, 241, 230 }; 
      double[] expected = new double[] { 500.0, 250.0, 250.0 };
      Console.Write("Observed counts: ");
      ShowVector(observed);
      Console.Write("Expected counts: ");
      ShowVector(expected);

      double[] result = ChiSqTest(observed, expected);
      double chiStat = result[0];
      double pVal = result[1];
      Console.WriteLine("The chi-square statistic = " + 
        chiStat.ToString("F4"));
      Console.WriteLine("The corresponding p-val  = " + 
        pVal.ToString("F4"));

      if (pVal < 0.05)
        Console.WriteLine("Strong evidence observed is off ");
      else if (pVal < 0.10)
        Console.WriteLine("Moderate evidence observed is off ");
      else
        Console.WriteLine("Insufficient evidence observed is off ");

      Console.WriteLine("End demo ");
      Console.ReadLine();
    }  // Main

    public static double[] ChiSqTest(int[] observed,
      double[] expected)
    {
      // 1. compute chi-square statistic
      double x = ChiSqStat(observed, expected);
      // 2. compute the p-val
      int df = observed.Length - 1;
      double pVal = ChiSqPval(x, df);
      // 3. return both values
      double[] result = new double[] { x, pVal };
      return result;
    } // ChiSqTest

    public static double ChiSqStat(int[] observed,
      double[] expected)
    {
      double sum = 0.0;
      for (int i = 0; i < observed.Length; ++i)
      {
        sum += ((observed[i] - expected[i]) *
          (observed[i] - expected[i])) / expected[i];
      }
      return sum;
    }

    public static double ChiSqPval(double x, int df)
    {
      // ACM Algorithm #299
      if (x <= 0.0 || df < 1)
        throw new Exception("Bad arg in ChiSqPval()");

      double a = 0.0; // 299 variable names
      double y = 0.0;
      double s = 0.0;
      double z = 0.0;
      double ee = 0.0; // change from e
      double c;

      bool even; // is df even?

      a = 0.5 * x;
      if (df % 2 == 0) even = true; else even = false;

      if (df > 1) y = Exp(-a); // ACM update remark (4)

      if (even == true) s = y;
      else s = 2.0 * Gauss(-Math.Sqrt(x));

      if (df > 2)
      {
        x = 0.5 * (df - 1.0);
        if (even == true) z = 1.0; else z = 0.5;
        if (a > 40.0) // ACM remark (5)
        {
          if (even == true) ee = 0.0;
          else ee = 0.5723649429247000870717135; // log(sqrt(pi))
          c = Math.Log(a); // log base e
          while (z <= x)
          {
            ee = Math.Log(z) + ee;
            s = s + Exp(c * z - a - ee); // ACM update remark (6)
            z = z + 1.0;
          }
          return s;
        } // a > 40.0
        else
        {
          if (even == true) ee = 1.0;
          else ee = 0.5641895835477562869480795 / Math.Sqrt(a);
          c = 0.0;
          while (z <= x)
          {
            ee = ee * (a / z); // ACM update remark (7)
            c = c + ee;
            z = z + 1.0;
          }
          return c * y + s;
        }
      } // df > 2
      else
      {
        return s;
      }
    } // ChiSqPval()

    private static double Exp(double x) // ACM update remark (3)
    {
      if (x < -40.0) // ACM update remark (8)
        return 0.0;
      else
        return Math.Exp(x);
    }

    public static double Gauss(double z)
    {
      // ACM Algorithm #209
      double y; // 209 scratch variable
      double p; // result. called 'z' in 209
      double w; // 209 scratch variable

      if (z == 0.0)
        p = 0.0;
      else
      {
        y = Math.Abs(z) / 2;
        if (y >= 3.0)
        {
          p = 1.0;
        }
        else if (y < 1.0)
        {
          w = y * y;
          p = ((((((((0.000124818987 * w
            - 0.001075204047) * w
            + 0.005198775019) * w
            - 0.019198292004) * w + 0.059054035642) * w
            - 0.151968751364) * w + 0.319152932694) * w
            - 0.531923007300) * w + 0.797884560593) * y * 2.0;
        }
        else
        {
          y = y - 2.0;
          p = (((((((((((((-0.000045255659 * y
            + 0.000152529290) * y - 0.000019538132) * y
            - 0.000676904986) * y + 0.001390604284) * y
            - 0.000794620820) * y - 0.002034254874) * y
           + 0.006549791214) * y - 0.010557625006) * y
          + 0.011630447319) * y - 0.009279453341) * y
         + 0.005353579108) * y - 0.002141268741) * y
        + 0.000535310849) * y + 0.999936657524;
        }
      }

      if (z > 0.0)
        return (p + 1.0) / 2;
      else
        return (1.0 - p) / 2;
    } // Gauss()

    public static void ShowVector(int[] vector)
    {
      for (int i = 0; i < vector.Length; ++i)
        Console.Write(vector[i].ToString() + "  ");
      Console.WriteLine();
    }

    public static void ShowVector(double[] vector)
    {
      for (int i = 0; i < vector.Length; ++i)
        Console.Write(vector[i].ToString("F1") + "  ");
      Console.WriteLine();
    }

  }  // Program
}  // ns

Once the model code was loaded into the VS editor, I right clicked on the Program.cs file in the Solution Explorer window and renamed the file to ChiSquareProgram. When prompted by VS, I allowed VS to automatically rename the program class. At the top of the template code, I removed all namespace references except the system namespace.

The Main () method starts with:

static void Main(string[] args)
{
  Console.WriteLine("Begin chi-square test ");
  int[] observed = new int[] { 529, 241, 230 }; 
  double[] expected = new double[] { 500.0, 250.0, 250.0 };
  Console.Write("Observed counts: ");
  ShowVector(observed);
  Console.Write("Expected counts: ");
  ShowVector(expected);
. . .

The observed and expected numbers are hard coded. In a no demonstration scenario, these values ​​would likely come from another monitored system. Overloaded functions defined by the ShowVector program are convenience only.

The heart of the program is made up of five statements:

  double[] result = ChiSqTest(observed, expected);
  double chiStat = result[0];
  double pVal = result[1];
  Console.WriteLine("The chi-square statistic = " + 
    chiStat.ToString("F4"));
  Console.WriteLine("The corresponding p-val  = " + 
    pVal.ToString("F4"));

The ChiSqTest () function returns an array with two cells. Cell [0] contains the calculated chi-square statistic, and the cell [1] holds the corresponding p-value. The reason for this unusual design is that it mimics the chisquare () function of the Python language SciPy library.

The Main () method ends with:

. . .
  if (pVal < 0.05)
    Console.WriteLine("Strong evidence observed is off ");
  else if (pVal < 0.10)
    Console.WriteLine("Moderate evidence observed is off ");
  else
    Console.WriteLine("Insufficient evidence observed is off ");

  Console.WriteLine("End demo ");
  Console.ReadLine();
}  // Main

The interpretation of the p-value depends on the problem. A common critical threshold for the p-value is 0.05. When the p-value is less than 0.05, there is less than a 5% chance that the observed and expected numbers are actually equal, but the differences were due to chance of sampling. However, you can use all of the appropriate threshold values ​​for the problem in question.

Wrap
As is the case with all classical statistical techniques, it is important to understand that the results are probabilistic. For the demonstration problem, even if the calculated p-value is very small, it is still possible that the difference between the observed and expected numbers is due to random errors inherent in all real data. In other words, a small chi-square p-value means that “this data needs to be examined by a human” rather than “the observed numbers do not match the expected numbers.”

About the Author

Dr James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Azure and Bing. James can be reached at [email protected].


Source link

Comments are closed.