Anomaly detection using principal component analysis (PCA) – Visual Studio Magazine

0


The data science lab

Anomaly Detection Using Principal Component Analysis (PCA)

The main advantage of using PCA for anomaly detection, over alternative techniques such as a neural autoencoder, is simplicity – assuming you have a function that calculates eigenvalues ​​and vectors. clean.

Principal component analysis (PCA) is a classic statistical technique that decomposes a data matrix into vectors called principal components. The main components can be used for several different purposes. One way to use PCA components is to examine a set of data items for abnormal items using a rebuild error. In short, the idea is to break down the source data matrix into its major components and then rebuild the original data using only the first major components. The reconstructed data will be similar, but not exactly identical, to the original data. The reconstructed data items that are the most different from the corresponding original items are anomalous items.

Figure 1: Anomaly Detection Using PCA in Action
[Click on image for larger view.] Figure 1: Anomaly Detection Using PCA in Action

A good way to see where this article is going is to take a look at the screenshot of a demo program featured in Figure 1. The demo sets up a dummy dataset of six elements:

 [ 5.1  3.5  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 6.4  3.2  4.5  1.5]
 [ 5.7  2.8  4.5  1.3]
 [ 7.2  3.6  6.1  2.5]
 [ 6.9  3.2  5.7  2.3]

Each data item has four items. The demo normalizes the data so that items with large items do not dominate items with small items:

 [ 0.6375  0.8750  0.2000  0.0667]
 [ 0.6750  0.9750  0.2429  0.1333]
 [ 0.8000  0.8000  0.6429  0.5000]
 [ 0.7125  0.7000  0.6429  0.4333]
 [ 0.9000  0.9000  0.8714  0.8333]
 [ 0.8625  0.8000  0.8143  0.7667]

The demonstration applies principal component analysis to normalized data, resulting in four principal components. The first two of the four principal components are used to reconstruct the data:

 [ 0.6348  0.8822  0.2125  0.0571]
 [ 0.6813  0.9691  0.2343  0.1383]
 [ 0.7761  0.8024  0.6379  0.5125]
 [ 0.7266  0.6919  0.6332  0.4366]
 [ 0.9081  0.8936  0.8626  0.8379]
 [ 0.8606  0.8108  0.8338  0.7510]

The reconstructed data is compared to the original data by calculating the sum of the squared errors between the items. The six error values ​​are (0.00031, 0.00017, 0.00076, 0.00037, 0.00021, 0.00075).

Error value [2] is the largest reconstruction error (0.00076) and therefore the data item [2] (6.4, 3.2, 4.5, 1.5) is the most abnormal.

This article assumes that you have intermediate or above knowledge of a C family programming language, but does not assume that you know anything about principal component analysis. The demo program is implemented using Python, but you should be able to refactor it into another language, such as C # or JavaScript, if you want.

The full source code for the demo program is presented in this article and is also available in the accompanying download file. All normal error checking has been removed to keep the main ideas as clear as possible.

To run the demo program, you must have Python installed on your machine. The demo program was developed on Windows 10 using the 64-bit Anaconda 2020.02 distribution (which contains Python 3.7.6). The demo program has no significant dependencies, so any relatively recent version of Python 3 will work fine.

Understanding PCA for Anomaly Detection
PCA is based on decomposition. Suppose you want to break down the integer value 64 into three components. There are many possible decompositions. A decomposition is (8, 4, 2) because 8 * 4 * 2 = 64. The first component, 8, represents most of the original value, 4 represents minus, and 2 represents minus. If you use all three components to rebuild the entire source, you will exactly replicate the source. But if you use only the first two components to reconstruct the entire source, you will get a value close to the source: 8 * 4 = 32.

Principal component analysis is a very complex decomposition that works on data matrices instead of simple integer values. In my opinion, PCA is best understood by looking at a real-life example, like the demo.

The six-item source dataset consists of six arbitrary items selected from the popular 150-item Iris dataset. Each element represents an iris flower and has four elements: the length and width of the sepals (a sepal is a leaf-like structure) and the length and width of the petals. The detection of anomalies by PCA only works on strictly digital data, which constitutes the main limitation of the technique.

The source data are:

 [ 5.1  3.5  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 6.4  3.2  4.5  1.5]
 [ 5.7  2.8  4.5  1.3]
 [ 7.2  3.6  6.1  2.5]
 [ 6.9  3.2  5.7  2.3]

Since PCA is based on statistical variance, it is important to standardize the source data. The demo normalizes the data by the four columns by constants (8, 4, 7, 3) so that all values ​​are between 0.0 and 1.0:

 [ 0.6375  0.8750  0.2000  0.0667]
 [ 0.6750  0.9750  0.2429  0.1333]
 [ 0.8000  0.8000  0.6429  0.5000]
 [ 0.7125  0.7000  0.6429  0.4333]
 [ 0.9000  0.9000  0.8714  0.8333]
 [ 0.8625  0.8000  0.8143  0.7667]

There are three PCA results: transformed data, principal components, and explained variance. The transformed data are:

 [-0.5517  0.0053  0.0173  0.0029]
 [-0.4754 -0.0993 -0.0128  0.0027]
 [ 0.0915  0.0357 -0.0013 -0.0275]
 [ 0.0314  0.1654 -0.0161  0.0105]
 [ 0.4948 -0.1040 -0.0136  0.0046]
 [ 0.4093 -0.0031  0.0265  0.0068]

Note that the transformed data has the same shape as the original source data. Transformed data is an internal representation that can be used with major components to reconstruct the original data.

The main components are:

 [ 0.2325 -0.0822  0.6488  0.7199]
 [-0.2739 -0.8898  0.2646 -0.2516]
 [ 0.3001 -0.4363 -0.6979  0.4823]
 [-0.8837  0.1060 -0.1482  0.4312]

The main components are stored in the columns, so the first component is (0.325, -0.2739, 0.3001, -0.8837). The number of columns in the original data is sometimes referred to as the dimension (dim) of the problem, so dim = 4 for demo data. Each major component has dim elements and there are dim components. In other words, the major component matrix has the form dim x dim.

The principal components are stored so that the first component represents most of the statistical variance in the decomposition, the second component represents the second largest variance, and so on. For the demo, the percentages of the total deviations taken into account are (0.94828, 0.04918, 0.00160, 0.00095). This means that the first principal component represents 94% of the total variance, the second 5%, and the third and fourth components represent the remaining 1% of the total variance.

One way to think of major components is that they are an alternate description or representation of the source data. The demo program shows that if you use all the major components to rebuild the data, you will recover the original source data. It is not useful for anomaly detection. If you only use a few of the main components to reconstruct the data, the reconstructed data will be close to the source data. The more major components you use, the closer the reconstruction will be to the source. The demo uses the first two components to reconstruct the data:

 [ 0.6348  0.8822  0.2125  0.0571]
 [ 0.6813  0.9691  0.2343  0.1383]
 [ 0.7761  0.8024  0.6379  0.5125]
 [ 0.7266  0.6919  0.6332  0.4366]
 [ 0.9081  0.8936  0.8626  0.8379]
 [ 0.8606  0.8108  0.8338  0.7510]

The demo uses the sum of the squared error between the elements to calculate a reconstruction error for each of the six data elements. The six reconstruction error values ​​are (0.00031, 0.00017, 0.00076, 0.00037, 0.00021, 0.00075).

For example, the first normalized source data item is (0.6375, 0.8750, 0.2000, 0.0667). Its reconstruction is (0.6348, 0.8822, 0.2125, 0.0571). The element error is: (0.6375 – 0.6348) ^ 2 + (0.8750 – 0.8822) ^ 2 + (0.2000 – 0.2125) ^ 2 + (0.0667 – 0 , 0571) ^ 2 = 0.00031.

The demonstration program
The full demo program is presented in List 1. The program starts by configuring the source data:

import numpy as np

def main():
  print("Begin Iris PCA reconstruction error ")
  X = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [5.4, 3.9, 1.7, 0.4],
    [6.4, 3.2, 4.5, 1.5],
    [5.7, 2.8, 4.5, 1.3],
    [7.2, 3.6, 6.1, 2.5],
    [6.9, 3.2, 5.7, 2.3]])
. . .

The demo data is hard-coded. In a scenario without a demo, you would likely read the source data into memory from a file using np.loadtxt () or a similar function.

List 1: Complete anomaly detection demonstration program

# pca_recon.py
# anomaly detection using PCA reconstruction error

# Anaconda3-2020.02  Python 3.7.6  NumPy 1.19.5
# Windows 10 

import numpy as np

def my_pca(X):
  # returns transformed X, prin components, var explained
  dim = len(X[0])  # n_cols
  means = np.mean(X, axis=0)
  z = X - means  # avoid changing X
  square_m = np.dot(z.T, z)
  (evals, evecs) = np.linalg.eig(square_m) 
  trans_x = np.dot(z, evecs[:,0:dim])
  prin_comp = evecs.T 
  v = np.var(trans_x, axis=0, ddof=1)  # sample var
  sv = np.sum(v)
  ve = v / sv
  # order everything based on variance explained
  ordering = np.argsort(ve)[::-1]  # sort high to low
  trans_x = trans_x[:,ordering]
  prin_comp = prin_comp[ordering,:]
  ve = ve[ordering]
  return (trans_x, prin_comp, ve)

def reconstructed(X, n_comp, trans_x, p_comp):
  means = np.mean(X, axis=0)
  result = np.dot(trans_x[:,0:n_comp], p_comp[0:n_comp,:])
  result += means
  return result

def recon_error(X, XX):
  diff = X - XX
  diff_sq = diff * diff
  errs = np.sum(diff_sq, axis=1)
  return errs

def main():
  print("nBegin Iris PCA reconstruction error ")
  np.set_printoptions(formatter={'float': '{: 0.1f}'.format})

  X = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [5.4, 3.9, 1.7, 0.4],

    [6.4, 3.2, 4.5, 1.5],
    [5.7, 2.8, 4.5, 1.3],

    [7.2, 3.6, 6.1, 2.5],
    [6.9, 3.2, 5.7, 2.3]])

  print("nSource X: ")
  print(X)

  col_divisors = np.array([8.0, 4.0, 7.0, 3.0])
  X = X / col_divisors
  print("nNormalized X: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(X)  

  print("nPerforming PCA computations ")
  (trans_x, p_comp, ve) = my_pca(X)
  print("Done ")

  print("nTransformed X: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(trans_x)

  print("nPrincipal components: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(p_comp)

  print("nVariance explained: ")
  np.set_printoptions(formatter={'float': '{: 0.5f}'.format}) 
  print(ve)

  XX = reconstructed(X, 4, trans_x, p_comp)
  print("nReconstructed X using all components: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(XX)

  XX = reconstructed(X, 2, trans_x, p_comp)
  print("nReconstructed X using first two components: ")
  np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
  print(XX)

  re = recon_error(X, XX)
  print("nReconstruction errors using two components: ")
  np.set_printoptions(formatter={'float': '{: 0.5f}'.format})
  print(re)

  print("nEnd PCA reconstruction error ")
if __name__ == "__main__":
  main()

Then the demo normalizes the source data by dividing each column by a constant which will give values ​​between 0.0 and 1.0:

  col_divisors = np.array([8.0, 4.0, 7.0, 3.0])
  X = X / col_divisors

The demo modifies the source data. In some scenarios, you might want to create a new matrix of normalized values ​​to leave the original source data unchanged. Alternative normalization techniques include min-max normalization and z-score normalization. If you do not normalize the source data, the reconstruction error will be dominated by the column that has the highest magnitude values.

The principal component analysis is performed by a call to a my_pca () function defined by the program:

  print("Performing PCA computations ")
  (trans_x, p_comp, ve) = my_pca(X)
  print("Done ")

The returned result is a tuple with three values. The trans_x is the internal transformed data that is needed to reconstruct the data. The p_comp is the matrix of the main components where the components are stored in the columns. The ve is a vector of percentages of explained variance. The my_pca () function is implemented so that the main components are stored in the order of greatest explained variance to least explained variance.

The demo reconstructs the data using a program-defined reconstructed () function:

  XX = reconstructed(X, 4, trans_x, p_comp)
  . . .
  XX = reconstructed(X, 2, trans_x, p_comp)

The first call to reconstructed () uses all 4 main components and therefore the source normalized data is reconstructed exactly. The second call uses only the first 2 major components, so the reconstructed data is close but not exactly the same as the source data.

The demonstration ends by calculating a vector of reconstruction errors for each data item using a program-defined recon_error () function:

. . .
  re = recon_error(X, XX)
  print("Reconstruction errors using two components: ")
  np.set_printoptions(formatter={'float': '{: 0.5f}'.format})
  print(re)

  print("End PCA reconstruction error ")
if __name__ == "__main__":
  main()

In a no-demonstration scenario, you will likely sort the error values ​​from largest to smallest to get the largest anomalous data items.

In the wings
The key statements of the program-defined my_pca () function are:

def my_pca(X):
  dim = len(X[0])  # number cols
  means = np.mean(X, axis=0)
  z = X - means  # avoid changing X
  square_m = np.dot(z.T, z)
  (evals, evecs) = np.linalg.eig(square_m)  # 'right-hand'
  trans_x = np.dot(z, evecs[:,0:dim]) 
  prin_comp = evecs.T
  # compute variance explained, sort everything
  . . .
  return (trans_x, prin_comp, ve)


Leave A Reply

Your email address will not be published.