Preparing MNIST Image Data Text Files — Visual Studio Magazine

The Data Science Lab

Preparation of MNIST image data text files

Dr. James McCaffrey of Microsoft Research demonstrates how to retrieve and prepare MNIST data for machine learning problems of image recognition.

Many machine learning problems fall into one of three categories: tabular data prediction (like the iris species problem), natural language processing (like the IMDB movie review sentiment problem), and recognition image (like the MNIST handwritten digits problem). This article describes how to retrieve and prepare MNIST data.

Data from MNIST (Modified National Institute of Standards and Technology) consists of 60,000 training images and 10,000 test images. Each image is a rough 28 x 28 (784 pixel) handwritten number between “0” and “9”. Each pixel value is a grayscale integer between 0 and 255.

The most popular neural network libraries, including PyTorch, scikit, and Keras, have some form of built-in MNIST dataset designed to work with the library. But there are two problems with using an integrated dataset. First, data access becomes a magic black box and important information is hidden. Second, the built-in datasets use the 60,000 training images and the 10,000 test images, which is very difficult to use because they are so large.

Figure 1: Converting Source Binary MNIST Data to Text Files
[Click on image for larger view.] Figure 1: Convert source binary MNIST data to text files

A good way to see where this article is heading is to take a look at the screenshot of a Python language program in Figure 1. MNIST source data files are stored in a proprietary binary format. The program loads binary pixels and label training files into memory, converts the data to tab-delimited text, and saves only the first 1,000 training images and their labels “0” through “9”. To verify the generated training file, the demo program reads the first training image into memory and displays that image, a number “5”, in the shell and graphically.

This article assumes that you have intermediate or better knowledge of a C-family programming language, preferably Python, but does not assume that you know anything about the MNIST dataset. The full source code for the demo program is shown in this article, and the code is also available in the companion file download.

Obtaining Source Data Files
The main storage site for MNIST binary data files is the MNIST Database of Handwritten Digits, but you can also find it in other places by searching the Internet. There are links to four GNU zip compressed files:

  train-images-idx3-ubyte.gz
  train-labels-idx1-ubyte.gz
  t10k-images-idx3-ubyte.gz
  t10k-labels-idx1-ubyte.gz

The first two files contain the pixel values ​​and associated labels for the 60,000 training data sets. The second two files are the 10,000 element test data. If you click on a link, you can download the associated file. I suggest downloading to a directory named ZippedBinary. Unlike compressed .zip files, Windows cannot extract .gz files, so you must use an application. I recommend the 7-Zip utility. After installing 7-Zip, you can open Windows File Explorer, then right-click on each .gz file and select the Extract Files option. I suggest extracting to a directory named UnzippedBinary and adding a .bin extension to the unzipped files.

Converting binary files to text file
The complete converter_minst.py data preparation program is presented in List 1. The program starts by importing two libraries:

import numpy as np
import matplotlib.pyplot as plt

Both libraries are used to display data but are not needed to convert binary to text. The demo uses three program-defined functions: main(), convert(), and display_from_file(). The main() function is:

def main():
  n_images = 1000
  print("Creating %d MNIST train images from binary files " % n_images)
  convert(".UnzippedBinarytrain-images.idx3-ubyte.bin",
          ".UnzippedBinarytrain-labels.idx1-ubyte.bin",
          "mnist_train_1000.txt", 1000)
  print("Showing train image [0]: ")
  img_file = ".mnist_train_1000.txt"
  display_from_file(img_file, idx=0)  # first image

The convert() function accepts the unpacked binary pixel file and the unpacked binary tag file, a destination text file, and the number of images to convert.

List 1: Program to convert MNIST binary data to text

# converter_mnist.py

import numpy as np
import matplotlib.pyplot as plt

# convert MNIST binary to text file; combine pixels and labels
# target format:
# pixel_1 (tab) pixel_2 (tab) . . pixel_784 (tab) digit

# 1. manually download four zipped-binary files from
#    yann.lecun.com/exdb/mnist/ 
# 2. use 7-Zip to unzip files, add ".bin" extension
# 3. determine format you want and modify script

def convert(img_file, label_file, txt_file, n_images):
  print("nOpening binary pixels and labels files ")
  lbl_f = open(label_file, "rb")   # labels (digits)
  img_f = open(img_file, "rb")     # pixel values
  print("Opening destination text file ")
  txt_f = open(txt_file, "w")      # output to write to

  print("Discarding binary pixel and label headers ")
  img_f.read(16)   # discard header info
  lbl_f.read(8)    # discard header info

  print("nReading binary files, writing to text file ")
  print("Format: 784 pixels then labels, tab delimited ")
  for i in range(n_images):   # number requested 
    lbl = ord(lbl_f.read(1))  # Unicode, one byte
    for j in range(784):  # get 784 pixel vals
      val = ord(img_f.read(1))
      txt_f.write(str(val) + "t") 
    txt_f.write(str(lbl) + "n")
  img_f.close(); txt_f.close(); lbl_f.close()
  print("nDone ")

def display_from_file(txt_file, idx):
  all_data = np.loadtxt(txt_file, delimiter="t",
    usecols=range(0,785), dtype=np.int64)

  x_data = all_data[:,0:784]  # all rows, 784 cols
  y_data = all_data[:,784]    # all rows, last col

  label = y_data[idx]
  print("digit = ", str(label), "n")

  pixels = x_data[idx]
  pixels = pixels.reshape((28,28))
  for i in range(28):
    for j in range(28):
      # print("%.2X" % pixels[i,j], end="")
      print("%3d" % pixels[i,j], end="")
      print(" ", end="")
    print("")

  plt.tight_layout()
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show()  

def main():
  n_images = 1000
  print("nCreating %d MNIST train images from binary files " 
    % n_images)
  convert(".UnzippedBinarytrain-images.idx3-ubyte.bin",
          ".UnzippedBinarytrain-labels.idx1-ubyte.bin",
          "mnist_train_1000.txt", 1000)

  # n_images = 100
  # print("nCreating %d MNIST test images from binary files " % 
      n_images)
  # convert(".UnzippedBinaryt10k-images.idx3-ubyte.bin",
  #         ".UnzippedBinaryt10k-labels.idx1-ubyte.bin",
  #         "mnist_test_100.txt", 100)

  print("nShowing train image [0]: ")
  img_file = ".mnist_train_1000.txt"
  display_from_file(img_file, idx=0)  # first image

if __name__ == "__main__":
  main()

The organization of the MNIST source binaries is somewhat unusual. Storing features (the pixel predictor values) and labels (the number to be predicted) in separate files, rather than together in a single file, was common in the 1990s when computers had limited memory.

Conversion of binary MNIST to text
The convert() function begins by opening two source binary files for reading and a target text file for writing:

def convert(img_file, label_file, txt_file, n_images):
  lbl_f = open(label_file, "rb")   # labels (digits)
  img_f = open(img_file, "rb")     # pixel values
  txt_f = open(txt_file, "w")      # output file

The “rb” argument stands for “read-binary” and “w” stands for write (default text file). Then the header information is consumed and discarded:

  img_f.read(16)   # discard header
  lbl_f.read(8)    # discard header

The read() method accepts the number of bytes to read. The pixel file has a 16 byte header and the label file has an 8 byte header. The main processing loop is:

  for i in range(n_images):   # number images 
    lbl = ord(lbl_f.read(1))  # get label
    for j in range(784):  # get 784 pixel values
      val = ord(img_f.read(1))
      txt_f.write(str(val) + "t") 
    txt_f.write(str(lbl) + "n")
  img_f.close(); txt_f.close(); lbl_f.close()

Since each tag value is between “0” and “9” and pixel values ​​are between 0 and 255, each tag and pixel value is stored as one byte. The Python ord() function converts a pixel stored as a byte into an integer. This loop is the main customization point. The demo output format is to write one image per line where the 784 pixel values ​​come first and the label/number is the last value on the line. Values ​​are delimited by tabs.

You should have no problem modifying this code to create MNIST text data with any format. For example, you might want to use a comma instead of a tab as a delimiter, or you might want to put the label/number as the first value on each line, followed by the 784 pixel values.

Viewing MNIST Data
Once the MNIST data has been stored in a text file, it is useful to view it to verify that the data has been correctly converted and saved. The program-defined display_from_file() function assumes that the MNIST data has been stored one image per line, tab delimited, with pixels first then label. The function definition begins:

def display_from_file(txt_file, idx):
  all_data = np.loadtxt(txt_file, delimiter="t",
    usecols=range(0,785), dtype=np.int64)
  x_data = all_data[:,0:784]  # all rows, 784 cols
  y_data = all_data[:,784]    # all rows, last col

The numpy loadtxt() functions read numeric values ​​stored as text in a two-dimensional numpy array. The x_data array contains all pixel values ​​and the y_data array contains all labels. Then the specified pixels and label are extracted:

  label = y_data[idx]
  pixels = x_data[idx]
  pixels = pixels.reshape((28,28))

The 784 pixel values ​​are reshaped into a 28 x 28 two-dimensional array in preparation for sending to the imshow() function of the matplotlib library (“image show”). But first, the 784 pixel values ​​are displayed in the command shell:

  for i in range(28):
    for j in range(28):
      print("%3d" % pixels[i,j], end="")
      print(" ", end="")
    print("")

And then the pixel values ​​are displayed graphically:

  plt.tight_layout()
  plt.imshow(pixels, cmap=plt.get_cmap('gray_r'))
  plt.show()

The “gray_r” argument of the get_cmap() method means “inverted grayscale” where 0 values ​​are displayed in white, 255 values ​​are displayed in black, and intermediate values ​​vary in shades of gray.

Using MNIST data in a PyTorch program
Once the MNIST data is saved as a text file, it is possible to code a PyTorch Dataset class to read the data and send it to a DataLoader object for training. A possible implementation of the dataset is shown in List 2. The dataset assumes that the MNIST data is in the format described in this article.

List 2: Use of MNIST data

import torch as T
device = T.device('cpu')

class MNIST_Dataset(T.utils.data.Dataset):
  # 784 tab-delim pixel values (0-255) then label (0-9)
  def __init__(self, src_file):
    all_xy = np.loadtxt(src_file, usecols=range(785),
      delimiter="t", comments="#", dtype=np.float32)

    tmp_x = all_xy[:, 0:784]  # all rows, cols [0,783]
    tmp_x /= 255
    tmp_x = tmp_x.reshape(-1, 1, 28, 28)
    tmp_y = all_xy[:, 784]

    self.x_data = 
      T.tensor(tmp_x, dtype=T.float32).to(device)
    self.y_data = 
      T.tensor(tmp_y, dtype=T.int64).to(device) 

  def __len__(self):
    return len(self.x_data)

  def __getitem__(self, idx):
    lbl = self.y_data[idx]  # no use labels
    pixels = self.x_data[idx] 
    return (pixels, lbl)

With this dataset definition, MNIST data can be browsed with code like this:

train_ds = MNIST_Dataset(".mnist_train_1000.txt")
train_ldr = T.utils.data.DataLoader(train_ds,
  batch_size=2, shuffle=False)
for (batch_idx, batch) in enumerate(train_ldr):
  (X, y) = batch
  print(X)  # pixels
  print(y)  # label/digit
  # training code here

In a non-demo scenario, you would set the shuffle parameter to True instead of False.

Wrap
The MNIST dataset is the “Hello World” of machine learning image recognition. Once you understand how to work with MNIST data, it is possible to create and train a convolutional neural network (CNN) to recognize handwritten digits. This will be the topic of the next Visual Studio Magazine Data Science Lab article.

The Fashion-MNIST dataset is closely related to the MNIST data. Fashion-MNIST has 60,000 training images and 10,000 test images where each image is a 28 x 28 grayscale representing one of 10 types of clothing (dress, coat, shirt, etc.) The Fashion clothing mages -MNIST are harder to classify than MNIST counts from images.

The MNIST and Fashion-MNIST datasets are relatively simple because they use grayscale values ​​(one channel). Working with color images (RGB – three channels) is more difficult. The Hello World of color images is the CIFAR-10 dataset. CIFAR-10 contains 50,000 training images and 10,000 32×32 test images with 10 classes: airplane, car, bird, cat, deer, dog, frog, horse, boat and truck.

About the Author


Dr. James McCaffrey works for Microsoft Research in Redmond, Washington. He has worked on several Microsoft products including Azure and Bing. Jacques can be reached at [email protected].



Comments are closed.