Preparing MNIST Image Data Text Files — Visual Studio Magazine
The Data Science Lab
Preparation of MNIST image data text files
Dr. James McCaffrey of Microsoft Research demonstrates how to retrieve and prepare MNIST data for machine learning problems of image recognition.
Many machine learning problems fall into one of three categories: tabular data prediction (like the iris species problem), natural language processing (like the IMDB movie review sentiment problem), and recognition image (like the MNIST handwritten digits problem). This article describes how to retrieve and prepare MNIST data.
Data from MNIST (Modified National Institute of Standards and Technology) consists of 60,000 training images and 10,000 test images. Each image is a rough 28 x 28 (784 pixel) handwritten number between “0” and “9”. Each pixel value is a grayscale integer between 0 and 255.
The most popular neural network libraries, including PyTorch, scikit, and Keras, have some form of built-in MNIST dataset designed to work with the library. But there are two problems with using an integrated dataset. First, data access becomes a magic black box and important information is hidden. Second, the built-in datasets use the 60,000 training images and the 10,000 test images, which is very difficult to use because they are so large.
A good way to see where this article is heading is to take a look at the screenshot of a Python language program in Figure 1. MNIST source data files are stored in a proprietary binary format. The program loads binary pixels and label training files into memory, converts the data to tab-delimited text, and saves only the first 1,000 training images and their labels “0” through “9”. To verify the generated training file, the demo program reads the first training image into memory and displays that image, a number “5”, in the shell and graphically.
This article assumes that you have intermediate or better knowledge of a C-family programming language, preferably Python, but does not assume that you know anything about the MNIST dataset. The full source code for the demo program is shown in this article, and the code is also available in the companion file download.
Obtaining Source Data Files
The main storage site for MNIST binary data files is the MNIST Database of Handwritten Digits, but you can also find it in other places by searching the Internet. There are links to four GNU zip compressed files:
train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz
The first two files contain the pixel values and associated labels for the 60,000 training data sets. The second two files are the 10,000 element test data. If you click on a link, you can download the associated file. I suggest downloading to a directory named ZippedBinary. Unlike compressed .zip files, Windows cannot extract .gz files, so you must use an application. I recommend the 7-Zip utility. After installing 7-Zip, you can open Windows File Explorer, then right-click on each .gz file and select the Extract Files option. I suggest extracting to a directory named UnzippedBinary and adding a .bin extension to the unzipped files.
Converting binary files to text file
The complete converter_minst.py data preparation program is presented in List 1. The program starts by importing two libraries:
import numpy as np import matplotlib.pyplot as plt
Both libraries are used to display data but are not needed to convert binary to text. The demo uses three program-defined functions: main(), convert(), and display_from_file(). The main() function is:
def main(): n_images = 1000 print("Creating %d MNIST train images from binary files " % n_images) convert(".UnzippedBinarytrain-images.idx3-ubyte.bin", ".UnzippedBinarytrain-labels.idx1-ubyte.bin", "mnist_train_1000.txt", 1000) print("Showing train image [0]: ") img_file = ".mnist_train_1000.txt" display_from_file(img_file, idx=0) # first image
The convert() function accepts the unpacked binary pixel file and the unpacked binary tag file, a destination text file, and the number of images to convert.
List 1: Program to convert MNIST binary data to text
# converter_mnist.py import numpy as np import matplotlib.pyplot as plt # convert MNIST binary to text file; combine pixels and labels # target format: # pixel_1 (tab) pixel_2 (tab) . . pixel_784 (tab) digit # 1. manually download four zipped-binary files from # yann.lecun.com/exdb/mnist/ # 2. use 7-Zip to unzip files, add ".bin" extension # 3. determine format you want and modify script def convert(img_file, label_file, txt_file, n_images): print("nOpening binary pixels and labels files ") lbl_f = open(label_file, "rb") # labels (digits) img_f = open(img_file, "rb") # pixel values print("Opening destination text file ") txt_f = open(txt_file, "w") # output to write to print("Discarding binary pixel and label headers ") img_f.read(16) # discard header info lbl_f.read(8) # discard header info print("nReading binary files, writing to text file ") print("Format: 784 pixels then labels, tab delimited ") for i in range(n_images): # number requested lbl = ord(lbl_f.read(1)) # Unicode, one byte for j in range(784): # get 784 pixel vals val = ord(img_f.read(1)) txt_f.write(str(val) + "t") txt_f.write(str(lbl) + "n") img_f.close(); txt_f.close(); lbl_f.close() print("nDone ") def display_from_file(txt_file, idx): all_data = np.loadtxt(txt_file, delimiter="t", usecols=range(0,785), dtype=np.int64) x_data = all_data[:,0:784] # all rows, 784 cols y_data = all_data[:,784] # all rows, last col label = y_data[idx] print("digit = ", str(label), "n") pixels = x_data[idx] pixels = pixels.reshape((28,28)) for i in range(28): for j in range(28): # print("%.2X" % pixels[i,j], end="") print("%3d" % pixels[i,j], end="") print(" ", end="") print("") plt.tight_layout() plt.imshow(pixels, cmap=plt.get_cmap('gray_r')) plt.show() def main(): n_images = 1000 print("nCreating %d MNIST train images from binary files " % n_images) convert(".UnzippedBinarytrain-images.idx3-ubyte.bin", ".UnzippedBinarytrain-labels.idx1-ubyte.bin", "mnist_train_1000.txt", 1000) # n_images = 100 # print("nCreating %d MNIST test images from binary files " % n_images) # convert(".UnzippedBinaryt10k-images.idx3-ubyte.bin", # ".UnzippedBinaryt10k-labels.idx1-ubyte.bin", # "mnist_test_100.txt", 100) print("nShowing train image [0]: ") img_file = ".mnist_train_1000.txt" display_from_file(img_file, idx=0) # first image if __name__ == "__main__": main()
The organization of the MNIST source binaries is somewhat unusual. Storing features (the pixel predictor values) and labels (the number to be predicted) in separate files, rather than together in a single file, was common in the 1990s when computers had limited memory.
Conversion of binary MNIST to text
The convert() function begins by opening two source binary files for reading and a target text file for writing:
def convert(img_file, label_file, txt_file, n_images): lbl_f = open(label_file, "rb") # labels (digits) img_f = open(img_file, "rb") # pixel values txt_f = open(txt_file, "w") # output file
The “rb” argument stands for “read-binary” and “w” stands for write (default text file). Then the header information is consumed and discarded:
img_f.read(16) # discard header lbl_f.read(8) # discard header
The read() method accepts the number of bytes to read. The pixel file has a 16 byte header and the label file has an 8 byte header. The main processing loop is:
for i in range(n_images): # number images lbl = ord(lbl_f.read(1)) # get label for j in range(784): # get 784 pixel values val = ord(img_f.read(1)) txt_f.write(str(val) + "t") txt_f.write(str(lbl) + "n") img_f.close(); txt_f.close(); lbl_f.close()
Since each tag value is between “0” and “9” and pixel values are between 0 and 255, each tag and pixel value is stored as one byte. The Python ord() function converts a pixel stored as a byte into an integer. This loop is the main customization point. The demo output format is to write one image per line where the 784 pixel values come first and the label/number is the last value on the line. Values are delimited by tabs.
You should have no problem modifying this code to create MNIST text data with any format. For example, you might want to use a comma instead of a tab as a delimiter, or you might want to put the label/number as the first value on each line, followed by the 784 pixel values.
Viewing MNIST Data
Once the MNIST data has been stored in a text file, it is useful to view it to verify that the data has been correctly converted and saved. The program-defined display_from_file() function assumes that the MNIST data has been stored one image per line, tab delimited, with pixels first then label. The function definition begins:
def display_from_file(txt_file, idx): all_data = np.loadtxt(txt_file, delimiter="t", usecols=range(0,785), dtype=np.int64) x_data = all_data[:,0:784] # all rows, 784 cols y_data = all_data[:,784] # all rows, last col
The numpy loadtxt() functions read numeric values stored as text in a two-dimensional numpy array. The x_data array contains all pixel values and the y_data array contains all labels. Then the specified pixels and label are extracted:
label = y_data[idx] pixels = x_data[idx] pixels = pixels.reshape((28,28))
The 784 pixel values are reshaped into a 28 x 28 two-dimensional array in preparation for sending to the imshow() function of the matplotlib library (“image show”). But first, the 784 pixel values are displayed in the command shell:
for i in range(28): for j in range(28): print("%3d" % pixels[i,j], end="") print(" ", end="") print("")
And then the pixel values are displayed graphically:
plt.tight_layout() plt.imshow(pixels, cmap=plt.get_cmap('gray_r')) plt.show()
The “gray_r” argument of the get_cmap() method means “inverted grayscale” where 0 values are displayed in white, 255 values are displayed in black, and intermediate values vary in shades of gray.
Using MNIST data in a PyTorch program
Once the MNIST data is saved as a text file, it is possible to code a PyTorch Dataset class to read the data and send it to a DataLoader object for training. A possible implementation of the dataset is shown in List 2. The dataset assumes that the MNIST data is in the format described in this article.
List 2: Use of MNIST data
import torch as T device = T.device('cpu') class MNIST_Dataset(T.utils.data.Dataset): # 784 tab-delim pixel values (0-255) then label (0-9) def __init__(self, src_file): all_xy = np.loadtxt(src_file, usecols=range(785), delimiter="t", comments="#", dtype=np.float32) tmp_x = all_xy[:, 0:784] # all rows, cols [0,783] tmp_x /= 255 tmp_x = tmp_x.reshape(-1, 1, 28, 28) tmp_y = all_xy[:, 784] self.x_data = T.tensor(tmp_x, dtype=T.float32).to(device) self.y_data = T.tensor(tmp_y, dtype=T.int64).to(device) def __len__(self): return len(self.x_data) def __getitem__(self, idx): lbl = self.y_data[idx] # no use labels pixels = self.x_data[idx] return (pixels, lbl)
With this dataset definition, MNIST data can be browsed with code like this:
train_ds = MNIST_Dataset(".mnist_train_1000.txt") train_ldr = T.utils.data.DataLoader(train_ds, batch_size=2, shuffle=False) for (batch_idx, batch) in enumerate(train_ldr): (X, y) = batch print(X) # pixels print(y) # label/digit # training code here
In a non-demo scenario, you would set the shuffle parameter to True instead of False.
Wrap
The MNIST dataset is the “Hello World” of machine learning image recognition. Once you understand how to work with MNIST data, it is possible to create and train a convolutional neural network (CNN) to recognize handwritten digits. This will be the topic of the next Visual Studio Magazine Data Science Lab article.
The Fashion-MNIST dataset is closely related to the MNIST data. Fashion-MNIST has 60,000 training images and 10,000 test images where each image is a 28 x 28 grayscale representing one of 10 types of clothing (dress, coat, shirt, etc.) The Fashion clothing mages -MNIST are harder to classify than MNIST counts from images.
The MNIST and Fashion-MNIST datasets are relatively simple because they use grayscale values (one channel). Working with color images (RGB – three channels) is more difficult. The Hello World of color images is the CIFAR-10 dataset. CIFAR-10 contains 50,000 training images and 10,000 32×32 test images with 10 classes: airplane, car, bird, cat, deer, dog, frog, horse, boat and truck.
About the Author
Dr. James McCaffrey works for Microsoft Research in Redmond, Washington. He has worked on several Microsoft products including Azure and Bing. Jacques can be reached at [email protected].
Comments are closed.