How to calculate the accuracy of the Transformer architecture model – Visual Studio magazine


The data science lab

How to calculate the accuracy of the transformer architecture model

Dr. James McCaffrey of Microsoft Research uses the Hugging Face library to simplify the implementation of NLP systems using Transformer Architecture (TA) models.

This article explains how to calculate the precision of a trained Transformer architecture model for natural language processing. Specifically, this article describes how to calculate the classification accuracy of a condensed BERT model that predicts sentiment (positive or negative) of movie reviews taken from the IMDB Movie Reviews dataset.

Transformer Architecture (TA) models have revolutionized Natural Language Processing (NLP), but TA systems are extremely complex and implementing them from scratch can take hundreds or even thousands of hours of work. . Hugging Face (HF) is an open source code library that provides pre-trained models and a set of APIs to work with models. The HF library makes implementing NLP systems using TA models much less difficult.

You can think of a pre-trained TA model as a sort of English expert who knows things like sentence structure and synonyms. But the MT expert doesn’t know anything about movies, so you provide additional training to refine the model so that he understands the difference between a positive movie review and a negative review. I explained how to refine and save a binary classification model in a previous article.

A good way to see where this article is going is to take a look at the screenshot of a demo program in Figure 1. The demo program begins by loading a DistilBERT model (bidirectional encoder representations distilled from transformers) into memory. Then the demo loads the model with the trained weights and biases that were previously saved to a file. The demo then loads data from test movie reviews that have the same structure as the movie reviews that were used to train the model.

Test reviews are tokenized: converted from text (“I liked this movie”) to integer identifiers. Tokenized journals are loaded into a PyTorch Dataset object so that they can be passed to the trained model. The demo covers the first five test reviews, one at a time. Test review [2] is incorrectly predicted as class 0 (negative) when in fact it is class 1 (positive) and therefore the demo displays the token IDs and source revision text (truncated to save space) for the diagnostic.

Figure 1: Precision of the computational model for a condensed BERT model for film sentiment analysis
[Click on image for larger view.] Figure 1: Precision of the computational model for a condensed BERT model for film sentiment analysis

This article assumes that you have advanced knowledge of a C family programming language, preferably Python, and a basic knowledge of PyTorch, but does not assume that you are familiar with the Hugging Face code library. The full source code for the demo program is presented in this article, and the code is also available in the accompanying download file.

To run the demo program, you must have Python, PyTorch, and HF installed on your machine. The demo programs were developed on Windows 10 using the Anaconda 2020.02 64-bit distribution (which contains Python 3.7.6), PyTorch version 1.8.0 for the processor installed via pip and HF transformers version 4.11.3. The installation of PyTorch is not trivial. You can find detailed step by step installation instructions in my blog Publish. The installation of the HF transformer library is relatively straightforward. You can run the “pip install transformers” shell command.

The demonstration program
The full demo code, with some minor edits to save space, is shown in List 1. I prefer to indent using two spaces rather than the standard four spaces. The backslash character is used for line continuation to break up long statements.

List 1: The complete model precision demonstration program

# imdb_hf_02_eval.py
# accuracy for tuned HF model for IMDB sentiment analysis 
# Python 3.7.6  PyTorch 1.8.0  HF 4.11.3  Windows 10
# zipped raw data at:
# https://ai.stanford.edu/~amaas/data/sentiment/

import numpy as np 
from pathlib import Path
from transformers import DistilBertTokenizerFast
import torch as T
from torch.utils.data import DataLoader
from transformers import DistilBertForSequenceClassification
from transformers import logging  # to suppress warnings

device = T.device('cpu')

class IMDbDataset(T.utils.data.Dataset):
  def __init__(self, reviews_lst, labels_lst):
    self.reviews_lst = reviews_lst  # list of token IDs
    self.labels_lst = labels_lst    # list of 0-1 ints

  def __getitem__(self, idx):
    item = {}  # [input_ids] [attention_mask] [labels]
    for key, val in self.reviews_lst.items():
      item[key] = T.tensor(val[idx]).to(device)
    item['labels'] = 
      T.tensor(self.labels_lst[idx]).to(device)
    return item

  def __len__(self):
    return len(self.labels_lst)

def read_imdb(root_dir):
  reviews_lst = []; labels_lst = []
  root_dir = Path(root_dir)
  for label_dir in ["pos", "neg"]:
    for f_handle in (root_dir/label_dir).iterdir():
      txt = f_handle.read_text(
        encoding='utf-8')
      reviews_lst.append(txt)
      if label_dir == "pos":
        labels_lst.append(1)
      else:
        labels_lst.append(0)
  return (reviews_lst, labels_lst)  # lists of strings

def print_list(lst, front, back):
  # print first and last items
  n = len(lst)
  for i in range(front): print(lst[i] + " ", end="")
  print(" . . . ", end="")
  for i in range(back): print(lst[n-1-i] + " ", end="")
  print("")

def accuracy(model, ds, toker, num_reviews):
  # item-by-item: good for debugging but slow
  n_correct = 0; n_wrong = 0
  loader = DataLoader(ds, batch_size=1, shuffle=False)
  for (b_ix, batch) in enumerate(loader):
    print("==========================================")
    print(str(b_ix) + "  ", end="")
    input_ids = batch['input_ids'].to(device)  # just IDs

    # tensor([[101, 1045, 2253, . . 0, 0]])
    # words = toker.decode(input_ids[0])
    # [CLS] i went and saw . . [PAD] [PAD]

    lbl = batch['labels'].to(device)  # target 0 or 1
    mask = batch['attention_mask'].to(device)
    with T.no_grad():
      outputs = model(input_ids, 
        attention_mask=mask, labels=lbl)

    # SequenceClassifierOutput(
    #  loss=tensor(0.0168),
    #  logits=tensor([[-2.2251, 1.8527]]),
    #  hidden_states=None,
    #  attentions=None)
    logits = outputs[1]  # a tensor
    pred_class = T.argmax(logits)
    print("  target: " + str(lbl.item()), end="")
    print("  predicted: " + str(pred_class.item()), end="")
    if lbl.item() == pred_class.item():
      n_correct += 1; print(" | correct")
    else:
      n_wrong += 1; print(" | wrong")

    if b_ix == num_reviews - 1:
      break

    if lbl.item() != pred_class.item():
      print("Test review as token IDs: ")
      T.set_printoptions(threshold=100, edgeitems=3)
      print(input_ids)
      print("Review source: ")
      words = toker.decode(input_ids[0])  # giant string
      print_list(words.split(' '), 3, 3)

  print("==========================================")

  acc = (n_correct * 1.0) / (n_correct + n_wrong)
  print("nCorrect: %4d " % n_correct)
  print("Wrong:   %4d " % n_wrong)
  return acc

def main():
  # 0. get ready
  print("nBegin evaluation of IMDB HF model ")
  logging.set_verbosity_error()  # suppress wordy warnings
  T.manual_seed(1)
  np.random.seed(1)

  # 1. load pretrained model
  print("nLoading (cached) untuned DistilBERT model ")
  model = 
    DistilBertForSequenceClassification.from_pretrained( 
    'distilbert-base-uncased')
  model.to(device)
  print("Done ")

  # 2. load tuned model wts and biases from file
  print("nLoading tuned model wts and biases ")
  model.load_state_dict(T.load(".Modelsimdb_state.pt"))
  model.eval()
  print("Done ")

  # 3. load training data used to create tuned model
  print("nLoading test data from file into memory ")
  test_texts, test_labels = 
    read_imdb(".DataSmallaclImdbtest")
  print("Done ")

  # 4. tokenize the raw text data
  print("nTokenizing test reviews data ")
  tokenizer = 
    DistilBertTokenizerFast.from_pretrained(
    'distilbert-base-uncased')
  test_encodings = 
    tokenizer(test_texts, truncation=True, padding=True)
  print("Done ")

  # 5. put tokenized text into PyTorch Dataset
  print("nConverting tokenized text to Pytorch Dataset ")
  test_dataset = IMDbDataset(test_encodings, test_labels)
  print("Done ")

  # 6. compute classification accuracy
  print("nComputing model accuracy on first 5 test data ")
  acc = accuracy(model, test_dataset, tokenizer, 
    num_reviews=5)
  print("Accuracy = %0.4f " % acc)

  print("nEnd demo ")

if __name__ == "__main__":
  main()
 
The main() function starts with these four statements:

def main():
  # 0. get ready
  print("nBegin evaluation of IMDB HF model ")
  logging.set_verbosity_error()  # suppress wordy warnings
  T.manual_seed(1)
  np.random.seed(1)
 . . .

In general, it’s not a good idea to suppress warning messages, but I’m doing it to make the output more tidy. It is a good idea to set the seeds of the NumPy and PyTorch random number generator so that the demo runs are repeatable.

The pre-trained TA model of Hugging Face is loaded as follows:

  # 1. load pretrained model
  print("nLoading (cached) untuned DistilBERT model ")
  model = 
    DistilBertForSequenceClassification.from_pretrained( 
    'distilbert-base-uncased')
  model.to(device)
  print("Done ")

The term “pre-trained” means that the model has been trained using Wikipedia text so that they understand English, but the pre-trained model does not understand movie reviews. The first time you run the program, the program will contact you using your internet connection and download the template. On subsequent runs of the program, the code will use the cached version of the model. On Windows systems, cached HF models are stored by default in C: Users (user) . Cache huggingface transformers.

The recorded weights and biases are loaded into the model:

  # 2. load tuned model wts and biases from file
  print("nLoading tuned model wts and biases ")
  model.load_state_dict(T.load(".Modelsimdb_state.pt"))
  model.eval()
  print("Done ")

These are the weights and biases that were calculated during training using the IMDB Movie Review training dataset. There are several ways to save a trained PyTorch model. This code assumes that the training code has registered the model state dictionary object, which contains the weight and biases but not the model structure.

IMDB test data is loaded into memory as follows:

  # 3. load training data used to create tuned model
  print("nLoading test data from file into memory ")
  test_texts, test_labels = 
    read_imdb(".DataSmallaclImdbtest")
  print("Done ")

The full IMDB movie reviews dataset contains 25,000 reviews (12,500 positive and 12,500 negative), which is difficult to work with. I manually pruned the dataset to the first 100 positive test reports and the first 100 negatives.

I tweaked the reviews of the test movies slightly. First, I renamed the first 10 positive test notices (located in the test / pos directory) to 0_10.txt, 1_10.txt, 2_7.txt,. . . 9_7.txt to 00_10.txt, 01_10.txt,. . . 09_7.txt so that they are processed by the read_imdb () function in order by filename. Second, I replaced the contents of the 02_7.txt file in the pos file directory (positive reviews) with:

“This movie was a total waste of my time. If you are looking for entertainment you should look elsewhere.”

I edited the text of the movie review to make it negative so that the review was poorly predicted for demo purposes.

The DistilBERT tokenizer object is instantiated with this code:

  # 4. tokenize the raw text data
  print("nTokenizing test reviews data ")
  tokenizer = 
    DistilBertTokenizerFast.from_pretrained(
    'distilbert-base-uncased')
  test_encodings = 
    tokenizer(test_texts, truncation=True, padding=True)
  print("Done ")

In general, each pre-trained HF model has its own tokenizer. Most HF tokenizers have a fast version and a basic version. The demo uses the caseless versions of the template and the tokenizer, which means all text in the movie review will be converted to lowercase. In many NLP scenarios, breakage is important and you will want to use a box model and tokenizer.

Movie reviews are truncated to a default maximum length of 512 tokens, and reviews less than 512 tokens have padding added at the end of the review so that all reviews are the same length.

Tokenized movie reviews are loaded into a PyTorch Dataset object using this instruction:

  # 5. put tokenized text into PyTorch Dataset
  print("nConverting tokenized text to Pytorch Dataset ")
  test_dataset = IMDbDataset(test_encodings, test_labels)
  print("Done ")

The PyTorch dataset will be passed to a DataLoader object which can serve movie reviews and target tags (0 or 1) in batches or one by one. To recap, the plain text of the movie review and labels are loaded from the file into memory using a program-defined read_imdb () function. The text is segmented using an HF library tokenizer object, then passed to a program-defined IMDbDataset object.

Reviews of test films are used to calculate the classification accuracy as follows:

  # 6. compute classification accuracy
  print("nComputing model accuracy on first 5 test data ")
  acc = accuracy(model, test_dataset, tokenizer, 
    num_reviews=5)
  print("Accuracy = %0.4f " % acc)

The precision () function defined by the program accepts the IMDbDataset which contains the movie review data. The tokenizer is also passed to precision () so that token identifiers can be converted back to their source text when diagnosing failed test cases. The number of notices is limited to 5 just to keep the size of the output small.

The precision function
The definition of the precision () function defined by the program begins with:


Comments are closed.