Preparing IMDB Movie Review Data for NLP Experiments — Visual Studio Magazine

The Data Science Lab

Preparation of IMDB movie review data for NLP experiments

Microsoft Research’s Dr. James McCaffrey shows how to get the raw source IMDB data, read movie reviews into memory, parse and mark reviews, create a vocabulary dictionary, and convert reviews to digital form.

A common data set for natural language processing (NLP) experiments is IMDB movie review data. The goal of an IMDB dataset problem is to predict whether a movie review has a positive sentiment (“It was a great movie”) or a negative sentiment (“The movie was a waste of time”) . A major challenge when working with the IMDB data set is data preparation.

This article explains how to get the raw source IMDB data, read movie reviews into memory, parse and tag reviews, create a vocabulary dictionary, and convert reviews into a digital form suitable for use by a system such as a network deep neural. , or an LSTM network, or a Transformer Architecture network.

The most popular neural network libraries, including PyTorch, scikit, and Keras, have some form of built-in IMDB dataset designed to work with the library. But there are two problems with using an integrated dataset. First, data access becomes a magic black box and important information is hidden. Second, the built-in datasets use the 25,000 training and 25,000 test movie reviews and it’s hard to use them because they’re so large.

Figure 1: Converting Source IMDB Scan Data to Token ID
[Click on image for larger view.] Figure 1: Convert source IMDB exam data to token ID

A good way to see where this article is heading is to take a look at the screenshot of a Python language program in Figure 1. Source IMDB movie reviews are stored as text files, one review per file. The program starts by loading all 50,000 movie reviews into memory and then parses each review into words/tokens. Words/tokens are used to create a vocabulary dictionary that associates each word/token with an integer identifier. For example, the word “the” maps to ID=4.

The vocabulary collection is then used to convert movie reviews of 20 words or less into token IDs. Reviews with less than 20 words/tokens are padded to an exact length of 20 words by adding a special padding ID of 0.

Movie reviews that have a positive sentiment, such as “This is a good movie” get a tag of 1 as the last value, and negative sentiment reviews get a tag of 0. The result is 16 movie reviews for the formation and 17 reviews for testing. . In a non-demonstration scenario, you allow longer reviews, such as up to 80 words, to get more training and test reviews.

This article assumes you have intermediate or better knowledge of a C-family programming language, preferably Python, but does not assume you know anything about the IMDB dataset. The full source code for the demo program is shown in this article, and the code is also available in the companion file download.

Obtaining Source Data Files
IMDB movie review data consists of 50,000 reviews – 25,000 for training and 25,000 for testing. The training and test files are split evenly into 12,500 positive reviews and 12,500 negative reviews. Negative reviews are reviews associated with films that the critic has rated from 1 to 4 stars. Positive reviews are those rated from 7 to 10 stars. Film reviews that have received 5 or 6 stars are not considered positive or negative and are not used.

The Big Dataset of Movie Reviews is the primary site for storing raw IMDB movie review data, but you can also find it in other places using an internet search. If you click the link on the webpage, you will download an 80MB file in tar-GNU-zip format named aclImdb_v1.tar.gz.

Unlike regular compressed .zip files, Windows cannot extract tar.gz files, so you must use an application. I recommend the free utility 7-Zip. After installing 7-Zip, you can open Windows File Explorer, then right-click the aclImdb_v1.tar.gz file and select Extract Here option. This will result in a 284MB file named aclImdb_v1.tar (“tape archive”). If you right-click on this tar file and select the Extract Here option, you will get an uncompressed root directory named aclimdb of around 300MB.

The aclimdb root directory contains subdirectories named test and train, as well as three files that you can ignore. The test and train directories contain subdirectories named neg and pos, as well as five files and a directory named unsup (50,000 untagged reviews for unsupervised analysis) that you can ignore. The neg and pos directories each contain 12,500 text files where each notice is a single file.

The 50,000 filenames look like 102_4.txt where the first part of the filename is the [0] for [12499] the review index and the second part of the file name is the numerical review score (0-4 for negative reviews and 7-10 for positive reviews).

Figure 2: Positive first review of IMDB dataset
[Click on image for larger view.] Figure 2: IMDB Dataset First Positive Training Review

The screenshot in Figure 2 shows the directory structure of IMDB movie review data. The contents of the first positive feelings training review (file 0.9.txt) are displayed in Notepad.

Make IMDB reviews train and test files
The complete make_data_files.py data preparation program, with some minor changes to save space, is shown in List 1. The program accepts the 50,000 movie review files as input and creates a training file and a test file.

The program has three helper functions that do all the work:

get_reviews(dir_path, max_reviews)
make_vocab(all_reviews)
generate_file(reviews_lists, outpt_file, w_or_a, 
  vocab_dict, max_review_len, label_char)

The get_reviews() function reads all files in a directory, marks the reviews, and returns a list of lists such as [[“a”, “great”, “movie”], [“i”, “liked”, “it”, “a”, “lot”], . . [“terrific”, “film”]]. The make_vocab() function accepts a list of tokenized revisions and builds a collection of dictionaries where the keys are tokenized words such as “movie” and the values ​​are integer identifiers such as 27. Dictionary key-value pairs are also written in a text file named vocab_file.txt so that they can be used later by an NLP system.

The generate_file() function accepts the results of get_reviews() and make_vocab() and produces a practice or test file. The program’s control logic is in a main() function that begins:

def main():
  print("Loading all reviews into memory - be patient ")
  pos_train_reviews = get_reviews(".aclImdbtrainpos", 12500)
  neg_train_reviews = get_reviews(".aclImdbtrainneg", 12500)
  pos_test_reviews = get_reviews(".aclImdbtestpos", 12500)
  neg_test_reviews = get_reviews(".aclImdbtestneg", 12500)
. . .

Next, the vocabulary dictionary is created from the training data:

  vocab_dict = make_vocab([pos_train_reviews, 
    neg_train_reviews])  # key = word, value = word rank
  v_len = len(vocab_dict)  
  # need this plus 4, for Embedding: 129888+4 = 129892

For the demo, there are 129,888 distinct words/tokens. This is a very large number because in addition to normal English words such as “film” and “excellent”, there are thousands of words specific to film reviews, such as “hitchcock” (a director) and “dicaprio” ( an actor ).

The vocabulary is based on word frequencies where ID = 4 is the most common word (“the”), ID = 5 is the second most common word (“and”) and so on. This allows you to filter out rare words that appear only once or twice.

The vocabulary reserves identifiers 0, 1, 2, and 3 for special tokens. ID=0 is for padding . ID=1 is for to indicate the beginning of a sequence. ID=2 is for for non-vocabulary words. ID=3 is reserved but not used. The number of tokens in the whole vocabulary is 129,888 + 4 = 129,892. This number is needed for an integration layer when building an NLP prediction system.

The demo program creates a training file with movie reviews of 20 words or less with these three instructions:

  max_review_len = 20  # exact fixed length
  generate_file(pos_train_reviews, ".imdb_train_20w.txt", 
    "w", vocab_dict, max_review_len, "1")
  generate_file(neg_train_reviews, ".imdb_train_20w.txt",
    "a", vocab_dict, max_review_len, "0")

The first call to generate_file() takes a “w” argument which creates the destination file for writing positive reviews. The second call uses an “a” argument to add the negative reviews. It is possible to use “a+” mode but using separate “w” and “a” modes is clearer in my opinion.

The test file is created in the same way:

  generate_file(pos_test_reviews, ".imdb_test_20w.txt", 
    "w", vocab_dict, max_review_len, "1")
  generate_file(neg_test_reviews, ".imdb_test_20w.txt", 
    "a", vocab_dict, max_review_len, "0")

The demo inspects the training file:

  f = open(".imdb_train_20w.txt", "r", encoding="utf8")
  for line in f: 
    print(line, end="")
  f.close()

The vocabulary dictionary accepts a word/token like “movie” and returns an id like 87. The demo creates an inverted vocabulary object named index_to_word that accepts an id and returns the corresponding word/token, taking into account the four special tokens:

  index_to_word = {}
  index_to_word[0] = ""
  index_to_word[1] = ""
  index_to_word[2] = ""
  for (k,v) in vocab_dict.items():
    index_to_word[v+3] = k

The demo program completes using the modified reverse vocabulary dictionary to decode and display the training file:

  f = open(".imdb_train_20w.txt", "r", encoding="utf8")
  for line in f:
    line = line.strip()
    indexes = line.split(" ")
    for i in range(len(indexes)-1):  # last is '0' or '1'
      idx = (int)(indexes[i])
      w = index_to_word[idx]
      print("%s " % w, end="")
    print("%s " % indexes[len(indexes)-1])
  f.close()

There is no standard schema for NLP vocabulary collections, which is another problem with using the built-in IMDB datasets from PyTorch and Keras. Moreover, a vocabulary collection depends entirely on how the source data is symbolized. This means that you still need to tokenize NLP data and create related vocabulary at the same time.

List 1: Program to create IMDB Movie Review Train and Test files

# make_data_files.py
#
# input: source Stanford 50,000 data files reviews
# output: one combined train file, one combined test file
# output files are in index version, using the Keras dataset
# format where 0 = padding, 1 = 'start', 2 = OOV, 3 = unused
# 4 = most frequent word ('the'), 5 = next most frequent, etc.

import os
# allow the Windws cmd shell to deal with wacky characters
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)

# ---------------------------------------------------------------

def get_reviews(dir_path, max_reviews):
  remove_chars = "!"#$%&()*+,-./:;<=>[email protected][]^_`{|}~" 
  # leave ' for words like it's 
  punc_table = {ord(char): None for char in remove_chars}  # dict
  reviews = []  # list-of-lists of words
  ctr = 1
  for file in os.listdir(dir_path):
    if ctr > max_reviews: break
    curr_file = os.path.join(dir_path, file)
    f = open(curr_file, "r", encoding="utf8")  # one line
    for line in f:
      line = line.strip()
      if len(line) > 0:  # number characters
        # print(line)  # to show non-ASCII == errors
        line = line.translate(punc_table)  # remove punc
        line = line.lower()  # lower case
        line = " ".join(line.split())  # remove consecutive WS
        word_list = line.split(" ")  # list of words
        reviews.append(word_list)
    f.close()  # close curr file
    ctr += 1
  return reviews

# ---------------------------------------------------------------

def make_vocab(all_reviews):
  word_freq_dict = {}   # key = word, value = frequency

  for i in range(len(all_reviews)):
    reviews = all_reviews[i]
    for review in reviews:
      for word in review:
        if word in word_freq_dict:
          word_freq_dict[word] += 1
        else:
          word_freq_dict[word] = 1

  kv_list = []  # list of word-freq tuples so can sort
  for (k,v) in word_freq_dict.items():
    kv_list.append((k,v))

  # list of tuples index is 0-based rank, val is (word,freq)
  sorted_kv_list = 
    sorted(kv_list, key=lambda x: x[1], 
      reverse=True)  # sort by freq

  f = open(".vocab_file.txt", "w", encoding="utf8")
  vocab_dict = {}  
  # key = word, value = 1-based rank 
  # ('the' = 1, 'a' = 2, etc.)
  for i in range(len(sorted_kv_list)):  # filter here . . 
    w = sorted_kv_list[i][0]  # word is at [0]
    vocab_dict[w] = i+1       # 1-based as in Keras dataset

    f.write(w + " " + str(i+1) + "n")  # save word-space-index
  f.close()

  return vocab_dict

# ---------------------------------------------------------------

def generate_file(reviews_lists, outpt_file, w_or_a, 
  vocab_dict, max_review_len, label_char):

  # write first time, append later. could use "a+" mode instead.
  fout = open(outpt_file, w_or_a, encoding="utf8")  
  offset = 3  # Keras offset: 'the' = 1 (most frequent) 1+3 = 4
      
  for i in range(len(reviews_lists)):  # walk each review-list
    curr_review = reviews_lists[i]
    n_words = len(curr_review)     
    if n_words > max_review_len:
      continue  # next i, continue without writing anything

    n_pad = max_review_len - n_words   # number 0s to prepend

    for j in range(n_pad):
      fout.write("0 ")
    
    for word in curr_review: 
      # a word in test set might not have been in train set     
      if word not in vocab_dict:  
        fout.write("2 ")   # out-of-vocab index        
      else:
        idx = vocab_dict[word] + offset
        fout.write("%d " % idx)
    
    fout.write(label_char + "n")  # '0' or '1
        
  fout.close()

# ---------------------------------------------------------------          

def main():
  print("Loading all reviews into memory - be patient ")
  pos_train_reviews = get_reviews(".aclImdbtrainpos", 12500)
  neg_train_reviews = get_reviews(".aclImdbtrainneg", 12500)
  pos_test_reviews = get_reviews(".aclImdbtestpos", 12500)
  neg_test_reviews = get_reviews(".aclImdbtestneg", 12500)

  # mp = max(len(l) for l in pos_train_reviews)  # 2469
  # mn = max(len(l) for l in neg_train_reviews)  # 1520
  # mm = max(mp, mn)  # longest review is 2469
  # print(mp, mn)

# ---------------------------------------------------------------  

  print("Analyzing reviews and making vocabulary ")
  vocab_dict = make_vocab([pos_train_reviews, 
    neg_train_reviews])  # key = word, value = word rank
  v_len = len(vocab_dict)  
  # need this value, plus 4, for Embedding: 129888+4 = 129892
  print("Vocab size = %d -- use this +4 for 
    Embedding nw " % v_len)

  max_review_len = 20  # exact fixed length
  # if max_review_len == None or max_review_len > mm:
  #   max_review_len = mm

  print("Generating training file len %d words or less " 
    % max_review_len)

  generate_file(pos_train_reviews, ".imdb_train_20w.txt", 
    "w", vocab_dict, max_review_len, "1")
  generate_file(neg_train_reviews, ".imdb_train_20w.txt",
    "a", vocab_dict, max_review_len, "0")

  print("Generating test file with len %d words or less " 
    % max_review_len)

  generate_file(pos_test_reviews, ".imdb_test_20w.txt", 
    "w", vocab_dict, max_review_len, "1")
  generate_file(neg_test_reviews, ".imdb_test_20w.txt", 
    "a", vocab_dict, max_review_len, "0")

  # inspect a generated file
  # vocab_dict was used indirectly (offset)

  print("Displaying encoded training file: n")
  f = open(".imdb_train_20w.txt", "r", encoding="utf8")
  for line in f: 
    print(line, end="")
  f.close()

# ---------------------------------------------------------------  

  print("Displaying decoded training file: ") 

  index_to_word = {}
  index_to_word[0] = ""
  index_to_word[1] = ""
  index_to_word[2] = ""
  for (k,v) in vocab_dict.items():
    index_to_word[v+3] = k

  f = open(".imdb_train_20w.txt", "r", encoding="utf8")
  for line in f:
    line = line.strip()
    indexes = line.split(" ")
    for i in range(len(indexes)-1):  # last is '0' or '1'
      idx = (int)(indexes[i])
      w = index_to_word[idx]
      print("%s " % w, end="")
    print("%s " % indexes[len(indexes)-1])
  f.close()

if __name__ == "__main__":
  main()

Comments are closed.