Project - How to Build a Sentiment Classifier using Python and IMDB Reviews

You are currently auditing this course.
8 / 11

Creating a lookup table

Computer can only process numbers but not words. Thus we need to convert the words in truncated_vocabulary into numbers.

So we now need to add a preprocessing step to replace each word with its ID (i.e., its index in the truncated_vocabulary). We will create a lookup table for this, using 1,000 out-of-vocabulary (oov) buckets.

We shall create the lookup table such that the most frequently occurring words have lower indices than less frequently occurring words.

Note:

  • tf.lookup.KeyValueTensorInitializer : Table initializer given keys and values tensors. More here

  • tf.lookup.StaticVocabularyTable : String to Id table wrapper that assigns out-of-vocabulary keys to buckets. More here

    If <other term> -> bucket_id, where bucket_id will be between 3 and 3 + num_oov_buckets - 1, calculated by: hash(<term>) % num_oov_buckets + vocab_size

  • table.lookup : Looks up keys in the table, outputs the corresponding values.

INSTRUCTIONS
  • Create a tensor words containing the words of truncated_vocabulary.

    << your code comes here >>= tf.constant(truncated_vocabulary)
    
  • Create the word_ids using the corresponding indices of words in truncated_vocabulry.

    word_ids = tf.range(len(truncated_vocabulary), dtype=tf.int64)
    
  • Create the table initializer vocab_init using tf.lookup.KeyValueTensorInitializer, given the keys(here words) and the values(here word_ids) tensors.

    vocab_init = << your code comes here >>(words, word_ids)
    
  • Set num_oov_buckets = 1000 and create the lookup table table using tf.lookup.StaticVocabularyTable. Observe, we pass the vocab_init, num_oov_buckets as input arguments to this.

    num_oov_buckets = 1000
    table = << your code comes here >>(vocab_init, num_oov_buckets)
    
  • Let's use the above table to look up the IDs of a few words:

    table.lookup(tf.constant([b"This movie was faaaaaantastic".split()]))
    

    Note: The words “this,” “movie,” and “was” were found in the table, so their IDs are lower than 10,000, while the word “faaaaaantastic” was not found, so it was mapped to one of the oov buckets, with an ID greater than or equal to 10,000.

Get Hint See Answer

Loading comments...