Project - How to Build a Sentiment Classifier using Python and IMDB Reviews

9 / 11

Creating the Final Train and Test sets

Now we will create the final training and test sets.

For creating the final training set train_set,

  • we batch the reviews

  • then we convert them to short sequences of words using the preprocess() function

  • then encode these words using a simple encode_words() function that uses the table we just built and finally prefetch the next batch.

Let us test the model(after training) on 1000 samples of the test data as it takes a lot of time to test on the whole test set. So we shall create the final test set on 1000 samples as follows.

For creating the final test set test_set,

  • we create a batch of 1000 test samples

  • then we convert them to short sequences of words using the preprocess() function

  • then encode these words using a simple encode_words() function that uses the table we just built.

Note:

  • dataset.repeat().batch(32) repeatedly creates the batches of 32 samples in the dataset.

  • dataset.repeat().batch(32).map(preprocess) applies the function preprocess on every batch.

  • dataset.map(encode_words).prefetch(1) applies the function encode_words to the data samples and paralelly fetches the next batch.

INSTRUCTIONS
  • Define the encode_words() function to encode the words of train data using the lookup table table.

    def encode_words(X_batch, y_batch):
        return table.lookup(X_batch), y_batch
    
  • Apply the function preprocess on every batch of data with 32 samples repeatedly on the train data datasets["train"].

    train_set = datasets["train"].repeat().batch(32).map(<< your code comes here >>)
    
  • Apply the function encode_words to the train_set and parallelly fetch the next batch.

    train_set = train_set.map(<< your code comes here >>).prefetch(1)
    
  • Similarly, apply the function preprocess on the test data datasets["test"].

    test_set = datasets["test"].batch(1000).map(<< your code comes here >>)
    
  • Apply the function encode_words to the test_set.

    test_set = test_set.map(<< your code comes here >>)
    
  • Let us see how the first data sample of the thus obtained train_set looks like:

    for X_batch, y_batch in train_set.take(1):
        print(X_batch)
        print(y_batch)
    
Get Hint See Answer


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...