Project - How to Build a Sentiment Classifier using Python and IMDB Reviews

5 / 11

Defining the preprocess function

  • Now we will create this preprocessing function where we will:

    • Truncate the reviews, keeping only the first 300 characters of each since you can generally tell whether a review is positive or not in the first sentence or two.

    • Then we use regular expressions to replace <br/> tags with spaces and to replace any characters other than letters and quotes with spaces.

    • Finally, the preprocess() function splits the reviews by the spaces, which returns a ragged tensor, and it converts this ragged tensor to a dense tensor, padding all reviews with the padding token <pad> so that they all have the same length.


  • tf.strings - Operations for working with string Tensors.

  • tf.strings.substr(X_batch, 0, 300) - For each string in the input Tensor X_batch, it creates a substring starting at index pos(here 0) with a total length of len(here 300). So basically, it returns substrings from Tensor of strings.

  • tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ") - Replaces elements of X_batch matching regex pattern <br\s*/?> with rewrite .

  • tf.strings.split(X_batch) - Split elements of input X_batch into a RaggedTensor.

  • X_batch.to_tensor(default_value=b"<pad>") - Converts the RaggedTensor into a tf.Tensor. default_value is the value to set for indices not specified in X_batch. Empty values are assigned default_value(here <pad>).

  • Use the following code to preprocess the data as described above:

    def preprocess(X_batch, y_batch):
        X_batch = tf.strings.substr(X_batch, 0, 300)
        X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
        X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
        X_batch = tf.strings.split(X_batch)
        return X_batch.to_tensor(default_value=b"<pad>"), y_batch
  • Let us now call the preprocess() function on X_batch, y_batch to see how the data after preprocessing looks like:

    << your code comes here >>(X_batch, y_batch)
Get Hint See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...