Defining the preprocess function

Now we will create this preprocessing function where we will:
- Truncate the reviews, keeping only the first 300 characters of each since you can generally tell whether a review is positive or not in the first sentence or two.
- Then we use regular expressions to replace <br/> tags with spaces and to replace any characters other than letters and quotes with spaces.
- Finally, the preprocess() function splits the reviews by the spaces, which returns a ragged tensor, and it converts this ragged tensor to a dense tensor, padding all reviews with the padding token <pad> so that they all have the same length.

Note:

tf.strings - Operations for working with string Tensors.
tf.strings.substr(X_batch, 0, 300) - For each string in the input Tensor X_batch, it creates a substring starting at index pos(here 0) with a total length of len(here 300). So basically, it returns substrings from Tensor of strings.
tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ") - Replaces elements of X_batch matching regex pattern <br\s*/?> with rewrite .
tf.strings.split(X_batch) - Split elements of input X_batch into a RaggedTensor.
X_batch.to_tensor(default_value=b"<pad>") - Converts the RaggedTensor into a tf.Tensor. default_value is the value to set for indices not specified in X_batch. Empty values are assigned default_value(here <pad>).

INSTRUCTIONS

Use the following code to preprocess the data as described above:

def preprocess(X_batch, y_batch):
    X_batch = tf.strings.substr(X_batch, 0, 300)
    X_batch = tf.strings.regex_replace(X_batch, rb"<br\s*/?>", b" ")
    X_batch = tf.strings.regex_replace(X_batch, b"[^a-zA-Z']", b" ")
    X_batch = tf.strings.split(X_batch)
    return X_batch.to_tensor(default_value=b"<pad>"), y_batch

Let us now call the preprocess() function on X_batch, y_batch to see how the data after preprocessing looks like:
```
<< your code comes here >>(X_batch, y_batch)
```

Get Hint See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Project - How to Build a Sentiment Classifier using Python and IMDB Reviews

Defining the preprocess function

XP

Loading comments...