Project - Building Spam Classifier

21 / 27

Spam Classifier - Create Transformer to Convert Word Counts to Vectors

Now we have the word counts, and we need to convert them to vectors. For this, we will build another transformer whose fit() method will build the vocabulary (an ordered list of the most common words) and whose transform() method will use the vocabulary to convert word counts to vectors. The output will be a sparse matrix.

  • Create a transformer WordCounterToVectorTransformer to convert word counts to vectors.

    from scipy.sparse import csr_matrix
    class << your code goes here >>(BaseEstimator, TransformerMixin):
        def __init__(self, vocabulary_size=1000):
            self.vocabulary_size = vocabulary_size
        def fit(self, X, y=None):
            total_count = Counter()
            for word_count in X:
                for word, count in word_count.items():
                    total_count[word] += min(count, 10)
            most_common = total_count.most_common()[:self.vocabulary_size]
            self.most_common_ = most_common
            self.vocabulary_ = {word: index + 1 for index, (word, count) in enumerate(most_common)}
            return self
        def transform(self, X, y=None):
            rows = []
            cols = []
            data = []
            for row, word_count in enumerate(X):
                for word, count in word_count.items():
                    cols.append(self.vocabulary_.get(word, 0))
            return csr_matrix((data, (rows, cols)), shape=(len(X), self.vocabulary_size + 1))
  • Now we will try this transformer that we created:

    vocab_transformer = WordCounterToVectorTransformer(vocabulary_size=10)
    X_few_vectors = vocab_transformer.fit_transform(X_few_wordcounts)
  • And finally we will convert the output vector to an array:

    << your code goes here >>.toarray()
Get Hint See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...