Only Few Seats Left for Advanced Certification Courses on Data Science, ML & AI by E&ICT Academy IIT RoorkeeApply Now
Now we will create a transformer that we will use to convert emails to word counters. Here we will use NLTK for stemming, and Regex for to replace URLs with the word "URL".
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming and AI knowledge extract meaningful information from vast sources like big data or the Internet since additional forms of a word related to a subject may need to be searched to get the best results. Stemming is also a part of queries and Internet search engines.
Here we will split sentences into words using Python's split() method, which uses whitespaces for word boundaries.
Copy paste the following code for the transformer as is:
import nltk from sklearn.base import BaseEstimator, TransformerMixin url_extractor = None stemmer = nltk.PorterStemmer() class EmailToWordCounterTransformer(BaseEstimator, TransformerMixin): def __init__(self, strip_headers=True, lower_case=True, remove_punctuation=True, replace_urls=True, replace_numbers=True, stemming=True): self.strip_headers = strip_headers self.lower_case = lower_case self.remove_punctuation = remove_punctuation self.replace_urls = replace_urls self.replace_numbers = replace_numbers self.stemming = stemming def fit(self, X, y=None): return self def transform(self, X, y=None): X_transformed =  for email in X: text = email_to_text(email) or "" if self.lower_case: text = text.lower() if self.replace_urls and url_extractor is not None: urls = list(set(url_extractor.find_urls(text))) urls.sort(key=lambda url: len(url), reverse=True) for url in urls: text = text.replace(url, " URL ") if self.replace_numbers: text = re.sub(r'\d+(?:\.\d*(?:[eE]\d+))?', 'NUMBER', text) if self.remove_punctuation: text = re.sub(r'\W+', ' ', text, flags=re.M) word_counts = Counter(text.split()) if self.stemming and stemmer is not None: stemmed_word_counts = Counter() for word, count in word_counts.items(): stemmed_word = stemmer.stem(word) stemmed_word_counts[stemmed_word] += count word_counts = stemmed_word_counts X_transformed.append(word_counts) return np.array(X_transformed)
No hints are availble for this assesment
Answer is not availble for this assesment