Enrollments Open for Advanced Certification Courses on Data Science, ML & AI by E&ICT Academy IIT RoorkeeApply Now
Now let us create a preprocessing function.
The following function first drops thesection, then converts all
<a>tags to the word HYPERLINK, then it gets rid of all HTML tags, leaving only the plain text. For readability, it also replaces multiple newlines with single newlines, and finally it unescapes html entities:
import re from html import unescape def html_to_plain_text(html): text = re.sub('<head.*?>.*?</head>', '', html, flags=re.M | re.S | re.I) text = re.sub('<a\s.*?>', ' HYPERLINK ', text, flags=re.M | re.S | re.I) text = re.sub('<.*?>', '', text, flags=re.M | re.S) text = re.sub(r'(\s*\n)+', '\n', text, flags=re.M | re.S) return unescape(text)
re module provides regular expression matching operations similar to those found in Perl. With the help of
html.unescape method, we can convert the ascii string into html script by replacing ascii characters with special characters by using
re.sub returns the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed.
No hints are availble for this assesment
Answer is not availble for this assesment