Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
In this final step, we would scrape the URL and save the data in a CSV file named amazon_product_reviews.csv
.
Here, we will use XPath
to navigate through elements and attributes in an XML document. XPath
stands for XML Path Language. XPath
uses "path like" syntax to identify and navigate nodes in an XML document. In XPath
, there are seven kinds of nodes: element
, attribute
, text
, namespace
, processing-instruction
, comment
, and document
nodes. XML documents are treated as trees of nodes. The topmost element of the tree is called the root element
. Here is an example of an XML document:
<?xml version="1.0" encoding="UTF-8"?>
<library>
<book>
<title lang="en">A Game of Thrones</title>
<author>George R. R. Martin</author>
<year>1996</year>
</book>
</library>
Example of nodes in the XML document above:
<library> (root element node)
<author>George R. R. Martin</author> (element node)
XPath uses path expressions to select nodes in an XML document by following a path. Here are some of the path expressions we will be using:
nodename
= Selects all nodes with the name "nodename"/
= Selects from the root node//
= Selects nodes in the document from the current node that match the selection no matter where they are.
= Selects the current node..
= Selects the parent of the current node@
= Selects attributesIn our case, the web page would look something like this:
<Reviews>
<Review>
<title> </title>
<author> </author>
</Review>
<Review>
<title> </title>
<author> </author>
</Review>
</Reviews>
Our code would point at each tag, and collect information from the same.
Define a variable reviews_df
as Pandas dataframe to store the review data
reviews_df = pd.DataFrame()
Set the parser path
xpath_reviews = '//div[@data-hook="review"]'
reviews = parser.xpath(xpath_reviews)
Set the various elements like rating, title, author etc. using the path expressions discussed above
xpath_rating = './/i[@data-hook="review-star-rating"]//text()'
xpath_title = './/a[@data-hook="review-title"]//text()'
xpath_author = './/span[@class="a-profile-name"]//text()'
xpath_date = './/span[@data-hook="review-date"]//text()'
xpath_body = './/span[@data-hook="review-body"]//text()'
xpath_helpful = './/span[@data-hook="helpful-vote-statement"]//text()'
Scrape the data
for review in reviews:
rating = review.xpath(xpath_rating)
title = review.xpath(xpath_title)
author = review.xpath(xpath_author)
date = review.xpath(xpath_date)
body = review.xpath(xpath_body)
helpful = review.xpath(xpath_helpful)
review_dict = {'rating': rating,
'title': title,
'author': author,
'date': date,
'body': body,
'helpful': helpful}
reviews_df = reviews_df.append(review_dict, ignore_index=True)
Save the scraped data in the CSV file
reviews_df.to_csv("amazon_product_reviews.csv", sep='\t', encoding='utf-8')
Print the data we just saved
print(reviews_df)
Want to create exercises like this yourself? Click here.
No hints are availble for this assesment
Note - Having trouble with the assessment engine? Follow the steps listed here
Loading comments...