Scraping Amazon Product Reviews - Get page data

In this final step, we would scrape the URL and save the data in a CSV file named amazon_product_reviews.csv.

Here, we will use XPath to navigate through elements and attributes in an XML document. XPath stands for XML Path Language. XPath uses "path like" syntax to identify and navigate nodes in an XML document. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. XML documents are treated as trees of nodes. The topmost element of the tree is called the root element. Here is an example of an XML document:

<?xml version="1.0" encoding="UTF-8"?>

<library>
  <book>
    <title lang="en">A Game of Thrones</title>
    <author>George R. R. Martin</author>
    <year>1996</year>
  </book>
</library>

Example of nodes in the XML document above:

<library> (root element node)

<author>George R. R. Martin</author> (element node)

XPath uses path expressions to select nodes in an XML document by following a path. Here are some of the path expressions we will be using:

nodename = Selects all nodes with the name "nodename"
/ = Selects from the root node
// = Selects nodes in the document from the current node that match the selection no matter where they are
. = Selects the current node
.. = Selects the parent of the current node
@ = Selects attributes

In our case, the web page would look something like this:

<Reviews>
    <Review>
        <title> </title>
        <author> </author>
    </Review>
    <Review>
        <title> </title>
        <author> </author>
    </Review>
</Reviews>

Our code would point at each tag, and collect information from the same.

INSTRUCTIONS

Define a variable reviews_df as Pandas dataframe to store the review data
```
reviews_df = pd.DataFrame()
```

Set the parser path

xpath_reviews = '//div[@data-hook="review"]'
reviews = parser.xpath(xpath_reviews)

Set the various elements like rating, title, author etc. using the path expressions discussed above

xpath_rating  = './/i[@data-hook="review-star-rating"]//text()'
xpath_title   = './/a[@data-hook="review-title"]//text()'
xpath_author  = './/span[@class="a-profile-name"]//text()'
xpath_date    = './/span[@data-hook="review-date"]//text()'
xpath_body    = './/span[@data-hook="review-body"]//text()'
xpath_helpful = './/span[@data-hook="helpful-vote-statement"]//text()'

Scrape the data

for review in reviews:
    rating  = review.xpath(xpath_rating)
    title   = review.xpath(xpath_title)
    author  = review.xpath(xpath_author)
    date    = review.xpath(xpath_date)
    body    = review.xpath(xpath_body)
    helpful = review.xpath(xpath_helpful)

    review_dict = {'rating': rating,
                   'title': title,
                   'author': author,
                   'date': date,
                   'body': body,
                   'helpful': helpful}

    reviews_df = reviews_df.append(review_dict, ignore_index=True)

Save the scraped data in the CSV file

reviews_df.to_csv("amazon_product_reviews.csv", sep='\t', encoding='utf-8')

Print the data we just saved
```
print(reviews_df)
```

See Answer

Note - Having trouble with the assessment engine? Follow the steps listed here

Project- Exploring Web Scraping: Python Adventures on Wikipedia and Amazon

Scraping Amazon Product Reviews - Get page data

XP

Congratulations on completing topic - Project- Exploring Web Scraping: Python Adventures on Wikipedia and Amazon

Please login to comment

0 Comments