Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left

  Apply Now

Project- Exploring Web Scraping: Python Adventures on Wikipedia and Amazon

11 / 11

Scraping Amazon Product Reviews - Get page data

In this final step, we would scrape the URL and save the data in a CSV file named amazon_product_reviews.csv.

Here, we will use XPath to navigate through elements and attributes in an XML document. XPath stands for XML Path Language. XPath uses "path like" syntax to identify and navigate nodes in an XML document. In XPath, there are seven kinds of nodes: element, attribute, text, namespace, processing-instruction, comment, and document nodes. XML documents are treated as trees of nodes. The topmost element of the tree is called the root element. Here is an example of an XML document:

<?xml version="1.0" encoding="UTF-8"?>

<library>
  <book>
    <title lang="en">A Game of Thrones</title>
    <author>George R. R. Martin</author>
    <year>1996</year>
  </book>
</library>

Example of nodes in the XML document above:

<library> (root element node)

<author>George R. R. Martin</author> (element node)

XPath uses path expressions to select nodes in an XML document by following a path. Here are some of the path expressions we will be using:

  • nodename = Selects all nodes with the name "nodename"
  • / = Selects from the root node
  • // = Selects nodes in the document from the current node that match the selection no matter where they are
  • . = Selects the current node
  • .. = Selects the parent of the current node
  • @ = Selects attributes

In our case, the web page would look something like this:

<Reviews>
    <Review>
        <title> </title>
        <author> </author>
    </Review>
    <Review>
        <title> </title>
        <author> </author>
    </Review>
</Reviews>

Our code would point at each tag, and collect information from the same.

INSTRUCTIONS
  • Define a variable reviews_df as Pandas dataframe to store the review data

    reviews_df = pd.DataFrame()
    
  • Set the parser path

    xpath_reviews = '//div[@data-hook="review"]'
    reviews = parser.xpath(xpath_reviews)
    
  • Set the various elements like rating, title, author etc. using the path expressions discussed above

    xpath_rating  = './/i[@data-hook="review-star-rating"]//text()'
    xpath_title   = './/a[@data-hook="review-title"]//text()'
    xpath_author  = './/span[@class="a-profile-name"]//text()'
    xpath_date    = './/span[@data-hook="review-date"]//text()'
    xpath_body    = './/span[@data-hook="review-body"]//text()'
    xpath_helpful = './/span[@data-hook="helpful-vote-statement"]//text()'
    
  • Scrape the data

    for review in reviews:
        rating  = review.xpath(xpath_rating)
        title   = review.xpath(xpath_title)
        author  = review.xpath(xpath_author)
        date    = review.xpath(xpath_date)
        body    = review.xpath(xpath_body)
        helpful = review.xpath(xpath_helpful)
    
        review_dict = {'rating': rating,
                       'title': title,
                       'author': author,
                       'date': date,
                       'body': body,
                       'helpful': helpful}
    
        reviews_df = reviews_df.append(review_dict, ignore_index=True)
    
  • Save the scraped data in the CSV file

    reviews_df.to_csv("amazon_product_reviews.csv", sep='\t', encoding='utf-8')
    
  • Print the data we just saved

    print(reviews_df)
    
See Answer

No hints are availble for this assesment


Note - Having trouble with the assessment engine? Follow the steps listed here

Loading comments...