Enrollments closing soon for Post Graduate Certificate Program in Applied Data Science & AI By IIT Roorkee | 3 Seats Left
Apply NowLogin using Social Account
     Continue with GoogleLogin using your credentials
Transcript: Let us take an example of the data store of the google search engine.
A search engine displays the ordered results relevant to the user's search query.
Are these search results queried from the website in realtime or from the database of Google? From the database of Google.
Google keeps downloading websites and parsing and storing the data found on the websites. This process of downloading website is called crawling and the tool is called crawler.
On what basis are these results ordered?
The answer is All of the above.
To figure out how important a page is, google assigns a page rank based on how many other websites are linking to a page. So, more number of websites are linking to your page improves your page rank and further if more important websites are linking to your page with the better page rank your page rank will even be better. The search engines need to find and maintain who is linking to who and with what name.
Say, we have a website of www.cnn.com which has some HTML content and is being referred by si.com as CNN and by microsoft.com as CNN News. si.com has some content which is being referred by microsoft.com as Sports Illustrated. This can be represented in a graph. This is stored in HBase in a table. This table has two column families contents and anchor. As the crawler crawls cnn.com's website, it would first create a row with row key as URL and add a value in say "HTML" column in contents column family.
After a while, say the crawler crawls the website of si.com and creates a record for the same. It also notices that the content contains link to cnn.com, so it adds a column 'si.com' to the anchor column family and adds the value for the row cnn.com as 'CNN'. After a while, the crawler stumbles upon microsoft.com which is also having an href to cnn.com and adds a row for Microsoft and adds a column to cnn.com rowkey and a column to si.com.
This process would go on. Notice that as more websites are being crawled, rows, as well as the columns, are increasing. Also, notice that we are utilizing the column header to store the data.
We often need the data of all sub-domains under a domain to be stored together. Since the data in an HBase table is ordered by the row key, keeping reversed URL as the key brings the records of a domain together.
Let me show you. If the URL was the key, the data in HBase would be in alphabetical order as:
hdinsights.microsoft.com
learn.cloudxlab.com
mail.cloudxlab.com
outlook.microsoft.com
www.cloudxlab.com
www.microsoft.com
You can clearly see that all of the rows for Microsoft aren't together. But if we reverse the URLs, the records for the same domain would be together after sorting:
com.cloudxlab.learn
com.cloudxlab.mail
com.cloudxlab.www
com.microsoft.hdinsights
com.microsoft.outlook
com.microsoft.www
So, keeping the reversed URL as row key brings the data for each domain together. Each cell could keep past few versions of values as shown in another example. The previous content of the website www.cnn.com is kept in t3 and t5 versions.
Notes:
In case, this video is not sufficient and you are looking for a more detailed discussion. Here are the video recordings from our live course that might help in clearing the data model example questions:
Taking you to the next exercise in seconds...
Want to create exercises like this yourself? Click here.
Loading comments...