Transcript: What is Data Modeling? It is the process of structuring your data using the constructs provided by datastore to solve your problem. On one side, you have business requirements and on the other side, you have features and limitations of the database. So, we need to model our data-keeping limitation of database in the view and utilizing the features of database in order to solve the business problem.
Let us understand the data model in Hbase. Hbase is based on Google's paper on Big Table and as per the definition of Google's big table, it is basically a Map meaning it stores data in the form of keys and the values. This map is sorted by the key and is multidimensional - the value can have any number of dimensions. It is persistent - the data that is saved into HBase remains there even after reboot. This map is distributed on to multiple machines. The dimensions or columns defined for one key can be entirely different from the values of another key.
Let us understand the basic building blocks of the data model in HBase & HBase Table. HBase organizes data into tables. The table names are Strings that are safe for file system path. Within a table, data is stored according to its row. Rows are identified uniquely by their row key and row keys do not have a data type & are treated as byte array.
Data within a row is grouped by column family. It Impacts the physical arrangement of data. The column families must be defined upfront and are not easily modifiable. Every row in a table has the same column families but a row need not store data in all of its column families. The name of column families should be Strings that are safe for use in a file system path because Hbase creates a folder with the name which is same as column family
Data within a column family is addressed by a column identifier or column qualifier. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between the rows and like row keys, column qualifiers don’t have to have data type the are always treated as a byte array[ ] or in other words, row key and column name can store any kind of binary data. tablename, row key, column family, column qualifier & version identifies a cell. Values of cell are of binary data type, meaning each cell can store any kind of data. byte[ ].
Values within a cell are versioned. Version number is by default the timestamp at the time of writing. If it is not specified at the time of writing, the current time is used. If you do not specify the version number at the time of reading data, the latest value is returned. The number of versions retained by HBase is configured for each column family. The default number of cell versions is three.
This is the logical view of a table. It has column families - personal and office. In personal column family, we have name and the home phone number. While in the office column family, we have phone and address columns.
The physical view is represented hierarchically - each key contains column families which has columns and then timestamps and then the values.