{"id":3882,"date":"2022-09-20T07:50:56","date_gmt":"2022-09-20T07:50:56","guid":{"rendered":"https:\/\/cloudxlab.com\/blog\/?p=3882"},"modified":"2022-09-20T08:03:49","modified_gmt":"2022-09-20T08:03:49","slug":"bucketing-clustered-by-and-cluster-by","status":"publish","type":"post","link":"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/","title":{"rendered":"Bucketing- CLUSTERED BY and CLUSTER BY"},"content":{"rendered":"\n<p>The bucketing in Hive is a data-organising technique. It is used to decompose data into more manageable parts, known as <em>buckets<\/em>, which in result, improves the performance of the queries. It is similar to partitioning, but with an added functionality of hashing technique. <\/p>\n\n\n\n<h2>Introduction<\/h2>\n\n\n\n<p>Bucketing, a.k.a clustering is a\u00a0technique to decompose data into buckets. In bucketing, Hive splits the data into a fixed number of buckets, according to a hash function over some set of columns. Hive ensures that all rows that have the same hash will be stored in the same bucket. However, a single bucket may contain multiple such groups.<\/p>\n\n\n\n<p>For example, bucketing the data in 3 buckets will look like-<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2022\/09\/Untitled.png\" alt=\"\" class=\"wp-image-3883\" width=\"626\" height=\"261\"\/><\/figure>\n\n\n\n<!--more-->\n\n\n\n<h2>How to perform bucketing?<\/h2>\n\n\n\n<p>Bucketing is performed by using <strong>CLUSTERED BY <\/strong>clause in Hive. You can use it like-<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">CREATE TABLE u_data_b( userid INT, movieid INT, rating INT, unixtime STRING)\n<span class=\"has-inline-color has-dark-red-color\">CLUSTERED BY(userid) INTO 32 BUCKETS<\/span>\nROW FORMAT DELIMITED\nFIELDS TERMINATED BY '\\t'\nSTORED AS TEXTFILE;\n\nINSERT OVERWRITE TABLE u_data_b\nSELECT * from u_data;\n<\/pre>\n\n\n\n<p>&nbsp;So, the above code will create only 32 buckets irrespective of the fact that how many unique values are in the <strong><em>userid<\/em><\/strong> column.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2022\/09\/06F3C757-EF6F-4096-A8CF-3E6A0B7483C3_1_201_a.jpeg\" alt=\"\" class=\"wp-image-3891\" width=\"865\" height=\"422\"\/><\/figure><\/div>\n\n\n\n<p>That further means, 32 sub-directories will be created which we can also see in the below figure:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2022\/09\/92B89022-ADAB-45AB-B95B-4823FEDE627E_4_5005_c.jpeg\" alt=\"\" class=\"wp-image-3894\" width=\"858\" height=\"399\"\/><\/figure><\/div>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-verse\">Note that bucketing is preferred in case of a column having high cardinality because in such case, partitioning can create too many partitions. Furthermore, partitioning will result in multiple Hadoop files which will increase the load on the same node as it has to carry the metadata of each of the partitions.<\/pre>\n\n\n\n<p><strong>CLUSTERED BY<\/strong> is often confused with the <strong>CLUSTER BY<\/strong> clause as they both sound the same. But, in reality, it is much different. Let&#8217;s understand the <strong>CLUSTER BY<\/strong> clause for more clarity.<\/p>\n\n\n\n<h2>CLUSTER BY<\/h2>\n\n\n\n<p><strong>CLUSTER BY&nbsp;<\/strong>is a shortcut for both&nbsp;<strong>DISTRIBUTE BY<\/strong>&nbsp;and&nbsp;<strong>SORT BY<\/strong>. It first, <em>distributes<\/em>&nbsp;the input rows among reducers as per <strong>DISTRIBUTE BY<\/strong> clause and then ensures the sorting order of values present in multiple reducers, giving global ordering, as per the <strong>SORT BY<\/strong> clause.<\/p>\n\n\n\n<p><strong>CLUSTER BY<\/strong> is used mainly with the <em>Transform\/Map-Reduce<\/em> Scripts. It is also sometimes useful in <strong>SELECT<\/strong> statements if there is a need to partition and sort the output of a query for subsequent queries.<\/p>\n\n\n\n<p>It&#8217;s syntax is:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>CLUSTER BY colName (',' colName)*<\/strong><\/pre>\n\n\n\n<p>For example,<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"><strong>SELECT * FROM u_data WHERE userid &lt; 5 AND movieid &lt; 20 CLUSTER BY userid;<\/strong><\/pre>\n\n\n\n<p>So, the above example will give the output as:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img width=\"449\" height=\"497\" src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2022\/09\/746A0F06-DAB8-4129-951B-A79071696B65_1_201_a.jpeg\" alt=\"\" class=\"wp-image-3886\"\/><\/figure>\n\n\n\n<p>As we can see above, the output is sorted in ascending order of the column <em>userid<\/em>.<\/p>\n\n\n\n<p>I hope it&#8217;s clear to all that in <strong>CLUSTER BY<\/strong> and <strong>CLUSTERED BY<\/strong>, the only thing common is that they sound the same.<\/p>\n\n\n\n<p>Happy Learning!!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The bucketing in Hive is a data-organising technique. It is used to decompose data into more manageable parts, known as buckets, which in result, improves the performance of the queries. It is similar to partitioning, but with an added functionality of hashing technique. Introduction Bucketing, a.k.a clustering is a\u00a0technique to decompose data into buckets. In &hellip; <a href=\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Bucketing- CLUSTERED BY and CLUSTER BY&#8221;<\/span><\/a><\/p>\n","protected":false},"author":36,"featured_media":3883,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[24,1],"tags":[174,175,177],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Bucketing- CLUSTERED BY and CLUSTER BY | CloudxLab Blog<\/title>\n<meta name=\"description\" content=\"Bucketing is used to decompose data into more manageable parts, known as buckets, which in result, improves the performance of the queries.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Bucketing- CLUSTERED BY and CLUSTER BY | CloudxLab Blog\" \/>\n<meta property=\"og:description\" content=\"Bucketing is used to decompose data into more manageable parts, known as buckets, which in result, improves the performance of the queries.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/\" \/>\n<meta property=\"og:site_name\" content=\"CloudxLab Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudxlab\" \/>\n<meta property=\"article:published_time\" content=\"2022-09-20T07:50:56+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-09-20T08:03:49+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2022\/09\/Untitled.png\" \/>\n\t<meta property=\"og:image:width\" content=\"626\" \/>\n\t<meta property=\"og:image:height\" content=\"261\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:site\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"3 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"CloudxLab Blog\",\"description\":\"Learn AI, Machine Learning, Deep Learning, Devops &amp; Big Data\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/cloudxlab.com\/blog\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/cloudxlab.com\/blog\/wp-content\/uploads\/2022\/09\/Untitled.png\",\"contentUrl\":\"https:\/\/cloudxlab.com\/blog\/wp-content\/uploads\/2022\/09\/Untitled.png\",\"width\":626,\"height\":261},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/#webpage\",\"url\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/\",\"name\":\"Bucketing- CLUSTERED BY and CLUSTER BY | CloudxLab Blog\",\"isPartOf\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/#primaryimage\"},\"datePublished\":\"2022-09-20T07:50:56+00:00\",\"dateModified\":\"2022-09-20T08:03:49+00:00\",\"author\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4438d405318314ec50940bde93ef548a\"},\"description\":\"Bucketing is used to decompose data into more manageable parts, known as buckets, which in result, improves the performance of the queries.\",\"breadcrumb\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/bucketing-clustered-by-and-cluster-by\/#webpage\"}}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4438d405318314ec50940bde93ef548a\",\"name\":\"Shubh Tripathi\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/76bb13891affbf9da48fa9701d774ff0?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/76bb13891affbf9da48fa9701d774ff0?s=96&d=mm&r=g\",\"caption\":\"Shubh Tripathi\"}}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3882"}],"collection":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/users\/36"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/comments?post=3882"}],"version-history":[{"count":10,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3882\/revisions"}],"predecessor-version":[{"id":3907,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3882\/revisions\/3907"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media\/3883"}],"wp:attachment":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media?parent=3882"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/categories?post=3882"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/tags?post=3882"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}