{"id":3596,"date":"2021-06-22T20:33:38","date_gmt":"2021-06-22T20:33:38","guid":{"rendered":"https:\/\/cloudxlab.com\/blog\/?p=3596"},"modified":"2021-06-22T20:36:04","modified_gmt":"2021-06-22T20:36:04","slug":"parallel-computing-with-dask","status":"publish","type":"post","link":"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/","title":{"rendered":"Parallel Computing with Dask"},"content":{"rendered":"\n<figure class=\"wp-block-image is-resized\"><img src=\"https:\/\/docs.dask.org\/en\/latest\/_images\/dask-overview.svg\" alt=\"Dask collections and schedulers\" width=\"608\" height=\"357\"\/><figcaption>Source: dask.org<\/figcaption><\/figure>\n\n\n\n<p>I recently discovered a nice simple library called Dask. <\/p>\n\n\n\n<p>Parallel computing basically means performing multiple tasks in parallel &#8211; it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing.<\/p>\n\n\n\n<p>There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A common characteristic you would find in most parallel computing libraries you would is the computational graph. A computational graph is essentially a directed acyclic graph or dependency graph. <\/p>\n\n\n\n<!--more-->\n\n\n\n<p>Say, for example, we want to compute the value of Z from x and y and we are giving this:<\/p>\n\n\n\n<p>y1 = y*4<\/p>\n\n\n\n<p>x = y1*y1 + 3<\/p>\n\n\n\n<p>Z = x\/3 &#8211; 4y1<\/p>\n\n\n\n<p>This computation can be expressed as a dependency graph:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image-22.png\" alt=\"\" class=\"wp-image-3597\" width=\"296\" height=\"183\"\/><figcaption>Directed Acyclic Graph representing computation<\/figcaption><\/figure>\n\n\n\n<p>Z depends on x and y1 and x depends on y1 and y1 depends on y. This graph is evaluated only when you need not automatically. This helps in optimization.<\/p>\n\n\n\n<p>Here is an analogy of lazy evaluation (Starts at 1 min.):<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube\"><div class=\"wp-block-embed__wrapper\">\n<div style=\"max-width: 1778px;\"><div style=\"left: 0; width: 100%; height: 0; position: relative; padding-bottom: 56.25%;\"><iframe title=\"4.11. Apache Spark Basics | Lazy Evaluation &amp; Lineage Graph\" src=\"https:\/\/www.youtube.com\/embed\/Gk3Xv_1coVA?rel=0&amp;start=61\" style=\"border: 0; top: 0; left: 0; width: 100%; height: 100%; position: absolute;\" allowfullscreen scrolling=\"no\" allow=\"encrypted-media; accelerometer; clipboard-write; gyroscope; picture-in-picture\"><\/iframe><\/div><\/div><script type=\"text\/javascript\">window.addEventListener(\"message\",function(e){\n                window.parent.postMessage(e.data,\"*\");\n            },false);<\/script>\n<\/div><figcaption>How does the Lazy Evaluation helps!<\/figcaption><\/figure>\n\n\n\n<p>I hope this helps you understanding what is the parallel computing and lazy evaluation.<\/p>\n\n\n\n<p>Here is an snippet of code to compute the mean price <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import dask.dataframe as dd\ndf = dd.read_csv('\/cxldata\/datasets\/project\/ny_stock_prediction\/fundamentals.csv')\ndf.groupby(df[\"Ticker Symbol\"])[\"Earnings Per Share\"].mean().compute()<\/code><\/pre>\n\n\n\n<p>This should print something like the following:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Ticker Symbol\n AAL     -0.3600\n AAP      5.9625\n AAPL    16.0375\n ABBV     2.2800\n ABC      2.3025\n          \u2026\n YHOO     1.8950\n YUM      2.8025\n ZBH      3.4625\n ZION     1.3575\n ZTS      0.9500\n Name: Earnings Per Share, Length: 448, dtype: float64<\/pre>\n\n\n\n<p>If you have to achieve the same thing using pandas, the code will look like the following: <\/p>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">import pandas as pd\n df = pd.read_csv('\/cxldata\/datasets\/project\/ny_stock_prediction\/fundamentals.csv')\n df.groupby(df[\"Ticker Symbol\"])[\"Earnings Per Share\"].mean()<\/code><\/pre>\n\n\n\n<p>Did you spot the difference between pandas and Dask? The only difference is an extra &#8220;compute()&#8221; in the case of Dask.<\/p>\n\n\n\n<p>Learn more about dask here: <a href=\"https:\/\/docs.dask.org\/en\/latest\/\">https:\/\/docs.dask.org\/en\/latest\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently discovered a nice simple library called Dask. Parallel computing basically means performing multiple tasks in parallel &#8211; it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing. There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A &hellip; <a href=\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Parallel Computing with Dask&#8221;<\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[124,123,126,125],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Parallel Computing with Dask | CloudxLab Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Parallel Computing with Dask | CloudxLab Blog\" \/>\n<meta property=\"og:description\" content=\"I recently discovered a nice simple library called Dask. Parallel computing basically means performing multiple tasks in parallel &#8211; it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing. There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A &hellip; Continue reading &quot;Parallel Computing with Dask&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/\" \/>\n<meta property=\"og:site_name\" content=\"CloudxLab Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudxlab\" \/>\n<meta property=\"article:published_time\" content=\"2021-06-22T20:33:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-06-22T20:36:04+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/docs.dask.org\/en\/latest\/_images\/dask-overview.svg\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:site\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"2 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"CloudxLab Blog\",\"description\":\"Learn AI, Machine Learning, Deep Learning, Devops &amp; Big Data\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/cloudxlab.com\/blog\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/docs.dask.org\/en\/latest\/_images\/dask-overview.svg\",\"contentUrl\":\"https:\/\/docs.dask.org\/en\/latest\/_images\/dask-overview.svg\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/#webpage\",\"url\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/\",\"name\":\"Parallel Computing with Dask | CloudxLab Blog\",\"isPartOf\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/#primaryimage\"},\"datePublished\":\"2021-06-22T20:33:38+00:00\",\"dateModified\":\"2021-06-22T20:36:04+00:00\",\"author\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4835f1b3d5000626cb15e9311d748e09\"},\"breadcrumb\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/parallel-computing-with-dask\/#webpage\"}}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4835f1b3d5000626cb15e9311d748e09\",\"name\":\"Sandeep Giri\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1393214840cf7455bb4cba055cb30468?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1393214840cf7455bb4cba055cb30468?s=96&d=mm&r=g\",\"caption\":\"Sandeep Giri\"},\"sameAs\":[\"https:\/\/cloudxlab.com\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3596"}],"collection":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/comments?post=3596"}],"version-history":[{"count":3,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3596\/revisions"}],"predecessor-version":[{"id":3601,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3596\/revisions\/3601"}],"wp:attachment":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media?parent=3596"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/categories?post=3596"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/tags?post=3596"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}