Parallel Computing with Dask

Dask collections and schedulers
Source: dask.org

I recently discovered a nice simple library called Dask.

Parallel computing basically means performing multiple tasks in parallel – it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing.

There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A common characteristic you would find in most parallel computing libraries you would is the computational graph. A computational graph is essentially a directed acyclic graph or dependency graph.

Say, for example, we want to compute the value of Z from x and y and we are giving this:

y1 = y*4

x = y1*y1 + 3

Z = x/3 – 4y1

This computation can be expressed as a dependency graph:

Directed Acyclic Graph representing computation

Z depends on x and y1 and x depends on y1 and y1 depends on y. This graph is evaluated only when you need not automatically. This helps in optimization.

Here is an analogy of lazy evaluation (Starts at 1 min.):

How does the Lazy Evaluation helps!

I hope this helps you understanding what is the parallel computing and lazy evaluation.

Here is an snippet of code to compute the mean price

import dask.dataframe as dd
df = dd.read_csv('/cxldata/datasets/project/ny_stock_prediction/fundamentals.csv')
df.groupby(df["Ticker Symbol"])["Earnings Per Share"].mean().compute()

This should print something like the following:

Ticker Symbol
 AAL     -0.3600
 AAP      5.9625
 AAPL    16.0375
 ABBV     2.2800
 ABC      2.3025
          …
 YHOO     1.8950
 YUM      2.8025
 ZBH      3.4625
 ZION     1.3575
 ZTS      0.9500
 Name: Earnings Per Share, Length: 448, dtype: float64

If you have to achieve the same thing using pandas, the code will look like the following:

import pandas as pd
 df = pd.read_csv('/cxldata/datasets/project/ny_stock_prediction/fundamentals.csv')
 df.groupby(df["Ticker Symbol"])["Earnings Per Share"].mean()

Did you spot the difference between pandas and Dask? The only difference is an extra “compute()” in the case of Dask.

Learn more about dask here: https://docs.dask.org/en/latest/