I recently discovered a nice simple library called Dask.
Parallel computing basically means performing multiple tasks in parallel – it could be on the same machine or on multiple machines. When it is on multiple machines, it is called distributed computing.
There are various libraries that support parallel computing such as Apache Spark, Tensorflow. A common characteristic you would find in most parallel computing libraries you would is the computational graph. A computational graph is essentially a directed acyclic graph or dependency graph.
Say, for example, we want to compute the value of Z from x and y and we are giving this:
y1 = y*4
x = y1*y1 + 3
Z = x/3 – 4y1
This computation can be expressed as a dependency graph:
Z depends on x and y1 and x depends on y1 and y1 depends on y. This graph is evaluated only when you need not automatically. This helps in optimization.
Here is an analogy of lazy evaluation (Starts at 1 min.):
I hope this helps you understanding what is the parallel computing and lazy evaluation.
Here is an snippet of code to compute the mean price
import dask.dataframe as dd
df = dd.read_csv('/cxldata/datasets/project/ny_stock_prediction/fundamentals.csv')
df.groupby(df["Ticker Symbol"])["Earnings Per Share"].mean().compute()
This should print something like the following:
Ticker Symbol AAL -0.3600 AAP 5.9625 AAPL 16.0375 ABBV 2.2800 ABC 2.3025 … YHOO 1.8950 YUM 2.8025 ZBH 3.4625 ZION 1.3575 ZTS 0.9500 Name: Earnings Per Share, Length: 448, dtype: float64
If you have to achieve the same thing using pandas, the code will look like the following:
import pandas as pd
df = pd.read_csv('/cxldata/datasets/project/ny_stock_prediction/fundamentals.csv')
df.groupby(df["Ticker Symbol"])["Earnings Per Share"].mean()
Did you spot the difference between pandas and Dask? The only difference is an extra “compute()” in the case of Dask.
Learn more about dask here: https://docs.dask.org/en/latest/