Running PySpark in Jupyter / IPython notebook

You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover 2.2, 2.3, 2.4 and 3.1 versions of Apache Spark.

What is Jupyter notebook?

The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

Please follow below steps to access the Jupyter notebook on CloudxLab

To start python notebook, Click on “Jupyter” button under My Lab and then click on “New -> Python 3”

This code to initialize is also available in GitHub Repository here.

For accessing Spark, you have to set several environment variables and system paths. You can do that either manually or you can use a package that does all this work for you. For the latter, findspark is a suitable choice. It wraps up all these tasks in just two lines of code:

import findspark
findspark.init('/usr/spark2.4.3')

Here, we have used spark version 2.4.3. You can specify any other version too whichever you want to use. You can check the available spark versions using the following command-

!ls /usr/spark*
If you choose to do the setup manually instead of using the package, then you can access different versions of Spark by following the steps below:

If you want to access Spark 2.2, use below code:

import os
import sys

os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

If you plan to use 2.3 version, please use below code to initialize

import os
import sys

os.environ["SPARK_HOME"] = "/usr/spark2.3/"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

If you plan to use 2.4 version, please use below code to initialize

import os
import sys

os.environ["SPARK_HOME"] = "/usr/spark2.4.3"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Now, initialize the entry points of Spark: SparkContext and SparkConf (Old Style)

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("appName")
sc = SparkContext(conf=conf)

Once you are successful in initializing the sc and conf, please use the below code to test

rdd = sc.textFile("/data/mr/wordcount/input/")
print(rdd.take(10))
print(sc.version)

You can initialize spark in spark2 (or dataframe) way as follows:

# Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()
sc = spark.sparkContext

# Now you even use hive
# Here we are querying the hive table student located in ab
spark.sql("select * from ab.student").show()

# it display something like this:



You can also initialize Spark 3.1 version, using the below code

import findspark
findspark.init('/usr/spark-3.1.2')

Access S3 Files in Spark

In this blog post we will learn how to access S3 Files using Spark on CloudxLab.
Please follow below steps to access S3 files:

#Login to Web Console

#Specify the hadoop config
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/

#Specify the Spark Class Path
export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/hadoop-aws.jar"
export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/lib/aws-java-sdk-1.7.4.jar"
export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/lib/guava-11.0.2.jar"

#Launch Spark Shell
/usr/spark1.6/bin/spark-shell

#On the spark shell Specify the AWS Key
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_AWS_ACCESS_KeY")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "YOUR_AWS_SECRET_ACCESS_KeY")

#Now Access s3 files using spark
#Create RDD out of s3 file
val nationalNames = sc.textFile("s3n://cxl-spark-test-data/sss/baby-names.csv")

#Just check the first line
nationalNames.take(1)

INSOFE Ties Up With CloudxLab

insofe+cloudxlab

Adding to an already impressive list of collaborations, International School of Engineering (INSOFE) has recently signed up with CloudxLab (CxL).  This move will enable INSOFE’s students to practice in a real world scenario through the cloud based labs offered by CloudxLab.

INSOFE’s flagship program, CPEE – Certificate Program in Engineering Excellence  – was created to transform “individuals into analytics professionals”.  It is listed at #3 between Columbia and Stanford at #2 and #4 respectively, and holds the distinction of being the only institute outside the US to hold a spot in this list by CIO.com.  This within an admirable 3 years of inception.  Having established itself as one of the top institutes globally, INSOFE is ceaselessly on the look out for innovative ways to engage and enhance student experience.

Continue reading “INSOFE Ties Up With CloudxLab”

SCMHRD Partners With CloudxLab

 

schmrd-with-cloudxlab

In a recent strategic partnership that demonstrates SCMHRD’s superior vision in pedagogy, the Post Graduate Program in Business Analytics (PGPBA) has tied up with well known learning innovation firm CloudxLab. With this partnership, SCMHRD’s students will get to learn and work with Big Data and analytics tools in the same manner that enterprises learn and use them.

SCMHRD’s flagship Analytics program PGPBA with its emphasis on Big Data analytics, as opposed to standard analytics, makes it relevant to a bigger gamut of employers and hence the better choice. This emphasis isn’t easy to cater to. Providing Big Data tools to learners entails providing a cluster (a bunch of computers) that they can practice on which in turn translates to expensive infrastructure, big support teams, and the operational costs that go with everything.

Continue reading “SCMHRD Partners With CloudxLab”

CloudxLab Getting Started Guide

Please use below resources to make most out of your CloudxLab Subscription

You can find the link to the complete getting started guide here.

CloudxLab hands-on videos

Hadoop videos on CloudxLab

Spark videos on CloudxLab

Stream Processing Using Apache Spark and Kafka

Thank you all for your overwhelming response to our “Stream Processing using Apache Spark and Apache Kafka session” in “Apache Spark Hands-On” series, which happened on June 15, 2016 8:00 pm IST

Key takeaways- 

+ Introduction to Apache Spark
+ Introduction to stream processing
+ Understanding RDD (Resilient Distributed Datasets)
+ Understanding Dstream
+ Kafka Introduction
+ Understanding Stream Processing flow
+ Real time hands-on using CloudxLab
+ Questions and Answers

Continue reading “Stream Processing Using Apache Spark and Kafka”

Apache Spark Introduction

Thank you all for your overwhelming response to our Apache Spark Introduction session in “Apache Spark Hands-On” series, which happened on April 28, 2016 8:00 pm IST

Presented By
Sandeep Giri

Sandeep Giri

Key takeaways for this webinar were

+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations – Transformation
+ RDD Operations – Actions
+ Hands-on demos using CloudxLab
+ Questions and Answers

Continue reading “Apache Spark Introduction”

CloudxLab Introduction

What is CloudxLab?

CloudxLab is a cloud based virtual lab for practicing Big Data (Hadoop, Spark etc), Machine Learning and Deep Learning technologies.

Origins

While training students on Big Data technologies at KnowBigData, we realized that our learners were facing a lot of trouble downloading and configuring virtual machines (VM) provided by major Hadoop vendors. Most often, these virtual machines were slow and would not allow for use of any other application on the same computer.

Moreover, working on a VM did not give a real world experience as one is still dealing with only one machine instead of a cluster of machines which is the whole idea of Big Data technologies which are primarily based on distributed computing.

This is how CloudxLab was conceptualized in an effort to resolve these pain points of learners. The video below will help understand how one of our clients – Simplilearn – is using CloudxLab to provide a better learning experience to their course takers.

Continue reading “CloudxLab Introduction”