Running PySpark in Jupyter / IPython notebook

You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover both 1 and 2 versions of Apache Spark.

What is Jupyter notebook?

The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

Please follow below steps to access the Jupyter notebook on CloudxLab

Step 1 – Login to web console

Step 2 – Run below commands on web console

Above commands will launch a Jupyter notebook and display these lines in console

To access the notebook, go to this address http://webconsole:port where

         webconsole – the domain of your web console

         port – port on which your notebook is running

If your web console is f.cloudxlab.com and your notebook is running on port 8890, go to http://f.cloudxlab.com:8890 to access the notebook on your browser.

Step 3- Set up environment variables

Step 4- Load PySpark module

You should get the result like below image if pyspark module is loaded properly.

PySpark Jupyter Notebook
PySpark Jupyter Notebook

Running Spark 2.0.1 using jupyter.

Start Jupyter using following commands:

And in the notebook, please use the following code:

The output should look like following:

Running Python Spark 2.0.1 using jupyter
Running Python Spark 2.0.1 using jupyter

 

  • Sravan Ch

    could you please let us know how to get the jdbc connection get working

    I tried adding the library to SPARK-CLASSPATH instead of adding it to spark.driver.extraLibraryPath or spark.executor.extraLibraryPath
    os.environ[‘SPARK_CLASSPATH’] = r”/home/sravandata002869/mysql-connector-java-5.1.40/mysql-connector-java-5.1.40-bin.jar”

    Would really appreciate if you can get me a working example with jdbc connection

    Thanks,
    Sravan

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

    • abhinav singh

      Hi Sravan,

      Just curious if above code is working from command line?

      • Sravan Ch

        I’m doing all this work on jupyter notebook

      • Sravan Ch

        Never Mind, I got this working..

        • abhinav singh

          Hi Sravan,

          Great!

          Can you please post the solution here so that it may help other users 🙂

          Thanks

  • Sravan Ch

    To give you the complete code that I was running in the notebook

    from pyspark import SparkContext,SparkConf
    from pyspark.sql import SQLContext,HiveContext

    sconf = SparkConf()
    sc=SparkContext(conf=sconf)
    sqlc=SQLContext(sc)

    username = “sqoopuser”
    pwd = “NHkkP876rp”
    hostname = “ip-172-31-13-154”
    dbname = “sqoopex”
    port = 3306
    table = “ani_country”

    df = sqlc.read.format(“jdbc”).options(
    url=”jdbc:mysql://{0}:{1}/{2}”.format(hostname, port, dbname),
    driver = “com.mysql.jdbc.Driver”,
    dbtable = table,
    user=username,
    password=pwd,
    ).load()

    It’s not working as it cannot recognize the driver and I was not able to add the jar properly

  • Sravan Ch

    I’ve put mysql jar in here
    /home/sravandata002869