{"id":272,"date":"2016-10-18T10:33:10","date_gmt":"2016-10-18T10:33:10","guid":{"rendered":"http:\/\/blog.cloudxlab.com\/?p=272"},"modified":"2023-07-03T12:41:36","modified_gmt":"2023-07-03T12:41:36","slug":"running-pyspark-jupyter-notebook","status":"publish","type":"post","link":"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/","title":{"rendered":"Running PySpark in Jupyter \/ IPython notebook"},"content":{"rendered":"<p>You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover 2.2, 2.3, 2.4 and 3.1 versions of Apache Spark.<\/p>\n<p><strong>What is\u00a0Jupyter\u00a0notebook?<\/strong><\/p>\n<p>The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the <a href=\"http:\/\/jupyter.org\/\">Jupyter website<\/a>.<\/p>\n<p>Please follow below steps to access the Jupyter notebook on CloudxLab<\/p>\n<p>To start python notebook, Click on &#8220;Jupyter&#8221; button under <a href=\"https:\/\/cloudxlab.com\/my-lab\">My Lab<\/a> and then click on &#8220;New -&gt; Python 3&#8221;<\/p>\n<p>This code to initialize is also available in <a href=\"https:\/\/github.com\/cloudxlab\/bigdata\/blob\/master\/spark\/python\/SparkStart.ipynb\">GitHub Repository here.<\/a><\/p>\n<p>For accessing Spark, you have to set several environment variables and system paths. You can do that either manually or you can use a package that does all this work for you. For the latter, <a href=\"https:\/\/github.com\/minrk\/findspark\">findspark<\/a> is a suitable choice. It wraps up all these tasks in just two lines of code:<\/p>\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">import findspark\nfindspark.init('\/usr\/spark2.4.3')<\/code><\/pre>\n\n\n\n<p>Here, we have used spark version 2.4.3. You can specify any other version too whichever you want to use. You can check the available spark versions using the following command-<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">!ls \/usr\/spark*<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-verse\"><strong>If you choose to do the setup manually instead of using the package, then you can access different versions of Spark by following the steps below:<\/strong><\/pre>\n\n\n\n<p>If you want to access Spark 2.2, use below code:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">import os\nimport sys\n\nos.environ[\"SPARK_HOME\"] = \"\/usr\/hdp\/current\/spark2-client\"\nos.environ[\"PYLIB\"] = os.environ[\"SPARK_HOME\"] + \"\/python\/lib\"\n# In below two lines, use \/usr\/bin\/python2.7 if you want to use Python 2\nos.environ[\"PYSPARK_PYTHON\"] = \"\/usr\/local\/anaconda\/bin\/python\" \nos.environ[\"PYSPARK_DRIVER_PYTHON\"] = \"\/usr\/local\/anaconda\/bin\/python\"\nsys.path.insert(0, os.environ[\"PYLIB\"] +\"\/py4j-0.10.4-src.zip\")\nsys.path.insert(0, os.environ[\"PYLIB\"] +\"\/pyspark.zip\")<\/code><\/pre>\n\n\n\n<p>If you plan to use 2.3 version, please use below code to initialize<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">import os\nimport sys\n\nos.environ[\"SPARK_HOME\"] = \"\/usr\/spark2.3\/\"\nos.environ[\"PYLIB\"] = os.environ[\"SPARK_HOME\"] + \"\/python\/lib\"\n# In below two lines, use \/usr\/bin\/python2.7 if you want to use Python 2\nos.environ[\"PYSPARK_PYTHON\"] = \"\/usr\/local\/anaconda\/bin\/python\" \nos.environ[\"PYSPARK_DRIVER_PYTHON\"] = \"\/usr\/local\/anaconda\/bin\/python\"\nsys.path.insert(0, os.environ[\"PYLIB\"] +\"\/py4j-0.10.7-src.zip\")\nsys.path.insert(0, os.environ[\"PYLIB\"] +\"\/pyspark.zip\")<\/code><\/pre>\n\n\n\n<p>If you plan to use 2.4 version, please use below code to initialize<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">import os\nimport sys\n\nos.environ[\"SPARK_HOME\"] = \"\/usr\/spark2.4.3\"\nos.environ[\"PYLIB\"] = os.environ[\"SPARK_HOME\"] + \"\/python\/lib\"\n# In below two lines, use \/usr\/bin\/python2.7 if you want to use Python 2\nos.environ[\"PYSPARK_PYTHON\"] = \"\/usr\/local\/anaconda\/bin\/python\" \nos.environ[\"PYSPARK_DRIVER_PYTHON\"] = \"\/usr\/local\/anaconda\/bin\/python\"\nsys.path.insert(0, os.environ[\"PYLIB\"] +\"\/py4j-0.10.7-src.zip\")\nsys.path.insert(0, os.environ[\"PYLIB\"] +\"\/pyspark.zip\")<\/code><\/pre>\n\n\n\n<p>Now, initialize the entry points of Spark: SparkContext and SparkConf (Old Style)<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">from pyspark import SparkContext, SparkConf\nconf = SparkConf().setAppName(\"appName\")\nsc = SparkContext(conf=conf)<\/code><\/pre>\n\n\n\n<p>Once you are successful in initializing the sc and conf, please use the below code to test<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">rdd = sc.textFile(\"\/data\/mr\/wordcount\/input\/\")\nprint(rdd.take(10))\nprint(sc.version)<\/code><\/pre>\n\n\n\n<p>You can initialize spark in spark2 (or dataframe) way as follows:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\"># Entrypoint 2.x\nfrom pyspark.sql import SparkSession\nspark = SparkSession.builder.appName(\"Spark SQL basic example\").enableHiveSupport().getOrCreate()\nsc = spark.sparkContext\n\n# Now you even use hive\n# Here we are querying the hive table student located in ab\nspark.sql(\"select * from ab.student\").show()\n\n# it display something like this:\n\n\ufeff<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image size-large\"><img width=\"536\" height=\"269\" src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2020\/10\/hive_spark_sql.png\" alt=\"\" class=\"wp-image-3225\"\/><\/figure>\n\n\n\n<p>You can also initialize Spark 3.1 version, using the below code<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"python\" class=\"language-python line-numbers\">import findspark\nfindspark.init('\/usr\/spark-3.1.2')<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover 2.2, 2.3, 2.4 and 3.1 versions of Apache Spark. What is\u00a0Jupyter\u00a0notebook? The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. &hellip; <a href=\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Running PySpark in Jupyter \/ IPython notebook&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"link","meta":[],"categories":[14],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Running PySpark in Jupyter \/ IPython notebook | CloudxLab Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Running PySpark in Jupyter \/ IPython notebook | CloudxLab Blog\" \/>\n<meta property=\"og:description\" content=\"You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover 2.2, 2.3, 2.4 and 3.1 versions of Apache Spark. What is\u00a0Jupyter\u00a0notebook? The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. &hellip; Continue reading &quot;Running PySpark in Jupyter \/ IPython notebook&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/\" \/>\n<meta property=\"og:site_name\" content=\"CloudxLab Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudxlab\" \/>\n<meta property=\"article:published_time\" content=\"2016-10-18T10:33:10+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-07-03T12:41:36+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2020\/10\/hive_spark_sql.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:site\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"3 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"CloudxLab Blog\",\"description\":\"Learn AI, Machine Learning, Deep Learning, Devops &amp; Big Data\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/cloudxlab.com\/blog\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2020\/10\/hive_spark_sql.png\",\"contentUrl\":\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2020\/10\/hive_spark_sql.png\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/#webpage\",\"url\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/\",\"name\":\"Running PySpark in Jupyter \/ IPython notebook | CloudxLab Blog\",\"isPartOf\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/#primaryimage\"},\"datePublished\":\"2016-10-18T10:33:10+00:00\",\"dateModified\":\"2023-07-03T12:41:36+00:00\",\"author\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/0efa3c54df68406de820ea466f002d3c\"},\"breadcrumb\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/running-pyspark-jupyter-notebook\/#webpage\"}}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/0efa3c54df68406de820ea466f002d3c\",\"name\":\"Abhinav Singh\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/fc74fe31169bf872f6ab11bbab621d53?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/fc74fe31169bf872f6ab11bbab621d53?s=96&d=mm&r=g\",\"caption\":\"Abhinav Singh\"},\"sameAs\":[\"https:\/\/cloudxlab.com\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/272"}],"collection":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/comments?post=272"}],"version-history":[{"count":57,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/272\/revisions"}],"predecessor-version":[{"id":4169,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/272\/revisions\/4169"}],"wp:attachment":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media?parent=272"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/categories?post=272"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/tags?post=272"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}