What, How & Why of Artificial Intelligence

Artificial Intelligence (AI) is the buzzword that is resounding and echoing all over the world. While large corporations, organizations & institutions are publicly proclaiming and publicizing their massive investments toward development and deployment of AI capabilities, people, in general, are feeling perplexed regarding the meaning and nuances of AI. This blog is an attempt to demystify AI and provide a brief introduction to the various aspects of AI to all such persons, engineers, non-engineers & beginners, who are seeking to understand AI. In the forthcoming discussion, we will explore the following questions:

  • What is AI & what does it seek to accomplish?
  • How will the goals of AI be accomplished, through which methodologies?
  • Why is AI gaining so much momentum now?

Continue reading “What, How & Why of Artificial Intelligence”

GraphFrames on CloudxLab

GraphFrames is quite a useful library of spark which helps in bringing Dataframes and GraphX package together.

From the website of Graphframes:

GraphFrames is a package for Apache Spark which provides DataFrame-based Graphs. It provides high-level APIs in Scala, Java, and Python. It aims to provide both the functionality of GraphX and extended functionality taking advantage of Spark DataFrames. This extended functionality includes motif finding, DataFrame-based serialization, and highly expressive graph queries.

You can use graph frames very easily with spark-shell at CloudxLab by using —package option in the following way. Continue reading “GraphFrames on CloudxLab”

CloudxLab Webinar on “Big Data & AI”: An Overwhelming Success

Understanding the current scenario of tremendous interest of students and professionals regarding “Big Data & AI”, CloudxLab conducted a webinar on July 12, 2017 to introduce and explain the many nuances of this upcoming field to all the enthusiasts. Mr Sandeep Giri, founder of CloudxLab with more than 15 years of experience in the industry with companies such as Amazon, InMobi, DE Shaw etc., was the lead presenter in the webinar.

The Scope:

The webinar covered the following:

  • Overview of Big Data
  • Big Data Technology Stack
  • Introduction to Artificial Intelligence
  • Demo of Machine Learning Example
  • Artificial Intelligence Technology Stack

Continue reading “CloudxLab Webinar on “Big Data & AI”: An Overwhelming Success”

How CloudxLab Helped Our User With A Job

We recently had a heart warming moment – one of our subscribers had been made an offer by Tata Consultancy Services.  His thank you note to us made our day.  Meara Laxman had subscribed to CloudxLab to practice his Big Data skills and as per him got more than he expected.  Here is our interview with him.

CxL: How did CloudxLab help you learn Big Data tools better?
Laxman: Cloudxlab helped me a lot in learning all the Bigdata eco system components. I had gained enough theoretical knowledge on big data tools from the internet but I ran into trouble trying to practice due to my incompatible system requirements and configurations. That is when I found Cloudxlab and subscribed to it.  I got good exposure to the practical aspects as Cloudxlab provided some sample lab session video material which are very clear and easy to practice and understand. Moreover, the Cloudxlab team helped me every time I had an issue and clarified all my queries.

CxL: How did CloudxLab help you with finding a new job?
Laxman: CloudxLab played a key role in getting me my new job. I lacked Continue reading “How CloudxLab Helped Our User With A Job”

Install Python packages on CloudxLab

In this blog post, we will learn how to install Python packages on CloudxLab.

Step 1-

Create the virtual environment for your project. A virtual environment is a tool to keep the dependencies required by different projects in separate places, by creating virtual Python environments for them. Login to CloudxLab web console and create a virtual environment for your project.

First of all, let’s switch to python3 using:-

export PATH=/usr/local/anaconda/bin:$PATH

Now let’s create a directory and the virtual environment inside it.

$ mkdir my_project
$ cd my_project
$ python -m venv venv

Continue reading “Install Python packages on CloudxLab”

CloudxLab Reviews

Jose

Jose Manual Ramirez Leon

It is really a great site. As a 37-year-old with a masters
in mechanical engineering, I decided to switch careers
and get another masters. One of my courses was
Big Data and, at the beginning, I was completely lost
& I was falling behind in my assignments and after
searching the internet for a solution, finally found  CloudxLab.

Not only do they have any conceivable Big Data
technology on their servers, they have superb
customer support. Whenever I have had a doubt,
even in debugging my own programs, they have
answered me with the correct solution in a few hours.

I earnestly recommend it to everyone.

Continue reading “CloudxLab Reviews”

Building Real-Time Analytics Dashboard Using Apache Spark

Apache Spark

In this blog post, we will learn how to build a real-time analytics dashboard using Apache Spark streaming, Kafka, Node.js, Socket.IO and Highcharts.

Complete Spark Streaming topic on CloudxLab to refresh your Spark Streaming and Kafka concepts to get most out of this guide.

Problem Statement

An e-commerce portal (http://www.aaaa.com) wants to build a real-time analytics dashboard to visualize the number of orders getting shipped every minute to improve the performance of their logistics.

Solution

Before working on the solution, let’s take a quick look at all the tools we will be using:

Apache Spark – A fast and general engine for large-scale data processing. It is 100 times faster than Hadoop MapReduce in memory and 10x faster on disk. Learn more about Apache Spark here

Python – Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Learn more about Python here

Kafka – A high-throughput, distributed, publish-subscribe messaging system. Learn more about Kafka here

Node.js – Event-driven I/O server-side JavaScript environment based on V8. Learn more about Node.js here

Socket.IO – Socket.IO is a JavaScript library for real-time web applications. It enables real-time, bi-directional communication between web clients and servers. Read more about Socket.IO here

Highcharts – Interactive JavaScript charts for web pages. Read more about Highcharts here

CloudxLab – Provides a real cloud-based environment for practicing and learn various tools. You can start practicing right away by just signing up online.

How To Build A Data Pipeline?

Below is the high-level architecture of the data pipeline

Data Pipeline
Data Pipeline

Our real-time analytics dashboard will look like this

Real-Time Analytics Dashboard
Real-Time Analytics Dashboard

Continue reading “Building Real-Time Analytics Dashboard Using Apache Spark”

Cloudera Certification Practice On CloudxLab

How does CloudxLab help with preparing for Cloudera, Hortonworks, and related certifications?  Here is an interview with one of our users who
has successfully completed the ‘Cloudera Certified Associate for Spark andUntitled Hadoop Developer‘ (CCA175) certification using CloudxLab for hands-on practice. Having completed the certification, Senthil Ramesh who is currently working with Accenture, gladly discussed his experience with us.

CxL: How did CloudxLab help you with the Cloudera certification and help you learn Big Data overall?

Senthil: CloudxLab played an important part in the hands on experience for my big data learning. As soon as I understood that my laptop may not be able to support all the tools necessary to work towards the certification, I started looking for a cloud based solution and found CloudxLab. The sign up was easy and everything was setup in a short time. I must say, without doing hands on it would have been harder to crack the certification. Thanks to CloudxLab for that.

CxL: Why CloudxLab and not a Virtual Machine?

Continue reading “Cloudera Certification Practice On CloudxLab”

CloudxLab Joins Hands With TechM’s UpX Academy

cloudxlab+upx

CloudxLab is proud to announce its partnership with TechMahindra’s UpX Academy.  TechM’s e-learning platform, UpX Academy, delivers courses in Big Data & Data Sciences.  With programs spanning over 6-12 weeks and covering in-demand skills such as Hadoop, Spark, Machine Learning, R and Tableau, UpX has tied up with CloudxLab to provide the latest to its course takers.

Run by an excellent team, we at CloudxLab are in awe of the attention UpX pays to the users needs.  As Sandeep (CEO at CloudxLab) puts it, “We were not surprised when UpX decided to come on board.  Their ultimate interest is in keeping their users happy and we are more than glad to work with them on this.”

Continue reading “CloudxLab Joins Hands With TechM’s UpX Academy”

Running PySpark in Jupyter / IPython notebook

You can run PySpark code in Jupyter notebook on CloudxLab. The following instructions cover 2.2, 2.3, 2.4 and 3.1 versions of Apache Spark.

What is Jupyter notebook?

The IPython Notebook is now known as the Jupyter Notebook. It is an interactive computational environment, in which you can combine code execution, rich text, mathematics, plots and rich media. For more details on the Jupyter Notebook, please see the Jupyter website.

Please follow below steps to access the Jupyter notebook on CloudxLab

To start python notebook, Click on “Jupyter” button under My Lab and then click on “New -> Python 3”

This code to initialize is also available in GitHub Repository here.

For accessing Spark, you have to set several environment variables and system paths. You can do that either manually or you can use a package that does all this work for you. For the latter, findspark is a suitable choice. It wraps up all these tasks in just two lines of code:

import findspark
findspark.init('/usr/spark2.4.3')

Here, we have used spark version 2.4.3. You can specify any other version too whichever you want to use. You can check the available spark versions using the following command-

!ls /usr/spark*
If you choose to do the setup manually instead of using the package, then you can access different versions of Spark by following the steps below:

If you want to access Spark 2.2, use below code:

import os
import sys

os.environ["SPARK_HOME"] = "/usr/hdp/current/spark2-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.4-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

If you plan to use 2.3 version, please use below code to initialize

import os
import sys

os.environ["SPARK_HOME"] = "/usr/spark2.3/"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

If you plan to use 2.4 version, please use below code to initialize

import os
import sys

os.environ["SPARK_HOME"] = "/usr/spark2.4.3"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/local/anaconda/bin/python" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/local/anaconda/bin/python"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

Now, initialize the entry points of Spark: SparkContext and SparkConf (Old Style)

from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("appName")
sc = SparkContext(conf=conf)

Once you are successful in initializing the sc and conf, please use the below code to test

rdd = sc.textFile("/data/mr/wordcount/input/")
print(rdd.take(10))
print(sc.version)

You can initialize spark in spark2 (or dataframe) way as follows:

# Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()
sc = spark.sparkContext

# Now you even use hive
# Here we are querying the hive table student located in ab
spark.sql("select * from ab.student").show()

# it display something like this:



You can also initialize Spark 3.1 version, using the below code

import findspark
findspark.init('/usr/spark-3.1.2')