CloudxLab has hosted several webinars in the past and all of them have been successful. But this time we thought to try something different. So, we all sat together and decided to do an offline meetup for Machine Learning. Though we had done some in the past, the engagement and interaction that one can get in the online webinar are not comparable. Anyhow, we then got in touch with Drupal Bangalore and they were having this event in R. V College of engineering. And one of the topics was Introduction to Machine Learning. We found this a good opportunity to bring our knowledge in the offline circle too.
Machine Learning Bootcamp
So it all happened on Nov 17 where Machine Learning enthusiasts gathered to attend the one day workshop on Machine Learning. The presenter was none other than Mr. Sandeep Giri, who has over 15 years of experience in the domain of Machine learning and Big Data technologies. He has worked in companies like Amazon, InMobi, and D. E. Shaw.
We have a one-day workshop on Introduction to Machine Learning with Drupal Bangalore. In this workshop, you will learn how to apply various Machine Learning techniques for everyday business problems.
Date: Saturday, Dec 16, 2017
Place: R. V. College of Engineering, Bangalore
Time: 11.30 am – 1.30 pm: Presentation and Demo, 2.30 pm – 4.30 pm: Hands-on
What will be covered?
An exposure to Machine Learning using Python to analyze, draw intelligence and build powerful models using real-world datasets. You’ll also gain the insights to apply data processing and Machine Learning techniques in real time.
After completing this workshop, you will be able to build and optimize your own automated classifier to extract insights from real-world data sets.
Python is increasingly being used as a scientific language. Matrix and vector manipulations are extremely important for scientific computations. Both NumPy and Pandas have emerged to be essential libraries for any scientific computation, including machine learning, in python due to their intuitive syntax and high-performance matrix computation capabilities.
In this post, we will provide an overview of the common functionalities of NumPy and Pandas. We will realize the similarity of these libraries with existing toolboxes in R and MATLAB. This similarity and added flexibility have resulted in wide acceptance of python in the scientific community lately. Topic covered in the blog are:
Overview of NumPy
Overview of Pandas
This post is an excerpt from a live hands-on training conducted by CloudxLab on 25th Nov 2017. It was attended by more than 100 learners around the globe. The participants were from countries namely; United States, Canada, Australia, Indonesia, India, Thailand, Philippines, Malaysia, Macao, Japan, Hong Kong, Singapore, United Kingdom, Saudi Arabia, Nepal, & New Zealand.
On November 3, CloudxLab conducted a successful webinar on “Introduction to Machine Learning”. It was a 3-hour session wherein the instructor shed some light on Machine Learning and its terminologies.
It was attended by more than 200 learners around the globe. The participants were from countries namely; United States, Canada, Australia, Indonesia, India, Thailand, Philippines, Malaysia, Macao, Japan, Hong Kong, Singapore, United Kingdom, Saudi Arabia, Nepal, & New Zealand.
Unless you’ve been living under the rock, you must have heard or read the term – Big Data. But many people don’t know what Big Data actually means. Even if they do then the definition of the same is not clear to them. If you’re one of them then don’t be disheartened. By the time you complete reading this very article, you will have a clear idea about Big Data and its terminology.
What is Big Data?
In very simple words, Big Data is data of very big size which can not be processed with usual tools like file systems & relational databases. And to process such data we need to have distributed architecture. In other words, we need multiple systems to process the data to achieve a common goal.
Can a machine create quiz which is good enough for testing a person’s knowledge of a subject?
So, last Friday, we wrote a program which can create simple ‘Fill in the blank’ type questions based on any valid English text.
This program basically figures out sentences in a text and then for each sentence it would first try to delete a proper noun and if there is no proper noun, it deletes a noun.
We are using textblob which is basically a wrapper over NLTK – The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.
In this post, we will learn how to configure tools required for CloudxLab’s Python for Machine Learning course. We will use Python 3 and Jupyter notebooks for hands-on practicals in the course. Jupyter notebooks provide a really good user interface to write code, equations, and visualizations.
Please choose one of the options listed below for practicals during the course.
Here are the top Apache Spark interview questions and answers. There is a massive growth in the big data space, and job opportunities are skyrocketing, making this the perfect time to launch your career in this space.
Our experts have curated these questions to give you an idea of the type of questions which may be asked in an interview. Hope these Apache Spark interview questions and answers guide will help you in getting prepared for your next interview.
1. What is Apache Spark and what are the benefits of Spark over MapReduce?
Spark is really fast. If run in-memory it is 100x faster than Hadoop MapReduce.
In Hadoop MapReduce, you write many MapReduce jobs and then tie these jobs together using Oozie/shell script. This mechanism is very time consuming and MapReduce tasks have heavy latency. Between two consecutive MapReduce jobs, the data has to be written to HDFS and read from HDFS. This is time-consuming. In case of Spark, this is avoided using RDDs and utilizing memory (RAM). And quite often, translating the output of one MapReduce job into the input of another MapReduce job might require writing another code because Oozie may not suffice.
In Spark, you can basically do everything from single code or console (PySpark or Scala console) and get the results immediately. Switching between ‘Running something on cluster’ and ‘doing something locally’ is fairly easy and straightforward. This also leads to less context switch of the developer and more productivity.
Spark kind of equals to MapReduce and Oozie put together.
Watch this video to learn more about benefits of using Apache Spark over MapReduce.
In this data analytics case study, we will use the US census data to build a model to predict if the income of any individual in the US is greater than or less than USD 50000 based on the information available about that individual in the census data.
The dataset used for the analysis is an extraction from the 1994 census data by Barry Becker and donated to the public site http://archive.ics.uci.edu/ml/datasets/Census+Income. This dataset is popularly called the “Adult” data set. The way that we will go about this case study is in the following order:
Describe the data- Specifically the predictor variables (also called independent variables features) from the Census data and the dependent variable which is the level of income (either “greater than USD 50000” or “less than USD 50000”).
Acquire and Read the data- Downloading the data directly from the source and reading it.
Clean the data- Any data from the real world is always messy and noisy. The data needs to be reshaped in order to aid exploration of the data and modeling to predict the income level.
Explore the independent variables of the data- A very crucial step before modeling is the exploration of the independent variables. Exploration provides great insights to an analyst on the predicting power of the variable. An analyst looks at the distribution of the variable, how variable it is to predict the income level, what skews it has, etc. In most analytics project, the analyst goes back to either get more data or better context or clarity from his finding.
Build the prediction model with the training data- Since data like the Census data can have many weak predictors, for this particular case study I have chosen the non-parametric predicting algorithm of Boosting. Boosting is a classification algorithm (here we classify if an individual’s income is “greater than USD 50000” or “less than USD 50000”) that gives the best prediction accuracy for weak predictors. Cross validation, a mechanism to reduce over fitting while modeling, is also used with Boosting.
Validate the prediction model with the testing data- Here the built model is applied on test data that the model has never seen. This is performed to determine the accuracy of the model in the field when it would be deployed. Since this is a case study, only the crucial steps are retained to keep the content concise and readable.
Buoyed by the success of our previous webinar and excited by the unending curiosity of our audience, we at CloudxLab decided to conduct another webinar on “Big Data & AI” on 24th August. Mr Sandeep Giri, founder of CloudxLab, was the lead presenter in the webinar. A graduate from IIT Roorkee with more than 15 years of experience in companies such as DE Shaw, Inmobi & Amazon, Sandeep conducted the webinar to the appreciation of all.