Getting Started with various tools at CloudxLab

Welcome!

We are happy to announce that we have come up with a new consolidated playlist, which summaries about various tools present at CloudxLab environment, how to use them and where to learn about them.

This would be incrementally improved as new technologies keep getting installed on the lab.

You may find the playlist here.

In this playlist, there is a dedicated slide for each technology. For example, if you want to understand how to use Pandas on the lab, go to the slide named Pandas.

Upon clicking on Pandas, you would be able to see the Pandas guide as follows:

As you could see, this slide contains all the basic information needed such as:

  • the purpose of the library
  • link for the official home page
  • link for the official documentation
  • related resources you could use to learn about the library.
  • instructions on how to use it on the CloudxLab environment.
  • 1-2 lines of sample examples to use it, such as how to inport the library and how to check the version.

We hope that this will be a great starting guide for our users and makes their job of getting started easier.

Happy learning!

Getting Started with Apache Airflow

Apache Airflow

When you are building a production system whether it’s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager and a scheduler. In workflow manager, you would define which processes to run and their interdependencies and in scheduler, you would want to execute them at a certain schedule.

When I started using Apache Hadoop in 2012, we used to get the HDFS data cleaned using our multiple streaming jobs written in Python, and then there were shell scripts and so on. It was cumbersome to run these manually. So, we started using Azkaban for the same, and later on Oozie came. Honestly, Oozie was less than impressive but it stayed due to the lack of alternatives.

As of today, Apache Airflow seems to be the best solution for creating your workflow. Unlike Oozie, Airflow is not really specific to Hadoop. It is an independent tool – more like a combination of Apache Ant and Unix Cron jobs. It has many more integrations. Check out Apache Airflow’s website.

Continue reading “Getting Started with Apache Airflow”

Introduction to Big Data and Distributed Systems

Introduction

As everyone knows, Big Data is a term of fascination in the present-day era of computing. It is in high demand in today’s IT industry and is believed to revolutionize technical solutions like never before.

Continue reading “Introduction to Big Data and Distributed Systems”

CloudXLab is proud to sponsor RACE360 as a Technology Partner.

RACE360, an Emerging Technology Conference 2019 (Powered by The Times of India) is happening on Wed, Aug 28th at The Lalit Ashok, Bengaluru. It is presented by REVA University, Bengaluru (REVA Academy for Corporate Excellence (RACE)).

Continue reading “CloudXLab is proud to sponsor RACE360 as a Technology Partner.”