Before we start with the main topic, let me explain a very important idea called serialization and its utility.
The data in the RAM is accessed based on the address that is why the name Random Access Memory but the data in the disc is stored sequentially. In the disc, the data is accessed using a file name and the data inside a file is kept in a sequence of bits. So, there is inherent mismatch in the format in which data is kept in memory and data is kept in the disc. You can watch this video to understand serialization further.
SQL is a very important skill. You not only can access the relational databases but also big data using Hive, Spark-SQL etcetera. Learning SQL could help you excel in various roles such as Business Analytics, Web Developer, Mobile Developer, Data Engineer, Data Scientist, and Data Analyst. Therefore having access to SQL client is very important via browser. In this blog, we are going to walk through the examples of interacting with SQLite and MySQL using Jupyter notebook.
A Jupyter notebook is a great tool for analytics and interactive computing. You can interact with various tools such as Python, Linux, File System, Scala, Lua, Spark, R, and SQL from the comfort of the browser. For almost every interactive tool, there is a kernel in Jupyter. Let us walk through how would you use SQL to interact with various databases from the comfort of your browser.
When you are building a production system whether it’s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager and a scheduler. In workflow manager, you would define which processes to run and their interdependencies and in scheduler, you would want to execute them at a certain schedule.
When I started using Apache Hadoop in 2012, we used to get the HDFS data cleaned using our multiple streaming jobs written in Python, and then there were shell scripts and so on. It was cumbersome to run these manually. So, we started using Azkaban for the same, and later on Oozie came. Honestly, Oozie was less than impressive but it stayed due to the lack of alternatives.
As of today, Apache Airflow seems to be the best solution for creating your workflow. Unlike Oozie, Airflow is not really specific to Hadoop. It is an independent tool – more like a combination of Apache Ant and Unix Cron jobs. It has many more integrations. Check out Apache Airflow’s website.
Consider a situation where we have an email inbox that consists of emails, and emails are to be processed. For example, processing those emails and classifying each of the emails as spam or non-spam. The other example of the processing could be we are indexing the email so that the search could be performed.
We have an email-processor program, running on various machines distributed physically from each other.
Now these machines need to somehow coordinate such that:
No email is processed two times
No email is left unprocessed
Usage of flags: we could mark the emails to be read or unread by any machine previously, and only consider those emails which are not yet read.
While processor 1 reads an email and marks it as read, and then the processor dies, then the email would not be touched by any other processor in future, because it was already marked as read by the first processor, and thus this email would be left unprocessed.
There should be a manager that could handle the workload and distribute the work to workers.
This manager could be a bottleneck as it has to maintain a large number of systems, and thus it would be overloaded. Also, what is the manager dies?
We need a central storage which could note down who is doing what, like email id, timestamp it was taken up by a processor, status of completion of processing, etc.
The central storage system can be a bottleneck. Say the email processor programs are running on a lot of machines, then the central storage system would be on high demand and thus it will be overloaded, and it may also die.
By using a distributed system that provides locking such as Zookeeper. You can also use the standard RDBMS system with locking but that would not be highly available.
provides simple primitives like set/get, so easy to program
has an easy data model, like a directory tree
is a resilient and highly available tool
To know more about CloudxLab courses, here you go!
In the Hadoop ecosystem, Apache Zookeeper plays an important role in coordination amongst distributed resources. Apart from being an important component of Hadoop, it is also a very good concept to learn for a system design interview.
If you would prefer the videos with hands-on, feel free to jump in here.
Having known of the prevalence of BigData in real-world scenarios, it’s time for us to understand how they work. This is a very important topic in understanding the principles behind system design and coordination among machines in big data. So let’s dive in.
Consider a scenario where there is a resource of data, and there is a worker machine that has to accomplish some task using that resource. For example, this worker is to process the data by accessing that resource. Remember that the data source is having huge data; that is, the data to be processed for the task is very huge.
As everyone knows, Big Data is a term of fascination in the present-day era of computing. It is in high demand in today’s IT industry and is believed to revolutionize technical solutions like never before.
The remaining useful life (RUL) is the length of time a machine is likely to operate before it requires repair or replacement. By taking RUL into account, engineers can schedule maintenance, optimize operating efficiency, and avoid unplanned downtime. For this reason, estimating RUL is a top priority in predictive maintenance programs.
Three are modeling solutions used for predicting the RUL which are mentioned below:
Regression: Predict the Remaining Useful Life (RUL), or Time to Failure (TTF).
Binary classification: Predict if an asset will fail within a certain time frame (e.g., Hours).
Multi-class classification: Predict if an asset will fail in different time windows: E.g., fails in window [1, w0] days; fails in the window [w0+1, w1] days; not fail within w1 days.
In this blog, I have covered binary classification and multi-class classification in the below sections.