What are the pre-requisites to learn big data?

Pre-requisites for Big Data Hadoop

We, at CloudxLab, keep getting a lot of questions online, sometimes offline, asking us

“I want to learn big data. But, just don’t know whether I am eligible or not.”

“I am so and so, can I learn big data?”

We have compiled the most common questions here. And, we will answer each one of them.

So, here we go.

What are those questions?

  1. I am from a non-technical background. Can I learn big data?
  2. Do I need to know programming languages such as Java, Python, PHP, etc.?
  3. Or, since it is big data, do I need to know any other relational databases such as Oracle or in general do I need to be well versed with SQL?
  4. And also, do I need to know the Unix or Linux?

The first question, I don’t have any technical background or programming experience.

Well, the answer is, you don’t have to compulsorily have a technical background as such. But, that said, if you can fine tune a few programming basics, it would be more than enough. And, to do this, you just need a few hours to get familiar.

The second question, do I need to know any programming languages, such as Java, Python, etc?

The answer is, you don’t have to be a hard-core programmer. That said, you should know the fundamentals of programming, which again takes a few hours to get to know.

For example, we offer a free Java course and a free self-paced Python course. You can check more details on our website.

The third question, do I need to know the SQL or any other RDBMS?

Well, the answer is yes. You should know at least SQL. If you don’t know, there are so many free resources available online.

The final question here, do I need to have Linux or Unix skills?

The answer is, not compulsory. But, it is good if you know.

Some generic questions:

  1. I am from the mainframe background, will learning big data help me?
  2. I am from telecom/pharma/manufacturing/FMCG background, will learning big data help me?
  3. I have not been in the job for the last few years, will learning big data help me find a job?
  4. I have been working in SAP field and now want to change my career to the big data, can a big data course help me?
  5. I am an MBA, will learning big data help me shift my career?

I am from the mainframe background, will learning big data help me shift my career?

Being in mainframe, you might have a good idea of programming such as Cobol. Also, you might be comfortable with SQL by now. This would accelerate your learning of big data. Now, since mainframes are not progressing much, it is very important to upgrade your technical skills to suit the new generation of technologies. We have seen many of our students from mainframes enrolling in our courses and successfully transitioning their careers.

I am from telecom/pharma/manufacturing background, will learning big data help me?

In telecom, pharma or manufacturing, the data that is being generated has become big data. Earlier, to derive insights or predictions, we were able to use traditional tools. But the same can’t be done anymore because data has grown exponentially. So, naturally, the industry is adopting big data technologies.

I have not been in the job for the last few years, will learning big data help me a job?

From time to time, the technology landscape changes giving an opportunity to those who have been in the industry. Before it is too late, it is better to equip yourself with new technologies, new skills to get a job in this current scenario. Long answer short – learning big data along with a few other skills will definitely help.

I have been working in SAP field and now want to change my career to the big data, can a big data course help me?

It’s a little tricky question. In SAP, I am not sure if you are a functional consultant or technical consultant. It does help to learn big data. But, the transition may take some time.

I am an MBA, will learning big data help me shift my career?

If you are at the beginning of your career, learning big data will definitely help you. If you have been in the job for a while, and want to switch your career, it takes additional effort to master the skills we discussed in the above.

So, to put it in a nutshell,

You need to know the fundamentals of a programming language such as Java or Python. We have a free course for both. Please visit our website www.cloudxlab.com and enroll yourself.

And also, you do need to know SQL. Again, we have a free course for this as well. Please visit our website for further details.

And, a little bit of Linux or Unix will complete the equation.

More than anything else, you need to have a great passion, ambition to succeed in your career, and willingness to put in sincere efforts and hard work.

Before we wrap up, please visit www.cloudxlab.com to know more details about our big data courses. We have an instructor-led course on big data and a few self-paced courses as well.

Hope we answered all your questions. If you have any other questions, please put them here in the comments or add your questions on the discussion forum on our website.

Phrase matching using Apache Spark

Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).

The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:

select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’

But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?

They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.

So, how we solved it finally? Here is one of my approaches. Please feel free to provide your input.

We first brought together the phrase and documents where there is at least one match.  Then we grouped the data based on the pair of phrase id and document id. And finally, we filtered the results based on whether all of the words in the phrase are found in the document or not and in the same order.

You can take a look at the project here. The Scala version is not yet finished, though Python version is done.

You may be wondering if it really makes it faster? And what makes it faster?

If you have m phrases and n documents. The phrases have w words and documents have k words.

The total complexity will be of the order of m*w * n * k. Each word from phrases will be compared with each word in documents.

While complexity using our approach will not be that straightforward to compute. Let me try.

First, it is going to sort the data. The total number of words are m*w + n*k. Let’s call it W

W = m*w + n*k

The complexity of sorting it is: W log W

Then we are going to sort the data based on (phrase Id, document id). If every phrase was found in every document then there will be a total of m * n records to be sorted.

m*n log (m*n)

but it is going to be far lesser and can be approximated to n. Now, sorting the data based on

So, final sorting will take approx: n* log(n)

We can safely ignore other processing steps as those are linear. The overall complexity or the time consumption is going to be of the order of:

(m*w + n*k) log(m*w + n*k)  +  m*n log (m*n)

Which is definitely way better than m*w * n * k

I hope you find it useful. Please visit coudxlab.com to see various courses and lab offerings.

References: