Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).
The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark’s DataFrame and joining those using “like”. Something on these lines:
select phrase.id, docs.id from phrases, docs where docs.txt like ‘%’ + phrases.phrase + ‘%’
But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?
They had also tried to use Apache Spark’s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.
So, how we solved it finally? Here is one of my approaches. Please feel free to provide your input.
We first brought together the phrase and documents where there is at least one match. Then we grouped the data based on the pair of phrase id and document id. And finally, we filtered the results based on whether all of the words in the phrase are found in the document or not and in the same order.
You may be wondering if it really makes it faster? And what makes it faster?
If you have m phrases and n documents. The phrases have w words and documents have k words.
The total complexity will be of the order of m*w * n * k. Each word from phrases will be compared with each word in documents.
While complexity using our approach will not be that straightforward to compute. Let me try.
First, it is going to sort the data. The total number of words are m*w + n*k. Let’s call it W
W = m*w + n*k
The complexity of sorting it is: W log W
Then we are going to sort the data based on (phrase Id, document id). If every phrase was found in every document then there will be a total of m * n records to be sorted.
m*n log (m*n)
but it is going to be far lesser and can be approximated to n. Now, sorting the data based on
So, final sorting will take approx: n* log(n)
We can safely ignore other processing steps as those are linear. The overall complexity or the time consumption is going to be of the order of:
(m*w + n*k) log(m*w + n*k) + m*n log (m*n)
Which is definitely way better than m*w * n * k
I hope you find it useful. Please visit coudxlab.com to see various courses and lab offerings.