{"id":1338,"date":"2018-07-20T10:52:39","date_gmt":"2018-07-20T10:52:39","guid":{"rendered":"http:\/\/blog.cloudxlab.com\/?p=1338"},"modified":"2019-01-08T12:12:23","modified_gmt":"2019-01-08T12:12:23","slug":"phrase-matching-using-apache-spark","status":"publish","type":"post","link":"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/","title":{"rendered":"Phrase matching using Apache Spark"},"content":{"rendered":"<p>Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million).<\/p>\n<p>The problem at first looked simple. The way engineers had solved it is by simply loading the two documents in Apache Spark&#8217;s DataFrame\u00a0and joining those using &#8220;like&#8221;. Something on these lines:<\/p>\n<blockquote><p><strong>select\u00a0phrase.id, docs.id from phrases, docs where docs.txt like &#8216;%&#8217; + phrases.phrase\u00a0+ &#8216;%&#8217;<\/strong><\/p><\/blockquote>\n<p>But it was taking huge time even on the small subset of the data and processing is done in distributed fashion. Any Guesses, why?<\/p>\n<p>They had also tried to use Apache Spark&#8217;s broadcast mechanism on the smaller dataset but still, it was taking a long while finishing even a small task.<\/p>\n<p><!--more--><\/p>\n<p>So, how we solved it finally? Here is one of my approaches. Please feel free to provide your input.<\/p>\n<p>We first brought together the phrase and documents where there is at least one match.\u00a0 Then we grouped the data based on the pair of phrase id and document id. And finally, we filtered the results based on whether all of the words in the phrase are found in the document or not and in the same order.<\/p>\n<p>You can take a look at <a href=\"https:\/\/github.com\/cloudxlab\/bigdata\/tree\/master\/spark\/projects\/phrase-search\">the project<\/a> <a href=\"https:\/\/github.com\/cloudxlab\/bigdata\/tree\/master\/spark\/projects\/phrase-search\">here<\/a>. The Scala version is not yet finished, though <a href=\"https:\/\/github.com\/cloudxlab\/bigdata\/tree\/master\/spark\/projects\/phrase-search\/python\">Python version<\/a> is done.<\/p>\n<p>You may be wondering if it really makes it faster? And what makes it faster?<\/p>\n<p>If you have m phrases and n documents. The phrases have w words and documents have k words.<\/p>\n<p>The total complexity will be of the order of m*w * n * k. Each word from phrases will be compared with each word in documents.<\/p>\n<p>While complexity using our approach will not be that straightforward to compute. Let me try.<\/p>\n<p>First, it is going to sort the data. The total number of words are m*w + n*k. Let&#8217;s call it W<\/p>\n<p>W = m*w + n*k<\/p>\n<p>The complexity of sorting it is: W log W<\/p>\n<p>Then we are going to sort the data based on (phrase Id, document id). If every phrase was found in every document then there will be a total of m * n records to be sorted.<\/p>\n<p>m*n log (m*n)<\/p>\n<p>but it is going to be far lesser and can be approximated to n. Now, sorting the data based on<\/p>\n<p>So, final sorting will take approx: n* log(n)<\/p>\n<p>We can safely ignore other processing steps as those are linear. The overall complexity or the time consumption is going to be of the order of:<\/p>\n<p>(m*w + n*k) log(m*w + n*k)\u00a0 +\u00a0 m*n log (m*n)<\/p>\n<p>Which is definitely way better than m*w * n * k<\/p>\n<p>I hope you find it useful. Please visit <a href=\"https:\/\/coudxlab.com\">coudxlab.com<\/a> to see\u00a0various courses and lab offerings.<\/p>\n<p>References:<\/p>\n<p>https:\/\/github.com\/cloudxlab\/bigdata\/tree\/master\/spark\/projects\/phrase-search\/python<\/p>\n<p>https:\/\/github.com\/cloudxlab\/bigdata\/tree\/master\/spark\/projects\/phrase-search\/python<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million). The problem at first looked simple. The way engineers had solved it is &hellip; <a href=\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Phrase matching using Apache Spark&#8221;<\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[24,13,14],"tags":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Phrase matching using Apache Spark | CloudxLab Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Phrase matching using Apache Spark | CloudxLab Blog\" \/>\n<meta property=\"og:description\" content=\"Recently, a friend whose company is working on large scale project reached out to us to seek a solution to a simple problem of finding a list of phrases (approximately 80,000) in a huge set of rich text documents (approx 6 million). The problem at first looked simple. The way engineers had solved it is &hellip; Continue reading &quot;Phrase matching using Apache Spark&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/\" \/>\n<meta property=\"og:site_name\" content=\"CloudxLab Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudxlab\" \/>\n<meta property=\"article:published_time\" content=\"2018-07-20T10:52:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2019-01-08T12:12:23+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:site\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"3 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"CloudxLab Blog\",\"description\":\"Learn AI, Machine Learning, Deep Learning, Devops &amp; Big Data\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/cloudxlab.com\/blog\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/#webpage\",\"url\":\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/\",\"name\":\"Phrase matching using Apache Spark | CloudxLab Blog\",\"isPartOf\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\"},\"datePublished\":\"2018-07-20T10:52:39+00:00\",\"dateModified\":\"2019-01-08T12:12:23+00:00\",\"author\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4835f1b3d5000626cb15e9311d748e09\"},\"breadcrumb\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/phrase-matching-using-apache-spark\/#webpage\"}}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4835f1b3d5000626cb15e9311d748e09\",\"name\":\"Sandeep Giri\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1393214840cf7455bb4cba055cb30468?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1393214840cf7455bb4cba055cb30468?s=96&d=mm&r=g\",\"caption\":\"Sandeep Giri\"},\"sameAs\":[\"https:\/\/cloudxlab.com\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/1338"}],"collection":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/comments?post=1338"}],"version-history":[{"count":5,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/1338\/revisions"}],"predecessor-version":[{"id":1502,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/1338\/revisions\/1502"}],"wp:attachment":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media?parent=1338"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/categories?post=1338"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/tags?post=1338"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}