{"id":3544,"date":"2021-06-08T17:33:01","date_gmt":"2021-06-08T17:33:01","guid":{"rendered":"https:\/\/cloudxlab.com\/blog\/?p=3544"},"modified":"2021-06-08T18:22:24","modified_gmt":"2021-06-08T18:22:24","slug":"getting-started-with-apache-airflow","status":"publish","type":"post","link":"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/","title":{"rendered":"Getting Started with Apache Airflow"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img width=\"1220\" height=\"659\" src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image-2.png\" alt=\"\" class=\"wp-image-3551\"\/><figcaption>Apache Airflow<\/figcaption><\/figure>\n\n\n\n<p>When you are building a production system whether it&#8217;s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager and a scheduler. In workflow manager, you would define which processes to run and their interdependencies and in scheduler, you would want to execute them at a certain schedule.<\/p>\n\n\n\n<p>When I started using Apache Hadoop in 2012, we used to get the HDFS data cleaned using our multiple streaming jobs written in Python, and then there were shell scripts and so on. It was cumbersome to run these manually. So, we started using Azkaban for the same, and later on Oozie came. Honestly, Oozie was less than impressive but it stayed due to the lack of alternatives.<\/p>\n\n\n\n<p>As of today, Apache Airflow seems to be the best solution for creating your workflow. Unlike Oozie, Airflow is not really specific to Hadoop. It is an independent tool &#8211; more like a combination of Apache Ant and Unix Cron jobs. It has many more integrations. Check out <a href=\"https:\/\/airflow.apache.org\/\">Apache Airflow&#8217;s website<\/a>.<\/p>\n\n\n\n<!--more-->\n\n\n\n<p>We have not made Apache Airflow available on CloudxLab. Here are the steps to get started:<\/p>\n\n\n\n<p><strong>1. Python3 Path<\/strong><\/p>\n\n\n\n<p><code>export PATH=\/usr\/local\/anaconda\/bin:$PATH<\/code><\/p>\n\n\n\n<p><strong>2.<\/strong> Set Airflow home<\/p>\n\n\n\n<pre id=\"codecell0\" class=\"wp-block-preformatted\">export AIRFLOW_HOME=~\/airflow<\/pre>\n\n\n\n<p>3. <strong>Initialize DB<\/strong><\/p>\n\n\n\n<p><code>airflow db init<\/code><\/p>\n\n\n\n<p><strong>4. Create admin user name. <\/strong><\/p>\n\n\n\n<p>Once you run this command it will ask for a password. Please feel free to set the correct parameters.<\/p>\n\n\n\n<pre id=\"codecell0\" class=\"wp-block-preformatted\">airflow users create \\\n    --username admin \\\n    --firstname Peter \\\n    --lastname Parker \\\n    --role Admin \\\n    --email myself@gmail.com<\/pre>\n\n\n\n<p id=\"block-59b449a2-1799-425c-898d-d3d4c5dac710\"><strong>5. Start the web server<\/strong><\/p>\n\n\n\n<p>This starts the user interface of Airflow. Please note that the only ports that are public on CloudxLab range from 4040 to 4100. In case, the following command gives an error, please try choosing some other port.<\/p>\n\n\n\n<p id=\"codecell0\"><code>airflow webserver --port 4050<\/code><\/p>\n\n\n\n<p>For example, if you are getting an error <kbd>[ERROR] Connection in use: ('0.0.0.0', 4050)<\/kbd>, press <kbd>CTRL+c <\/kbd>to cancel and try increasing the port number <code>airflow webserver --port 4060<\/code><\/p>\n\n\n\n<p><strong>6. Find  out the host on which you are located. <\/strong><\/p>\n\n\n\n<p>First find out the private ip address of the host using <code>ifconfig<\/code><\/p>\n\n\n\n<p>And then, using the <a href=\"https:\/\/cloudxlab.com\/my-lab#ip-mappings\">IP Mapping in MyLab<\/a> find out the public hostname. <\/p>\n\n\n\n<p><strong>7. Open the Airflow UI<\/strong> using the port (step 5) and host (step 6). The URL would look something like this: http:\/\/e.cloudxlab.com:4050<\/p>\n\n\n\n<p>You will see a UI like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img width=\"1043\" height=\"548\" src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image.png\" alt=\"\" class=\"wp-image-3547\"\/><\/figure>\n\n\n\n<p>Enter the login as &#8216;admin&#8217; and password as specified in step 4.<\/p>\n\n\n\n<p>Once you have logged in, you will see an interface like the following. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img width=\"1412\" height=\"1016\" src=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image-1.png\" alt=\"\" class=\"wp-image-3548\"\/><figcaption>Apache Airflow User Interface<\/figcaption><\/figure>\n\n\n\n<p>Now you can go ahead and start the start using the Apache Airflow. <\/p>\n\n\n\n<pre id=\"codecell0\" class=\"wp-block-preformatted\"># start the scheduler\n# open a new terminal or else run webserver with ``-D`` option to run it as a daemon\nairflow scheduler<\/pre>\n\n\n\n<p>You can follow <a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/tutorial.html\">these tutorials<\/a> and <a href=\"https:\/\/airflow.apache.org\/docs\/apache-airflow\/stable\/howto\/index.html\">How-To guides<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When you are building a production system whether it&#8217;s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager &hellip; <a href=\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Getting Started with Apache Airflow&#8221;<\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[118,73],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v16.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Getting Started with Apache Airflow | CloudxLab Blog<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Getting Started with Apache Airflow | CloudxLab Blog\" \/>\n<meta property=\"og:description\" content=\"When you are building a production system whether it&#8217;s a machine learning model deployment or simple data cleaning, you would need to run multiple steps with multiple different tools and you would want to trigger some processes periodically. This is not possible to do it manually more than once. Therefore, you need a workflow manager &hellip; Continue reading &quot;Getting Started with Apache Airflow&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/\" \/>\n<meta property=\"og:site_name\" content=\"CloudxLab Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/cloudxlab\" \/>\n<meta property=\"article:published_time\" content=\"2021-06-08T17:33:01+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-06-08T18:22:24+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image-2.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:site\" content=\"@CloudxLab\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\">\n\t<meta name=\"twitter:data1\" content=\"3 minutes\">\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebSite\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"CloudxLab Blog\",\"description\":\"Learn AI, Machine Learning, Deep Learning, Devops &amp; Big Data\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":\"https:\/\/cloudxlab.com\/blog\/?s={search_term_string}\",\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/#primaryimage\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image-2.png\",\"contentUrl\":\"https:\/\/blog.cloudxlab.com\/wp-content\/uploads\/2021\/06\/image-2.png\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/#webpage\",\"url\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/\",\"name\":\"Getting Started with Apache Airflow | CloudxLab Blog\",\"isPartOf\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/#primaryimage\"},\"datePublished\":\"2021-06-08T17:33:01+00:00\",\"dateModified\":\"2021-06-08T18:22:24+00:00\",\"author\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4835f1b3d5000626cb15e9311d748e09\"},\"breadcrumb\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"item\":{\"@type\":\"WebPage\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/\",\"url\":\"https:\/\/cloudxlab.com\/blog\/\",\"name\":\"Home\"}},{\"@type\":\"ListItem\",\"position\":2,\"item\":{\"@id\":\"https:\/\/cloudxlab.com\/blog\/getting-started-with-apache-airflow\/#webpage\"}}]},{\"@type\":\"Person\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#\/schema\/person\/4835f1b3d5000626cb15e9311d748e09\",\"name\":\"Sandeep Giri\",\"image\":{\"@type\":\"ImageObject\",\"@id\":\"https:\/\/cloudxlab.com\/blog\/#personlogo\",\"inLanguage\":\"en-US\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1393214840cf7455bb4cba055cb30468?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1393214840cf7455bb4cba055cb30468?s=96&d=mm&r=g\",\"caption\":\"Sandeep Giri\"},\"sameAs\":[\"https:\/\/cloudxlab.com\"]}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","_links":{"self":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3544"}],"collection":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/comments?post=3544"}],"version-history":[{"count":6,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3544\/revisions"}],"predecessor-version":[{"id":3554,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/posts\/3544\/revisions\/3554"}],"wp:attachment":[{"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/media?parent=3544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/categories?post=3544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cloudxlab.com\/blog\/wp-json\/wp\/v2\/tags?post=3544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}