In this blog post we will learn how to access S3 Files using Spark on CloudxLab.
Please follow below steps to access S3 files:
#Login to Web Console #Specify the hadoop config export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ #Specify the Spark Class Path export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/hadoop-aws.jar" export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/lib/aws-java-sdk-1.7.4.jar" export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/lib/guava-11.0.2.jar" #Launch Spark Shell /usr/spark1.6/bin/spark-shell #On the spark shell Specify the AWS Key sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_AWS_ACCESS_KeY") sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "YOUR_AWS_SECRET_ACCESS_KeY") #Now Access s3 files using spark #Create RDD out of s3 file val nationalNames = sc.textFile("s3n://cxl-spark-test-data/sss/baby-names.csv") #Just check the first line nationalNames.take(1)