Access S3 Files in Spark

In this blog post we will learn how to access S3 Files using Spark on CloudxLab.
Please follow below steps to access S3 files:

#Login to Web Console

#Specify the hadoop config
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/

#Specify the Spark Class Path
export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/hadoop-aws.jar"
export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/lib/aws-java-sdk-1.7.4.jar"
export SPARK_CLASSPATH="$SPARK_CLASSPATH:/usr/hdp/current/hadoop-client/lib/guava-11.0.2.jar"

#Launch Spark Shell
/usr/spark1.6/bin/spark-shell

#On the spark shell Specify the AWS Key
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_AWS_ACCESS_KeY")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "YOUR_AWS_SECRET_ACCESS_KeY")

#Now Access s3 files using spark
#Create RDD out of s3 file
val nationalNames = sc.textFile("s3n://cxl-spark-test-data/sss/baby-names.csv")

#Just check the first line
nationalNames.take(1)