Apache Spark Introduction

Thank you all for your overwhelming response to our Apache Spark Introduction session in “Apache Spark Hands-On” series, which happened on April 28, 2016 8:00 pm IST

Presented By
Sandeep Giri

Sandeep Giri

Key takeaways for this webinar were

+ Introduction to Apache Spark
+ Introduction to RDD (Resilient Distributed Datasets)
+ Loading data into an RDD
+ RDD Operations – Transformation
+ RDD Operations – Actions
+ Hands-on demos using CloudxLab
+ Questions and Answers

Hands-on Webinar

Presentation

Feedback

Feedback

We will be organizing more hand-on sessions on Apache Spark in coming days. Please follow CloudxLab on Twitter for updates on upcoming events.

Please feel free to drop your comments. It will help us improve the quality of webinar and content.

See you at the next event.

  • Jagpreet Singh

    Hello,

    Can you please help me know why I am facing following issue:

    I am trying to print text from the twitter post. I am able to print the text when producer sends only the text from the tweet to Spark i.e. producer.send( topic, tweet [“text”])

    However, on sending complete JSON object, that is:
    json_data = json.dumps(tweet)
    producer.send (topic, json_data)
    nothing prints on the console using pprint() function.

    Spark File has:

    os.environ[“SPARK_HOME”] = “C:spark”
    sc = SparkContext(appName=”PythonStreamingDirectKafkaWordCount”)
    ssc = StreamingContext(sc, 5)
    sc.setLogLevel(“ERROR”)
    brokers, topic = sys.argv[1:]
    print (“creating Kafka Direct Stream Object”)
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {“metadata.broker.list”: brokers})
    print (“Json object extracted from kvs map”)
    tweetRDD = kvs.map(lambda (k, v): json.loads(v)).map(lambda tweet: tweet[“text”])
    print(“TweetRDD Created”)
    tweetRDD.pprint()
    print(“Should Print TweetRDD”)
    ssc.start()
    ssc.awaitTermination()

    Producer File:
    import tweepy
    from kafka import KafkaProducer
    import json

    class StdOutListener(tweepy.StreamListener):

    def on_status(self, tweet):
    try:
    print tweet.text
    json_data = json.dumps(tweet)
    producer.send(mytopic, json_data)
    ………
    if __name__ == ‘__main__’:
    producer = KafkaProducer(bootstrap_servers=”localhost:9092″, value_serializer=lambda v: json.dumps(v).encode(‘utf-8’))

  • Jagpreet Singh

    Hello,

    Can you please help me know why I am facing following issue:

    I am trying to print text from the twitter post. I am able to print the text when producer sends only the text from the tweet to Spark i.e. producer.send( topic, tweet [“text”])

    However, on sending complete JSON object, that is:
    json_data = json.dumps(tweet)
    producer.send (topic, json_data)
    nothing prints on the console using pprint() function.

    Spark File has:

    os.environ[“SPARK_HOME”] = “C:spark”
    sc = SparkContext(appName=”PythonStreamingDirectKafkaWordCount”)
    ssc = StreamingContext(sc, 5)
    sc.setLogLevel(“ERROR”)
    brokers, topic = sys.argv[1:]
    print (“creating Kafka Direct Stream Object”)
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {“metadata.broker.list”: brokers})
    print (“Json object extracted from kvs map”)
    tweetRDD = kvs.map(lambda (k, v): json.loads(v)).map(lambda tweet: tweet[“text”])
    print(“TweetRDD Created”)
    tweetRDD.pprint()
    print(“Should Print TweetRDD”)
    ssc.start()
    ssc.awaitTermination()

    Producer File:
    import tweepy
    from kafka import KafkaProducer
    import json

    class StdOutListener(tweepy.StreamListener):

    def on_status(self, tweet):
    try:
    print tweet.text
    json_data = json.dumps(tweet)
    producer.send(mytopic, json_data)
    ………
    if __name__ == ‘__main__’:
    producer = KafkaProducer(bootstrap_servers=”localhost:9092″, value_serializer=lambda v: json.dumps(v).encode(‘utf-8’))

  • Jagpreet Singh

    Hello,

    Can you please help me know why I am facing following issue:

    I am trying to print text from the twitter post. I am able to print the text when producer sends only the text from the tweet to Spark i.e. producer.send( topic, tweet [“text”])

    However, on sending complete JSON object, that is:
    json_data = json.dumps(tweet)
    producer.send (topic, json_data)
    nothing prints on the console using pprint() function.

    Spark File has:

    os.environ[“SPARK_HOME”] = “C:spark”
    sc = SparkContext(appName=”PythonStreamingDirectKafkaWordCount”)
    ssc = StreamingContext(sc, 5)
    sc.setLogLevel(“ERROR”)
    brokers, topic = sys.argv[1:]
    print (“creating Kafka Direct Stream Object”)
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {“metadata.broker.list”: brokers})
    print (“Json object extracted from kvs map”)
    tweetRDD = kvs.map(lambda (k, v): json.loads(v)).map(lambda tweet: tweet[“text”])
    print(“TweetRDD Created”)
    tweetRDD.pprint()
    print(“Should Print TweetRDD”)
    ssc.start()
    ssc.awaitTermination()

    Producer File:
    import tweepy
    from kafka import KafkaProducer
    import json

    class StdOutListener(tweepy.StreamListener):

    def on_status(self, tweet):
    try:
    print tweet.text
    json_data = json.dumps(tweet)
    producer.send(mytopic, json_data)
    ………
    if __name__ == ‘__main__’:
    producer = KafkaProducer(bootstrap_servers=”localhost:9092″, value_serializer=lambda v: json.dumps(v).encode(‘utf-8’))

  • Jagpreet Singh

    Hello,

    Can you please help me know why I am facing following issue:

    I am trying to print text from the twitter post. I am able to print the text when producer sends only the text from the tweet to Spark i.e. producer.send( topic, tweet [“text”])

    However, on sending complete JSON object, that is:
    json_data = json.dumps(tweet)
    producer.send (topic, json_data)
    nothing prints on the console using pprint() function.

    Spark File has:

    os.environ[“SPARK_HOME”] = “C:spark”
    sc = SparkContext(appName=”PythonStreamingDirectKafkaWordCount”)
    ssc = StreamingContext(sc, 5)
    sc.setLogLevel(“ERROR”)
    brokers, topic = sys.argv[1:]
    print (“creating Kafka Direct Stream Object”)
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {“metadata.broker.list”: brokers})
    print (“Json object extracted from kvs map”)
    tweetRDD = kvs.map(lambda (k, v): json.loads(v)).map(lambda tweet: tweet[“text”])
    print(“TweetRDD Created”)
    tweetRDD.pprint()
    print(“Should Print TweetRDD”)
    ssc.start()
    ssc.awaitTermination()

    Producer File:
    import tweepy
    from kafka import KafkaProducer
    import json

    class StdOutListener(tweepy.StreamListener):

    def on_status(self, tweet):
    try:
    print tweet.text
    json_data = json.dumps(tweet)
    producer.send(mytopic, json_data)
    ………
    if __name__ == ‘__main__’:
    producer = KafkaProducer(bootstrap_servers=”localhost:9092″, value_serializer=lambda v: json.dumps(v).encode(‘utf-8’))

  • Jagpreet Singh

    Hello,

    Can you please help me know why I am facing following issue:

    I am trying to print text from the twitter post. I am able to print the text when producer sends only the text from the tweet to Spark i.e. producer.send( topic, tweet [“text”])

    However, on sending complete JSON object, that is:
    json_data = json.dumps(tweet)
    producer.send (topic, json_data)
    nothing prints on the console using pprint() function.

    Spark File has:

    os.environ[“SPARK_HOME”] = “C:spark”
    sc = SparkContext(appName=”PythonStreamingDirectKafkaWordCount”)
    ssc = StreamingContext(sc, 5)
    sc.setLogLevel(“ERROR”)
    brokers, topic = sys.argv[1:]
    print (“creating Kafka Direct Stream Object”)
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {“metadata.broker.list”: brokers})
    print (“Json object extracted from kvs map”)
    tweetRDD = kvs.map(lambda (k, v): json.loads(v)).map(lambda tweet: tweet[“text”])
    print(“TweetRDD Created”)
    tweetRDD.pprint()
    print(“Should Print TweetRDD”)
    ssc.start()
    ssc.awaitTermination()

    Producer File:
    import tweepy
    from kafka import KafkaProducer
    import json

    class StdOutListener(tweepy.StreamListener):

    def on_status(self, tweet):
    try:
    print tweet.text
    json_data = json.dumps(tweet)
    producer.send(mytopic, json_data)
    ………
    if __name__ == ‘__main__’:
    producer = KafkaProducer(bootstrap_servers=”localhost:9092″, value_serializer=lambda v: json.dumps(v).encode(‘utf-8’))

  • Jagpreet Singh

    Hello,

    Can you please help me know why I am facing following issue:

    I am trying to print text from the twitter post. I am able to print the text when producer sends only the text from the tweet to Spark i.e. producer.send( topic, tweet [“text”])

    However, on sending complete JSON object, that is:
    json_data = json.dumps(tweet)
    producer.send (topic, json_data)
    nothing prints on the console using pprint() function.

    Spark File has:

    os.environ[“SPARK_HOME”] = “C:spark”
    sc = SparkContext(appName=”PythonStreamingDirectKafkaWordCount”)
    ssc = StreamingContext(sc, 5)
    sc.setLogLevel(“ERROR”)
    brokers, topic = sys.argv[1:]
    print (“creating Kafka Direct Stream Object”)
    kvs = KafkaUtils.createDirectStream(ssc, [topic], {“metadata.broker.list”: brokers})
    print (“Json object extracted from kvs map”)
    tweetRDD = kvs.map(lambda (k, v): json.loads(v)).map(lambda tweet: tweet[“text”])
    print(“TweetRDD Created”)
    tweetRDD.pprint()
    print(“Should Print TweetRDD”)
    ssc.start()
    ssc.awaitTermination()

    Producer File:
    import tweepy
    from kafka import KafkaProducer
    import json

    class StdOutListener(tweepy.StreamListener):

    def on_status(self, tweet):
    try:
    print tweet.text
    json_data = json.dumps(tweet)
    producer.send(mytopic, json_data)
    ………
    if __name__ == ‘__main__’:
    producer = KafkaProducer(bootstrap_servers=”localhost:9092″, value_serializer=lambda v: json.dumps(v).encode(‘utf-8’))