Python PySpark & Big Data Analysis Using Python Made Simple

Python PySpark & Big Data Analysis Using Python Made Simple, available at $19.99, has an average rating of 3.85, with 93 lectures, based on 21 reviews, and has 732 subscribers.

You will learn about Learn different pyspark functions Learn how big data analysis is done using pyspark This course is ideal for individuals who are Beginners who are interested in learning different functions of Python Pyspark or Beginners who are keen to learn big data analysis using Python Pyspark It is particularly useful for Beginners who are interested in learning different functions of Python Pyspark or Beginners who are keen to learn big data analysis using Python Pyspark.

Enroll now: Python PySpark & Big Data Analysis Using Python Made Simple

Summary

Title: Python PySpark & Big Data Analysis Using Python Made Simple

Price: $19.99

Average Rating: 3.85

Number of Lectures: 93

Number of Published Lectures: 93

Number of Curriculum Items: 93

Number of Published Curriculum Objects: 93

Original Price: $19.99

Quality Status: approved

Status: Live

What You Will Learn

Learn different pyspark functions
Learn how big data analysis is done using pyspark

Who Should Attend

Beginners who are interested in learning different functions of Python Pyspark
Beginners who are keen to learn big data analysis using Python Pyspark

Target Audiences

Beginners who are interested in learning different functions of Python Pyspark
Beginners who are keen to learn big data analysis using Python Pyspark

Welcome to the course ‘Python Pyspark and Big Data Analysis Using Python Made Simple’

This course is from a software engineer who has managed to crack interviews in around 16 software companies.

Sometimes, life gives us no time to prepare, There are emergency times where in we have to buck up our guts and start bringing the situations under our control rather then being in the control of the situation. At the end of the day, All leave this earth empty handed. But given a situation, we should live up or fight up in such a way that the whole action sequence should make us proud and be giving us goosebumps when we think about it right after 10 years.

Apache Spark is an open-source processing engine built around speed, ease of use, and analytics.�

Spark is Developed to utilize distributed, in-memory data structures to improve data processing speeds for most workloads, Spark performs up to 100 times faster than Hadoop MapReduce for iterative algorithms. Spark supports Java, Scala, and Python APIs for ease of development.

The PySpark API Utility Module enables the use of Python to interact with the Spark programming model. For programmers who are
already familiar with Python, the PySpark API provides easy access to the extremely high-performance data processing enabled by Spark’s Scala architecture —without really the need to learn any Scala.�

Though Scala is much more efficient, the PySpark API allows data scientists with experience of Python to write programming logic in the language most
familiar to them. They can use it to perform rapid distributed transformations on large sets of data, and get the results back in Python-friendly notation.�

PySpark transformations (such as map, flatMap, filter) return resilient distributed datasets (RDDs). The short functions are passed to RDD methods using Python’s lambda syntax, while longer functions are defined with the def keyword.�

PySpark automatically ships the requested functions to worker nodes. The worker nodes then run the Python processes and push the results back to SparkContext, which stores the data in the RDD.�

PySpark offers access via an interactive shell, providing a simple way to learn the API.�

This course has a lot of programs , single line statements which extensively explains the use of pyspark apis.
Through programs and through small data sets we have explained how actually a file with big data sets is analyzed the required results are returned.

The course duration is around 6 hours. We have followed the question and answer approach to explain the pyspark api concepts.
We would request you to kindly check the list of pyspark questions in the course landing page and then if you are interested, you can enroll in the course.

Note: This course is designed for Absolute Beginners�

Questions:

>> Create and print an RDD from a python collection of numbers. The given collection of numbers should be distributed in 5 partitions
>> Demonstrate the use of glom() function
>> Using the range() function, print ‘1, 3, 5’
>> what is the output of the below statements ?
sc=SparkContext()
sc.setLogLevel(“ERROR”)

sc.range(5).collect()
sc.range(2, 4).collect()
sc.range(1, 7, 2).collect()

>> For a given python collection of numbers in the RDD with a given set of partitions. Perform the following:
-> write a function which calculates the square of each numbers
-> apply this function on the specified partitions in the rdd

>> what is the output of the below statements:�

[[0, 1], [2, 3], [4, 5]]

write a statement such that you get the below outputs:

[0, 1, 16, 25]
[0, 1]
[4, 9]
[16, 25]

>> with the help of SparkContext(), read and display the contents of a text file

>> explain the use of union() function

>> Is it possible to combine and print the contents of a text file and contents of a rdd ?

>> write a pgm to list a particular directory’s text files and their contents

>> Given two functions seqOp and combOp, what is the output of the below statements:
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(sc.parallelize([1, 2, 3, 4], 2).aggregate((0, 0), seqOp, combOp))

>> Given a data set: [1, 2] : Write a statement such that we get the output as below:

[(1, 1), (1, 2), (2, 1), (2, 2)]

>> Given the data: [1,2,3,4,5].
What is the difference between the output of the below 2 statements:
print(sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(4).glom().collect())
print(sc.parallelize([1, 2, 3, 4, 5], 5).coalesce(4).glom().collect())

>> Given two rdds x and y:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])

Write a pyspark pgm statement which produces the below statement:
[(‘a’, ([1], [2])), (‘b’, ([4], []))]

>> Given the below statement:

m = sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

Find out a way to print the below values:

‘2’
‘4’

>> explain the output of the below statment:

print(sc.parallelize([2, 3, 4]).count())
output: 3

>> Given the statement :

rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])

Find a way to count the occurences of the the keys and print the output as below:

[(‘a’, 2), (‘b’, 1)]

>> explain the output of the below statement:

print(sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()))

output: [(1, 2), (2, 3)]

>> Given the rdd which contains the elements -> [1, 1, 2, 3],
try to print only the first occurence of the number�

output: [1, 2, 3]

>> Given the below statement:
rdd = sc.parallelize([1, 2, 3, 4, 5])
write a statement to print only -> [2, 4]

>> Given data: [2, 3, 4]. Try to print only the first element in the data (i.e 2)

>> Given the below statement:
  rdd = sc.parallelize([2, 3, 4])
  Write a statement to get the below output from the above rdd:
  [(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]

>> Given the below statement:
x = sc.parallelize([(“a”, [“x”, “y”, “z”]), (“b”, [“p”, “r”])])
Write a statement/statements to get the below output from the above rdd:
[(‘a’, ‘x’), (‘a’, ‘y’), (‘a’, ‘z’), (‘b’, ‘p’), (‘b’, ‘r’)]

>> Given the below statement:
rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
What is the output of the below statements:
print(sorted(rdd.foldByKey(0, add).collect()))
print(sorted(rdd.foldByKey(1, add).collect()))
print(sorted(rdd.foldByKey(2, add).collect()))

>> Given below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2), (“c”, 8)])
Write a statement to get the output as
[(‘a’, (1, 2)), (‘b’, (4, None)), (‘c’, (None, 8))]

>> is it possible to get the number of partitions in the rdd

>> Given below statements:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
write a snippet to get the following output:
[(0, [2, 8]), (1, [1, 1, 3, 5])]�

>> Given below statements:
w = sc.parallelize([(“a”, 5), (“b”, 6)])
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
z = sc.parallelize([(“b”, 42)])
write a snippet to get the following output:
output: [(‘a’, ([5], [1], [2], [])), (‘b’, ([6], [4], [], [42]))]�

>> Given below statements:
rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
write a snippet to get the following output:
output:
[1, 2, 3]

>> Given below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2), (“a”, 3)])
write a snippet to get the following output:
output:
[(‘a’, (1, 2)), (‘a’, (1, 3))]
[(‘a’, (2, 1)), (‘a’, (3, 1))]

>> For the given data: [0, 1, 2, 3]
Write a statement to get the output as:
[(0, 0), (1, 1), (4, 2), (9, 3)]

>> For the given data: [0, 1, 2, 3, 4] and [0, 1, 2, 3, 4]
Write a statement to get the output as:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]

>> Given the data:
[(0, 0), (1, 1), (4, 2), (9, 3)]
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
Write a statement to get the output as:
[(0, [[0], [0]]), (1, [[1], [1]]), (2, [[], [2]]), (3, [[], [3]]), (4, [[2], [4]]), (9, [[3], []])]

>> Given the data: [(1, 2), (3, 4)]
Print only ‘1’ and ‘3’

>> Given the below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
write a snippet to get the following output:
output:
[(‘a’, (1, 2)), (‘b’, (4, None))]
[(‘a’, (1, 2))]

>> What is the output of the below statements:
rdd = sc.parallelize([“b”, “a”, “c”])
print(sorted(rdd.map(lambda x: (x, 1)).collect()))

>> What is the output of the below statements:
rdd = sc.parallelize([1, 2, 3, 4], 2)
def f(iterator): yield sum(iterator)
print(rdd.mapPartitions(f).collect())

>> Explain the output of the below code snippet:

rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator):
yield splitIndex
print(rdd.mapPartitionsWithIndex(f).sum())
output: 6

>> Explain the output of the below code snippet:
x = sc.parallelize([(“a”, [“apple”, “banana”, “lemon”]), (“b”, [“grapes”])])
def f(x): return len(x)
print(x.mapValues(f).collect())
output: [(‘a’, 3), (‘b’, 1)]

>> What is the output of the below snippet:
import findspark
findspark.init(‘/opt/spark-2.2.1-bin-hadoop2.7’)
import pyspark
import os
from pyspark import SparkContext

sc=SparkContext()
sc.setLogLevel(“ERROR”)

print(sc.parallelize([1, 2, 3]).mean())

>> what is the output of the below snippet:
pairs = sc.parallelize([1, 2, 3]).map(lambda x: (x, x))
sets = pairs.partitionBy(2).glom().collect()
print(sets)

>> Given the rdd below:
sc.parallelize([1, 2, 3, 4, 5])
write a statement to get the below output:
output: 15

>> Given the statement below:
rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
Write a statement to get the below output:
output:
[(‘a’, 2), (‘b’, 1)]

>> what is the difference between leftouterjoin and rightoutjoin

>> Given the below statement:
tmp = [(‘a’, 1), (‘b’, 2), (‘1’, 3), (‘d’, 4), (‘2’, 5)]
what is the output of the below statements:
print(sc.parallelize(tmp).sortBy(lambda x: x[0]).collect())
print(sc.parallelize(tmp).sortBy(lambda x: x[1]).collect())

>> Given the statement:
x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 3)])
y = sc.parallelize([(“a”, 3), (“c”, None)])
do something to get the output:
[(‘a’, 1), (‘b’, 4), (‘b’, 5)]

>> Given the statement:
x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 2)])
y = sc.parallelize([(“a”, 3), (“c”, None)])
do something to get the output:
[(‘b’, 4), (‘b’, 5)]

>> Given the statement:
sc.parallelize([“a”, “b”, “c”, “d”], 3)
do something to get the output:
[(‘a’, 0), (‘b’, 1), (‘c’, 2), (‘d’, 3)]

>> Given the statement:
sc.parallelize([“a”, “b”, “c”, “d”, “e”], 3)
do something to get the output:
[(‘a’, 0), (‘b’, 1), (‘c’, 4), (‘d’, 2), (‘e’, 5)]

>> Given the statements:
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
do something to get the output:
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]

>> output of the given pgm:

sc=SparkContext()
sc.setLogLevel(“ERROR”)

data = [[“xyz1″,”a1”,1, 2],
[“xyz1″,”a2”,3,4],
[“xyz2″,”a1”,5,6],
[“xyz2″,”a2”,7,8],
[“xyz3″,”a1”,9,10]]

rdd = sc.parallelize(data,4)
output = rdd.map(lambda y : [y[0],y[1],(y[2]+y[3])/2])
output2 = output.filter(lambda y : “a2” in y)
output4 = output2.takeOrdered(num=3, key = lambda x :-x[2])
print(output4)
output5 = output2.takeOrdered(num=3, key = lambda x :x[2])
print(output5)

>> output the contents of a text file

>> output the contents of a csv file

>> write a pgm to save to a sequence file and read from a sequence file

>> write a pgm to save data in json format and display the contents of a json file

>> write a pgm to add the indices of data sets

>> write a pgm to differentiate across odd and even numbers using filter function

>> write a pgm to explain the concept of join function

>> write a pgm to explain the concept of map function

>> write a pgm to explain the concept of fold function

>> write a pgm to explain the concept of reducebykey function

>> write a pgm to explain the concept of combinebykey function

>> There are many pgms which are showcase to analyse big data

Course Curriculum

Chapter 1: Introduction

Lecture 1: Introduction

Lecture 2: Question 1

Lecture 3: Question 2

Lecture 4: Question 3

Lecture 5: Question 4

Lecture 6: Question 5

Lecture 7: Question 6

Lecture 8: Question 7

Lecture 9: Question 8

Lecture 10: Question 9

Lecture 11: Question 10

Lecture 12: Question 11

Lecture 13: Question 12

Lecture 14: Question 13

Lecture 15: Question 14

Lecture 16: Question 15

Lecture 17: Question 16

Lecture 18: Question 17

Lecture 19: Question 18

Lecture 20: Question 19

Lecture 21: Question 20

Lecture 22: Question 21

Lecture 23: Question 22

Lecture 24: Question 23

Lecture 25: Question 24

Lecture 26: Question 25

Lecture 27: Question 26

Lecture 28: Question 27

Lecture 29: Question 28

Lecture 30: Question 29

Lecture 31: Question 30

Lecture 32: Question 31

Lecture 33: Question 32

Lecture 34: Question 33

Lecture 35: Question 34

Lecture 36: Question 35

Lecture 37: Question 36

Lecture 38: Question 37

Lecture 39: Question 38

Lecture 40: Question 39

Lecture 41: Question 40

Lecture 42: Question 41

Lecture 43: Question 42

Lecture 44: Question 43

Lecture 45: Question 44

Lecture 46: Question 45

Lecture 47: Question 46

Lecture 48: Question 47

Lecture 49: Question 48

Lecture 50: Question 49

Lecture 51: Question 50

Lecture 52: Question 51

Lecture 53: Question 52

Lecture 54: Question 53

Lecture 55: Question 54

Lecture 56: Question 55

Lecture 57: Question 56

Lecture 58: Question 57

Lecture 59: Question 58

Lecture 60: Question 59

Lecture 61: Question 60

Lecture 62: Question 61

Lecture 63: Question 62

Lecture 64: Question 63

Lecture 65: Question 64

Lecture 66: Question 65

Lecture 67: Question 66

Lecture 68: Question 67

Lecture 69: Question 68

Lecture 70: Question 69

Lecture 71: Question 70

Lecture 72: Question 71

Lecture 73: Question 72

Lecture 74: Question 73

Lecture 75: Question 74

Lecture 76: Question 75

Lecture 77: Question 76

Lecture 78: Question 77

Chapter 2: Exercises

Lecture 1: Exercise 1

Lecture 2: Exercise 2

Lecture 3: Exercise 3

Lecture 4: Exercise 4

Lecture 5: Exercise 5

Lecture 6: Exercise 6

Lecture 7: Exercise 7

Lecture 8: Exercise 8

Lecture 9: Exercise 9

Lecture 10: Exercise 10

Lecture 11: Exercise 11

Lecture 12: Exercise 12

Lecture 13: Exercise 13

Lecture 14: Exercise 14

Lecture 15: Lecture 93: Exercise 15

Instructors

Satish Venkatesh
Software Engineer

Rating Distribution

1 stars: 3 votes
2 stars: 1 votes
3 stars: 4 votes
4 stars: 4 votes
5 stars: 9 votes

Frequently Asked Questions

How long do I have access to the course materials?

You can view and review the lecture materials indefinitely, like an on-demand channel.

Can I take my courses with me wherever I go?

Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don’t have an internet connection, some instructors also let their students download course lectures. That’s up to the instructor though, so make sure you get on their good side!

Menu