Python PySpark & Big Data Analysis Using Python Made Simple
Python PySpark & Big Data Analysis Using Python Made Simple, available at $19.99, has an average rating of 3.85, with 93 lectures, based on 21 reviews, and has 732 subscribers.
You will learn about Learn different pyspark functions Learn how big data analysis is done using pyspark This course is ideal for individuals who are Beginners who are interested in learning different functions of Python Pyspark or Beginners who are keen to learn big data analysis using Python Pyspark It is particularly useful for Beginners who are interested in learning different functions of Python Pyspark or Beginners who are keen to learn big data analysis using Python Pyspark.
Enroll now: Python PySpark & Big Data Analysis Using Python Made Simple
Summary
Title: Python PySpark & Big Data Analysis Using Python Made Simple
Price: $19.99
Average Rating: 3.85
Number of Lectures: 93
Number of Published Lectures: 93
Number of Curriculum Items: 93
Number of Published Curriculum Objects: 93
Original Price: $19.99
Quality Status: approved
Status: Live
What You Will Learn
- Learn different pyspark functions
- Learn how big data analysis is done using pyspark
Who Should Attend
- Beginners who are interested in learning different functions of Python Pyspark
- Beginners who are keen to learn big data analysis using Python Pyspark
Target Audiences
- Beginners who are interested in learning different functions of Python Pyspark
- Beginners who are keen to learn big data analysis using Python Pyspark
Welcome to the course ‘Python Pyspark and Big Data Analysis Using Python Made Simple’
This course is from a software engineer who has managed to crack interviews in around 16 software companies.
Sometimes, life gives us no time to prepare, There are emergency times where in we have to buck up our guts and start bringing the situations under our control rather then being in the control of the situation. At the end of the day, All leave this earth empty handed. But given a situation, we should live up or fight up in such a way that the whole action sequence should make us proud and be giving us goosebumps when we think about it right after 10 years.
Apache Spark is an open-source processing engine built around speed, ease of use, and analytics.
Spark is Developed to utilize distributed, in-memory data structures to improve data processing speeds for most workloads, Spark performs up to 100 times faster than Hadoop MapReduce for iterative algorithms. Spark supports Java, Scala, and Python APIs for ease of development.
The PySpark API Utility Module enables the use of Python to interact with the Spark programming model. For programmers who are
already familiar with Python, the PySpark API provides easy access to the extremely high-performance data processing enabled by Spark’s Scala architecture —without really the need to learn any Scala.
Though Scala is much more efficient, the PySpark API allows data scientists with experience of Python to write programming logic in the language most
familiar to them. They can use it to perform rapid distributed transformations on large sets of data, and get the results back in Python-friendly notation.
PySpark transformations (such as map, flatMap, filter) return resilient distributed datasets (RDDs). The short functions are passed to RDD methods using Python’s lambda syntax, while longer functions are defined with the def keyword.
PySpark automatically ships the requested functions to worker nodes. The worker nodes then run the Python processes and push the results back to SparkContext, which stores the data in the RDD.
PySpark offers access via an interactive shell, providing a simple way to learn the API.
This course has a lot of programs , single line statements which extensively explains the use of pyspark apis.
Through programs and through small data sets we have explained how actually a file with big data sets is analyzed the required results are returned.
The course duration is around 6 hours. We have followed the question and answer approach to explain the pyspark api concepts.
We would request you to kindly check the list of pyspark questions in the course landing page and then if you are interested, you can enroll in the course.
Note: This course is designed for Absolute Beginners
Questions:
>> Create and print an RDD from a python collection of numbers. The given collection of numbers should be distributed in 5 partitions
>> Demonstrate the use of glom() function
>> Using the range() function, print ‘1, 3, 5’
>> what is the output of the below statements ?
sc=SparkContext()
sc.setLogLevel(“ERROR”)
sc.range(5).collect()
sc.range(2, 4).collect()
sc.range(1, 7, 2).collect()
>> For a given python collection of numbers in the RDD with a given set of partitions. Perform the following:
-> write a function which calculates the square of each numbers
-> apply this function on the specified partitions in the rdd
>> what is the output of the below statements:
[[0, 1], [2, 3], [4, 5]]
write a statement such that you get the below outputs:
[0, 1, 16, 25]
[0, 1]
[4, 9]
[16, 25]
>> with the help of SparkContext(), read and display the contents of a text file
>> explain the use of union() function
>> Is it possible to combine and print the contents of a text file and contents of a rdd ?
>> write a pgm to list a particular directory’s text files and their contents
>> Given two functions seqOp and combOp, what is the output of the below statements:
seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(sc.parallelize([1, 2, 3, 4], 2).aggregate((0, 0), seqOp, combOp))
>> Given a data set: [1, 2] : Write a statement such that we get the output as below:
[(1, 1), (1, 2), (2, 1), (2, 2)]
>> Given the data: [1,2,3,4,5].
What is the difference between the output of the below 2 statements:
print(sc.parallelize([1, 2, 3, 4, 5], 3).coalesce(4).glom().collect())
print(sc.parallelize([1, 2, 3, 4, 5], 5).coalesce(4).glom().collect())
>> Given two rdds x and y:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
Write a pyspark pgm statement which produces the below statement:
[(‘a’, ([1], [2])), (‘b’, ([4], []))]
>> Given the below statement:
m = sc.parallelize([(1, 2), (3, 4)]).collectAsMap()
Find out a way to print the below values:
‘2’
‘4’
>> explain the output of the below statment:
print(sc.parallelize([2, 3, 4]).count())
output: 3
>> Given the statement :
rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
Find a way to count the occurences of the the keys and print the output as below:
[(‘a’, 2), (‘b’, 1)]
>> explain the output of the below statement:
print(sorted(sc.parallelize([1, 2, 1, 2, 2], 2).countByValue().items()))
output: [(1, 2), (2, 3)]
>> Given the rdd which contains the elements -> [1, 1, 2, 3],
try to print only the first occurence of the number
output: [1, 2, 3]
>> Given the below statement:
rdd = sc.parallelize([1, 2, 3, 4, 5])
write a statement to print only -> [2, 4]
>> Given data: [2, 3, 4]. Try to print only the first element in the data (i.e 2)
>> Given the below statement:
rdd = sc.parallelize([2, 3, 4])
Write a statement to get the below output from the above rdd:
[(2, 2), (2, 2), (3, 3), (3, 3), (4, 4), (4, 4)]
>> Given the below statement:
x = sc.parallelize([(“a”, [“x”, “y”, “z”]), (“b”, [“p”, “r”])])
Write a statement/statements to get the below output from the above rdd:
[(‘a’, ‘x’), (‘a’, ‘y’), (‘a’, ‘z’), (‘b’, ‘p’), (‘b’, ‘r’)]
>> Given the below statement:
rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
What is the output of the below statements:
print(sorted(rdd.foldByKey(0, add).collect()))
print(sorted(rdd.foldByKey(1, add).collect()))
print(sorted(rdd.foldByKey(2, add).collect()))
>> Given below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2), (“c”, 8)])
Write a statement to get the output as
[(‘a’, (1, 2)), (‘b’, (4, None)), (‘c’, (None, 8))]
>> is it possible to get the number of partitions in the rdd
>> Given below statements:
rdd = sc.parallelize([1, 1, 2, 3, 5, 8])
write a snippet to get the following output:
[(0, [2, 8]), (1, [1, 1, 3, 5])]
>> Given below statements:
w = sc.parallelize([(“a”, 5), (“b”, 6)])
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
z = sc.parallelize([(“b”, 42)])
write a snippet to get the following output:
output: [(‘a’, ([5], [1], [2], [])), (‘b’, ([6], [4], [], [42]))]
>> Given below statements:
rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
write a snippet to get the following output:
output:
[1, 2, 3]
>> Given below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2), (“a”, 3)])
write a snippet to get the following output:
output:
[(‘a’, (1, 2)), (‘a’, (1, 3))]
[(‘a’, (2, 1)), (‘a’, (3, 1))]
>> For the given data: [0, 1, 2, 3]
Write a statement to get the output as:
[(0, 0), (1, 1), (4, 2), (9, 3)]
>> For the given data: [0, 1, 2, 3, 4] and [0, 1, 2, 3, 4]
Write a statement to get the output as:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
>> Given the data:
[(0, 0), (1, 1), (4, 2), (9, 3)]
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
Write a statement to get the output as:
[(0, [[0], [0]]), (1, [[1], [1]]), (2, [[], [2]]), (3, [[], [3]]), (4, [[2], [4]]), (9, [[3], []])]
>> Given the data: [(1, 2), (3, 4)]
Print only ‘1’ and ‘3’
>> Given the below statements:
x = sc.parallelize([(“a”, 1), (“b”, 4)])
y = sc.parallelize([(“a”, 2)])
write a snippet to get the following output:
output:
[(‘a’, (1, 2)), (‘b’, (4, None))]
[(‘a’, (1, 2))]
>> What is the output of the below statements:
rdd = sc.parallelize([“b”, “a”, “c”])
print(sorted(rdd.map(lambda x: (x, 1)).collect()))
>> What is the output of the below statements:
rdd = sc.parallelize([1, 2, 3, 4], 2)
def f(iterator): yield sum(iterator)
print(rdd.mapPartitions(f).collect())
>> Explain the output of the below code snippet:
rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator):
yield splitIndex
print(rdd.mapPartitionsWithIndex(f).sum())
output: 6
>> Explain the output of the below code snippet:
x = sc.parallelize([(“a”, [“apple”, “banana”, “lemon”]), (“b”, [“grapes”])])
def f(x): return len(x)
print(x.mapValues(f).collect())
output: [(‘a’, 3), (‘b’, 1)]
>> What is the output of the below snippet:
import findspark
findspark.init(‘/opt/spark-2.2.1-bin-hadoop2.7’)
import pyspark
import os
from pyspark import SparkContext
sc=SparkContext()
sc.setLogLevel(“ERROR”)
print(sc.parallelize([1, 2, 3]).mean())
>> what is the output of the below snippet:
pairs = sc.parallelize([1, 2, 3]).map(lambda x: (x, x))
sets = pairs.partitionBy(2).glom().collect()
print(sets)
>> Given the rdd below:
sc.parallelize([1, 2, 3, 4, 5])
write a statement to get the below output:
output: 15
>> Given the statement below:
rdd = sc.parallelize([(“a”, 1), (“b”, 1), (“a”, 1)])
Write a statement to get the below output:
output:
[(‘a’, 2), (‘b’, 1)]
>> what is the difference between leftouterjoin and rightoutjoin
>> Given the below statement:
tmp = [(‘a’, 1), (‘b’, 2), (‘1’, 3), (‘d’, 4), (‘2’, 5)]
what is the output of the below statements:
print(sc.parallelize(tmp).sortBy(lambda x: x[0]).collect())
print(sc.parallelize(tmp).sortBy(lambda x: x[1]).collect())
>> Given the statement:
x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 3)])
y = sc.parallelize([(“a”, 3), (“c”, None)])
do something to get the output:
[(‘a’, 1), (‘b’, 4), (‘b’, 5)]
>> Given the statement:
x = sc.parallelize([(“a”, 1), (“b”, 4), (“b”, 5), (“a”, 2)])
y = sc.parallelize([(“a”, 3), (“c”, None)])
do something to get the output:
[(‘b’, 4), (‘b’, 5)]
>> Given the statement:
sc.parallelize([“a”, “b”, “c”, “d”], 3)
do something to get the output:
[(‘a’, 0), (‘b’, 1), (‘c’, 2), (‘d’, 3)]
>> Given the statement:
sc.parallelize([“a”, “b”, “c”, “d”, “e”], 3)
do something to get the output:
[(‘a’, 0), (‘b’, 1), (‘c’, 4), (‘d’, 2), (‘e’, 5)]
>> Given the statements:
x = sc.parallelize(range(0,5))
y = sc.parallelize(range(1000, 1005))
do something to get the output:
[(0, 1000), (1, 1001), (2, 1002), (3, 1003), (4, 1004)]
>> output of the given pgm:
sc=SparkContext()
sc.setLogLevel(“ERROR”)
data = [[“xyz1″,”a1”,1, 2],
[“xyz1″,”a2”,3,4],
[“xyz2″,”a1”,5,6],
[“xyz2″,”a2”,7,8],
[“xyz3″,”a1”,9,10]]
rdd = sc.parallelize(data,4)
output = rdd.map(lambda y : [y[0],y[1],(y[2]+y[3])/2])
output2 = output.filter(lambda y : “a2” in y)
output4 = output2.takeOrdered(num=3, key = lambda x :-x[2])
print(output4)
output5 = output2.takeOrdered(num=3, key = lambda x :x[2])
print(output5)
>> output the contents of a text file
>> output the contents of a csv file
>> write a pgm to save to a sequence file and read from a sequence file
>> write a pgm to save data in json format and display the contents of a json file
>> write a pgm to add the indices of data sets
>> write a pgm to differentiate across odd and even numbers using filter function
>> write a pgm to explain the concept of join function
>> write a pgm to explain the concept of map function
>> write a pgm to explain the concept of fold function
>> write a pgm to explain the concept of reducebykey function
>> write a pgm to explain the concept of combinebykey function
>> There are many pgms which are showcase to analyse big data
Course Curriculum
Chapter 1: Introduction
Lecture 1: Introduction
Lecture 2: Question 1
Lecture 3: Question 2
Lecture 4: Question 3
Lecture 5: Question 4
Lecture 6: Question 5
Lecture 7: Question 6
Lecture 8: Question 7
Lecture 9: Question 8
Lecture 10: Question 9
Lecture 11: Question 10
Lecture 12: Question 11
Lecture 13: Question 12
Lecture 14: Question 13
Lecture 15: Question 14
Lecture 16: Question 15
Lecture 17: Question 16
Lecture 18: Question 17
Lecture 19: Question 18
Lecture 20: Question 19
Lecture 21: Question 20
Lecture 22: Question 21
Lecture 23: Question 22
Lecture 24: Question 23
Lecture 25: Question 24
Lecture 26: Question 25
Lecture 27: Question 26
Lecture 28: Question 27
Lecture 29: Question 28
Lecture 30: Question 29
Lecture 31: Question 30
Lecture 32: Question 31
Lecture 33: Question 32
Lecture 34: Question 33
Lecture 35: Question 34
Lecture 36: Question 35
Lecture 37: Question 36
Lecture 38: Question 37
Lecture 39: Question 38
Lecture 40: Question 39
Lecture 41: Question 40
Lecture 42: Question 41
Lecture 43: Question 42
Lecture 44: Question 43
Lecture 45: Question 44
Lecture 46: Question 45
Lecture 47: Question 46
Lecture 48: Question 47
Lecture 49: Question 48
Lecture 50: Question 49
Lecture 51: Question 50
Lecture 52: Question 51
Lecture 53: Question 52
Lecture 54: Question 53
Lecture 55: Question 54
Lecture 56: Question 55
Lecture 57: Question 56
Lecture 58: Question 57
Lecture 59: Question 58
Lecture 60: Question 59
Lecture 61: Question 60
Lecture 62: Question 61
Lecture 63: Question 62
Lecture 64: Question 63
Lecture 65: Question 64
Lecture 66: Question 65
Lecture 67: Question 66
Lecture 68: Question 67
Lecture 69: Question 68
Lecture 70: Question 69
Lecture 71: Question 70
Lecture 72: Question 71
Lecture 73: Question 72
Lecture 74: Question 73
Lecture 75: Question 74
Lecture 76: Question 75
Lecture 77: Question 76
Lecture 78: Question 77
Chapter 2: Exercises
Lecture 1: Exercise 1
Lecture 2: Exercise 2
Lecture 3: Exercise 3
Lecture 4: Exercise 4
Lecture 5: Exercise 5
Lecture 6: Exercise 6
Lecture 7: Exercise 7
Lecture 8: Exercise 8
Lecture 9: Exercise 9
Lecture 10: Exercise 10
Lecture 11: Exercise 11
Lecture 12: Exercise 12
Lecture 13: Exercise 13
Lecture 14: Exercise 14
Lecture 15: Lecture 93: Exercise 15
Instructors
-
Satish Venkatesh
Software Engineer
Rating Distribution
- 1 stars: 3 votes
- 2 stars: 1 votes
- 3 stars: 4 votes
- 4 stars: 4 votes
- 5 stars: 9 votes
Frequently Asked Questions
How long do I have access to the course materials?
You can view and review the lecture materials indefinitely, like an on-demand channel.
Can I take my courses with me wherever I go?
Definitely! If you have an internet connection, courses on Udemy are available on any device at any time. If you don’t have an internet connection, some instructors also let their students download course lectures. That’s up to the instructor though, so make sure you get on their good side!
You may also like
- Top 10 Language Learning Courses to Learn in November 2024
- Top 10 Video Editing Courses to Learn in November 2024
- Top 10 Music Production Courses to Learn in November 2024
- Top 10 Animation Courses to Learn in November 2024
- Top 10 Digital Illustration Courses to Learn in November 2024
- Top 10 Renewable Energy Courses to Learn in November 2024
- Top 10 Sustainable Living Courses to Learn in November 2024
- Top 10 Ethical AI Courses to Learn in November 2024
- Top 10 Cybersecurity Fundamentals Courses to Learn in November 2024
- Top 10 Smart Home Technology Courses to Learn in November 2024
- Top 10 Holistic Health Courses to Learn in November 2024
- Top 10 Nutrition And Diet Planning Courses to Learn in November 2024
- Top 10 Yoga Instruction Courses to Learn in November 2024
- Top 10 Stress Management Courses to Learn in November 2024
- Top 10 Mindfulness Meditation Courses to Learn in November 2024
- Top 10 Life Coaching Courses to Learn in November 2024
- Top 10 Career Development Courses to Learn in November 2024
- Top 10 Relationship Building Courses to Learn in November 2024
- Top 10 Parenting Skills Courses to Learn in November 2024
- Top 10 Home Improvement Courses to Learn in November 2024