Big Data Professional Program
Preference | Dates | Timing | Location |
---|---|---|---|
In-Person and Live Webinars | 06, 09, 13, 16, 20, 23 February 2023 | Mondays & Thursdays: 7:00PM - 9:30PM | Dubai Knowledge Park |
In-Person and Live Webinars | 20, 22, 24, 27 February, 1, 2 March 2023 | Mondays, Wednesdays, Fridays: 10:00AM - 12:30PM | Dubai Knowledge Park |
Course Description
One of the most valuable technology skills is the ability to analyze and gain insight from massive datasets. This course is specifically designed to bring you up to speed on one of the best technologies for this task, Apache Spark! The top technology companies including Google, Facebook, Netflix, Airbnb, Amazon, NASA are all using Spark to solve their big data problems!
This course will cover the latest Spark technologies including Spark SQL, MLlib and Spark Streaming. After completing this course, you will be able to work on real-world big data projects using Spark and PySpark. Upon successful completion of this program you will obtain KHDA-Attested Course Completion Certificate.

Unit 1 – Big Data Ecosystem
- What is Big Data?
- Big Data Characteristics
- Data Processing Challenges
- What is a Distributed File System (DFS)?
- Solving the Speed Problem with DFS
- What is Hadoop / HDFS?
- Big Data Processing with Map Reduce
- Hive vs. Pig
- Introducing Apache Spark
- Map Reduce, Hive, Pig vs. Spark
- Hadoop Ecosystem Overview
Unit 2 – Environment Setup
- Apache Spark Installation
- Pyspark Installation and Configuration
Unit 3 – Spark DataFrames
- Spark DataFrame Basics
- Spark DataFrame Basic Operations
- Groupby and Aggregate Operations
- Handling Missing Data
- Working with Dates and Timestamps
- Practical Project: Processing 20 million records with Apache Spark
Unit 4 – Introduction to Machine Learning with Spark’s MLlib
Unit 5 – Regression
- Developing a Regression Model with Spark’s MLlib
- Practical Project: Building a Linear Regression Model for a Shipping Company
Unit 6 – Classification
- Implementing a Classification Model with Spark’s MLlib
- Practical Project: Building a Logistic Regression Model for a Marketing Agency
Unit 7 – Clustering
- Solving an Unsupervised Learning Problem with Spark’s MLlib
- Practical Project: Implementing K-Means Clustering Algorithm
Unit 8 – Recommender Systems
- Introduction to Recommender Systems
- Collaborative Filtering Recommender Systems
- Practical Project: Building a Movie Recommender System with Spark
Unit 9 – Natural Language Processing
- Introduction to Natural Language Processing (NLP) with Spark
- Word and Character Tokenizers
- Stop Words Removal
- Feature Extractors (TF-IDF)
- Practical Project: Building a Spam Filter
Unit 10 – Spark Streaming
- Introduction to Streaming with Spark
- Processing Unstructured Data with Spark Streaming
- Practical Project: Processing Twitter Data Feeds using Spark Streaming
- Data analysts and future data scientists interested to learn how to process Big Data with Apache Spark.
- Software engineers and programmers who want to understand the larger Big Data ecosystem, and use it to store and analyze Big Data.
- Project, program, or product managers interested to learn about the high-level architecture of Big Data.
- Experience with Python programming, and machine learning, or successful completion of our Artificial Intelligence Professional Program.
The participants who have successfully completed this course will be able to analyze large datasets (structured and unstructured data) and build predictive models using Apache Spark and Pyspark.
Testimonials


