Big Data Professional Program

Preference Dates Timing Location
In-Person and Live Webinars 06, 09, 13, 16, 20, 23 February 2023 Mondays & Thursdays: 7:00PM - 9:30PM Dubai Knowledge Park
In-Person and Live Webinars 20, 22, 24, 27 February, 1, 2 March 2023 Mondays, Wednesdays, Fridays: 10:00AM - 12:30PM Dubai Knowledge Park

Course Description

One of the most valuable technology skills is the ability to analyze and gain insight from massive datasets.  This course is specifically designed to bring you up to speed on one of the best technologies for this task, Apache Spark! The top technology companies including Google, Facebook, Netflix, Airbnb, Amazon, NASA are all using Spark to solve their big data problems!

This course will cover the latest Spark technologies including Spark SQL, MLlib and Spark Streaming.  After completing this course, you will be able to work on real-world big data projects using Spark and PySpark.  Upon successful completion of this program you will obtain KHDA-Attested Course Completion Certificate.

Unit 1 – Big Data Ecosystem

  • What is Big Data?
  • Big Data Characteristics
  • Data Processing Challenges
  • What is a Distributed File System (DFS)?
  • Solving the Speed Problem with DFS
  • What is Hadoop / HDFS?
  • Big Data Processing with Map Reduce
  • Hive vs. Pig
  • Introducing Apache Spark
  • Map Reduce, Hive, Pig vs. Spark
  • Hadoop Ecosystem Overview

Unit 2 – Environment Setup

  • Apache Spark Installation
  • Pyspark Installation and Configuration

Unit 3 – Spark DataFrames

  • Spark DataFrame Basics
  • Spark DataFrame Basic Operations
  • Groupby and Aggregate Operations
  • Handling Missing Data
  • Working with Dates and Timestamps
  • Practical Project: Processing 20 million records with Apache Spark

Unit 4 – Introduction to Machine Learning with Spark’s MLlib

Unit 5 – Regression

  • Developing a Regression Model with Spark’s MLlib
  • Practical Project: Building a Linear Regression Model for a Shipping Company

Unit 6 – Classification

  • Implementing a Classification Model with Spark’s MLlib
  • Practical Project: Building a Logistic Regression Model for a Marketing Agency

Unit 7 – Clustering

  • Solving an Unsupervised Learning Problem with Spark’s MLlib
  • Practical Project: Implementing K-Means Clustering Algorithm

Unit 8 – Recommender Systems

  • Introduction to Recommender Systems
  • Collaborative Filtering Recommender Systems
  • Practical Project: Building a Movie Recommender System with Spark

Unit 9 – Natural Language Processing

  • Introduction to Natural Language Processing (NLP) with Spark
  • Word and Character Tokenizers
  • Stop Words Removal
  • Feature Extractors (TF-IDF)
  • Practical Project: Building a Spam Filter

Unit 10 – Spark Streaming

  • Introduction to Streaming with Spark
  • Processing Unstructured Data with Spark Streaming
  • Practical Project: Processing Twitter Data Feeds using Spark Streaming
  • Data analysts and future data scientists interested to learn how to process Big Data with Apache Spark.
  • Software engineers and programmers who want to understand the larger Big Data ecosystem, and use it to store and analyze Big Data.
  • Project, program, or product managers interested to learn about the high-level architecture of Big Data.

The participants who have successfully completed this course will be able to analyze large datasets (structured and unstructured data) and build predictive models using Apache Spark and Pyspark.