Big Data: Storing and Processing Massive Datasets

Preference Dates Timing Location
Evening Course 17, 18, 19, 20, 23, 24 February 2020 07:15PM- 09:45PM Dubai Knowledge Park
Evening Course 13, 14, 15, 16, 19, 20 April 2020 07:15PM- 09:45PM Dubai Knowledge Park

Course Description

One of the most valuable technology skills is the ability to store and process huge data sets, and this course is specifically designed to bring you up to speed on some of the hottest technologies for this task including Hadoop and Apache Spark. The top technology companies are all using Hadoop and Spark to solve their big data problems!

This course will enable you to learn and master the most popular Big Data and Hadoop technologies including HDFS, MapReduce, Spark, MLlib and Spark Streaming.   It’s filled with hands-on projects from various industries and verticals including transportation, advertising and entertainment.


Unit 1 – Big Data Overview

  • Overview of the Hadoop Ecosystem
  • Hadoop’s Core: HDFS and MapReduce
  • How MapReduce distributes processing?
  • Unit 2 – Environment Setup

    • Hadoop Installation on a Linux Virtual Machine
    • Spark Installation on a Linux Virtual Machine
    • Configuring HDFS 
    • Configuring pyspark
  • Unit 3 – Spark DataFrames
  • What is Spark?
  • Spark DataFrame Basics
  • Spark DataFrame Basic Operations
  • Groupby and Aggregate Operations
  • Handling Missing Data
  • Working with Dates and Timestamps
  • Practical Project: Processing a 20 million records dataset
  • Unit 4 –  Machine Learning with Spark MLlib
  • How to implement a Linear Regression model with MLlib
  • Practical Project: Building a regression model for a shipping company
  • How to implement a Logistic Regression model with MLlib
  • Practical Project: Building a classification model for a marketing agency 

Unit 5 – Recommender Systems

  • Introduction to Recommender Systems
  • Collaborative Filtering Recommender Systems
  • Practical Project: Building a movie recommender system with Spark

Unit 6 – Spark Streaming

  • Introduction to Streaming with Spark
  • Spark Streaming Documentation
  • Practical Project: Processing Twitter feeds using Spark Streaming
  • Data analysts and database administrators who are curious about Hadoop and how it relates to their work.
  • Software engineers and programmers who want to understand the larger Hadoop ecosystem, and use it to store, analyze, and vend “big data” at scale.
  • Project, program, or product managers who want to understand the high-level architecture of Big Data and Hadoop.

The participants who have successfully completed this course are encouraged to take Innosoft Certified Big Data Professional Exam (BD-200)