Big Data: Storing and Processing Massive Datasets

Preference Dates Timing Delivery Method
Evening Course 18 - 26 November 2020 07:00PM- 09:30PM Live Sessions, Lecture Videos and Hands-on Projects

Course Description

One of the most valuable technology skills is the ability to store and process huge data sets, and this course is specifically designed to bring you up to speed on some of the hottest technologies for this task including Hadoop and Apache Spark. The top technology companies are all using Hadoop and Spark to solve their big data problems!

This course will enable you to learn and master the most popular Big Data and Hadoop technologies including HDFS, MapReduce, Spark, MLlib and Spark Streaming.   It’s filled with hands-on projects from various industries and verticals including transportation, advertising and entertainment.


Unit 1 – Big Data Ecosystem 

  • What is Big Data?
  • Big Data Characteristics
  • Data Processing Challenges
  • What is a Distributed File System (DFS)?
  • Solving the Speed Problem with DFS
  • What is Hadoop?
  • HDFS
  • Big Data Processing with Map Reduce
  • Hive vs. Pig
  • Introducing Apache Spark
  • Map Reduce, Hive, Pig vs. Spark
  • Ambari Web UI
  • Hadoop Ecosystem
  • Unit 2 – Linux Operating System Review

    • Create a virtual machines using VirtualBox
    • Install Linux on virtual machines
    • Run simple Linux commands using the shell
    • Manage files and directories from the shell prompt
    • Create, view, and edit text files from command line with the vi editor
    • Set Linux permissions on files and directories
    • Access remote systems securely using SSH
    • Configure basic Linux networking
    • Archive files and copy them from one system to another
    • Download, install, update, and manage software packages 
    • Unit 3 – Environment Setup

    • Install and Setup a Hadoop Cluster on Linux
    • Install and Setup a Spark Cluster on a Linux
    • Configure HDFS 
    • Configure pyspark
    • Configure Jupyter Notebook to access Hadoop and Spark Clusters
  • Unit 4 – Spark DataFrames

  • What is Spark?
  • Spark DataFrame Basics
  • Spark DataFrame Basic Operations
  • Groupby and Aggregate Operations
  • Handling Missing Data
  • Working with Dates and Timestamps
  • Practical Project: Processing a 20 million records dataset
  • Unit 5 –  Machine Learning with Spark MLlib

  • Implement Linear Regression Model with Spark’s MLlib
  • Practical Project: Build a regression model for a shipping company
  • Implement a Logistic Regression model with Spark’s MLlib
  • Practical Project: Build a classification model for a marketing agency 

Unit 6 – Recommender Systems

  • Introduction to Recommender Systems
  • Collaborative Filtering Recommender Systems
  • Practical Project: Building a movie recommender system with Spark

Unit 7 – Spark Streaming

  • Introduction to Streaming with Spark
  • Processing Unstructured Data with Spark Streaming 
  • Practical Project: Processing Twitter Feeds using Spark Streaming
  • Data analysts and database administrators who are curious about Hadoop and how it relates to their work.
  • Software engineers and programmers who want to understand the larger Hadoop ecosystem, and use it to store, analyze, and vend “big data” at scale.
  • Project, program, or product managers who want to understand the high-level architecture of Big Data and Hadoop.

The participants who have successfully completed this course are encouraged to take Innosoft Certified Big Data Professional Exam (BD-200)