Cloud Computing for Big Data

Dates	Timing	Location
03 - 07 February 2019	07:00PM - 10:00PM	Dubai Knowledge Park
07 - 11 April 2019	07:00PM - 10:00PM	Dubai Knowledge Park

Course Description

To be able to process massive datasets, you need to setup clusters for both data processing and data storage. Many of the aspiring data science professionals or engineers have very little knowledge or experience on how to do this in a Linux environment. This course will enable you to master all the skills required to setup cloud clusters for data storage using Hadoop’s HDFS, and Spark for data processing.
Not only you will successfully install and configure Hadoop and Spark clusters, but you will also learn how configure your development environment (Jupyter Notebook) to access these clusters to store and process massive datasets.

Course Outline

Audience

Prerequisites

After the Course

Course Outline

Linux Administration Fundamentals

Create a virtual machines using VirtualBox

Install Linux (Ubuntu 16 Server) on a virtual machine

Copy, move, create, delete, and organize files from the bash shell prompt

Resolve problems by using Linux documentation

Create, view, and edit text files from command line with the VIM editor

Set file permissions and understand the effect of different security permissions

Extract and Archive Compressed Files with the tar command

Creating symbolic links to files

Set environment variables

Access remote systems securely

Configure basic networking

Configure DNS using /etc/hosts

Monitor and Manage Linux Processes

Locate and accurately interpret log files for troubleshooting

Hadoop Cluster Installation and Configuration

Architecture of a Hadoop Cluster
DNS Configuration
Creating and Distributing SSH Keys
Downloading and Unpacking Hadoop Binaries
Setting up Environment Variables
Configuring the Master Node
Slave Nodes Configuration
Configuring Memory Allocation
Formating and Running HDFS
Configuring YARN as a Job Scheduler
Running and Monitoring HDFS
Running YARN

Spark Cluster Installation and Configuration

Preparing your System for Spark Installation
Installing Spark on the Master Node
Installing Spark On the Slave Nodes
Integrating Spark with YARN
Running the Spark Cluster
Configuring the Memory Allocation
Running a Spark Application on top of a YARN Cluster
Monitoring Your Spark Applications

Running Massive Datasets on Spark and Hadoop Clusters

Storing Massive Datasets on HDFS
Configure Jupyter Notebooks to access Spark and Hadoop Clusters

Audience

IT professionals, Data Scientists and Big Data Engineers who are interested to setup Hadoop and Spark Clusters on the cloud, and run massive datasets on top of this infrastructure.

Prerequisites

There are no prerequisites for this course.

After the Course

The participants who have successfully completed this course will be able to setup large cloud infrastructure for Big Data on a Linux environment.