Cloud Computing for Big Data

Dates Timing Location
03 - 07 February 2019 07:00PM - 10:00PM Dubai Knowledge Park
07 - 11 April 2019 07:00PM - 10:00PM Dubai Knowledge Park

Course Description

To be able to process massive datasets, you need to setup clusters for both data processing and data storage. Many of the aspiring data science professionals or engineers have very little knowledge or experience on how to do this in a Linux environment.  This course will enable you to master all the skills required to setup cloud clusters for data storage using Hadoop’s HDFS, and Spark for data processing.
Not only you will successfully install and configure Hadoop and Spark clusters, but you will also learn how configure your development environment (Jupyter Notebook) to access these clusters to store and process massive datasets.

Linux Administration Fundamentals

Create a virtual machines using VirtualBox
Install Linux (Ubuntu 16 Server) on a virtual machine
Copy, move, create, delete, and organize files from the bash shell prompt
Resolve problems by using Linux documentation
Create, view, and edit text files from command line with the VIM editor
Set file permissions and understand the effect of different security permissions
Extract and Archive Compressed Files with the tar command 
Creating symbolic links to files
Set environment variables
Access remote systems securely
Configure basic networking
Configure DNS using /etc/hosts
Monitor and Manage Linux Processes
Locate and accurately interpret log files for troubleshooting

 

Hadoop Cluster Installation and Configuration

Architecture of a Hadoop Cluster
DNS Configuration
Creating and Distributing SSH Keys
Downloading and Unpacking Hadoop Binaries
Setting up Environment Variables
Configuring the Master Node
Slave Nodes Configuration
Configuring Memory Allocation
Formating and Running HDFS
Configuring YARN as a Job Scheduler
Running and Monitoring HDFS
Running YARN

Spark Cluster Installation and Configuration

Preparing your System for Spark Installation
Installing Spark on the Master Node
Installing Spark On the Slave Nodes
Integrating Spark with YARN
Running the Spark Cluster
Configuring the Memory Allocation
Running a Spark Application on top of a YARN Cluster
Monitoring Your Spark Applications

Running Massive Datasets on Spark and Hadoop Clusters

Storing Massive Datasets on HDFS
Configure Jupyter Notebooks to access Spark and Hadoop Clusters

IT professionals, Data Scientists and Big Data Engineers who are interested to setup Hadoop and Spark Clusters on the cloud, and run massive datasets on top of this infrastructure.

There are no prerequisites for this course.

The participants who have successfully completed this course will be able to setup large cloud infrastructure for Big Data on a Linux environment.