MOD 20775: Perform Data Engineering on Microsoft HDInsight


Duration: 5 days

This five-day course will give students the ability plan and implement big data workflows on HDInsight.

Those who may be interested in this course are:

Data engineers
Data architects
Data scientists
Data developers who intend to implement big data engineering workflows on HDInsight.
After completing this course, students will be able to:

Deploy HDInsight Clusters
Authorizing Users to Access Resources
Loading Data into HDInsight
Troubleshooting HDInsight
Implement Batch Solutions
Design Batch ETL Solutions for Big Data with Spark
Analyze Data with Spark SQL
Analyze Data with Hive and Phoenix
Describe Stream Analytics
Implement Spark Streaming Using the DStream API
Develop Big Data Real-Time Processing Solutions with Apache Storm
Build Solutions that use Kafka and HBase


It is recommended that as well as professional experience, students should have:

Programming experience using R, and familiarity with common R packages
Knowledge of common statistical methods and data analysis best practices
Basic knowledge of the Microsoft Windows operating system and its core functionality
Working knowledge of relational databases

What’s included?

  • Authorized Courseware
  • Intensive Hands on Skills Development with an Experienced Subject Matter Expert
  • Hands on practice on real Servers and extended lab support 1.800.482.3172
  • Examination Vouchers & Onsite Certification Testing- (excluding Adobe and PMP Boot Camps)
  • Academy Code of Honor: Test Pass Guarantee
  • Optional: Package for Hotel Accommodations, Lunch and Transportation

With several convenient training delivery methods offered, The Code Academy makes getting the training you need easy. Whether you prefer to learn in a classroom or an online live learning virtual environment, training videos hosted online, and private group classes hosted at your site. We offer expert instruction to individuals, government agencies, non-profits, and corporations. Our live classes, on-sites, and online training videos all feature certified instructors who teach a detailed curriculum and share their expertise and insights with trainees. No matter how you prefer to receive the training, you can count on The Code Academy for an engaging and effective learning experience.


  • Instructor Led (the best training format we offer)
  • Live Online Classroom – Online Instructor Led
  • Self-Paced Video

Speak to an Admissions Representative for complete details

StartFinishPublic PricePublic Enroll Private PricePrivate Enroll


Module 1: Getting Started with HDInsight

What is Big Data?
Introduction to Hadoop
Working with MapReduce Function
Introducing HDInsight
Lab: Working with HDInsight

Provision an HDInsight cluster and run MapReduce jobs

Module 2: Deploying HDInsight Clusters

Identifying HDInsight cluster types
Managing HDInsight clusters by using the Azure portal
Managing HDInsight Clusters by using Azure PowerShell
Lab: Managing HDInsight clusters with the Azure Portal

Create an HDInsight cluster that uses Data Lake Store storage
Customize HDInsight by using script actions
Delete an HDInsight cluster

Module 3: Authorizing Users to Access Resources

Non-domain Joined clusters
Configuring domain-joined HDInsight clusters
Manage domain-joined HDInsight clusters
Lab: Authorizing Users to Access Resources

Prepare the Lab Environment
Manage a non-domain joined cluster

Module 4: Loading data into HDInsight

Storing data for HDInsight processing
Using data loading tools
Maximising value from stored data
Lab: Loading Data into your Azure account

Load data for use with HDInsight

Module 5: Troubleshooting HDInsight

Analyze HDInsight logs
YARN logs
Heap Dumps
Operations management suite
Lab: Troubleshooting HDInsight

Analyze HDInsight logs
Analyze YARN logs
Monitor resources with Operations Management Suite

Module 6: Implementing Batch Solutions

Apache Hive storage
HDInsight data queries using Hive and Pig
Operationalize HDInsight
Lab: Implement Batch Solutions

Deploy HDInsight cluster and data storage
Use data transfers with HDInsight clusters
Query HDInsight cluster data

Module 7: Design Batch ETL solutions for big data with Spark

What is Spark?
ETL with Spark
Spark performance
Lab: Design Batch ETL solutions for big data with Spark.

Create an HDInsight Cluster with access to Data Lake Store
Use HDInsight Spark cluster to analyze data in Data Lake Store
Analyzing website logs using a custom library with Apache Spark cluster on HDInsight
Managing resources for Apache Spark cluster on Azure HDInsight

Module 8: Analyze Data with Spark SQL

Implementing iterative and interactive queries
Perform exploratory data analysis
Lab: Performing exploratory data analysis by using iterative and interactive queries

Build a machine learning application
Use zeppelin for interactive data analysis
View and manage Spark sessions by using Livy

Module 9: Analyze Data with Hive and Phoenix

Implement interactive queries for big data with an interactive hive.
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix
Lab: Analyze data with Hive and Phoenix

Implement interactive queries for big data with interactive Hive
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix

Module 10: Stream Analytics

Stream analytics
Process streaming data from stream analytics
Managing stream analytics jobs
Lab: Implement Stream Analytics

Process streaming data with stream analytics
Managing stream analytics jobs

Module 11: Implementing Streaming Solutions with Kafka and HBase

Building and Deploying a Kafka Cluster
Publishing, Consuming, and Processing data using the Kafka Cluster
Using HBase to store and Query Data
Lab: Implementing Streaming Solutions with Kafka and HBase

Create a virtual network and gateway
Create a storm cluster for Kafka
Create a Kafka producer
Create a streaming processor client topology
Create a Power BI dashboard and streaming dataset
Create an HBase cluster
Create a streaming processor to write to HBase

Module 12: Develop big data real-time processing solutions with Apache Storm

Persist long-term data
Stream data with Storm
Create Storm topologies
Configure Apache Storm
Lab: Developing big data real-time processing solutions with Apache Storm

Stream data with Storm
Create Storm Topologies

Module 13: Create Spark Streaming Applications

Working with Spark Streaming
Creating Spark Structured Streaming Applications
Persistence and Visualization
Lab: Building a Spark Streaming Application

Installing Required Software
Building the Azure Infrastructure
Building a Spark Streaming Pipeline