20775 Performing Data Engineering on Microsoft HD Insight
Request a Class



COURSE TIMES: 9:00am - 4:30pm

Printable version of this course


The main purpose of the course is to give students the ability plan and implement big data workflows on HDInsight.


The primary audience for this course is data engineers, data architects, data scientists, and data developers who plan to implement big data engineering workflows on HDInsight.

In addition to their professional experience, students who attend this course should have:

  • Programming experience using R, and familiarity with common R packages.
  • Knowledge of common statistical methods and data analysis best practices.
  • Basic knowledge of the Microsoft Windows operating system and its core functionality.
  • Working knowledge of relational databases.


*Course Cost listed does not include the cost of courseware or exam. Course is subject to a minimum enrollment to run. Course may run virtually as a Virtual Instructor-Led (VILT) class if the minimum enrollment is not met. If the course is under the minimum enrollment the course may run as 4 day class (Bootcamp Style). For more information, please contact learn@vtec.org or call 207-775-0244.


Module 1: Getting Started with HDInsight
What is Big Data?
Introduction to Hadoop
Working with MapReduce Function
Introducing HDInsight
Lab : Working with HDInsight

Module 2: Deploying HDInsight Clusters
Identifying HDInsight cluster types
Managing HDInsight clusters by using the Azure portal
Lab : Managing HDInsight clusters with the Azure Portal

Module 3: Authorizing Users to Access Resources
Non-domain Joined clusters
Configuring domain-joined HDInsight clusters
Manage domain-joined HDInsight clusters
Lab : Authorizing Users to Access Resources

Module 4: Loading data into HDInsight
Storing data for HDInsight processing
Using data loading tools
Maximising value from stored data
Lab : Loading Data into your Azure account

Module 5: Troubleshooting HDInsight
Analyze HDInsight logs
YARN logs
Heap dumps
Operations management suite
Lab : Troubleshooting HDInsight

Module 6: Implementing Batch Solutions
Apache Hive storage
HDInsight data queries using Hive and Pig
Operationalize HDInsight
Lab : Implement Batch Solutions

Module 7: Design Batch ETL solutions for big data with Spark
What is Spark?
ETL with Spark
Spark performance
Lab : Design Batch ETL solutions for big data with Spark.

Module 8: Analyze Data with Spark SQL
Implementing iterative and interactive queries
Perform exploratory data analysis
Lab : Performing exploratory data analysis by using iterative and interactive queries

Module 9: Analyze Data with Hive and Phoenix
Implement interactive queries for big data with interactive hive.
Perform exploratory data analysis by using Hive
Perform interactive processing by using Apache Phoenix
Lab : Analyze data with Hive and Phoenix

Module 10: Stream Analytics
Stream analytics
Process streaming data from stream analytics
Managing stream analytics jobs
Lab : Implement Stream Analytics

Module 11: Implementing Streaming Solutions with Kafka and HBase
Building and Deploying a Kafka Cluster
Publishing, Consuming, and Processing data using the Kafka Cluster
Using HBase to store and Query Data
Lab : Implementing Streaming Solutions with Kafka and HBase

Module 12: Develop big data real-time processing solutions with Apache Storm
Persist long term data
Stream data with Storm
Create Storm topologies
Configure Apache Storm
Lab : Developing big data real-time processing solutions with Apache Storm

Module 13: Create Spark Streaming Applications
Working with Spark Streaming
Creating Spark Structured Streaming Applications
Persistence and Visualization
Lab : Building a Spark Streaming Application