A Practical Introduction to Data Analysis and Big Data

COURSE OVERVIEW

Participants will have the opportunity to put this knowledge into practice through hands-on exercises. Group interaction and instructor feedback make up an important component of the class.

The course starts with an introduction to elemental concepts of Big Data, then progresses into the programming languages and methodologies used to perform Data Analysis. Finally, we discuss the tools and infrastructure that enable Big Data storage, Distributed Processing, and Scalability.

Introduction to Data Analysis and Big Data

What Makes Big Data "Big"?
- Velocity, Volume, Variety, Veracity (VVVV)
Limits to Traditional Data Processing
Distributed Processing
Statistical Analysis
Types of Machine Learning Analysis
Data Visualization

Big Data Roles and Responsibilities

Administrators
Developers
Data Analysts

Languages Used for Data Analysis

R Language
- Why R for Data Analysis?
- Data manipulation, calculation and graphical display
Python
- Why Python for Data Analysis?
- Manipulating, processing, cleaning, and crunching data

Approaches to Data Analysis

Statistical Analysis
- Time Series analysis
- Forecasting with Correlation and Regression models
- Inferential Statistics (estimating)
- Descriptive Statistics in Big Data sets (e.g. calculating mean)
Machine Learning
- Supervised vs unsupervised learning
- Classification and clustering
- Estimating cost of specific methods
- Filtering
Natural Language Processing
- Processing text
- Understaing meaning of the text
- Automatic text generation
- Sentiment analysis / topic analysis
Computer Vision
- Acquiring, processing, analyzing, and understanding images
- Reconstructing, interpreting and understanding 3D scenes
- Using image data to make decisions

Big Data Infrastructure

Data Storage
- Relational databases (SQL)
  - MySQL
  - Postgres
  - Oracle
- Non-relational databases (NoSQL)
  - Cassandra
  - MongoDB
  - Neo4js
- Understanding the nuances
  - Hierarchical databases
  - Object-oriented databases
  - Document-oriented databases
  - Graph-oriented databases
  - Other
Distributed Processing
- Hadoop
  - HDFS as a distributed filesystem
  - MapReduce for distributed processing
- Spark
  - All-in-one in-memory cluster computing framework for large-scale data processing
  - Structured streaming
  - Spark SQL
  - Machine Learning libraries: MLlib
  - Graph processing with GraphX
Scalability
- Public cloud
  - AWS, Google, Aliyun, etc.
- Private cloud
  - OpenStack, Cloud Foundry, etc.
- Auto-scalability

Requirements
Duration

A general understanding of math.
A general understanding of programming.
A general understanding of databases.

35 hours (usually 5 days including breaks)

COURSE COMPLETION

Participants who complete this instructor-led, live training will gain a practical, real-world understanding of Big Data and its related technologies, methodologies and tools.

CREDIT BEARING

This course is NOT credit bearing

COURSE LICENCE

This course is available under Attribution-ShareAlike 2.0 South Africa

more course info