Become a Data Engineer

Where to Start

Azure Data Engineering

Extracting tables from a transactional database
Writing CTEs for time-based aggregations
Validatin row counts across systems
Designing lean, efficient data warehouse structure
Automating transformation with procedures

Real World Use Cases

Pro Tip

Focus on mastering window functions, partitioning, and query optimisation; these show up constantly in engineering interviews and production jobs.

SQL Bolt

See Resource

Advanced SQL

Real World Use Cases

Pro Tip

Automating daily refresh from APIs into SQL
Triggering data validation after a load completes
Coordinating multiple pipelines across systems
Creating reportable views from large datasets
Creating alerts for activity monitoring

Use parameterised pipelines and dynamic datasets; they make ADF/Airflow jobs reusable, scalable, and efficient.

Engineering on AWS

ADF for Engineers

Python

Python is the most flexible language for automating data workflows. It’s used to clean, transform, and move data between sources. Workflows that can transition into different languages to serve separate orchestration functions are powerful.

Why It Matters?

Data Warehousing

Data engineers build the structures where analysts and applications can access data reliably. You need to understand data models, warehouse architecture, and cloud-native platforms, to ensure scalability and future proofing.

Why It Matters?

Writing ETL scripts that pull data via API
Using Pandas to preprocess CSVs or JSON logs
Scheduling batch jobs with Python + Airflow
Creating helpers to join data in databricks
Logging results in tables for later audit

Real World Use Cases

Pro Tip

Learn how to write reusable functions and logging into your scripts early; it pays off as your projects scale.

Real World Use Cases

Pro Tip

Designing star schemas in Snowflake or SSMS
Noticing when averages hide outliers
Digging into causes of results change
Spotting opportunities in weekly sales data
Talking confidently to all stakeholders

Understanding dimensional modeling (facts vs dimensions) helps you build for query speed and usability, not just storage.

Where to Start

Python for Engineering

Where to Start

Data Warehouse Toolkit

Python ETL Projects

Google Colab

Snowflake 101

Google Cloud Engineering

See on Coursera

Advanced Skills

Advanced SQL

Basic SQL gets you through simple extracts. Advanced SQL for things like window functions, CTEs, optimisation techniques, and indexing is how you handle billions of rows efficiently. These skills separate junior engineers from experienced ones

Why It Matters?

These are the aspirationl skills you’ll need to excel as a Data Engineer or prepare to transition into a more advanced role

Advanced Reporting

As companies grow, analysts can’t handle every request. Engineers build data marts, semantic layers, and reusable views that allow non-technical users to generate insights without waiting. Creating clean, easy to interpret datasets helps with all level of collaborator

Where to Start

Why It Matters?

SQL Performance

Where to Start

Python for Data Analysts

Writing Multi level aggregations for reporting
Debugging and tuning slow running queries
Using PARTITION BY and RANK for churn
Dynamically reconciling data through automation
Logging execution runs for later audit

Real World Use Cases

Pro Tip

Use EXPLAIN or QUERY PLAN tools, they reveal where your queries bottleneck before deployment, helping to optimise first.

Advanced SQL

Real World Use Cases

Pro Tip

Creating pre-aggregated views for Power BI
Building dbt models that analysts can extend
Automating KPI refresh logic for exec dashboards
Allowing dashboard input to backfill datasets
Embed in various apps for curated views

The best pipelines serve business users; think like a product manager, not just a coder to ensure data is used and effective.

Reporting Bootcamp

Cloud Data Tools

Modern data engineers are cloud-native. Whether you're batch processing with Spark, building notebooks in Databricks, or managing serverless ETL jobs. These are the operational tools of enterprise-scale infrastructure and analytics.

Why It Matters?

Version Control

Data teams now operate like software teams. Using Git to manage your pipeline code, version transformations, and collaborate via pull requests brings clear alignment and responsibilities across data teams, and allows visibility to manage more easily.

Why It Matters?

Using Spark to transform large datasets
Building ML-ready data pipelines in Databricks
Using AWS Lambda to trigger ETL after an S3 drop
Using notebooks for multi language tranformation
Reconciliation pipelines to ensure accuracy

Real World Use Cases

Pro Tip

Pair something like Databricks with Delta Lake to make your big data pipelines reliable, ACID-compliant, and lightning fast.

Real World Use Cases

Pro Tip

Managing dbt model changes across dev and prod
Using branches to test experimental logic safely
Reviewing peer pipeline edits before merge
Aligning Approval process for confident merges
Rolling back when careless approvals happen

Adopt Git workflows early, even if you're working solo. It’s good practice and acts as a safety net when pipelines fail.

Where to Start

Databricks Engineering

Where to Start

Git Pocket Guide

Databricks essentials

Git & GitHub Bootcamp