A producer's curriculum · verified 2 Jul 2026
What every instrument of the data stack can and cannot do: its range, its cost model, and when it mixes. No implementation required. Every link checked live.
Core track is ~14 h; depth shelves take it to ~28 h. Sections appear in playing order: 0 · 2 · 1 · 3 · 4 · 5 · 6. Section 4 is the hinge between the data stack and the AI layer.
Section 0 · Play first
How to hear music at all. Why the ERP can't be queried directly, what a lakehouse actually is, and why open table formats prevent vendor lock-in.
Say what a database is (organized, queryable, multi-user) versus a spreadsheet or loose files.
Explain SQL as a declarative language for asking questions of tables, and why every stack tool speaks it.
Hear "we use Postgres / Mongo / a graph DB" and judge fit to workload.
The most load-bearing concept in the curriculum: why analysts can't query the live ERP, and why a separate analytical store exists.
Say why the modern stack parks everything in cheap object storage, and the tradeoffs.
Narrate 30 years of analytical storage: warehouse, lake, lakehouse, and why the last isn't a buzzword mashup.
Ask a client "warehouse, lake, or lakehouse, and why?" and evaluate the answer.
Explain what Iceberg/Delta adds to a lake (transactions, schema evolution, time travel) and why open formats kill lock-in.
Depth shelf
A second angle: what broke in Hive-era data lakes.
The full ecosystem tour: catalogs, engines, lock-in risk, at conceptual altitude.
Fills the "files vs tables" gap no video covers well.
Optional academic grounding. Stop after lecture 1; the rest is engineering.
Section 2 · Play second
Where the tiers diverge most. The single most valuable idea here: most companies' data fits on one machine, and the industry sold them distributed systems anyway.
Say why the app's Postgres isn't a warehouse, and when "lake" is marketing.
From a founding BigQuery engineer: push back when a vendor sizes a $10M-$50M company for petabytes. Written version.
Describe when DuckDB/MotherDuck covers a mid-size company entirely, and its ceiling (concurrency, enterprise governance).
Explain 60-second minimum billing, idle warehouses, and why Snowflake bills surprise CFOs.
Contrast per-query-scan vs per-second-compute pricing, and each one's failure mode.
Say what Databricks adds over a warehouse, and why a BI-only company doesn't need it.
Identify when a company picks Fabric for fit versus licensing inertia.
Hold your own in a "Snowflake or Databricks" boardroom conversation; open table formats are erasing the moats.
Depth shelf
Distinguish Spark-the-engine from Databricks-the-business. Concept lessons only.
The BigQuery-insider war stories the talk omits.
The sequel: what matters is the hot recent slice, not total volume.
Row vs column storage and the ~1-2 TB comfort zone. No good video exists for this.
Two-minute warm-ups before a client meeting.
Section 1 · Play third
Getting the music out of the 20 systems. Connectors, batch versus streaming, and the loop back into operations.
Define ETL and explain why data must move before analysis.
Say why the industry flipped to load-then-transform, and spot a vendor selling legacy architecture.
Draw a client's future stack and name each tool's category. Anchors Section 6 too; watch once.
Explain what you pay Fivetran for: schema drift, incremental syncs, zero maintenance, and its price.
Tell a client when Fivetran's per-row pricing bites and when self-hosted Airbyte makes sense.
Push back on "real-time" dashboard demands; batch is right for ~90% of mid-market reporting.
Explain "warehouse data pushed back into the CRM" and name Hightouch and Census.
The skeptical counterweight: judge whether a client needs reverse ETL or just a cleaner CRM.
Depth shelf
Watch for concepts, skim the code: a free library replaces a five-figure Fivetran bill if you have one engineer.
Why SAP extraction is hard (licensing walls, 100k+ cryptic tables, delta logic). No video equivalent exists.
Market context: the extraction and transformation leaders are now one company.
Section 3 · Play fourth
The arrangement: dbt as the portable notation, and the Kimball discipline of translating a business into a queryable model. This is composition theory.
Say "dbt is the T in ELT: SQL transformations with software-engineering discipline."
Recognize a dbt project; know what models, tests and docs mean there.
Explain to a CFO why stored-procedure spaghetti is a liability, and what version control, testing and lineage buy.
Explain orchestration as "the thing that runs pipeline steps in order and complains on failure."
Say why newer teams pick Dagster while enterprises stay on Airflow.
Sketch a star schema for a client's sales process; use "fact" and "dimension" correctly.
Argue the business case for dimensional modeling. Full 4-lesson playlist.
Diagnose "the same customer exists five times" as an MDM problem, not a BI problem.
Depth shelf
Why fuzzy-matching customers is hard and "dedupe it in Excel" fails. 2-minute definition.
Task-centric-mature vs asset-centric-modern, with tradeoffs.
State of the transformation layer: post-Fivetran-merger, AI era.
So "we use Data Vault" doesn't bluff you.
Highly visual modeling explainer from an ex-Mercedes-Benz data platform lead. His free full courses (30 h SQL, data warehouse project) are the best "watch someone actually build a warehouse" material when a lab build is on the agenda.
Section 4 · Play fifth · the hinge
Where the data stack meets the AI layer, and where the consulting differentiation lives. Memorize one number: raw text-to-SQL plateaus near 64% accuracy; semantic-layer queries hit 98-100%.
Define it in one sentence and explain the "revenue defined five ways" problem it kills.
Cite the exact accuracy gap when someone says "the model can just write the SQL." The decisive evidence; no video carries it.
Judge whether a team's "metrics in the BI tool" setup is a governance liability.
Articulate Cube vs dbt Semantic Layer, and what "agents query the model, not the tables" means operationally.
Explain a Genie space, why it needs curated datasets, and how the approach differs from dbt and Cube.
The killer diligence question: "who maintains the semantic model?" Accuracy comes from curation, not the LLM.
Depth shelf
The strategic case, defensible against "context windows will fix it."
Third-party reality check on the modeling effort involved.
Spot how each vendor bends the definition toward its own architecture.
Study the original: Palantir
The $300B version of the intersection layer, explained by the company that built it: the business as objects, links and actions, not tables.
The design pattern to imitate at one-hundredth the price. Their sharpest line: the Ontology "represents the decisions in an enterprise, not simply the data."
Real enterprises demoing AI over their ontology: use cases across manufacturing, healthcare and logistics. Watch a couple and translate each to a mid-size company. Index of editions: palantir.com/aipcon.
Section 5 · Play sixth
The release. MCP over data stores and the AI-analyst pattern: a frontier model exploring schemas, writing SQL, hitting errors, and self-correcting.
Explain what an MCP server over a database exposes, and why remote managed MCP beats laptop-local for client rollouts.
The best live footage of the AI-analyst loop over 50M+ rows, warts included. Describe it from having watched it, not from marketing copy.
The three ingredients beyond a frontier model: curated context, execution loop, verifiable output. Judge vendor demos against them.
A non-Databricks second opinion: compare Genie's curated-space approach with the open MCP approach.
Depth shelf
The skeptical counterweight; reuse his stress-test questions in any AI-analyst sales demo.
Postgres and Snowflake MCP have no quality video yet; the pattern transfers 1:1 from the MotherDuck sessions.
Section 6 · Play last · the finale
Which combinations mix, which are the saxophone in the orchestra, and how each billing model punishes each usage pattern. Lands hardest after the tool sections.
The man who coined the movement conceding the term outlived its usefulness: why MDS was a zero-interest-rate phenomenon.
The core critique: too many tools, integration tax, too little value for mid-size companies.
Situate any vendor a client mentions on the map. The landscape itself.
Give a client three concrete signals that they do (or don't yet) need a warehouse.
Sketch a defensible stack for a $10M vs a $50M company and defend each component.
Audit-question a client's Snowflake bill without touching a console.
Predict which billing model (credits, scan pricing, DBUs) punishes which usage pattern; discount the self-serving conclusion.
Depth shelf
Heavier on the Databricks vs Snowflake vs Fabric platform war.
The capacity-unit model; prices partly stale, mechanics current.
Connects "small data" to "smaller stack" in one narrative.
The permanent shelf
Four books, each the canon of its layer. Read alongside the videos, not instead of them.
The craft of translating a business into a queryable model. The single highest-leverage book for the logical side of data. Read the first ~6 chapters; the rest is a reference by industry.
The conceptual map of the whole territory, written for architecture thinking rather than tool operation. The spine text of this curriculum.
Enterprise-wide data architecture and systems integration: domains, data products, master data. The logic of connecting 20 systems.
The canon of the AI layer: model selection, evaluation, RAG, agents. Most-read book on the O'Reilly platform in 2025.