I am currently building an Agentic AI data pipeline to test the current state of Agentic AI on Data Engineering tasks.
As I was deep into this experiment, trying to give my AI agent more autonomy, I had this exchange with my pair-programmer, Microsoft Copilot. It was so insightful that I had to share it upfront:
Copilot: "What you’re bumping into is the gap between the marketing narrative ('LLMs can run your whole pipeline') and the current reality ('they need a lot of scaffolding to not get stuck, hallucinate, or loop forever')... Your goal is different: You want to strip away the scaffolding so you can see the raw capability (or lack thereof) — and measure it honestly."
Copilot nailed it. My mission for this edition of Lab Notes was to test that "raw capability."
I wanted to create a simple, agentic data pipeline where I could drop a CSV file into a bucket and have an AI agent decide what to do with it. No guardrails. No handholding. Just a raw test of the agent's autonomy to see where the state of agentic AI currently is.
The results were... unexpected.
The Setup: My Toolkit and Philosophy
For this experiment, my stack was modern and local-first: Python, Ollama 3 for the LLM, MinIO for object storage, and LangGraph to define the agent's workflow. The special ingredient was Microsoft Copilot, which I used as a pair-programmer for the entire process.
From the start, Copilot and I aligned on the core philosophy:
Copilot: "If you want this agentic from day one, then we design the pipeline so the agent is the conductor, not just a helper bolted on later... The code you write is mostly 'tools' the agent can call, not fixed ETL logic."
With that in mind, I gave the agent a set of tools—functions for ingesting files, cleaning data, profiling tables, and executing SQL. Then, I wrote a detailed "constitution," a master prompt that laid out the rules of the game for my Ollama agent.
The Agent's "Constitution" (Master Prompt):
You are controlling a data pipeline.
## Current State
- CSV files in MinIO (total: {csv_count}): {csv_files}
- Bronze tables in DuckDB (count: {bronze_count}): {bronze}
- Silver tables in DuckDB (count: {silver_count}): {silver}
- Gold tables in DuckDB (count: {gold_count}): {gold}
## Goal
The pipeline is complete when there are NO missing Bronze, Silver, or Gold tables.
## Rules
1. If all three "Missing" lists are empty, you MUST choose "end".
2. You MUST choose exactly one action from the Available Actions list.
3. If Bronze is not missing for any file, you MUST NOT choose `ingest_all_bronze`.
... and so on.
## Available Actions
- ingest_all_bronze
- profile_all_bronze
- clean_all_to_silver
- create_all_gold
- end
## Final Output
Respond ONLY in valid JSON...
Building with a Co-Pilot: The Struggle for Autonomy
This is where the philosophical tension began. Copilot's first instinct was to build a scaffold with a simulated agent to test the flow.
Copilot: "Once you’re happy with the flow, we swap
call_agent()for a real LLM call — and that’s when you’ll start seeing the agent truly adapt..."
I pushed back. The whole point was to test a real agent from the start. But as we integrated the real LLM, Copilot kept trying to add guardrails. For example, it wanted to hard-code table names. I had to reiterate:
Me: "Let it have control, we should not be creating things for it." Copilot: "Exactly — if you want the agent to have full control... then the missing piece is to give it the ability to discover what’s in DuckDB for itself."
This led to a crucial insight. As we were building, I noticed the agent's "reasoning" logs felt artificial.
Me: "We are creating the explanation, how do we know that is the explanation?" Copilot: "Ah — I see what you’re getting at... If you want the real explanation... you need to capture it from the LLM’s own output... not from our hard‑coded control flow."
The penny dropped when we started using LangGraph. The branching logic was all in Python.
Me: "Ok, so adding langgraph we are now deciding and not an llm?" Copilot: "Exactly — the way we’ve built your current LangGraph pipeline, the branching logic is entirely in Python, so you (the developer) are deciding the next step, not the LLM."
After clarifying the mission one last time—to test the raw, un-influenced agent—we built the final version.
The Twist: The Unexpected Results
With the scaffolding removed, I dropped a single CSV file into MinIO and let the agent run. It started well... and then it got weird.
Here's a snippet from the log:
2025-09-18 08:50:52,066 [INFO] LLM Decision: {'reasoning': '...the most logical next step is to start by creating a Bronze table.', 'next_step': 'ingest_all_bronze', 'params': {}}
2025-09-18 08:50:52,096 [INFO] Ingesting ai_data_pipeline/raw/planets.csv -> bronze_planets
2025-09-18 08:50:55,048 [INFO] LLM Decision: {'reasoning': 'Since there are no missing Bronze tables... I will choose an action that helps move towards creating Silver tables.', 'next_step': 'clean_all_to_silver', 'params': {}}
2025-09-18 08:50:55,049 [INFO] Cleaning bronze_planets -> silver_planets
2025-09-18 08:50:57,208 [INFO] LLM Decision: {'reasoning': 'Since there are no missing Bronze tables and the CSV file is present, we need to create a Bronze table.', 'next_step': 'ingest_all_bronze', 'params': {}}
2025-09-18 08:51:02,143 [INFO] LLM Decision: {'reasoning': "Since there are no missing Bronze tables, I will choose an action that doesn't involve creating new Bronze tables. Ingesting all Bronze data could potentially create new Silver and Gold tables.", 'next_step': 'ingest_all_bronze', 'params': {}}
2025-09-18 08:51:03,572 [INFO] LLM Decision: {'reasoning': 'Since there are no missing Bronze tables... we should focus on ingesting new data into Bronze.', 'next_step': 'ingest_all_bronze', 'params': {}}
It got stuck in a loop, violating its own rules. Here's the final breakdown:
What Worked:
Ingestion & Discovery: The agent correctly identified the file, chose names, and loaded the Bronze table. It successfully used its tools to see its own creation.
Profiling & Cleaning: It successfully profiled the Bronze table and created a Silver table from it.
What Broke (and Why):
Duplicate Actions: It called the cleaning step three times because it lacked idempotency checks or memory of its past actions.
Hallucinated Logic: It ran a simple
SELECTfor the Gold step instead of aCREATE TABLE AS SELECT, so nothing was saved.Observation Confusion: It would correctly list the tables but then ignore that information in the next step, demonstrating a lack of trust in its own tools.
Conclusion: Reflections from the Lab
This experiment perfectly highlights the gap between the marketing narrative and the current engineering reality of agentic AI. My pair-programmer, Copilot, said it best: out-of-the-box agents are still brittle. They don't have enough self-awareness of the environment they're operating in.
The journey revealed that building a truly autonomous agent isn't about writing a clever prompt. It's about building a sophisticated environment with memory, introspection tools, and fault-tolerant loops. Without that scaffolding, the "raw capability" of the agent, while impressive, gets stuck in simple loops and struggles with basic state management.
The future of agentic AI is bright, but for now, the conductor still needs a very well-designed orchestra pit to avoid falling off the stage. And this leads to the ultimate practical question: if we have to spend as much time designing that complex orchestra pit as we would just conducting the music ourselves, then for now, we might as well just write the code.


