Welcome to Lab Notes, the mid-month post where I share what I'm currently working on, the tools I'm testing, and the real-world problems I'm trying to solve.
Here’s a dilemma I bet you’ve faced: Your data science team needs to analyze the freshest production data for their models to be accurate. But your security team (rightfully) has a zero-tolerance policy for exposing sensitive customer PII. The traditional solution? A slow, expensive ETL process to copy and sanitize the data into a separate environment, leaving your analysts with stale data and you with a bigger cloud bill.
It feels like an impossible trade-off between speed, cost, and security.
For the last few weeks, I've been experimenting with a modern, open-source stack that might just solve it. Let's walk through the setup.
The Foundation: Zero-Copy Clones with Nessie & Iceberg
The first step is to create an isolated environment without physically copying any data. This is where the concept of "zero-copy cloning" comes in, and it's made possible by catalogs that function like Git for your data.
For this experiment, I'm using Project Nessie as my Apache Iceberg catalog. Nessie allows you to "branch" your data tables. A branch isn't a copy of the data; it's just a lightweight, independent pointer to a version of the table. This means you can create a new, isolated workspace in seconds, no matter how many petabytes are in the table.
Here's a Python snippet that uses the Nessie v2 REST API directly to create a new branch for our ephemeral environment.
import requests
NESSIE_V2_URL = "http://localhost:19120/api/v2/"
branch_name = "analyst-dev-branch"
# First, get the existing branches to find the hash of 'main'
trees = requests.get(f"{NESSIE_V2_URL}trees")
trees.raise_for_status()
data = trees.json()
# This assumes the first reference is the 'main' branch, if you have
# other branches, you will need to search for main here.
main_branch_hash = data.get("references", [])[0].get("hash")
# Prepare the request to create the new branch
params = {
"name": branch_name,
"type": "BRANCH",
}
body = {
"name": "main",
"type": "BRANCH",
"hash": main_branch_hash
}
# Create the new branch via the API
response = requests.post(
f"{NESSIE_V2_URL}trees",
params=params,
json=body
)
response.raise_for_status()
print(f"Branch '{branch_name}' created successfully.")
Just like that, we have an isolated sandbox pointing to our production data, and we haven't copied a single data file.
The Ephemeral Query Layer with Trino
With our analyst-dev-branch created, we need a way to query it. The key insight here is that we don't point a long-running Trino cluster to our branch. Instead, we spin up an ephemeral Trino instance at runtime that is configured to use our new branch as its default reference.
This approach is powerful because it keeps our analytical environment completely isolated and temporary. We can launch it, run our tests and queries, and then tear it down when we delete the branch. A common way to manage this is with a Docker Compose file where we can set the Trino catalog's target branch.
etc/catalog/nessie.properties
Properties
connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie-catalog.uri=http://nessie:19120/api/v2
iceberg.nessie-catalog.ref=analyst-dev-branch
iceberg.nessie-catalog.default-warehouse-dir=s3://warehouse
With Trino running and pointed directly at our isolated branch, our queries become much simpler.
The Magic Trick - On-the-Fly PII Masking
Here's where this pattern goes from useful to game-changing. Even with our ephemeral Trino instance, we still need to solve the PII exposure problem.
Trino's powerful security features allow it to mask data at query time. The analyst or developer never sees the raw PII because Trino intercepts the query and applies a masking function before returning the results. The underlying data files are never changed.
We can configure this with a simple set of rules. For example, we can define a rule that says any user with the analyst role should see a masked version of the email column.
Example Trino Access Control Rule:
JSON
{
"columns": [
{
"user": "analyst",
"schema": "bronze",
"table": "people",
"column": "email",
"mask": "regexp_replace(email, '.*@', '****@')"
}
]
}
With this rule in place, because our Trino instance is already pointing to the correct branch, the queries are simple and clean:
An engineer with full privileges runs: SELECT name, email FROM nessie.bronze.people LIMIT 1;
Result: Jane Doe | jane.doe@example.com
An analyst with the analyst role runs the exact same query: SELECT name, email FROM nessie.bronze.people LIMIT 1;
Result: Jane Doe | ****@example.com
The user gets to work with the data's full structure and scale in an isolated, ephemeral environment, but without the risk of exposing sensitive information.
A Full SDLC Workflow
The real power of this pattern goes far beyond ad-hoc analysis. By providing a secure, production-scale data environment that can be spun up in seconds, we unlock a full "shift-left" workflow for the entire Software Development Lifecycle.
For Developers: Instead of working with stale, sampled CSVs, developers can test their new features—like a new dbt model or a Python script—locally against the fresh, masked
analyst-dev-branch. This catches bugs earlier and ensures code works on realistic data volumes and structures.For QA Testers: QA teams can point their automated test suites at this ephemeral environment to validate new code against production-like data conditions. When testing is done, the branch can simply be deleted.
This approach brings data into every step of the SDLC, making the entire development process faster, safer, and more reliable, without the need to maintain a full dev and qa environment. And this principle of a secure query layer isn't limited to the lakehouse; Trino can be deployed in front of existing databases to provide masked data for dev and QA in any architecture.
An Incremental Step for Any Architecture
Here’s the most powerful takeaway from this experiment for me: While the zero-copy cloning with Nessie and Iceberg is cutting-edge, the on-the-fly PII masking with Trino is a feature that almost any data team can adopt today.
You don't need a full lakehouse to get this benefit.
Trino can be deployed as a secure, federated query layer in front of your existing databases—whether they're PostgreSQL, MySQL, SQL Server, or others. By channeling analyst queries through Trino, you can implement robust PII masking and fine-grained access control immediately. It's a powerful, incremental step to reduce data exposure risk without needing to re-architect your entire data platform.
Lab Summary
So, to recap the experiment: we successfully created a safe, isolated, and ephemeral environment for production data analysis. We did it with zero storage cost for the copy and with sensitive PII masked directly at the query layer.
This pattern is incredibly promising, and I'll be continuing to test its limits. This is a real-time look at my notes, and I'd love to hear your thoughts.
Have you experimented with Trino's security features or Project Nessie? What are your go-to methods for handling PII? I'd love to hear your experiences in the comments below


