Show HN: Rocky – Rust SQL engine with branches, replay, column lineage
github.comHi HN, I'm Hugo. I've been building Rocky over the past month, shipping fast in the open. The binary is on GitHub Releases, `dagster-rocky` on PyPI, and the VS Code extension on the Marketplace. I held off on a broader announcement until the trust-system surface was coherent enough to talk about as one thing. The governance waveplan — column classification, per-env masking, 8-field audit trail on every run, `rocky compliance` rollup, role-graph reconciliation, retention policies — landed end-to-end last week in engine-v1.16.0 and rounded out in v1.17.4 (tagged 2026-04-26). That's the milestone I'd been waiting for.
The pitch: keep Databricks or Snowflake. Bring Rocky for the DAG. Rocky is a Rust-based control plane for warehouse pipelines. Storage and compute stay with your warehouse. Rocky owns the graph — dependencies, compile-time types, drift, incremental logic, cost, lineage, governance. The things your current stack can't give you because it doesn't own the DAG.
A few things I think are interesting:
- Branches + replay. `rocky branch create stg` gives you a logical copy of a pipeline's tables (schema-prefix today; native Delta SHALLOW CLONE and Snowflake zero-copy are next). `rocky replay <run_id>` reconstructs which SQL ran against which inputs. Git-grade workflow on a warehouse.
- Column-level lineage from the compiler, not a post-hoc graph crawl. The type checker traces columns through joins, CTEs, and windows. VS Code surfaces it inline via LSP.
- Governance as a first-class surface. Column classification tags plus per-env masking policies, applied to the warehouse via Unity Catalog (Databricks) or masking policies (Snowflake). 8-field audit trail on every run. `rocky compliance` rollup that CI can gate on. Role-graph reconciliation via SCIM + per-catalog GRANT. Retention policies with a warehouse-side drift probe.
- Cost attribution. Every run produces per-model cost (bytes, duration). `[budget]` blocks in `rocky.toml`; breaches fire a `budget_breach` hook event.
- Compile-time portability + blast radius. Dialect-divergence lint across Databricks / Snowflake / BigQuery / DuckDB (12 constructs). `SELECT *` downstream-impact lint.
- Schema-grounded AI. Generated SQL goes through the compiler — AI suggestions type-check before they can land.
What Rocky isn't:
- Not a warehouse — it's the control plane on top.
- Not a Fivetran replacement. `rocky load` handles files (CSV/Parquet/JSONL); for SaaS sources use Fivetran, Airbyte, or warehouse-native CDC.
- Not dbt Cloud — no hosted UI, no managed scheduler. First-class Dagster integration if you need orchestration.
Adapters: Databricks (GA), Snowflake (Beta), BigQuery (Beta), DuckDB (local dev / playground). Apache 2.0.
I'd love feedback on the trust-system framing, the governance surface (particularly classification-to-masking resolution in `rocky compile` and the `rocky compliance` CI gate), the branches/replay design, the cost-attribution primitives, or anything else that catches your eye. Happy to go deep in the thread.
The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality.
Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.
Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review.
Same. I worked on an in-house product many years ago now where lineage and provenance were the entire point. Really cool to see this!
Thank you!
Hey Xiaoher-C! Hum, I don't have a lineage diff command yet. As of now, I can make a small lift to wire up together two commands that already exist: "rocky ci --diff --base main" which runs a diff between main and HEAD, and "rocky lineage --column columnname --downstream". I'll add this to my backlog!
Data contracts as types and compile time checks (even across languages) are not new - this is a recent paper exposing the idea of correctness-by-design pipeline, which is a super set of this particular issue obviously (disclaimer: I'm one of the author of the paper): https://arxiv.org/pdf/2602.02335
Hey jtagliabuetooso! Absolutely, the idea isn't new. Rocky's bet is on the shipped implementation. I'll read your paper, thank you for sharing.
“Why it’s distinctive” is misleading (perhaps LLM-generated)?
Imo, we cite other work because it puts our work in context for experts and beginners alike; because it makes clear that we all stand on someone else’s shoulders (progress is, most of the time, a collective endeavor, not a lone-genius affair); because it is intellectually honest to acknowledge our debts.
Especially today, when putting research ideas out there almost guarantees they will be plagiarized by someone vibe-coding or vibe-writing, recognizing that our contributions come from somewhere is more important than ever. The implementation may or may not be novel, but the fact that it depends on LLMs even in the README should make you even more aware of why proper attribution is crucial: what's the incentive for open innovation if we all behave like this?
I hadn't come across Bauplan or your work before today's thread. Looks like a few of us are landing on branches/replay/lineage from different angles (like yours Iceberg-native, mine warehouse-delegated). I will spend proper time with the paper and Bauplan.
If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run?
Yeah, fair pushback, and yes the intro was AI-assisted. Marketing is not my strength nor I am a native english speaker. I built this in about a month with heavy LLM tooling and the seed comment is part of that. I'm not going to pretend otherwise.
The code is what it is. `cargo test --workspace` runs across 19 crates. CI on 5 platforms (macOS ARM/Intel, Linux x86/ARM, Windows). JSON output schemas are codegen-checked in CI so docs can't drift from the binary.
If you want to skip the marketing copy and look at engine reasoning instead: PR #240 (audit trail), #241 (column classification + masking), #270 (failed-source surfacing in discover).
I'd rather hear "the code is bad" than "the post sounds AI-written".
Not sure why you are downvoted here.
'A-Lot' of side projects, hobby projects, etc.. are all using AI tools now. Also for marketing, every sales/marketing firm is using AI. So why critisize this guy inparticular.
AI is pervasive, the train has left the station. So that is not a reason to criticize this project. There might be other reasons, I'm not sure, but not that an AI was used.
It's really a weird world now.
I do think the author is doing a disservice to themselves by writing the post and comments using LLM, even if the code is mostly agent built. People can tell right away, all the LLM shibboleths are there... it feels cheap. Just write naturally and then Google translate, don't let the LLM speak on your behalf.
What's going to distinguish projects that are built this way is the ability to explain, document, support, and maintain said projects over the long term. That will be the crucible. Gone are the days of "build it and they will come", and I feel a bit sad about that.
It's so easy to let the code grow under you beyond what you have the capacity to do the above for.
I've got the same thing going on. Eschewing paid work and grinding 16, 17 hours a day boiling the sea to build the whole universe from scratch (also a database, but of a different sort than this project) integrating all my favourite DB research papers and ideas that I've accumulated over the last 30 years. Outperforms postgres 2-4x or more, has a battery of correctness tests, Lean proofs, benchmarks, etc. etc.
But frankly I'd be nervous to share. Especially here. I don't even know where it ends up. Not least because if I'm doing it, so are 50 other people, probably.
I totally acknowledge that. The only reason for passing my replies through AI was just because it's my first time posting here and opening a side-project of mine publicly.
All the engine architecture decisions are mine though and this project came up to solve a real problem I currently have at work with a zero-touch data pipeline leveraging FiveTran, Dagster, dbt and Databricks. This is a data pipeline that servers multiple agencies and data producers who work with data from more than 300 clients and multiple connectors.
Rocky essentially was built based on all the time spent awaken at night thinking about all these problems and how could they be addressed differently, considering that dbt is not suiting well this particular use-case.
I decided to open Rocky to public for free because of two simple reasons: 1st is that it might help others and I fullfill my ego of having built something other people like and use. 2nd is that I'm the solo maintainer. A project can only get proper traction if more people contributes to it.
Because "Yeah, fair pushback" is AI smell. Either everything this person does is passed through an AI from code to blogs to even their HN comments and submissions; or they use AI so much they're starting to talk like it colloquially. Either way no one has time for that.
"Yeah, fair pushback"
Really hard to tell. Because that used to be a common phrase that real people would use.
So now I have to change my own language in order to not appear like I'm an AI? We are getting in a weird place where Humans have to act/sound increasingly 'odd', to appear not 'perfect' like an AI.
It's really not hard to tell. It's the "How do you do fellow kids" of AI-isms. The presence of "fair pushback" and a single em dash reads as 99% AI generated as far as I am concerned.
Yes, if you don't want to sound like you're cargo culting AI, you do have to change the way you talk because people aren't going to care otherwise. At the very least just because it's boring. That's always been the nature of slang and lingo.
"not hard to tell"
Or, with all of the AI slop, you think you are detecting all AI. And don't realize the stuff that is AI and not noticed. There is a wide variety of tools now, with different degrees of output quality.
https://ifunny.co/picture/it-s-been-forever-it-s-been-foreve...
I'm fine with work that uses AI. I use AI every day. I'm not fine with AI slop and it's very easy to tell what is slop and what's not, the same way it's easy to distinguish a selfie from a museum quality photograph. Are some selfies works of art? Few and far between, so you'd be forgiven if you dismiss all selfie-looking photographs as not worth your time.
This comment itself is likely written by AI by the sounds of it. It may be worth your time writing it out in your own words in your native language and then finding a competent translation tool to translate your words.
> I'd rather hear "the code is bad" than "the post sounds AI-written".
Of course you would. Reading through and judging the quality of AI output is the largest amount of effort in a world where you can get everything else by prompting. Please internalize this: If you want to be respected you will have to put in effort yourself. There is no way around this.
I truly appreciate your feedback and it's definitely a lesson learned for me. As I said to cmrdporcupine, "The only reason for passing my replies through AI was just because it's my first time posting here and opening a side-project of mine publicly. "
All the engine architecture decisions are mine though and this project came up to solve a real problem in a data pipeline that serves multiple clients, connectors, producers, etc.
I'm late to the party, but there is a dilemma that seems to be facing poor English speakers and writers that I think is a bit imagined, and using LLMs to cover your weakness with the language hurts the public perception of you. There was an article posted last year about East African contractors who were used to do the RLHF post-training for early frontier models, saying LLM speak was really just the way Africans speak English. I don't think that's entirely correct, because frankly, after several decades of working with international teams, I think it's the way a lot of non-Anglophone English speakers speak.
It comes across as both childish and overly formal at the same time. Affected, too excited. It's the guy that says "Hi name!" on Slack and then waits for you to respond, instead of just saying what they actually wanted to say. You're responding to everyone here with some variant of "thank you, I appreciate it." That just isn't the way people speak to each other in normal conversation. It's the way a consultant speaks to you when you're being told you're being laid off and your own manager is too cowardly to deliver the news personally. Sandwich the real point between effusion and praise, when all we actually want is the real point. It feels patronizing, like we're being spoken down to. It's the way politicians and CEOs speak, every word prepared by committee, nothing genuine.
It's all the worse knowing this isn't even you and we're being patronized and spoken down to by a marketing bot you're delegating communication to.
Yes, I can see you're a bit late indeed. Several replies ago I have already acknowledged that I used LLM for the initial post and initial replies.
I use AI everyday, for different purposes, fixing my English when I feel I need, brainstorming, automate tasks, getting ideas for dinner, asking things I don't know about raising a 1 year old boy, eck even finance stuff. And to be honest, I'm quite pleased about it and see no shame on it. I'm no salesperson, nor I have a grasp at marketing, nor I'm used to promote my work. With this thread in HN, (which was a suggestion by LLM! :D), I just wanted to share what I built a let others use it, regardless of typing each character or not, or using LLM, or using smoke signals.
But one thing I cannot avoid, is being polite and friendly because that is who I am when speaking in my native language or in English with my peers. So, saying "Hey" and "Hi", "I appreciate" and "thank you", is part of my day to day.
I didn't know about the article you mentioned but thanks for letting me know. :)
Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks includes jobs and pipelines, so it very much owns the DAG, no?
Fair point. Databricks owns a scheduling DAG (Workflows, DLT). What I meant by "owns the DAG" is the semantic DAG: model-to-model dependencies with column-level types that the compiler builds.
Workflows knows task A runs before task B. Rocky knows `dim_customer.email` flows from `raw_users.email_address` through three CTEs in `stg_customers`. Different layer, same word.
I'll be more careful with that framing.
> I'll be more careful with that framing.
I think you should also try to do a better job selling the benefit of this.
As a data engineer, I can see why this might be useful, but glancing through your README, the dots were not completely connected
Make sense. Reviewing the README is on my TODO list. Thanks for the heads up!
Cool release.
IMO, "Why it's distinctive" is a bit misleading on a few points: certainly the dbt and DX folks can add their POV, but even considering stuff I know / authored ;-), https://arxiv.org/pdf/2308.05368 from 2023 (and following releases) cover branches in a native way (no clone), immutability (re-run), and lineage.
Extensions to be considered are different languages (what about Python), and branch semantics. Two immediate questions would be: can you nest branches? How does merge works across systems if you don't control compute?
Hey jtagliabuetooso! Rocky is SQL-first instead of Python-first, and that was a deliberate scope choice. Also, Rocky acts as control plane that delegates compute to the warehouse (Databricks, etc) rather than owning the runtime. I don't have a strong use-case for nested branches, so no, that's not a feature Rocky has of today. I'll read the paper, thank you for sharing.
How is the Git semantics of merge, rebase and diff defined in the system?
It’s quite odd to choose the name ”Rocky” when that is already the name of one of the most popular Linux distributions.
First time I heard about Rocky Linux distribution. Thank you for the heads up.
hiya, anders from dbt here. cool project -- I especially love the branching and budgeting options you've built in. both are things that I'd love for the dbt standard to include one day. was it dbt's lack of those feature that inspired you to start this project? It also seems you have an aversion to Jinja, which, believe me, I get!
FYI dbt-fusion [1] is going GA next week (though GA for Databricks will come later) Most of it is source-available and ELv2-licensed, but there's a number of crates that are Apache 2.0, namely: dbt-xdbc, dbt-adapter, dbt-auth, dbt-jinja, dbt-agate. We also have plans to OSS more as time goes on (stay tuned).
I just wanted to call out the OSS crates in case you'd rather focus on "making your beer taste better" than have to re-build foundations. I'd love to hear if any of those crates come in handy for you (even more so if they don't work for you).
Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!
[1]: https://github.com/dbt-labs/dbt-fusion
Hey Anders! Thanks a lot for dropping a comment and show interest in Rocky. Yes! I won't going to lie that Jinja is one of the things that gives me some itches :). But it wasn't the major reason for start building Rocky though.
It all started with the need for auto-generating dbt models from the FiveTran connections I integrate with, then having to hot reload code location in Dagster to discover new assets. All in a zero-touch data pipeline. FiveTran connections are discovered as they're created, assets are materialized as these connections sync.
Auto-generating these dbt models and get the manifest aligned between Dagster code location reloads plus spinning up pods in EKS for each Dagster runs that need to rely on these auto-generated models have some impact on the performance overall, not only in production, but also affects DX in their local environment.
Rocky wasn't born with a "dbt replacement" in mind at all, but it was born to solve a real issue I'm facing. I made sure I can integrate well with dbt as it's in my plans to leverage the awesome work available as dbt packages for FiveTran.
I'll definitely have a look the crates you mentioned! Thank you!
thanks for the context!
> Auto-generating these dbt models and get the manifest aligned between Dagster code location
I just added you on LinkedIn. if you accept my connection there I can DM you a private preview document that you might find very interesting related to dbt project metadata (that is way less painful than `manifest.json`)
Accepted! :)
Congrats on the work, but have you considered another name? Naming is hard and always will be: When I first scanned the headline, my initial thought was "that's an interesting area for the Rocky Linux team to explore". After a moment, "wait, no, that's confusing, it's some other Rocky".
Thanks Peter. All my side-projects are named after my pets. I had a dog named Rocky and given this project is also an underdog competing with well-established tools such as dbt and sqlmesh, I decided to keep Rocky when opening it to public. But I'm happy to get some suggestions for a better name to this tool :)
I love that! I am inspired to create Terry, Tizzie, Topé, Bubba, and Roxy (the three Ts are in my office right now), the last two are no longer with us but for the hole in my heart.
I have no idea what these projects would be, but based on personalities, Roxy would chew through CPU and memory like a beaver (she loved turning large branches into small chunks), Bubba would inspire calm and peacefulness but walk into things (he was one-eyed and a little clumsy), Terry would stick like glue (an eBPF program, maybe?), Tizzie would work well most of the time then destroy your stuff (an AI agent?), and Topé would always be there, but never quite willing to participate (a bad Windows driver?).
I don't the area well enough at all to suggest an alternate name, but maybe Wiley, which is an indirect reference to Dag from Barnyard via Wile E. Coyote?
Love your pet names and how you characterize them if you would name something after them :) Wiley is an interesting name!
I have another side-project, still private, which I named Shimi, my current dog's name. I'd thought naming my dog Sashimi, but Shimi is just shorter and simpler. I'm now considering stealing the name from my this other side-project for renaming Rocky, but I'll put more thought into it :)
I fear that there is an even closer candidate for confusion: RocksDB
Oh yeah, good call.
Looks cool, I've been waiting for someone to build this since dbt and SQLMesh acquisition. It would be great to have model versioning and support for ClickHouse SQL.
Thanks. On model versioning — what's the use case you have in mind? A few options that map to different designs:
- dbt-style semantic-layer versions (v1/v2 of a model) - schema migration history - branch-based (Rocky already has branches + replay)
Different design choice for each, so it helps to know which problem you're trying to solve.
ClickHouse is tractable through the Adapter SDK without engine patching. If you can share roughly your model count and workload shape, I can put a real timeline on it. Open to community PRs too.
fyi, llm written comments are discouraged on hackernews.
https://news.ycombinator.com/item?id=47340079
Not saying yours are, but them -- dashes certainly looks like it ;)
Fair. I just use it for tidying up my replies as I'm not a native English speaker.
* * *
Thanks for the careful read. The "what breaks if I rename this column" question is exactly what column lineage from the compiler is meant to answer, and you said it better than I did in the post.
On the schema-grounded AI angle: agreed. The failure mode you describe — structurally valid SQL that joins on the wrong key or aggregates at the wrong grain because the model hallucinated a relationship — is exactly what the compiler is positioned to catch. AI-generated SQL runs through the type checker before it can land, so suggestions that don't validate against the actual DAG never reach the user. NL-to-SQL tools that integrate a compile step would close exactly the gap you're pointing at.
On your two questions:
1. Branch isolation for stateful models — mixed answer, and worth being honest about:
2. Cost attribution. Both bytes scanned and duration are captured per-model in the run record (`bytes_scanned` and duration on `RunRecord`). Budget gating today is on cost (USD) and duration — `max_usd` and `max_duration_ms` in `[budget]` blocks in `rocky.toml`, as independent thresholds. A direct bytes-scanned budget threshold isn't gateable today; the bytes are in the run record for analysis but you can't currently fail CI on "this run scanned more than N TB". Reasonable extension if there's demand.