I'm building an agent framework in golang and it is extremely light weight. Startup time is under 1/2 second, and RAM usage is really low. I have a 12 year old laptop and it happily runs without slowing down.
There's no reason what is essentially a string concat engine should be slow on any hardware, including old hardware.
Link is here [0]. The idea is to model cognitive states (how to think), and workflows (what to think about) as statecharts. The charts will be defined in YAML (version-able, hot-reloading). Context payloads are defined in an agent YAML file. Think of it as a map, like a drive map for a computer's HDD/SSD. You spec the order of context chunks, what goes into them, and then when the inference payload is built, it uses the context map definition (comprised of the chunks you defined), the agent definition (including model params like context length, temp, etc), cognitive state, and workflow state to build out the inference payload.
Agent cognitive states may add chunks to the system prompt. Workflows may add chunks to the system prompt. Tool access may vary by agent/workflow state (policy is last-defined-wins overlays to keep it simple to reason about).
Agents may run by themselves or be 'bound' to a workflow. Agents can detach from a workflow before it is finished, and either re-bind, or another agent may bind to the workflow (one implements, another reviews, for example).
Conceptually, this is all very simple, which is why I'm hand rolling it.
The goal is a minimal runtime that can support long-running agents in a 'zero human company' setting.
On top of the runtime will be a minimal change control workflow (if you've spent time in hardware engineering, these are standard processes governed by a company's quality system).
I've yet to wire in the economic pieces (token spend, power consumption, rollups that show performance of various agents based on inputs and outputs).
It is a bit far fetched, but I'd like to get this thing ISO9001 certified, and maybe AS9100 certified.
This is all to scratch my own itch, tbh. Most agentic systems are hard to reason about, bloated, lack visibility in the appropriate places, lack economic data of sufficient granularity, and so on. So I'm building this.
I've been trying to migrate over the zed and think they're Agent Client Protocol[1] is pretty neat, I wonder how much memory pressure Claude Code exerts if it is going through that mechanism instead
It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.
The rough estimate is 2 * L * H_kv * D * bytes per element
Where:
* L = number of layers
* H_kv = # of KV heads
* D = head dimension
* factor of 2 = keys + values
The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.
For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.
But regardless of the details, you’re off by several orders of magnitude.
'A million tokens of context' is literally Terrabytes of KV cache VRAM on very expensive Nvidia silicon - on the model.
On the Agent, yes, the context window does relate to RAM, because the 'entire conversational history' is generally kept in memory. So ballpark 1M 'words' across a bunch of strings. It's not that-that much.
Claude Code is not inneficient because 'it's not Rust' - it's just probably not very efficiently designed.
Rust does not bestow magical properties that make memory more efficient really.
A bit more, but it's not going to change this situation.
'Dong it in Rust' might yield amazing returns just because the very nature of the activity is 'optimization'.
Lamenting any 'not even criticism' of Rust as 'denialism' is just evidence of the insane cult that is Rust.
Rebuilding Claude Code in Rust will make almost no difference in terms of real world performance. V8 is 'relatively fast', and there wouldn't be any noticeable improvements there, and probably not memory footprint either.
The source for Claude Code was leaked and it's a vibe-coded mess, there's not much thought given to clean architecture, it's unlikely they've just cleaned up a bit and given thought to memory consumption etc, if they did, they'd get by far most of the way there and likely abnegate and real want to 'do it in rust', unless there are other architectural considerations.
You're the delusional one for bringing up the memory usage of the inference server that clearly isn't running inside the coding agent.
The problem with your comments is that you're showing off a fundamental lack of understanding between managed languages and unmanaged languages.
The vast majority of GCs are optimized for throughput and allocate big chunks of memory. They also tend to never release it if there was a temporary memory spike. The most advanced GCs also tend to have either read or write barriers, which slow down basic object accesses.
Just in time compilation and managed languages in general need to retain a runtime representation of the source code to perform JIT compilation and then they have to store the compiled code in memory as well.
JavaScript uses references against dynamic objects, which means you have to pay the indirection cost of a pointer but you also need to store type information as well to monomorphize the object literals and classes at runtime and fall back to a regular hashmap when fields are added dynamically.
All of these things will add up and increase the amount of memory the application uses and how slow it runs.
Sure Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM, but if those were not there you could easily build a C++ based alternative that runs circles around a hypothetical JavaScript based Claude Code that got its act together.
1) I'm not 'delusional' for bringing up 'What Memory is Used Where' - I'm clarifying for the people who seem a bit confused (see above) as to 'where the context lives' - and trying to provide a simple mental model for that.
That's the opposite of delusional.
It's just information.
Attacking people for anything 'Rust related' however - is the quintessential reason why everyone hates the Rust community.
2) 'The problem with your comment' is that it's presumptive and arrogant - as if I 'don't know the difference between GC and managed languages'.
I've been writing software since 1990.
Embedded (on custom Silicon), UI, SaaS, backend, some embedded work I've done is still in production today from almost 30 years ago.
I've written a scripting languages (for production), and cyclic ref-count gc (didn't make it to production).
Your comments about GC etc. are fine - but they but they don't really offer any insight into the actual problem.
There's one critical detail aka 'memory not released after spikes', yes, this is observed behaviour, but it's usually accommodated with a little bit of decent Engineering.
If you're going to make the comparative basis an an 'Idiomatic Rust' solution (aka good patterns), the we should make the assumption of an 'Idiomatic Node' solution for Claude Code.
3) 'The other problem with your comment' is that your conclusion is wrong - by your own hand.
Right here: "Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM," - the implication being that Claude Claude does not inherently have to 'leak all that RAM' - and would run just as fine with some basic work.
An 'Idiomatic Node' implementation of Claude Code wouldn't exhibit those problems, and would perform pragmatically just as well as an Idiomatic Rust implementation.
From a memory management situation, Rust might use significantly less memory, but a 150Mb footprint vs 350Mb foot print for an average session is 'pragmatically immaterial'.
The difference in 'perceived performance' would be negligible - if any.
The 'cost' of writing a the 'kind of program that Claude code is' in a systems-level language would be quite a lot, for not really much benefit.
The 'Rust or C++' solution would not 'run circles' around the 'node' implementation in anything but some 'preformative', inward looking benchmarks, aka 'the worst kind of Engineering'.
Consider pondering why almost nobody writes such applications in Rust or C++.
Hi, I'm the developer of zerostack!
No, the memory footprint is not beacuse of the context window size: on my benchmarks, with a 128k context loaded, and it jumped from 8MB (without any chat/context loaded) to 11MB.
The reasons why the memory footprint of zerostack are:
- Rust, and not JS/Python, so no interpreters/VMs on top
- Load-as-needed, so we only allocate things like LLM connectors when needed
- `smallvec` used for most of the array usage of the tool (up to N items are stored in stack)
- `compactstring` used for most of the string usage of the tool (up to N chars are stored in stack)
- `opt-level=z` to force LLVM to optimize for binary size and not for performance (even tho we still beat both in TTFT and in tool use time opencode)
The context window is not on your system. It's on the server with the model. There may be some local prompt caching, of some sort, but you're not locally hosting the context unless you're also locally hosting the model.
To me it would certainly make sense if the protocol just said "append this text to context window id/sha256", in particular as the data is cached in tensor level in the provider side, so they need to first do that lookup anyway. So I would be surprised if they don't have that.
In addition, this protocol could make it more transparent to say "oh we cannot proceed as we dropped the this cache, are you sure you want to proceed and consume a whole lot of expensive uncached tokens?". Oh, maybe that's a reason not to do it..
That's just the plain text (or whatever files), that's not the context the model is directly working with on the server, which is tokenized, embedded, vectorized and has attention run against those vectors. The local history is generally quite small, the context generally quite a bit larger. A text conversation of a few hundred kilobytes in plain text will be gigabytes in context.
I'm building an agent framework in golang and it is extremely light weight. Startup time is under 1/2 second, and RAM usage is really low. I have a 12 year old laptop and it happily runs without slowing down.
There's no reason what is essentially a string concat engine should be slow on any hardware, including old hardware.
Isn't 2 second startup time a lot? With zerostack, I managed to get it down to ~90ms
They said 1/2 as in 0.5 seconds as in 500 ms.
Sounds interesting, would you like to share any more information about your project?
Link is here [0]. The idea is to model cognitive states (how to think), and workflows (what to think about) as statecharts. The charts will be defined in YAML (version-able, hot-reloading). Context payloads are defined in an agent YAML file. Think of it as a map, like a drive map for a computer's HDD/SSD. You spec the order of context chunks, what goes into them, and then when the inference payload is built, it uses the context map definition (comprised of the chunks you defined), the agent definition (including model params like context length, temp, etc), cognitive state, and workflow state to build out the inference payload.
Agent cognitive states may add chunks to the system prompt. Workflows may add chunks to the system prompt. Tool access may vary by agent/workflow state (policy is last-defined-wins overlays to keep it simple to reason about).
Agents may run by themselves or be 'bound' to a workflow. Agents can detach from a workflow before it is finished, and either re-bind, or another agent may bind to the workflow (one implements, another reviews, for example).
Conceptually, this is all very simple, which is why I'm hand rolling it.
The goal is a minimal runtime that can support long-running agents in a 'zero human company' setting.
On top of the runtime will be a minimal change control workflow (if you've spent time in hardware engineering, these are standard processes governed by a company's quality system).
I've yet to wire in the economic pieces (token spend, power consumption, rollups that show performance of various agents based on inputs and outputs).
It is a bit far fetched, but I'd like to get this thing ISO9001 certified, and maybe AS9100 certified.
This is all to scratch my own itch, tbh. Most agentic systems are hard to reason about, bloated, lack visibility in the appropriate places, lack economic data of sufficient granularity, and so on. So I'm building this.
[0] https://github.com/zerohumancompany2/maelstrom-code
I've been trying to migrate over the zed and think they're Agent Client Protocol[1] is pretty neat, I wonder how much memory pressure Claude Code exerts if it is going through that mechanism instead
1: https://zed.dev/acp
Not answering your question, but I just realized the new Anthropic billing changes are affecting ACP clients like Zed :(
https://zed.dev/blog/anthropic-subscription-changes
The memory footprint is great, it allows finally running these coding agents in extra small instances -- say x1 on shellbox.dev
Hmm, if they're this small something like smolmachines (like shellbox, but free and local) might be a great fit.
Yes. Just this fact is going to make a lot of people try it out.
I have 29 Claude Codes open, using 6.3 GiB RSS total
Are you sure you don't have an LSP plugin or something running?
Isn't that because of the context window size?
The context window has nothing to do with RAM usage and even if it did, a million tokens of context is maybe 5mb.
It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.
The rough estimate is 2 * L * H_kv * D * bytes per element
Where:
* L = number of layers * H_kv = # of KV heads * D = head dimension * factor of 2 = keys + values
The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.
For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.
But regardless of the details, you’re off by several orders of magnitude.
Pretty sure we're talking about the output text, not the tensors.
These LLM replies are really getting annoying.
Mine? I literally wrote what I wrote because “context window” as a term of art refers to the LLM’s context window.
I guess get better at detecting LLMs instead of accusing everything of being an LLM reply?
'A million tokens of context' is literally Terrabytes of KV cache VRAM on very expensive Nvidia silicon - on the model.
On the Agent, yes, the context window does relate to RAM, because the 'entire conversational history' is generally kept in memory. So ballpark 1M 'words' across a bunch of strings. It's not that-that much.
Claude Code is not inneficient because 'it's not Rust' - it's just probably not very efficiently designed.
Rust does not bestow magical properties that make memory more efficient really.
A bit more, but it's not going to change this situation.
'Dong it in Rust' might yield amazing returns just because the very nature of the activity is 'optimization'.
Rust "denialism" is as annoying as rust evangelism.
Of course any seemingly idiomatic rust is going to run circles around TS transpiled into JIT-compiled JS.
Lamenting any 'not even criticism' of Rust as 'denialism' is just evidence of the insane cult that is Rust.
Rebuilding Claude Code in Rust will make almost no difference in terms of real world performance. V8 is 'relatively fast', and there wouldn't be any noticeable improvements there, and probably not memory footprint either.
The source for Claude Code was leaked and it's a vibe-coded mess, there's not much thought given to clean architecture, it's unlikely they've just cleaned up a bit and given thought to memory consumption etc, if they did, they'd get by far most of the way there and likely abnegate and real want to 'do it in rust', unless there are other architectural considerations.
You're the delusional one for bringing up the memory usage of the inference server that clearly isn't running inside the coding agent.
The problem with your comments is that you're showing off a fundamental lack of understanding between managed languages and unmanaged languages.
The vast majority of GCs are optimized for throughput and allocate big chunks of memory. They also tend to never release it if there was a temporary memory spike. The most advanced GCs also tend to have either read or write barriers, which slow down basic object accesses.
Just in time compilation and managed languages in general need to retain a runtime representation of the source code to perform JIT compilation and then they have to store the compiled code in memory as well.
JavaScript uses references against dynamic objects, which means you have to pay the indirection cost of a pointer but you also need to store type information as well to monomorphize the object literals and classes at runtime and fall back to a regular hashmap when fields are added dynamically.
All of these things will add up and increase the amount of memory the application uses and how slow it runs.
Sure Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM, but if those were not there you could easily build a C++ based alternative that runs circles around a hypothetical JavaScript based Claude Code that got its act together.
1) I'm not 'delusional' for bringing up 'What Memory is Used Where' - I'm clarifying for the people who seem a bit confused (see above) as to 'where the context lives' - and trying to provide a simple mental model for that.
That's the opposite of delusional.
It's just information.
Attacking people for anything 'Rust related' however - is the quintessential reason why everyone hates the Rust community.
2) 'The problem with your comment' is that it's presumptive and arrogant - as if I 'don't know the difference between GC and managed languages'.
I've been writing software since 1990.
Embedded (on custom Silicon), UI, SaaS, backend, some embedded work I've done is still in production today from almost 30 years ago.
I've written a scripting languages (for production), and cyclic ref-count gc (didn't make it to production).
Your comments about GC etc. are fine - but they but they don't really offer any insight into the actual problem.
There's one critical detail aka 'memory not released after spikes', yes, this is observed behaviour, but it's usually accommodated with a little bit of decent Engineering.
If you're going to make the comparative basis an an 'Idiomatic Rust' solution (aka good patterns), the we should make the assumption of an 'Idiomatic Node' solution for Claude Code.
3) 'The other problem with your comment' is that your conclusion is wrong - by your own hand.
Right here: "Claude Code has severe architectural issues causing it to leak hundreds of gigabytes of RAM," - the implication being that Claude Claude does not inherently have to 'leak all that RAM' - and would run just as fine with some basic work.
An 'Idiomatic Node' implementation of Claude Code wouldn't exhibit those problems, and would perform pragmatically just as well as an Idiomatic Rust implementation.
From a memory management situation, Rust might use significantly less memory, but a 150Mb footprint vs 350Mb foot print for an average session is 'pragmatically immaterial'.
The difference in 'perceived performance' would be negligible - if any.
The 'cost' of writing a the 'kind of program that Claude code is' in a systems-level language would be quite a lot, for not really much benefit.
The 'Rust or C++' solution would not 'run circles' around the 'node' implementation in anything but some 'preformative', inward looking benchmarks, aka 'the worst kind of Engineering'.
Consider pondering why almost nobody writes such applications in Rust or C++.
You have a point but it's definitely not TBs for 1M. Should be more like 100G.
Hi, I'm the developer of zerostack! No, the memory footprint is not beacuse of the context window size: on my benchmarks, with a 128k context loaded, and it jumped from 8MB (without any chat/context loaded) to 11MB.
The reasons why the memory footprint of zerostack are:
- Rust, and not JS/Python, so no interpreters/VMs on top
- Load-as-needed, so we only allocate things like LLM connectors when needed
- `smallvec` used for most of the array usage of the tool (up to N items are stored in stack)
- `compactstring` used for most of the string usage of the tool (up to N chars are stored in stack)
- `opt-level=z` to force LLVM to optimize for binary size and not for performance (even tho we still beat both in TTFT and in tool use time opencode)
- heavy usage of [LTO](https://en.wikipedia.org/wiki/Interprocedural_optimization#W...)
The context window is not on your system. It's on the server with the model. There may be some local prompt caching, of some sort, but you're not locally hosting the context unless you're also locally hosting the model.
Chat history is kept locally, generally you have to send the 'whole history' to the model 'each turn'.
Only "generally"? I'm curious what API has moved away from this protocol that seems mode adapted to conversaions with humans than agentic loops.
So the standard API you pass it all along but I think there are some odd open ai apis that are different.
To me it would certainly make sense if the protocol just said "append this text to context window id/sha256", in particular as the data is cached in tensor level in the provider side, so they need to first do that lookup anyway. So I would be surprised if they don't have that.
In addition, this protocol could make it more transparent to say "oh we cannot proceed as we dropped the this cache, are you sure you want to proceed and consume a whole lot of expensive uncached tokens?". Oh, maybe that's a reason not to do it..
That's just the plain text (or whatever files), that's not the context the model is directly working with on the server, which is tokenized, embedded, vectorized and has attention run against those vectors. The local history is generally quite small, the context generally quite a bit larger. A text conversation of a few hundred kilobytes in plain text will be gigabytes in context.
KV for a sota model is into terrabytes