Show HN: How LLMs Work – Interactive visual guide based on Karpathy's lecture

ynarwal.github.io

236 points by ynarwal__ 1 day ago

All content is based on Andrej Karpathy's "Intro to Large Language Models" lecture (youtube.com/watch?v=7xTGNNLPyMI). I downloaded the transcript and used Claude Code to generate the entire interactive site from it — single HTML file. I find it useful to revisit this content time to time.

lateral_cloud 1 day ago

This is completely AI generated..don't bother reading.

skiing_crawling 20 hours ago

What the value of publishing purely LLM generated content? Anyone can prompt the same thing out of it.

jblitzar 20 hours ago

yeah this seems entirely LLM-generated.

dylkil 21 hours ago

Was this created with claude design? it looks very similar to something it mocked for me last week

arcza 1 day ago

Another low effort, dark mode slopsite. You lost me at "44 terabytes" before I even got to the emdash in that sentence.

@dang, when is the 'flag as slop' button coming?

rrr_oh_man 1 day ago

Slopsite... What a perfect name for that
rahen 23 hours ago

Some subreddits have already taken measures against this: https://www.reddit.com/r/ProgrammingLanguages/comments/1sd66...
Same everywhere: avalanches of AI garbage and intellectual dishonesty. People claiming "I wrote this", then a look at the code shows massive slop and an author with no clue about the topic.
More worrying, this trend is creeping to all domains: "Nearly 75,000 tracks uploaded to Deezer are fully created using AI. That’s 44% of daily uploads, and more than 2 million per month. Back in June, the daily number was around 20,000."
https://www.vice.com/en/article/how-deezer-is-fighting-fraud...

PetitPrince 1 day ago

Have you reread what was produced by Claude Code before publishing ? This thing in one of the first paragraph jumps out:

> you end up with about 44 terabytes — roughly what fits on a single hard drive

No normal person would think that 44 TB is a usual hard drive size (I don't think it even exists ? 32TB seems the max in my retailer of choice). I don't think it's wrong per se to use LLM to produce cool visualization, but this lack of proof reading doesn't inspire confidence (especially since the 44TB is displayed proheminently with a different color).

QuantumNomad_ 1 day ago

I agreed when I read your comment, but that turned out to be almost directly from the video https://youtu.be/7xTGNNLPyMI
From around the 2:20 mark he says:
“[…] actually ends up being only about 44 TB of disk space. You can get a USB stick for like a TB very easily, or I think this could fit on a single hard drive almost today”
So it’s just slightly altered from what was said in the original video. And the LLM rewritten version of it also says “roughly” where he said “almost”, and I guess 44 TB is pretty roughly or pretty almost 32 TB. Although I’d still personally probably put it as “can fit on a pair of decently sized hard drives today” (for example across two 24 TB drives).
Regardless, it’s close enough to what was said in the source video that it’s not something the LLM just made up out of nowhere.
- erwincoumans 1 day ago
  
  There is a 122TB nvme PCIe SSD: https://www.solidigm.com/products/data-center/d5/p5336.html
  
  sigmoid10 22 hours ago
  
  That thing is like 10,000 bucks. That's not what I'd call a consumer hard drive. In fact there's already another one with 245TB. But you probably won't see these outside datacenters for a while.
  
  littlestymaar 21 hours ago
  
  > But you probably won't see these outside datacenters for a while.
  That's especially true now that Data centers spendings are crazy high.
lucideer 1 day ago

Hard drives are currently scarce due to market factors, so it's not surprising that 32TB is the biggest in your local retailer, but 40tb+ ssds were a little more widely available a year or two ago.
Still obviously crazy to consider that any kind of "average" or common size, but certainly not outrageous, especially for someone working in that field.
ashtonshears 1 day ago

Yes claude is being dumb/hallucinating. Yes it does exist, there are much larger drives than that produced by the main manufacturers
- embedding-shape 23 hours ago
  
  Even if those things are true, no reasonable person who knows what they're talking about, and who writes for an wider audience would say something like "you end up with about 44 terabytes — roughly what fits on a single hard drive" though.
  
  ashtonshears 23 hours ago
  
  Thats what i said
  
  embedding-shape 23 hours ago
  
  What you comment currently states, is basically "It's possible" which yeah, I guess I said too. But doesn't seem you said something like "no reasonable person would say something like that", which for me was the main point of my comment...
  
  ashtonshears 23 hours ago
  
  My statement had two sentences. You chose to read only the second for some reason. We dont disagree, respect
  
  embedding-shape 22 hours ago
  
  Sorry, we must be reading two completely comment threads. Hope you'll have a nice weekend regardless, take care! :)
  
  ashtonshears 2 hours ago
  
  Thanks I am! I hope you do as well
  
  raincole 21 hours ago
  
  You're probably hallucinating. The parent comment is quite clear.
  
  embedding-shape 21 hours ago
  
  Seems we might be two then, I never said I think parent's comment wasn't clear :)
tcp_handshaker 23 hours ago

Maybe Karpathy used an LLM too. The 44 TB number, happens to match exactly the currently largest available drives sold for Enterprise by Seagate, not 40 TB, not 50 TB but 44 TB...coincidence ? - [1]
For SSDs record seems to be 245 TB - [2]
[1] - https://www.seagate.com/stories/articles/seagate-delivers-in...
"Seagate’s Mozaic™ 4+ hard drives supporting capacities up to 44TB are now shipping in volume to two leading hyperscale cloud providers."
[2] - https://fudzilla.com/kioxia-showcases-245-76tb-lc9-enterpris...

gushogg-blake 1 day ago

I haven't found an explanation yet that answers a couple of seemingly basic questions about LLMs:

What does the input side of the neutral network look like? Is it enough bits to represent N tokens where N is the context size? How does it handle inputs that are shorter than the context size?

I think embedding is one of the more interesting concepts behind LLMs but most pages treat it as a side note. How does embedding treat tokens that can have vastly different meanings in different contexts - if the word "bank" were a single token, for example, how does embedding account for the fact that it can mean river bank or money bank? Do the elements of the vector point in both directions? And how exactly does embedding interact with the training and inference processes - does inference generate updated embeddings at any point or are they fixed at training time?

(Training vs inference time is another thing explanations are usually frustrating vague on)

maciejzj 1 day ago

AFAIK – the input is (at most basic level) a matrix with L tokens (rows) and d embedding length (cols). The input tokens are initially coded into discrete IDs but they are turned into embeddings by something like `torch.nn.Embedding`. The embedding layer can be thought of as a "lookup table" but it is matrix multiplication learned through gradient descent (adjusted at train time, fixed values at inference time). The length of embedding (d) is also fixed, L is not. If you check out the matrix multiplication formulas for both embedding layer and attention you will notice that they work for any number of rows/tokens/L (linear algebra and rules of matrix multiplication). The context limit is imposed by auxiliary factors – positional encoding and overall model ability to produce coherent output for very long input.
When it comes down to the meaning of "bank" embedding, it cannot be interpreted directly, however, you can run statistical analysis on embeddings (like PCA). If we were to say, the embedding for "bank" contains all possible meaning of this word, the particular one is inferred not by the embedding layer, but via later attention operations that associate this particular token with the other tokens in the sequence (e.g. self attention).
- gushogg-blake 23 hours ago
  
  This is exactly what I was looking for, thanks!
  
  sigmoid10 22 hours ago
  
  In this particular case the embedding wouldn't tell you anything about river bank vs any other bank. At that stage of the computation, this info simply isn't encoded yet. That would come from the context, which is later calculated in the attention matrix, i.e. the only place were tokens are cross-computed along the sequence dimension. Bank would have a strong connection to another token (or several ones) that defines its exact meaning in the current context and together they would create a feature vector in an intermediate embedding space somewhere in the deep layers of the model. The embedding space talked about here is just the input/output matrix that compactifies a huge, highly sparse input matrix (essentially just an array of one-hot vectors glued together) into something more compact and less sparse. There's no real theoretical need for this, it just so happens that GPUs suck at multiplying huge sparse matrices. If we ever get LLMs designed to run on CPUs or analog circuits, you might even be able to just get rid of it entirely.
GistNoesis 1 day ago

Typically the input of a LLM is a sequence of tokens, aka a list of integer between 0 and max number of tokens.
The sequence is of variable length. It was one of the "early" problem in sequence modelling : how to deal with input of varying length with neural networks. There is a lot of literature about it.
This is the source of plenty of silent problems of various kind :
- data out of distribution (short sequence vs long sequences may not have the same performance )
- quadratic behavior due to data copy
- normalization issues
- memory fragmentation
- bad alignment
One way of dealing with it is by considering a variable length sequence as a fixed sized sequence but filling with zeros the empty elements and having some "masks" to specify which elements should be ignored during the operations.
----
Concerning the embedding having multiple semantic meaning, it is best effort, all combinations of behavior can occur. The embedding layer is typically the first layer and it convert the integer from the token into a vector of embedding dimension of floating point numbers. It tries its best to separate the meaning to make the task of the subsequent layers of the neural network easier. It's shovelling the shit it can't handle down to road for the next layers to deal with it.
For experiments you can try to merge two tokens into one or into <unknown> token, in order to free some token for special use without having to increase the size of the vocabulary.
Embeddings some times can be the average of the disambiguated embeddings. Some times can be their own things.
In addition to embeddings, you can often look at the inner representation at a specific depth of the neural network. There after a few layers the representation have usually been disambiguated based on the context.
The last layer is also specially interesting because it is the one used to project back to the original token space. Sometimes we force the weights to be shared with the embedding layer. This projection layer usually can't use context so it must have within itself all necessary information to very simply map back to token space. This last representation is often used as a full sequence representation vector which can be used for subsequent more specialized training task.
Embedding weights are fixed after training, but in-context learning occur during inference. The early tokens of the prompt will help disambiguate the new tokens more easily. For example <paragraph about money> bank vs <paragraph about landscape> bank vs bank will have the same input embedding for the bank token, but one or two layer down the line, the associated representation will be very different and close to the appropriate meaning.
- gushogg-blake 23 hours ago
  
  Exactly what I was looking for, thanks!

lukeholder 1 day ago

Page keeps annoyingly scroll-jumping a few pixels on iOS safari

tbreschi 1 day ago

Yeah that typing effect at the top (expanding the composer) seems to be the isssue
vinnymac 22 hours ago

This made it impossible for me to read, and Safari Reader was unavailable, further cementing my press of the back button.
skinner927 22 hours ago

You can use “Hide Distracting Items” feature on iOS Safari to delete the box.

vova_hn2 21 hours ago

I think that BPE visualization is slightly misleading, because it seems to imply that the "old" (smaller) tokens are thrown away and replaced with longer tokens, which is not the case.

In fact, it is purely additive process: we iteratively add the most frequent pairs to the set, until we reach the desired total number of tokens. But we never remove tokens, we keep everything, including the initial 256 tokens, representing bytes.

This ensures that the model is capable of producing every possible unicode sequence (in fact, I think that it is capable of producing every possible byte sequence, but bytes that are not valid unicode are filtered during sampling).

Edit #1: also, this page entirely skips the attention mechanism, which is, in my opinion, both the most interesting part and the part that is hardest to understand (I can't say that I fully understand it, to me it is just some linear algebra matrix multiplication magic).

jasonjmcghee 21 hours ago

Highly recommend instead reading the human created "The Illustrated GPT-2" by Jay Alammar - https://jalammar.github.io/illustrated-gpt2/

And his similar work.

He also has a free course on "how llms work"

jblitzar 20 hours ago

This Jay Alammar guide is great! I used it when writing my own transformer

ynarwal__ 21 hours ago

Update: The "single hard drive" claim was wrong and I've corrected it to "roughly 10 consumer hard drives" (44TB ÷ ~4TB = ~11). Attribution to Karpathy is now a direct link. Added a caveat under the stats noting these are representative 2024-era figures — the exact numbers shift with every release and that's somewhat the point. Also did a few iterations on visual redesign (linked in the header as v2) with a proper top navigation bar after a few people found the dot nav hard to use and UI was jumping.

Also I have not fact checked everything but I have read it and it seems to be aligned with what is described in the lecture.

thesz 21 hours ago

The page does very poor job tokenizing phrase "Noinceolik fiyulnabmed fyvaproldge" into "Noinceolik fiyulnabm ed fyvaproldge", factoring only "ed" suffix. As if made up words such as "noinceolik" are so common they are part of 100K token vocabulary.

The actual application of GPT-5 tokenizer at [1] to my made up phrase results in 14 tokens, only two of them are four characters long and there are tokens containing spaces.

[1] https://gpt-tokenizer.dev/

I will read along, though.

ynarwal__ 21 hours ago

I appreciate the feedback, I did notice that as well and I had this thought perhaps this is not worth fixing since I have a link to tiktokenizer. I decided to remove it and just added a more prominent link to tiktokenizer.
- thesz 2 hours ago
  
  BPE that is used in tokenization is very simple: https://en.wikipedia.org/wiki/Byte-pair_encoding

amelius 19 hours ago

In these types of tutorial, can't we move tokenization to the end and just start with a model that inputs and outputs bytes (latin-1) instead?

That would (1) take the reader to the real meat sooner, and would (2) make it way more compelling why we need something as convoluted as tokenization.

Barbing 1 day ago

Lefthand labels (like Introduction) can overlap over main text content on the right in the central panel - may be able to trigger by reducing window width.

ynarwal__ 21 hours ago

I disagree with some comments saying it's not worth reading since it's generated by LLM. Even though I made it clear that I have download the transcript. LLMs are exceptionally good at generating accurate information if information is directly loaded into context window.

siva7 21 hours ago

> WITHOUT RAG > "I don't have reliable information about a colony called Ares Base. As of my > training cutoff, no such Mars colony has been established..."

Oh we must have lived in a parallel universe then if this is a "without rag" textbook example.

gslepak 23 hours ago

Just want to give appreciation for proper attribution. I feel like still some people will say "Here's something I made" when the reality is, "Here's something I asked my AI to make."

endymion-light 1 day ago

I really dislike the default AI slop css - if you're going to do this - please have a design language and taste ideas beforehand. It can help so much in refining the look.

Genuine piece of feedback, as soon as I see those gradients + quirks. My perception immediately becomes - you put no effort into finding your own style, therefore you will not have put effort into creating this website.

rrr_oh_man 1 day ago

> Genuine piece of feedback, as soon as I see those gradients + quirks. My perception immediately becomes - you put no effort into finding your own style, therefore you will not have put effort into creating this website.
Right on both counts
- jblitzar 20 hours ago
  
  I think the css here is what Claude spits out when you prompt it to make frontend

learningToFly33 1 day ago

I’ve had a look, and it’s very well explained! If you ever want to expand it, you could also add how embedded data is fed at the very final step for specific tasks, and how it can affect prediction results.

rrr_oh_man 1 day ago

Nobody explained anything. This is entirely LLM-generated.
- davkap92 22 hours ago
  
  Well Karpathy did. this is basically his work (with the big caveat if there are no LLM mistakes/hallucinations). Its just exporting his transcript to a different format. I think if its presented as 'look what i built' its misleading. but if its presented as, this format of information was helpful to digest, and properly attributed to the author (which it was here) theres no real issue

5asHajh 1 day ago

"Retrieved chunks are prepended to the prompt before the LLM sees the question. The model generates from injected facts rather than relying on memorized training data — dramatically reducing hallucination on knowledge-intensive tasks."

So plagiarism is even explicit now. A stolen database relying on cosine similarity to parse the prompts.

Why doesn't The Pirate Bay have a $1 trillion valuation?

weego 22 hours ago

putting text in colored boxes around the page isn't really interactive or visual in the way I'd hoped, but it looks pretty.

hansmayer 1 day ago

> and used Claude Code to generate the entire interactive site from it

Hard pass on AI slop. First - principally as it brings no real value, anyone can iterate over some prompts to generate a version of this. Secondly - more specific - Don't you know that LLMs are particularly prone to make mistakes in summarising, where they make subtle changes in the wording which has much wider context impact?

If you insist on being the human part of a centaur, then at least do your human slave part - inspect the excremented "content", fix inconsistencies etc.

PeakScripter 1 day ago

currently working on somewhat same thing myself