Microgpt explained interactively

311 points by growingswe 4 months ago

> By the end of training, the model produces names like "kamon", "karai", "anna", and "anton". None of them are copies from the dataset.

Hey, I am able to see kamon, karai, anna, and anton in the dataset, it'd be worth using some other names: https://raw.githubusercontent.com/karpathy/makemore/988aa59/...

ayhanfuat 4 months ago

You are absolutely right. The whole post reads like AI generated.
- butterisgood 4 months ago
  
  ISWYDT
- re 4 months ago
  
  I didn't get that sense from the prose; it didn't have the usual LLM hallmarks to me, though I'm not enough of an expert in the space to pick up on inaccuracies/hallucinations.
  The "TRAINING" visualization does seem synthetic though, the graph is a bit too "perfect" and it's odd that the generated names don't update for every step.
  
  oytis 4 months ago
  
  For me it was the prose that alarmed me. Short sentences, aggressive punctuation, desperately trying to keep you engaged. It is totally possible to ask the model to choose a different style - I think that's either the default or corresponds to tastes of the content creators
- jsheard 4 months ago
  
  The rate they are posting new articles on random subjects is also a pretty indicative of a content mill.
  In 3 days they've covered machine learning, geometry, cryptography, file formats and directory services.
  
  growingswe 4 months ago
  
  I had to look up what a content mill is. I'm not one, I think. It's "random" stuff because my interests are different. These posts are not written sequentially, I've been working on them (except for this MicroGPT one) for weeks and only publishing now.
  
  UltraSane 4 months ago
  
  It has gotten to the point you need timestamped keystrokes and a screen recording to prove you actually wrote something yourself.
  
  bonoboTP 4 months ago
  
  Soon you'll be able to generate a video with AI that shows you typing the entire thing, and will narrate it in your own voice with voice cloning.
  
  sincerely 4 months ago
  
  This already exists in visual art: timelapses of the drawing process were being used to prove that pictures weren’t AI generated, until someone made a program that takes a picture and generates a fake progress vid
  
  gwern 4 months ago
  
  Dude, you literally start the article with
  > Andrej Karpathy wrote a 200-line Python script that trains and runs a GPT from scratch, with no libraries or dependencies, just pure Python.
  Almost immediately afterwards, you have a section titled "Numbers, not letters". Need I go on?
  Interestingly, despite all the AI tics, the opening passes Pangram as 100% human... though all the following sections I randomly checked also come back as 100% AI. So the simplest explanation would be that you are operating adversarially and you tweaked the opening to target Pangram (perhaps through a anti-AI-detection service, which now exist and are being used by the cutting edge, as Pangram is known to be relatively easy to beat, similar to how people started search-and-replacing em-dashes when that got a little too well known), which unfortunately means I now expect you to lie to me in your response since you apparently went that far to start building up clout.
  (BTW, how did you accidentally pick 4 rare names which were in the dataset? "Thanks, will fix" is not a real response to that observation. Are you also going to remove all of the 'just pure X' and 'Y, not X' constructions from your posts now that I've pointed it out?)
  
  gwern 4 months ago
  
  (Pangram commentary: https://x.com/max_spero_/status/2028620466995220568 https://x.com/max_spero_/status/2028689219112034436 )
  
  jsheard 4 months ago
  
  Addendum - now they've changed the dates of several articles retroactively, to increase the spacing.
growingswe 4 months ago

Thanks, will fix

jmkd 4 months ago

It says its tailored for beginners, but I don't know what kind of beginner can parse multiple paragraphs like this:

"How wrong was the prediction? We need a single number that captures "the model thought the correct answer was unlikely." If the model assigns probability 0.9 to the correct next token, the loss is low (0.1). If it assigns probability 0.01, the loss is high (4.6). The formula is − log ⁡ ( � ) −log(p) where � p is the probability the model assigned to the correct token. This is called cross-entropy loss."

growingswe 4 months ago

I see. The problem with me writing these is even though I'm not an expert, I do have a bit of knowledge on certain things so I'm prone to say things that make sense to me but not to beginners. I'll rethink it
- gwern 4 months ago
  
  One of the downsides of using an expert LLM to write for you is that they know all that perfectly well, even if you don't, and aren't too bothered by such a chunk. It's like reading any Wikipedia article on mathematics... This is the kind of thing that people are documenting in the LLM-user literature in creating an illusion of expertise (or 'illusion of transparency'). Because the LLM explains it so fluently, you feel like you understand, even though you don't. Hence new phrases like 'cognitive debt' to try to deal with it.
  (This is also why people like cramming or lectures rather than quizzing or spaced repetition, because they produce a certain 'illusion of depth' https://gwern.net/doc/psychology/cognitive-bias/illusion-of-... ).

windowshopping 4 months ago

The part that eludes me is how you get from this to the capability to debug arbitrary coding problems. How does statistical inference become reasoning?

For a long time, it seemed the answer was it doesn't. But now, using Claude code daily, it seems it does.

fc417fc802 4 months ago

Because it's not statistical inference on words or characters but rather stacked layers of statistical inference on ~arbitrarily complex semantic concepts which is then performed recursively.
- love2read 4 months ago
  
  This answer makes sense if you know that LLMs have layers, if you don't this answer is not super informative.
  If I were to describe this to a nontechnical person, I would say:
  LLMs are big stacks of layers of "understanders" that each teach the next guy something.
  Imagine you are making a large language model that has 4 layers. Each layer will talk to it's immediate neighbor.
  The first layer will get the bare minimum, in the LLM's of today, that's groups of letters that are common to come up together, called "tokens". This layer will try to derive a bit of meaning to tell the next layer, such as grouping of letters into words.
  The next layer may be a little bit more semantic, for example interpreting that the word "hot" immediately followed by the word "dog" maps to a phrase "hot dog".
  The layer after that, becoming a bit more intelligent given it's predecessors have already had some chances at smaller interpretations may now try to group words into bigger blobs, such as "i want a hot dog" as one combined phrase rather than a set of separated concepts.
  The final layer may do something even more intelligent afterward, like realize that this is a quote in a book.
  The point is that each layer tries to add a little meaning for the next layer.
  I want to stress this: the layers do not actually correspond to specific concepts the way I just expressed, the point is that each layer adds a bit more "semantic meaning" for the next layer.
ferris-booler 4 months ago

IMO your question is the largest unknown in the ML research field (neural net interpretability is a related area), but the most basic explanation is "if we can always accurately guess the next 'correct' word, then we will always answer questions correctly".
An enormous amount of research+eng work (most of the work of frontier labs) is being poured into making that 'correct' modifier happen, rather than just predicting the next token from 'the internet' (naive original training corpus). This work takes the form of improved training data (e.g. expert annotations), human-feedback finetuning (e.g. RLHF), and most recently reinforcement learning (e.g. RLVR, meaning RL with verifiable rewards), where the model is trained to find the correct answer to a problem without 'token-level guidance'. RL for LLMs is a very hot research area and very tricky to solve correctly.
antonvs 4 months ago

One problem is that "statistical inference" is overly reductive. Sure, there's a statistical aspect to the computations in a neural network, but there's more to it than that. As there is in the human brain.
mike_hearn 4 months ago

DNNs aren't really "statistical" inference in the way most people would understand the term statistics. The underlying maths owes much more to calculus than statistics. The model isn't just encoding statistics about the text it was trained on, it's attempting to optimize a solution to the problem of picking the next token with all the complexity that goes into that.

malnourish 4 months ago

I read through this entire article. There was some value in it, but I found it to be very "draw the rest of the owl". It read like introductions to conceptual elements or even proper segues had been edited out. That said, I appreciated the interactive components.

davidw 4 months ago

It started off nicely but before long you get
"The MLP (multilayer perceptron) is a two-layer feed-forward network: project up to 64 dimensions, apply ReLU (zero out negatives), project back to 16"
Which starts to feel pretty owly indeed.
I think the whole thing could be expanded to cover some more of it in greater depth.
- tibbar 4 months ago
  
  I think the big frustration I've had in learning modern ML is that the entire owl is just so complicated. A poor explainer reads like "black box is black boxing the other black box", completely undecipherable. A mediocre-to-above-average explanation will be like "(loosely introduced concept) is (doing something that sounds meaningful) to black box", which is a little better. However, when explanations start getting more accurate, you run into the sheer volume of concepts/data transforms taking place in a transformer, and there's too much information to be useful as a pedagogical device.
- growingswe 4 months ago
  
  I tried to include tooltips in some places that go into more depth, but I understand there's a jump. I'm not sure what will be the best way to go about it tbh
  
  malnourish 4 months ago
  
  I liked the tooltips. You should define each term the first time it shows up (MLP for example).

love2read 4 months ago

Is it becoming a thing to misspell and add grammatical mistakes on purpose to show that an LLM didn't write the blog post? I noticed several spelling mistakes in Karpathy's blog post that this article is based on and in this article.

klysm 4 months ago

I expect this kind of counter signaling to become more common in the coming years.
efilife 4 months ago

You just started to notice it
refulgentis 4 months ago

People aren't gonna be happy I spell this out, but, Karpathy's not The Dude.
He's got a big Twitter following so people assume somethings going on or important, but he just isn't.
Biggest thing he did in his career was feed Elon's Full Self Driving delusion for years and years and years.
Note, then, how long he lasted at OpenAI, and how much time he spends on code golf.
If you're angry to read this, please, take a minute and let me know the last time you saw something from him that didn't involve A) code golf B) coining phrases.
- fennecbutt 4 months ago
  
  Agree, same as Carmack. They're suit and tie types now.
- tibbar 4 months ago
  
  I have no skin in the game here, but this seems a bit "sharp-edged", do you have something against the guy? He just seems deep into his influencer/retired hobbyist arc to me...
  
  refulgentis 4 months ago
  
  No, and me too. Just had been sitting in my chest a while when I see people expecting non-hobbyist work from him. And had been worried to post it because things you and I understand become sharp-edged when spoken out loud to other people who don't.
- mike_hearn 4 months ago
  
  Tesla FSD isn't a delusion. There are people using it to successfully do long distance drives across the USA right now, without interventions. Dunno how much credit Karpathy gets for that, but the tech works.
  
  refulgentis 4 months ago
  
  I almost edited in something about 2018 vs 2026 but didn’t, trusted you to understand :)
- love2read 4 months ago
  
  Is this AI generated?
  
  refulgentis 4 months ago
  
  What?

grey-area 4 months ago

The original article from Karpathy: https://karpathy.github.io/2026/02/12/microgpt/

kinnth 4 months ago

That was one of the most helpful walkthroughs i've read. Thanks for explaining so well with all of the steps.

I wasn't a coder but with AI I am actually writing code. The more i familiarise myself with everything the easier it becomes to learn. I find AI fascinating. By making it so simple and clear it helps when i think what i need to feed it.

lozzo 4 months ago

This was a beautiful article to stumble upon.

I had seen Karpathy 's work - https://karpathy.github.io/2026/02/12/microgpt/ - but found it still too demanding to get it

This was the next simplification I just needed

dreamking 4 months ago

It seems that Tmobile is originally block this website that I can't open this blog page...

https://www.t-mobile.com/home-internet/http-warning?url=http...

thebiblelover7 4 months ago

I know many comments mentioned that it was too introductory, or too deep. But as someone that does not have much experience understanding how these models work, I found this overview to be pretty great.

There were some concepts I didn't quite understand but I think this is a good starting point to learning more about the topic.

danhergir 4 months ago

I went through the article, and it makes sense to me that we're getting names as an output, but why doing so with names?

growingswe 4 months ago

Names is just a random problem to demonstrate the model. It could be anything, I believe
- danhergir 4 months ago
  
  What if you just use words
  
  aeve890 4 months ago
  
  Probably because names kinda obfuscate the ridiculous impracticality of this exercise. This microgpt can produce a random sequence of letters and by chance it might look like a name. If the thing output, let's say "Kianna" you just think "wow, it IS a name" but is it though? (Idk if it's a real name, at least not in Spanish) Isn't a normal word, so the randomness of names helps to hide the fact that this gpt just outputs random shit that looks like names. If you just use words you will get mostly random shit that doesn't resemble any real words. Just my hypothesis. I can see the convenience of using names. The output look like real names but you can achieve the same result with old ai and very basic algorithms.
  
  danhergir 4 months ago
  
  in other words, it works well because it doesnt make out weird names.

ChrisArchitect 4 months ago

Microgpt

https://news.ycombinator.com/item?id=47202708