4-bit floating point FP4

www.johndcook.com

74 points by chmaynard 1 day ago

teo_zero 17 hours ago

When you have so few bits, does it really make sense to invent a meaning for the bit positions? Just use an index into a "palette" of pre-determined numbers.

As a bonus, any operation can be replaced with a lookup into a nxn table.

childintime 17 hours ago

Exactly. And pick them on the e^x curve.
0-_-0 14 hours ago

You want to make multiplication cheap, it's not just about compression
- mysterydip 12 hours ago
  
  Wouldn’t multiplication just be an 8 bit lookup table? a*b is just lut[a<<4+b]
- kevmo314 4 hours ago
  
  Multiplication at this resolution is already implemented via lookup tables.
  
  ineedasername 2 hours ago
  
  For FP4, yes... sometimes... it depends. But newer Nvidia architecture eg Blackwell w/ NVFP4 does not, they perform micro block scaling in the core. For older architectures, low quants like FP4 are also often not done native, and instead inflated back to BF16, eg with BnB.
petters 4 hours ago

That's a good idea and it exists: https://www.johndcook.com/blog/2026/04/18/qlora/
It seems quite wastful to have two zeros when you only have 4 bits it total
- saulpw 3 hours ago
  
  OTOH, it seems quite plausible that the most important numbers to represent are:
  +0 -0 +1 -1 +inf -inf
  
  Dwedit 2 hours ago
  
  Why waste a slot on -0?
  
  saulpw 1 hour ago
  
  Because it means "infinitesimal negative" which is distinct from "infinitesimal positive".
  
  parsimo2010 1 hour ago
  
  In standard FP32, the infs are represented as a sign bit, all exponent bits=1, and all mantissa bits=0. The NaNs are represented as a sign bit, all exponent bits=1, and the mantissa is non-zero. If you used that interpretation with FP4, you'd get the table below, which restricts the representable range to +/- 3, and it feels less useful to me. If you're using FP4 you probably are space optimized and don't want to waste a quarter of your possible combinations on things that aren't actually numbers, and you'd likely focus your efforts on writing code that didn't need to represent inf and NaN.
  Bits s exp m Value ------------------- 0000 0 00 0 +0 0001 0 00 1 +0.5 0010 0 01 0 +1 0011 0 01 1 +1.5 0100 0 10 0 +2 0101 0 10 1 +3 0110 0 11 0 +inf 0111 0 11 1 NaN 1000 1 00 0 -0 1001 1 00 1 -0.5 1010 1 01 0 -1 1011 1 01 1 -1.5 1100 1 10 0 -2 1101 1 10 1 -3 1110 1 11 0 -inf 1111 1 11 1 NaN

FarmerPotato 7 hours ago

I too want fewer bits of mantissa in my floating point!

But what I wish is that there had been fp64 encoding with a field for number of significant digits.

strtod() would encode this, fresh out of an instrument reading (serial). It would be passed along. It would be useful EVEN if it weren't updated by arithmetic with other such numbers.

Every day I get a query like "why does the datum have so many decimal digits? You can't possibly be saying that the instrument is that precise!"

Well, it's because of sprintf(buf, "%.16g", x) as the default to CYA.

Also sad is the complaint about "0.56000 ... 01" because someone did sprintf("%.16f").

I can't fix this in one class -- data travels between too many languages and communication buffers.

In short, I wish I had an fp64 double where the last 4 bits were ALWAYS left alone by the CPU.

slwvx 4 hours ago

I've seen more packages that do interval arithmetic than those which keep track of significant digits. For example: https://github.com/JuliaIntervals/IntervalArithmetic.jl

conaclos 1 day ago

There is a relevant Wikipedia page about minifloats [0]

> The smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values, is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa.

[0] https://en.wikipedia.org/wiki/Minifloat

sc0ttyd 1 day ago

9 years ago, I shared this as an April Fools joke here on HN.

It seems that life is imitating art.

https://github.com/sdd/ieee754-rrp

mysterydip 1 day ago

I especially like your HQQ precision
- sc0ttyd 1 day ago
  
  I think it is only a matter of time before HQQ / 1FP takes over. It's the logical conclusion. I hope to be using my 96-blade razor by then too
  
  phendrenad2 2 hours ago
  
  https://theonion.com/fuck-everything-were-doing-five-blades-...
Dylan16807 1 day ago

> 9 years ago, I shared this as an April Fools joke here on HN.
That's fun.
> It seems that life is imitating art.
You didn't even beat wikipedia to the punch. They've had a nice page about minifloats using 6-8 bit sizes as examples for about 20 years.
The 4 bit section is newer, but it actually follows IEEE rules. Your joke formats forgot there's an implied 1 bit in the fraction. And how exponents work.
nomel 21 hours ago

Lowest I've used is 8 bit floats for time delays, in embedded devices.
- the__alchemist 12 hours ago
  
  Interesting! I have been using integers or f32 for that. What was the use case specifically? Did you write a software float for it? I remember writing a `f16` type for an IC that used that was a pain!
lifthrasiir 21 hours ago

Another attempt includes Tom 7's binary3 format [1].
[1] https://tom7.org/nand/

Figs 23 hours ago

> The notation ExMm denotes a format with x exponent bits and y mantissa bits.

Shouldn't that be m mantissa bits (not y) -- i.e. typo here -- or am I misunderstanding something?

recursivecaveat 19 hours ago

You're correct yeah, 'ExMy'.

karmakaze 1 day ago

There's an "Update:" note for a next post on NF4 format. As far as I can tell this is neither NVFP4 nor MXFP4 which are commonly used with LLM model files. The thing with these formats is that common information is separated in batches so not a singular format but a format for groups of values. I'd like to know more about these (but not enough to go research them myself).

nivertech 14 hours ago

FP2 spec:

  00 -> 0.0
  01 -> 1.0
  10 -> Inf
  11 -> NaN

  00 -> 0.0
  01 -> 1.0
  10 -> Inf
  11 -> -Inf

0-_-0 14 hours ago

  00 -> 0.0
  01 ->-0.0
  10 -> Inf
  11 -> -Inf

tim333 10 hours ago

I guess my first car's four speed box was a bit like a FP2 float. Lever forward/back, right/left -> 3.65, 2.15, 1.42, 1.00 ratios.

chrisjj 1 day ago

> Programmers were grateful for the move from 32-bit floats to 64-bit floats. It doesn’t hurt to have more precision

Someome didn't try it on GPU...

kimixa 1 day ago

Even the latest CPUs have a 2:1 fp64:fp32 performance ratio - plus the effects of 2x the data size in cache and bandwidth use mean you can often get greater than a 2x difference.
If you're in a numeric heavy use case that's a massive difference. It's not some outdated "Ancient Lore" that causes languages that care about performance to default to fp32 :P
- adgjlsfhk1 1 day ago
  
  > languages that care about performance to default to fp32
  What do you mean by this? In C 1.0 is a double.
  
  kimixa 1 day ago
  
  But the "float" typename is generally fp32 - if we assume the "most generically named type" is the "default". Though this is a bit of an inconsistency with C - the type name "double" surely implies it's double the expected baseline while, as you mentioned, constants and much of libm default to 'double'.
- pixelesque 1 day ago
  
  > Even the latest CPUs have a 2:1 fp64:fp32 performance ratio
  Not completely - for basic operations (and ignoring byte size for things like cache hit ratios and memory bandwidth) if you look at (say Agner Fog's optimisation PDFs of instruction latency) the basic SSE/AVX latency for basic add/sub/mult/div (yes, even divides these days), the latency between float and double is almost always the same on the most recent AMD/Intel CPUs (and normally execution ports can do both now).
  Where it differs is gather/scatter and some shuffle instructions (larger size to work on), and maths routines like transcendentals - sqrt(), sin(), etc, where the backing algorithms (whether on the processor in some cases or in libm or equivalent) obviously have to do more work (often more iterations of refinement) to calculate the value to greater precision for f64.
  
  kimixa 1 day ago
  
  > ... if you look at (say Agner Fog's optimisation PDFs of instruction latency) ...
  That.... doesn't seem true? At least for most architectures I looked at?
  While true the latency for ADDPS and ADDPD are the same latency, using the zen4 example at least, the double variant only calculates 4 fp64 values compared to the single-precision's 8 fp32. Which was my point? If each double precision instruction processes a smaller number of inputs, it needs to be lower latency to keep the same operation rate.
  And DIV also has a significntly lower throughput for fp32 vs fp64 on zen4, 5clk/op vs 3, while also processing half the values?
  Sure, if you're doing scalar fp32/fp64 instructions it's not much of a difference (though DIV still has a lower throughput) - but then you're already leaving significant peak flops on the table I'm not sure it's a particularly useful comparison. It's just the truism of "if you're not performance limited you don't need to think about performance" - which has always been the case.
  So yes, they do at least have a 2:1 difference in throughput on zen4 - even higher for DIV.
  
  adgjlsfhk1 1 day ago
  
  This depends largely on your operations. There is lots of performance critical code that doesn't vectorize smoothly, and for those operations, 64 bit is just as fast.
  
  kimixa 22 hours ago
  
  Yes, if you're not FP ALU limited (which is likely the case if not vectorized), or data cache/bandwidth/thermally limited from the increased cost of fp64, then it doesn't matter - but as I said that's true for every performance aspect that "doesn't matter".
  That doesn't mean that there are no situations where it does matter today - which is what I feel is implied by calling it "Ancient".
  
  pixelesque 14 hours ago
  
  Well, maybe not all admittedly, and I didn't look at AVX2/512, but it looks like `_mm_div_ps` and `_mm_div_pd` are identical for divide, at the 4-wide level for the basics.
  Obviously, the wider you go, the more constrained you are on infrastructure and how many ports there are.
  My point was more it's very often the expensive transcendentals where the performance difference is felt between f32 and f64.
  
  omoikane 1 day ago
  
  > the latency between float and double is almost always the same on the most recent AMD/Intel CPUs
  If you are developing for ARM, some systems have hardware support for FP32 but use software emulation for FP64, with noticeable performance difference.
  https://gcc.godbolt.org/z/7155YKTrK
Sharlin 1 day ago

Yeah, and even on CPU using doubles is almost unheard of in many fields.

Panzerschrek 18 hours ago

This doesn't look like good a floating point format. NaNs and INFs are missing.

ant6n 1 day ago

> In ancient times, floating point numbers were stored in 32 bits.

I thought in ancient times, floating point numbers used to be 80 bit. They lived in a funky mini stack on the coprocessor (x87). Then one day, somebody came along and standardized those 32 and 64 bit floats we still have today.

_trampeltier 1 day ago

80 bits is just in the processor. Thats why you might a little bit different result, depending how you calculated first and maybe stored something in the RAM
convolvatron 1 day ago

I was going to reply that just because intel did something funny doesn't mean that it was the beginning of the story. but it turns out that the release of the 8087 predates the ratification of IEEE floats by 2 years. in addition, the primary numeric designer for the 8087 was apparently Kahan, which means that they were both part of the same design process. of course there were other formats predating both of these
- indolering 1 day ago
  
  The floating point "standard" was basically codifying multiple different vendor implementations of the same idea. Hence the mess that floating point is not consistent across implementations.
  
  jcranmer 1 day ago
  
  IEEE 754 basically had three major proposals that were considered for standardization. There was the "KCS draft" (Kahan, Coonen, Stone), which was the draft implemented for the x87 coprocessor. There was DEC's counter proposal (aka the PS draft, for Payne and Strecker), and HP's counter proposal (aka, the FW draft for Fraley and Walther). Ultimately, it was the KCS draft that won out and become what we now know as IEEE 754.
  One of the striking things, though, is just how radically different KCS was. By the time IEEE 754 forms, there is a basic commonality of how floating-point numbers work. Most systems have a single-precision and double-precision form, and many have an additional extended-precision form. These formats are usually radix-2, with a sign bit, a biased exponent, and an integer mantissa, and several implementations had hit on the implicit integer bit representation. (See http://www.quadibloc.com/comp/cp0201.htm for a tour of several pre-IEEE 754 floating-point formats). What KCS did that was really new was add denormals, and this was very controversial. I also think that support for infinities was introduced with KCS, although there were more precedents for the existence of NaN-like values. I'm also pretty sure that sticky bits as opposed to trapping for exceptions was considered innovative. (See, e.g., https://ethw-images.s3.us-east-va.perf.cloud.ovh.us/ieee/f/f... for a discussion of the differences between the early drafts.)
  Now, once IEEE 754 came out, pretty much every subsequent implementation of floating-point has started from the IEEE 754 standard. But it was definitely not a codification of existing behavior when it came out, given the number of innovations that it had!
Sharlin 1 day ago

x87 always had a choice of 32/64/80-bit user-facing floats. It just operated internally on 80 bits.
ted_dunning 1 day ago

That is merely medieval times.
In ancient times, floats were all 60 bits and there was no single precision.
See page 3-15 of this https://caltss.computerhistory.org/archive/6400-cdc.pdf
- ant6n 1 day ago
  
  That written document is prehistoric.

brcmthrowaway 23 hours ago

Does Apple GPU support any of these natively?

Or does that matter - its the kernel that handles the FP format?

burnt-resistor 1 day ago

FP4 1:2:0:1 (other examples: binary32 1:8:0:23, 8087 ep 1:15:1:63)

S:E:l:M

S = sign bit present (or magnitude-only absolute value)

E = exponent bits (typically biased by 2^(E-1) - 1)

l = explicit leading integer present (almost always 0 because the leading digit is always 1 for normals, 0 for denormals, and not very useful for special values)

M = mantissa (fraction) bits

The limitations of FP4 are that it lacks infinities, [sq]NaNs, and denormals that make it very limited to special purposes only. There's no denying that it might be extremely efficient for very particular problems.

If a more even distribution were needed, a simpler fixed point format like 1:2:1 (sign:integer:fraction bits) is possible.