NOTE: this is critiquing the author's 4NF definition (from a link in TFA), not TFA itself.
> If you read any text that defines 4NF, the first new term you hear is “multivalued dependency”. [Kent 1983] also uses “multivalued facts”. I may be dumb but I only very recently realized that it means just “a list of unique values”. Here it would be even better to say that it’s a list of unique IDs.
This is an inaccurate characterization, and the rest of the post only makes sense when viewed through this strawman. The reason 4NF is explained in the "weird, roundabout way" is because it demonstrates [one of] the precise problem[s] the normal form sets out to solve: a combinatorial explosion of rows.
If you have a table:
CREATE TABLE Product(
product_id INT NOT NULL,
supplier_id INT NOT NULL,
warehouse_id INT NOT NULL
);
If you only ever add an additional supplier or an additional warehouse for a given product, it's only adding one row. But if you add both to the same product, you now have 4 rows for a single product; if you add 5 suppliers and 3 warehouses to the same product, you now have 15 rows for a single product, etc. This fact might be lost on someone if they're creating a table with future expansion in mind without thinking it through, because they'd never hit the cross-product, so the design would seem reasonable.
The conclusion reached (modulo treating an array as an atomic value) is in fact in 4NF, but it doesn't make any sense why it's needed if you redefine multivalued dependency to mean a set.
In a roundabout way this article captures well why I don't really like thinking in terms of "normal forms", especially as a numbered list like that. The key insights are really 1. Avoid redundancy and 2. This may involve synthesizing relationships that don't immediately obviously exist from a human perspective. Both of those can be expanded on at quite some length, but I never found much value in the supposedly-blessed intermediate points represented by the nominally numbered "forms". I don't find them useful either for thinking about the problem or for communicating about it.
Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM) doesn't mean it is actually useful... sometimes it just means that it made it easy to write multiple choice test questions. (e.g., "What does Layer 2 of the OSI network model represent? A: ... B: ... C: ... D: ..." to which the most appropriate real-world answer is "Who cares?")
> Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM)
One problem is that normal forms are underspecified even by the academy.
E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).
2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).
Well, we are roughly the same age then. Our is a cynical generation.
"One problem is that normal forms are underspecified even by the academy."
The cynic in me would say they were doing their job by the example I gave, which is just to provide easy test answers, after which there wasn't much reason to iterate on them. I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.
> I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.
Real-talk: those consultants are absolutely essential - and are the unsung heroes of so many "organic" database projects that would have gotten started as an Excel spreadsheet on a nontechnical middle-manager's workgroup-networked desktop, which grew over time into a dBase file, then MSAccess/JET, then MSDE or MSSQL Express if they (think) they knew what they're doing, and then if it's the mid-2000s then maybe it'll be moved onto dedicated on-prem Oracle or MSSQL box - but still an RDBMS; I remember in 2014 all the talk was about moving data out of on-prem RDBMS siloes and onto Cloud(TM)-y OLAP clusters (trying to hide the fact they're running stock Postgres) which acted as a source for a Hadoop cluster - all to produce dashboards and visualizations made with the $100k Tableau license your company purchased after their sales guys showed your org's procurement people a good time in Cancun.
None of the evolution and progress described above could have happened if not for the awful DB designs in that initial Access DB - the anti-patterns would be carried through the DB whenever it ascended to the next tier of serious-business-ness, and each and every design-decision made out of innocent ignorance gets gradually massaged-out of the model by the regular and recurring visits by DBA consultants - because (and goddamnit it's true): a depressingly tiny proportion of software people (let alone computer-people) know anything about DB design+theory - nor all the vendor-specific gotchas.
What I still don't understand is how in 2026 - after 30 years of scolding beginners online - that we've successfully gotten greenhorn software-dev people to move away from VBA/VB6's dead-end, PHP's unintentional fractal of bad design, and MySQL's meh-ness - and onto sane and capable platforms like TypeScript, Node, and Postgres - all good stuff; and yet on my home-turf on StackOverflow, I still see people writing SQL-92 style JOINs and CREATE TABLE statements covered in more backticks than my late grandmother's labrador. I honestly have no idea where/when/how all those people somehow learned SQL-92's obsolete JOIN syntax today.
So in conclusion: the evidence suggests that not enough people today truly understand databases well-enough to render expensive DBA consultants irrelevant.
> Also, personally I think that 6NF should be foundational, but that's a separate matter.
I share your ideal, but there exists a slight problem: no RDBMS I'm aware of really facilitates 6NF or DKNF (or even Codd's full relational concept; or newfound essentials like relational-division, and so on...).
There are also genuine ergonomic issues to contend with: pretty-much every RDBMS design and/or administration tool I've used in the past 20 years (SSMS, SSDT, DBeaver, MSAccess (lol), phpMyAdmin, etc) will present the database as a long, flat list of tables - often only in alphabetical order (if you're lucky, the tooling might let you group the tables into logical subfolders based on some kind of 2-part name scheme baked into the RDBMS (e.g. "schemas" in MSSQL).
...which starts being counterproductive when 6NF means you have a large number of tables that absolutely need to exist - but aren't really that significant alone by themselves; but they always need to remain accessible to the user of the tool (so they can't be completely hidden). So you'll turn to the Diagramming feature in your DB GUI, which gives you a broader 2D view of your DB where you can proximally group related objects together - instead of endlessly scrolling a long alphabetical list; and you can actually see FKs represented by physical connections which aids intuitive groking when you're mentally onboarding onto a huge, legacy production DB design.
...but DB diagrams are just too slow to load (as the tooling needs to read the entire DB's schema, design; all objects first before it can give you a useful view of everything - it's just so incredibly grating; whereas that alphabetical list loads instantly.
Sorry I'm just rambling now but anyway, my point is, 6NF is great, but our tooling sucks, and the RDBMS they connect to suck even more (e.g. SQL-92 defined the 4 main CONSTRAINT types seen in practically all RDBMS today (CHECK, FOREIGN KEY, UNIQUE, and DEFAULT); over 30 years later we still have the same anaemic set of primitive constraints; only Postgres went further (with its `EXCEPT` constraint). As of 2026, and almost 40 years since it was defined, no RDBMS supports ASSERTION constraints; wither DOMAIN constraints and a unified type-system that elegantly mediates between named scalars, relations (unordered sets of tuples), queries, and DOMAINs and the rest.
...this situation is maddening to me because so many data-modelling problems exist _because_ of how unevolved our RDBMS are.
I could have worded my post a bit better - I didn't mean to imply DBeaver only showed a flat list of tables/objects; but DBeaver is hardly unique in having DB diagrams; my point was that every DB-diagram feature/tool/workspace in a DB admin/IDE (like DBeaver, SSMS, SSDT, etc) is necessarily performance-constrained because they need to load _so much_ metadata before they can show an accurate - and therefore useful - picture of the DB - even if it's just a subset of all tables/objects.
You can do what you do at the MAC layer without any regard for whether or not it is "OSI layer 2", or whether your MAC layer "cheats" and has features that extend into layers 1, or 3, or any other layer. Failing to implement something useful because "that's not what OSI layer 2 is and this is data layer 2 and the OSI model says not to do that" is silly.
To stay on the main topic, same for the "normalization forms". Do what your database needs.
The concepts are just attractive nuisances. They are more likely to hurt someone than to help them.
OSI is particularly obnoxious because layers 5 and 6 don't exist separately in practically any system. Application layer protocols handle them in their own bespoke ways, so we have a software stack consisting of layers 2,3,4,7 like its the pentatonic scale or something.
Breaking 1NF is essentially always incorrect. You're fundamentally limiting your system, and making it so that you will struggle to perform certain queries. Only break 1NF when you're absolutely 100% certain that nobody anywhere will ever need to do anything even slightly complex with the data you're looking at. And then, probably still apply 1NF anyways. Everyone that ever has to use your system is going to hate you when they find this table because you didn't think of the situation that they're interested in. "Why does this query use 12 CTEs and random functions I've never heard of and take 5 minutes to return 20,000 rows?" "You broke 1NF."
2NF is usually incorrect to break. Like it's going to be pretty obnoxious to renormalize your data using query logic, but it won't come up nearly as frequently. If it's really never going to come up that often in practical terms, then okay.
3NF and BCNF are nice to maintain, but the number of circumstances where they're just not practical or necessary starts to feel pretty common. Further, the complexity of the query to undo the denormalization will not be as obnoxious as it is for 1NF or 2NF. But if you can do it, you probably should normalize to here.
4NF and higher continue along the same lines, but increasingly gets to what feels like pretty arbitrary requirements or situations where the cost you're paying in indexes is starting to become higher than the relational algebra benefits. Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.
> Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.
FK constraints add a negligible amount of space, if any. The indexes they require do, certainly, but presumably you're already doing joins on those FKs, so they should already be indexed.
Junction tables are how you represent M:N relationships. If you don't have them, you're either storing multiple values in an array (which, depending on your POV, may or may not violate 1NF), or you have a denormalized wide table with multiple attributes, some of which are almost certainly NULL.
Also, these all serve to prevent various forms of data anomalies. Databases must be correct above all else; if they're fast but wrong, they're useless.
> Junction tables are how you represent M:N relationships.
Yeah, the problem is that when you get to 4NF+, you're often looking at creating a new table joining through a junction table for a single multi-valued data field that may be single values a plurality or majority of the time. So you need the base table, the junction table that has at least two columns, and the actual data table.
So, you've added two tables, two foreign key constraints, two primary key indexes, potentially more non-clustered indexes... and any query means you need two joins. And data validation is hard because you need to use an anti-join to find missing data.
Or, you can go with an 1:N relationship. Now you have only one more table at the cost of potentially duplicating values between entities. But if we're talking about, say, telephone numbers? Sure, different entities might share the same phone number. Do you need a junction table so you don't duplicate a phone number? You're certainly not saving disk space or improving performance by doing that unless there's regularly dozens of individual records associated to a single phone number.
And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.
> Databases must be correct above all else; if they're fast but wrong, they're useless.
Yeah, and if they're "correct" but you can't get it to return data in a timely manner, they're also useless. A database that's a black hole is not an improvement. If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?
Right. But faceting data is also part of what a good database designer does. That includes views over the data; materialisation, if it is justified; stored procedures and cursors.
I've never had to do 18 joins to extract information in my career. I'm sure these cases do legitimately exist but they are of course rare, even in large enterprises. Most companies are more than capable of distinguishing OLTP from OLAP and real-time from batch and design (or redesign) accordingly.
Databases and their designs shift with the use case.
> joining through a junction table for a single multi-valued data field
I may be misunderstanding you, but to me it sounds like you're conflating domain modeling with schema modeling. If your domain is like most SaaS apps, then Phone, Email, Address, etc. are probably all attributes of a User, and are 1:N. The fact that multiple Users may share an Address (either from multiple people living together, or people moving) doesn't inherently mean you have an M:N relationship that you must model with schema. If you were using one of those attributes as an identity (e.g. looking up a customer by their phone number), that still doesn't automatically mean you have to model everything as M:N - you could choose to accept the possibility of duplicates that you have to deal with in application code or by a human, or you could choose to create a UNIQUE constraint that makes sense for 99% of your users (e.g. `(phone_number, deactivated_at)` enforces that a phone number is only assigned to one active user at a time), and find another way to handle the rare exceptions. In both cases, you're modeling the schema after your business logic, which is IMO the correct way to do so.
I apologize if I came across as implying that any possible edge case means that you must change your schema to handle it. That is not my design philosophy. The schema model should rigidly enforce your domain model, and if your domain model says that a User has 0+ PhoneNumber, then you should design for 1:N.
> And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.
If the attribute is mostly 1:1, then whether or not you should decompose it largely comes down to semantic clarity, performance, and the possibility of expansion.
This table is in 3NF (and BCNF, and 4NF):
CREATE TABLE User (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
email VARCHAR(254) NOT NULL,
phone VARCHAR(32) NULL
);
So is this:
CREATE TABLE User (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
email VARCHAR(254) NOT NULL,
phone_1 VARCHAR(32) NULL,
phone_2 VARCHAR(32) NULL,
);
Whereas this may violate 3NF depending on how you define a Phone in your domain:
CREATE TABLE User (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(255) NOT NULL,
email VARCHAR(254) NOT NULL,
phone_1 VARCHAR(32) NULL,
phone_1_type ENUM('HOME', 'CELL', 'WORK') NOT NULL,
phone_2 VARCHAR(32) NULL,
phone_2_type ENUM('HOME', 'CELL', 'WORK') NOT NULL,
);
If a Phone is still an attribute of a User, and you're not trying to model the Phone as its own entity, then arguably `phone_1_type` is describing how the User uses it (I personally think this is a bit of a stretch). Similarly, it can be argued that this design violates 1NF, because `(phone_n, phone_n_type)` is a repeating group, even if you've split it out into two columns. Either way, I think it's a bad design (adding two more columns that will be NULL for most users to support a tiny minority isn't great, and the problem compounds over time).
> If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?
The only times I've seen anything close to that many joins are:
1. Recreating a denormalized table from disparate sources (which are themselves often not well-constructed) to demonstrate that it's possible.
2. Doing some kinds of queries in MySQL <= 5.7 on tables modeling hierarchical data using an adjacency list, because it doesn't have CTEs.
3. When product says "what if we now supported <wildly different feature from anything currently offered>" and the schema was in no way designed to support that.
Even with the last one, I think the most I saw was 12, which was serendipitous because it's the default `geqo_threshold` for Postgres.
What always frustrates me is that when people on here discuss deeply technical and/or meta-aspects of programming (e.g. type theory), it's taken at face value, but the same is not true of databases. They are generally treated as a dumb data store that you can throw anything into, and when someone explains why that's a bad idea, or why an academic concept like normal form is still applicable, it's met with criticism.
Even when it's purely performance-related, it usually gets a shrug, and "it's good enough." Cool, you're wrecking the B+tree, maybe don't do that. It's as if I said, "I'm using an array to store these millions of items that I later need to de-duplicate," and when someone suggests maybe using a set, I dismiss it.
Agreed. In practice I just ask "am I storing the same fact in two places?" & fix it if yes. Never once sat down and thought "let me check if this is in 4NF specifically."
Imperative mood "normalize" assumes that you had something not-normalized before you received that instruction. It's not useful when your table design strategy is already normalization-preserving, such as the most basic textbook strategy (a table per anchor, a column per attribute or 1:N link, a 2-column table per M:N link).
And this is basically the main point of my critique of 4NF and 5NF. They both traditionally present an unexplained table that is supposed to be normalized. But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?
It's like saying that to in order to implement an algorithm you have to remove bugs from its original implementation — where does this implementation come from?
The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.
Because of that if you, as a novice, look at a typical production schema, and you have this "thou shalt normalize" instruction, you'll be confused.
> But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?
I find the bafflement expressed in the article as well as the one linked extremely attractive. It made both a joy to read.
Were I to hazard a guess: Might it be a consequence of lack of disk space in those early decades, resulting into developers being cautious about defining new tables and failing to rationalise that the duplication in their tragic designs would result in more space wasted?
> The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.
Agreed, but as the OP comment stated they usually started out normalised and then pushed out denormalised representations for nice contiguous reads.
As a victim of maintaining a stack on top of an EAV schema once upon a time, I have great appreciation for contiguous reads.
> Normalization-as-process makes sense in a specific scenario: When converting a hierarchical database model into a relational model.
That makes much more sense as reasoning.
If I can also offer a second hazard of guess. I used to work in embedded in the 2000's and it was absolutely insane how almost all of the eldy architects and developers would readily accept some fixed width file format for data storage over a sensible solution that offered out of the box transactionality and relational modelling like Sqlite. This creates a mindset where each datastore is effectively siloed and must contain all the information to perform the operation, potentially leading to these denormalised designs.
Bit weird, given that was from the waterfall era, implying that the "Big Design Up Front" wasn't actually doing any real thinking about modelling up front. But I've been in that room and I think a lot of it was cargo cult. To deal with the insanity of simple file I/O as data, I had to write a rudimentary atomicity system from scratch in order to fix the dumb corruption issues of their design when I would have got that for free with Sqlite.
The numbered forms are most useful as a teaching device, not an engineering specification. Once you have internalized 2NF and 3NF violations through a few painful bugs, you start spotting partial and transitive dependencies by feel rather than by running through definitions. The forms gave you the vocabulary. The bugs gave you the instinct..
I'll go one further and say that if you're reaching for DISTINCT and you have joins, you may have joined the data the wrong way. It's not a RULE, but it's ALWAYS a 'smell' when I see a query that uses DISTINCT to shove away duplicate matches. I always add a comment for the exceptions.
It depends on if you are doing OLTP (granular, transactional) vs OLAP (fact/date based aggregates) - dates are generally not something you'd consider in a fully normalized flow to uniqify records.
The article makes a good point about when 5nf becomes impractical. In my experience, stopping at BCNF or 4nf often strikes a better balance unless you have very clear join dependencies. How do others decide where to stop normalizing in real world apps?
I have so many questions about that. Should that normal form basically replace 5NF for the purposes of teaching?
Why do they hate us and do not provide any illustrative real-life example without using algebraic notation? Is it even possible?
I just want to see a CREATE TABLE statement, and some illustrative SELECT statements. The standard examples always give just the dataset, but dataset examples are often ambiguous.
> (in its joins)
Do you understand what are "its" joins? What is even "it" here.
For me NF>3 seems like an implicit encoding of underlying data logic. They impose additional restrictions (usually contrived and artificial, break really fast in real life) on data not directly expressed as data tuples. Because of that they are hard to explain, natural reaction: "why you just don't store data?".
I haven’t finished reading this but I am commenting because of the form. Lead with the conclusions, table of contents, and then sources? This is someone who is confident in what they write. I wish more writing trusted the audience to decide if the writing were important instead of stringing the audience allow. Keep up the good work.
I love reading about the normal forms, because it makes me sound like I know what I'm talking about in the conversation where the backend folks tell me, "if we normalized that data then the database would go down". This is usually followed by arguments over UUID versions for some reason.
Me still using bigints... Which haven't given me any problems. Wouldn't use it for client generated IDs but that is not what most applications require anyway.
This was an attempt to extend jokes and not ask for explanation: there are a number of normal forms, and people usually talk about "normalization" without being specific thus conflating all of them; out of 7 UUID versions, only 2 generally make sense for use today depending on whether you need time-incrementing version or not.
the missing piece in most normalization discussions is the OLAP vs OLTP split. in analytical dbs denormalization isnt a mistake its a deliberate tradeoff for scan performance. teaching normal forms without that context sets people up to make the wrong calls when they hit a warehouse workload
Color me impressed. Even being very well versed in database design myself, this is just pragmatic and straight to the point, the way I'd have liked it back in the day.
I think the main problem of how 4NF and 5NF formal definitions were taught is that essentially common sense (which is mostly "sufficient" to understand 1NF-3NF) starts to slip away, and you start needing the mathematical background that Ed Codd (and others) had. And trying to avoid that is how those weird examples came up.
Over time I’ve developed a philosophy of starting roughly around 3NF and adjusting as the project evolves. Usually this means some parts of the db get demoralize and some get further normalized
>> Usually this means some parts of the db get demoralize
I largely agree with your practical approach, but try and keep the data excited about the process, sell the "new use cases for the same data!" angle :)
This is great. Then I would consider the aggreated, validated, and canonicalized source as a Golden Source. Where I've seen issues is that someone starts to query from a nonauthoritative source because they know about it, instead of going upstream to a proper source.
JSON is extremely fast these days. Gzipped JSON perhaps even more so.
I find that JSON blobs up to about 1 megabyte are very reasonable in most scenarios. You are looking at maybe a millisecond of latency overhead in exchange for much denser I/O for complex objects. If the system is very write-intensive, I would cap the blobs around 10-100kb.
I adore contiguous reads that ideas like that yield. I'd rather push that out to a read-only end point, then getting sucked into the entropy of treating what is effectively an unschema-ed blob into editable data.
> You are looking at maybe a millisecond of latency overhead [for 1 megabyte]
Considering the data transfer alone for 1 MB / 1 msec requires 8 Gbps, I have doubts. But for fun, I created a small table in Postgres 18 with an INT PK, and a few thousand JSONB blobs of various sizes, up to 1 MiB. Median timing was 4.7 msec for a simple point select, compared to 0.1 msec (blobs of 3 KiB), and 0.8 msec (blobs of 64 KiB). This was on a MBP M4 Pro, using Python with psycopg, so latency is quite low.
The TOAST/de-TOAST overhead is going to kill you for any blobs > 2 KiB (by default, adjustable). And for larger blobs, especially in cloud solutions where the disk is almost always attached over a network, the sheer number of pages you have to fetch (a 1 MiB blob will nominally consume 128 pages, modulo compression, row overhead, etc.) will add significant latency. All of this will also add pressure to actually useful pages that may be cached, so queries to more reasonable tables will be impacted as well.
RDBMS should not be used to store blobs; it's not a filesystem.
NOTE: this is critiquing the author's 4NF definition (from a link in TFA), not TFA itself.
> If you read any text that defines 4NF, the first new term you hear is “multivalued dependency”. [Kent 1983] also uses “multivalued facts”. I may be dumb but I only very recently realized that it means just “a list of unique values”. Here it would be even better to say that it’s a list of unique IDs.
This is an inaccurate characterization, and the rest of the post only makes sense when viewed through this strawman. The reason 4NF is explained in the "weird, roundabout way" is because it demonstrates [one of] the precise problem[s] the normal form sets out to solve: a combinatorial explosion of rows.
If you have a table:
If you only ever add an additional supplier or an additional warehouse for a given product, it's only adding one row. But if you add both to the same product, you now have 4 rows for a single product; if you add 5 suppliers and 3 warehouses to the same product, you now have 15 rows for a single product, etc. This fact might be lost on someone if they're creating a table with future expansion in mind without thinking it through, because they'd never hit the cross-product, so the design would seem reasonable.
The conclusion reached (modulo treating an array as an atomic value) is in fact in 4NF, but it doesn't make any sense why it's needed if you redefine multivalued dependency to mean a set.
In a roundabout way this article captures well why I don't really like thinking in terms of "normal forms", especially as a numbered list like that. The key insights are really 1. Avoid redundancy and 2. This may involve synthesizing relationships that don't immediately obviously exist from a human perspective. Both of those can be expanded on at quite some length, but I never found much value in the supposedly-blessed intermediate points represented by the nominally numbered "forms". I don't find them useful either for thinking about the problem or for communicating about it.
Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM) doesn't mean it is actually useful... sometimes it just means that it made it easy to write multiple choice test questions. (e.g., "What does Layer 2 of the OSI network model represent? A: ... B: ... C: ... D: ..." to which the most appropriate real-world answer is "Who cares?")
> Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM)
One problem is that normal forms are underspecified even by the academy.
E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).
2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).
Also, the numeric sequence is not very useful because there are at least four non-numeric forms (https://andreipall.github.io/sql/database-normalization/).
Also, personally I think that 6NF should be foundational, but that's a separate matter.
"1979 (I was one year old then)."
Well, we are roughly the same age then. Our is a cynical generation.
"One problem is that normal forms are underspecified even by the academy."
The cynic in me would say they were doing their job by the example I gave, which is just to provide easy test answers, after which there wasn't much reason to iterate on them. I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.
yep. born 1960.
> I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.
Real-talk: those consultants are absolutely essential - and are the unsung heroes of so many "organic" database projects that would have gotten started as an Excel spreadsheet on a nontechnical middle-manager's workgroup-networked desktop, which grew over time into a dBase file, then MSAccess/JET, then MSDE or MSSQL Express if they (think) they knew what they're doing, and then if it's the mid-2000s then maybe it'll be moved onto dedicated on-prem Oracle or MSSQL box - but still an RDBMS; I remember in 2014 all the talk was about moving data out of on-prem RDBMS siloes and onto Cloud(TM)-y OLAP clusters (trying to hide the fact they're running stock Postgres) which acted as a source for a Hadoop cluster - all to produce dashboards and visualizations made with the $100k Tableau license your company purchased after their sales guys showed your org's procurement people a good time in Cancun.
None of the evolution and progress described above could have happened if not for the awful DB designs in that initial Access DB - the anti-patterns would be carried through the DB whenever it ascended to the next tier of serious-business-ness, and each and every design-decision made out of innocent ignorance gets gradually massaged-out of the model by the regular and recurring visits by DBA consultants - because (and goddamnit it's true): a depressingly tiny proportion of software people (let alone computer-people) know anything about DB design+theory - nor all the vendor-specific gotchas.
What I still don't understand is how in 2026 - after 30 years of scolding beginners online - that we've successfully gotten greenhorn software-dev people to move away from VBA/VB6's dead-end, PHP's unintentional fractal of bad design, and MySQL's meh-ness - and onto sane and capable platforms like TypeScript, Node, and Postgres - all good stuff; and yet on my home-turf on StackOverflow, I still see people writing SQL-92 style JOINs and CREATE TABLE statements covered in more backticks than my late grandmother's labrador. I honestly have no idea where/when/how all those people somehow learned SQL-92's obsolete JOIN syntax today.
So in conclusion: the evidence suggests that not enough people today truly understand databases well-enough to render expensive DBA consultants irrelevant.
> Also, personally I think that 6NF should be foundational, but that's a separate matter.
I share your ideal, but there exists a slight problem: no RDBMS I'm aware of really facilitates 6NF or DKNF (or even Codd's full relational concept; or newfound essentials like relational-division, and so on...).
There are also genuine ergonomic issues to contend with: pretty-much every RDBMS design and/or administration tool I've used in the past 20 years (SSMS, SSDT, DBeaver, MSAccess (lol), phpMyAdmin, etc) will present the database as a long, flat list of tables - often only in alphabetical order (if you're lucky, the tooling might let you group the tables into logical subfolders based on some kind of 2-part name scheme baked into the RDBMS (e.g. "schemas" in MSSQL).
...which starts being counterproductive when 6NF means you have a large number of tables that absolutely need to exist - but aren't really that significant alone by themselves; but they always need to remain accessible to the user of the tool (so they can't be completely hidden). So you'll turn to the Diagramming feature in your DB GUI, which gives you a broader 2D view of your DB where you can proximally group related objects together - instead of endlessly scrolling a long alphabetical list; and you can actually see FKs represented by physical connections which aids intuitive groking when you're mentally onboarding onto a huge, legacy production DB design.
...but DB diagrams are just too slow to load (as the tooling needs to read the entire DB's schema, design; all objects first before it can give you a useful view of everything - it's just so incredibly grating; whereas that alphabetical list loads instantly.
Sorry I'm just rambling now but anyway, my point is, 6NF is great, but our tooling sucks, and the RDBMS they connect to suck even more (e.g. SQL-92 defined the 4 main CONSTRAINT types seen in practically all RDBMS today (CHECK, FOREIGN KEY, UNIQUE, and DEFAULT); over 30 years later we still have the same anaemic set of primitive constraints; only Postgres went further (with its `EXCEPT` constraint). As of 2026, and almost 40 years since it was defined, no RDBMS supports ASSERTION constraints; wither DOMAIN constraints and a unified type-system that elegantly mediates between named scalars, relations (unordered sets of tuples), queries, and DOMAINs and the rest.
...this situation is maddening to me because so many data-modelling problems exist _because_ of how unevolved our RDBMS are.
DBeaver can show a relationship diagram between tables. It's the main reason I've used it at all.
https://dbeaver.com/docs/dbeaver/ER-Diagrams/
I could have worded my post a bit better - I didn't mean to imply DBeaver only showed a flat list of tables/objects; but DBeaver is hardly unique in having DB diagrams; my point was that every DB-diagram feature/tool/workspace in a DB admin/IDE (like DBeaver, SSMS, SSDT, etc) is necessarily performance-constrained because they need to load _so much_ metadata before they can show an accurate - and therefore useful - picture of the DB - even if it's just a subset of all tables/objects.
Why shouldn’t we care about layer 2? You can do really fun and interesting things at the MAC layer.
You can do what you do at the MAC layer without any regard for whether or not it is "OSI layer 2", or whether your MAC layer "cheats" and has features that extend into layers 1, or 3, or any other layer. Failing to implement something useful because "that's not what OSI layer 2 is and this is data layer 2 and the OSI model says not to do that" is silly.
To stay on the main topic, same for the "normalization forms". Do what your database needs.
The concepts are just attractive nuisances. They are more likely to hurt someone than to help them.
OSI is particularly obnoxious because layers 5 and 6 don't exist separately in practically any system. Application layer protocols handle them in their own bespoke ways, so we have a software stack consisting of layers 2,3,4,7 like its the pentatonic scale or something.
The levels do the most important thing in computer science, give discrete and meaningful levels to talk/argue about at the watercolour
I still see value in the numbering.
Breaking 1NF is essentially always incorrect. You're fundamentally limiting your system, and making it so that you will struggle to perform certain queries. Only break 1NF when you're absolutely 100% certain that nobody anywhere will ever need to do anything even slightly complex with the data you're looking at. And then, probably still apply 1NF anyways. Everyone that ever has to use your system is going to hate you when they find this table because you didn't think of the situation that they're interested in. "Why does this query use 12 CTEs and random functions I've never heard of and take 5 minutes to return 20,000 rows?" "You broke 1NF."
2NF is usually incorrect to break. Like it's going to be pretty obnoxious to renormalize your data using query logic, but it won't come up nearly as frequently. If it's really never going to come up that often in practical terms, then okay.
3NF and BCNF are nice to maintain, but the number of circumstances where they're just not practical or necessary starts to feel pretty common. Further, the complexity of the query to undo the denormalization will not be as obnoxious as it is for 1NF or 2NF. But if you can do it, you probably should normalize to here.
4NF and higher continue along the same lines, but increasingly gets to what feels like pretty arbitrary requirements or situations where the cost you're paying in indexes is starting to become higher than the relational algebra benefits. Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.
> Your database disk usage by table report is going to be dominated by junction tables, foreign key constraints, and indexes, and all you're really buying with that disk space is academic satisfaction.
FK constraints add a negligible amount of space, if any. The indexes they require do, certainly, but presumably you're already doing joins on those FKs, so they should already be indexed.
Junction tables are how you represent M:N relationships. If you don't have them, you're either storing multiple values in an array (which, depending on your POV, may or may not violate 1NF), or you have a denormalized wide table with multiple attributes, some of which are almost certainly NULL.
Also, these all serve to prevent various forms of data anomalies. Databases must be correct above all else; if they're fast but wrong, they're useless.
> Junction tables are how you represent M:N relationships.
Yeah, the problem is that when you get to 4NF+, you're often looking at creating a new table joining through a junction table for a single multi-valued data field that may be single values a plurality or majority of the time. So you need the base table, the junction table that has at least two columns, and the actual data table.
So, you've added two tables, two foreign key constraints, two primary key indexes, potentially more non-clustered indexes... and any query means you need two joins. And data validation is hard because you need to use an anti-join to find missing data.
Or, you can go with an 1:N relationship. Now you have only one more table at the cost of potentially duplicating values between entities. But if we're talking about, say, telephone numbers? Sure, different entities might share the same phone number. Do you need a junction table so you don't duplicate a phone number? You're certainly not saving disk space or improving performance by doing that unless there's regularly dozens of individual records associated to a single phone number.
And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.
> Databases must be correct above all else; if they're fast but wrong, they're useless.
Yeah, and if they're "correct" but you can't get it to return data in a timely manner, they're also useless. A database that's a black hole is not an improvement. If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?
Right. But faceting data is also part of what a good database designer does. That includes views over the data; materialisation, if it is justified; stored procedures and cursors.
I've never had to do 18 joins to extract information in my career. I'm sure these cases do legitimately exist but they are of course rare, even in large enterprises. Most companies are more than capable of distinguishing OLTP from OLAP and real-time from batch and design (or redesign) accordingly.
Databases and their designs shift with the use case.
> joining through a junction table for a single multi-valued data field
I may be misunderstanding you, but to me it sounds like you're conflating domain modeling with schema modeling. If your domain is like most SaaS apps, then Phone, Email, Address, etc. are probably all attributes of a User, and are 1:N. The fact that multiple Users may share an Address (either from multiple people living together, or people moving) doesn't inherently mean you have an M:N relationship that you must model with schema. If you were using one of those attributes as an identity (e.g. looking up a customer by their phone number), that still doesn't automatically mean you have to model everything as M:N - you could choose to accept the possibility of duplicates that you have to deal with in application code or by a human, or you could choose to create a UNIQUE constraint that makes sense for 99% of your users (e.g. `(phone_number, deactivated_at)` enforces that a phone number is only assigned to one active user at a time), and find another way to handle the rare exceptions. In both cases, you're modeling the schema after your business logic, which is IMO the correct way to do so.
I apologize if I came across as implying that any possible edge case means that you must change your schema to handle it. That is not my design philosophy. The schema model should rigidly enforce your domain model, and if your domain model says that a User has 0+ PhoneNumber, then you should design for 1:N.
> And if the field is 1:1... or even 90% or 95% 1:1... do you really need a separate table just so you don't store a NULL in a column? You're not going to be eliminating nulls from your queries. They'll be full of LEFT JOINs everywhere; three-valued logic isn't going anywhere.
If the attribute is mostly 1:1, then whether or not you should decompose it largely comes down to semantic clarity, performance, and the possibility of expansion.
This table is in 3NF (and BCNF, and 4NF):
So is this:
Whereas this may violate 3NF depending on how you define a Phone in your domain:
If a Phone is still an attribute of a User, and you're not trying to model the Phone as its own entity, then arguably `phone_1_type` is describing how the User uses it (I personally think this is a bit of a stretch). Similarly, it can be argued that this design violates 1NF, because `(phone_n, phone_n_type)` is a repeating group, even if you've split it out into two columns. Either way, I think it's a bad design (adding two more columns that will be NULL for most users to support a tiny minority isn't great, and the problem compounds over time).
> If it takes 20 joins just to return basic information, you're going to run into performance problems as well as usability problems. If 18 of those joins are to describe fidelity that you don't even need?
The only times I've seen anything close to that many joins are:
1. Recreating a denormalized table from disparate sources (which are themselves often not well-constructed) to demonstrate that it's possible. 2. Doing some kinds of queries in MySQL <= 5.7 on tables modeling hierarchical data using an adjacency list, because it doesn't have CTEs. 3. When product says "what if we now supported <wildly different feature from anything currently offered>" and the schema was in no way designed to support that.
Even with the last one, I think the most I saw was 12, which was serendipitous because it's the default `geqo_threshold` for Postgres.
What always frustrates me is that when people on here discuss deeply technical and/or meta-aspects of programming (e.g. type theory), it's taken at face value, but the same is not true of databases. They are generally treated as a dumb data store that you can throw anything into, and when someone explains why that's a bad idea, or why an academic concept like normal form is still applicable, it's met with criticism.
Even when it's purely performance-related, it usually gets a shrug, and "it's good enough." Cool, you're wrecking the B+tree, maybe don't do that. It's as if I said, "I'm using an array to store these millions of items that I later need to de-duplicate," and when someone suggests maybe using a set, I dismiss it.
Agreed. In practice I just ask "am I storing the same fact in two places?" & fix it if yes. Never once sat down and thought "let me check if this is in 4NF specifically."
Normalize till it hurts, then denormalize till it works!
what a marvelous motto <3.
Certainly a lot more concise than the article or the works the article references.
Imperative mood "normalize" assumes that you had something not-normalized before you received that instruction. It's not useful when your table design strategy is already normalization-preserving, such as the most basic textbook strategy (a table per anchor, a column per attribute or 1:N link, a 2-column table per M:N link).
And this is basically the main point of my critique of 4NF and 5NF. They both traditionally present an unexplained table that is supposed to be normalized. But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?
It's like saying that to in order to implement an algorithm you have to remove bugs from its original implementation — where does this implementation come from?
The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.
Because of that if you, as a novice, look at a typical production schema, and you have this "thou shalt normalize" instruction, you'll be confused.
This is my big teaching pet peeve.
> But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?
I find the bafflement expressed in the article as well as the one linked extremely attractive. It made both a joy to read.
Were I to hazard a guess: Might it be a consequence of lack of disk space in those early decades, resulting into developers being cautious about defining new tables and failing to rationalise that the duplication in their tragic designs would result in more space wasted?
> The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.
Agreed, but as the OP comment stated they usually started out normalised and then pushed out denormalised representations for nice contiguous reads.
As a victim of maintaining a stack on top of an EAV schema once upon a time, I have great appreciation for contiguous reads.
> Might it be a consequence of lack of disk space in those early decades
A plausible explanation of "normalization as a process" was actually found in https://www.cargocultcode.com/normalization-is-not-a-process... ("So where did it begin?").
I hope someday to find some technical report of migrating to the relational database, from around that time.
> Normalization-as-process makes sense in a specific scenario: When converting a hierarchical database model into a relational model.
That makes much more sense as reasoning.
If I can also offer a second hazard of guess. I used to work in embedded in the 2000's and it was absolutely insane how almost all of the eldy architects and developers would readily accept some fixed width file format for data storage over a sensible solution that offered out of the box transactionality and relational modelling like Sqlite. This creates a mindset where each datastore is effectively siloed and must contain all the information to perform the operation, potentially leading to these denormalised designs.
Bit weird, given that was from the waterfall era, implying that the "Big Design Up Front" wasn't actually doing any real thinking about modelling up front. But I've been in that room and I think a lot of it was cargo cult. To deal with the insanity of simple file I/O as data, I had to write a rudimentary atomicity system from scratch in order to fix the dumb corruption issues of their design when I would have got that for free with Sqlite.
Normalize the app, denormalize the reporting.
The numbered forms are most useful as a teaching device, not an engineering specification. Once you have internalized 2NF and 3NF violations through a few painful bugs, you start spotting partial and transitive dependencies by feel rather than by running through definitions. The forms gave you the vocabulary. The bugs gave you the instinct..
The lost art of normalizing databases. ”Why is the ARR so high on client X? Oh, we’re counting it 11 times lol”.
I would maybe throw in date as an key too. Bad idea?
Frankly I don't think that overcounting is solved by normalizing, because it's easy to write an overcounting SQL query over perfectly normalized data.
I tried to explain the real cause of overcounting in my "Modern Guide to SQL JOINs":
https://kb.databasedesignbook.com/posts/sql-joins/#understan...
Great read, thank you!
I'll go one further and say that if you're reaching for DISTINCT and you have joins, you may have joined the data the wrong way. It's not a RULE, but it's ALWAYS a 'smell' when I see a query that uses DISTINCT to shove away duplicate matches. I always add a comment for the exceptions.
It depends on if you are doing OLTP (granular, transactional) vs OLAP (fact/date based aggregates) - dates are generally not something you'd consider in a fully normalized flow to uniqify records.
Makes sense. I’m an OLAP guy.
Well you are in luck, if someone tells your to normalize you can tell them their source system can fuck right off.
Especially loved the article linked that was dissing down formal definitions of 4NF.
The article makes a good point about when 5nf becomes impractical. In my experience, stopping at BCNF or 4nf often strikes a better balance unless you have very clear join dependencies. How do others decide where to stop normalizing in real world apps?
https://en.wikipedia.org/wiki/Essential_tuple_normal_form is cool!
Since I had bad memory, I asked the ai to make me a mnemonic:
* Every
* Table
* Needs
* Full-keys (in its joins)
I have so many questions about that. Should that normal form basically replace 5NF for the purposes of teaching?
Why do they hate us and do not provide any illustrative real-life example without using algebraic notation? Is it even possible?
I just want to see a CREATE TABLE statement, and some illustrative SELECT statements. The standard examples always give just the dataset, but dataset examples are often ambiguous.
> (in its joins)
Do you understand what are "its" joins? What is even "it" here.
I'm super frustrated. This paper is 14 years old.
https://dl.acm.org/doi/10.1145/2274576.2274589
I'll try reading it again.
Chris Date has a course on this using his parts and supplies example. Don't have time to find it but maybe ai can find it.
https://www.oreilly.com/videos/c-j-dates-database/9781449336...
https://www.amazon.ca/Database-Design-Relational-Theory-Norm...
For me NF>3 seems like an implicit encoding of underlying data logic. They impose additional restrictions (usually contrived and artificial, break really fast in real life) on data not directly expressed as data tuples. Because of that they are hard to explain, natural reaction: "why you just don't store data?".
I haven’t finished reading this but I am commenting because of the form. Lead with the conclusions, table of contents, and then sources? This is someone who is confident in what they write. I wish more writing trusted the audience to decide if the writing were important instead of stringing the audience allow. Keep up the good work.
I love reading about the normal forms, because it makes me sound like I know what I'm talking about in the conversation where the backend folks tell me, "if we normalized that data then the database would go down". This is usually followed by arguments over UUID versions for some reason.
So which normal form do they argue for and against? And what UUID version wins the argument?
Not OP, but UUID v7 is what you want for most database workloads (other than something like Spanner)
I use the null uuid as primary key - never had any DB scaling issues.
Yeah, no NULL is ever equal to any other NULL, so they are basically unique.
You are also guaranteed to be able to retrieve your data, just query for '... is null'. No complicated logic needed!
Me still using bigints... Which haven't given me any problems. Wouldn't use it for client generated IDs but that is not what most applications require anyway.
Explaining jokes is poor form.
On the internet it is normal.
This was an attempt to extend jokes and not ask for explanation: there are a number of normal forms, and people usually talk about "normalization" without being specific thus conflating all of them; out of 7 UUID versions, only 2 generally make sense for use today depending on whether you need time-incrementing version or not.
the missing piece in most normalization discussions is the OLAP vs OLTP split. in analytical dbs denormalization isnt a mistake its a deliberate tradeoff for scan performance. teaching normal forms without that context sets people up to make the wrong calls when they hit a warehouse workload
My brain has been blunted too far due to dynamodb and NoSQL storage usage and now I can’t even normalize anymore
Color me impressed. Even being very well versed in database design myself, this is just pragmatic and straight to the point, the way I'd have liked it back in the day.
I think the main problem of how 4NF and 5NF formal definitions were taught is that essentially common sense (which is mostly "sufficient" to understand 1NF-3NF) starts to slip away, and you start needing the mathematical background that Ed Codd (and others) had. And trying to avoid that is how those weird examples came up.
i like it but i find the writing style difficult to read.
Could you share an example of writing stule that you enjoy?
güzel elinize sağlık
love this
There are use cases where is better to not normalize the data.
Typically it's better to take normalized data and denormalize for your use case vs. not normalize in the first place. Really depends on your needs
Over time I’ve developed a philosophy of starting roughly around 3NF and adjusting as the project evolves. Usually this means some parts of the db get demoralize and some get further normalized
>> Usually this means some parts of the db get demoralize
I largely agree with your practical approach, but try and keep the data excited about the process, sell the "new use cases for the same data!" angle :)
lol took me a second. an amusing autocorrect!
One day I hope to write about denormalization, explained explicitly via JOINs.
Please do, you content is great!
I'm a fan of the sushi principle: raw data is better than cooked data.
Each process should take data from a golden source and not a pre-aggregated or overly normalized non-authorative source.
Sometimes the role of your system is to be the authoritative source of data that it has aggregated, validated, and canonicalized.
This is great. Then I would consider the aggreated, validated, and canonicalized source as a Golden Source. Where I've seen issues is that someone starts to query from a nonauthoritative source because they know about it, instead of going upstream to a proper source.
JSON is extremely fast these days. Gzipped JSON perhaps even more so.
I find that JSON blobs up to about 1 megabyte are very reasonable in most scenarios. You are looking at maybe a millisecond of latency overhead in exchange for much denser I/O for complex objects. If the system is very write-intensive, I would cap the blobs around 10-100kb.
I adore contiguous reads that ideas like that yield. I'd rather push that out to a read-only end point, then getting sucked into the entropy of treating what is effectively an unschema-ed blob into editable data.
> You are looking at maybe a millisecond of latency overhead [for 1 megabyte]
Considering the data transfer alone for 1 MB / 1 msec requires 8 Gbps, I have doubts. But for fun, I created a small table in Postgres 18 with an INT PK, and a few thousand JSONB blobs of various sizes, up to 1 MiB. Median timing was 4.7 msec for a simple point select, compared to 0.1 msec (blobs of 3 KiB), and 0.8 msec (blobs of 64 KiB). This was on a MBP M4 Pro, using Python with psycopg, so latency is quite low.
The TOAST/de-TOAST overhead is going to kill you for any blobs > 2 KiB (by default, adjustable). And for larger blobs, especially in cloud solutions where the disk is almost always attached over a network, the sheer number of pages you have to fetch (a 1 MiB blob will nominally consume 128 pages, modulo compression, row overhead, etc.) will add significant latency. All of this will also add pressure to actually useful pages that may be cached, so queries to more reasonable tables will be impacted as well.
RDBMS should not be used to store blobs; it's not a filesystem.
yep.
for immutable blobs use crypto digest as db key and store blobs in file system.
Doing reports or lookups for stuff inside json is the death of any performance, however.