I find it distasteful and disturbing that copyright infringement by the people training the LLM in violation of a license is considered contamination by the licensed code. It’s not contamination. The code didn’t seep into your codebase. If the LLM was trained in such a way that portions of code long enough to be protectable then the license was violated by humans. The liability for the problem doesn’t lie on the shoulders of the contributors to the originally licensed code. It lies on the people inserting it into your codebase without following the terms of the license.
The article also singles out the GPL repeatedly as a source of contamination. It doesn’t mention source-available proprietary licenses. It doesn’t mention code put online with no clear license, which according to the Bern Convention and the laws in at least the United States is automatically copyright protected with no license for use by others at all. It doesn’t talk about attribution for BSD-style or CC-SA-Attribution licenses. There’s no mention of leaked proprietary code. It just singles out GPL as some sort of unique problem.
This seems quite shoddy and biased for an article by someone who’s writing about the law.
It is probably fair that a huge share of code that is Foss is licensed under GPL, much larger than the share of source available proprietary licensed code
You would assume that there is more proprietary code available to read on the internet than GPL code? Do you have any rationale for that assumption?
Basically all GPL code is available on the web and there is a vast amount of it. I barely see any current non-FOSS code on the internet, although I think it would be fair to count the big projects who have been using pseudo-OSS licenses lately as proprietary. Wouldn't a safer assumption be a ratio of 10:1 or 100:1 for lines of GPL vs. lines of "shared source?"
Is GPL a larger share of source out there than BSD, MIT, ISC, CC, BSL, Apache, and source available combined? Enough bigger that it is repeatedly mentioned as a singular issue without so much as the words “or other licenses”?
That's the wrong metric, however. Thousands of small pet repos are unlikely to have more code than a single Chromium repo (mostly LGPL), Linux, Qt, etc.
I thought fair use was decided on a case by case basis, and could not be guaranteed? If true, wouldn't that mean that in other cases it could be ruled differently?
I don't have the exact ruling in front of me, but IIRC the judge pretty clearly said that training a model was fair use. IIRC, he declared it "quintessentially transformative".
The case by case basis was about acquisition and possession of the copyrighted material. Anthropic pirated a large number of books and illegally stored digital copies of many that they did purchase legally. The training being protected doesn't give them the right to violate copyright in that way.
Google, for example, purchased print versions of their training material and had a small army of employees digitize them and then delete the digital copies when they were done. That hasn't been challenged AFAIK, but would likely have been found to be not a violation. That's I think what was meant by case by case basis.
It's like if someone breaks into my house and I shoot them with my gun, that's very likely self defense, but if I'm not allowed to own a gun, I may still end up in trouble with the law.
Whether or not you’re pirating and making illegal copies of something depends greatly on the terms under which you’re allowed to make those copies. You can copy GPL-licensed code all day every day so long as you abide by the license. The same is true of the BSD licenses, MIT, ISC, Apache, et cetera.
If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright.
> If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright
I don't disagree with that.
What I'm saying is that the judge ruled that training a model using copyrighted books wasn't derivative. It was transformative, so the training wasn't a copyright violation.
He then went on to say that the way Anthropic acquired and handled that material was a copyright violation because Anthropic pirated and copied a large number of books that were not under a license like the ones you mentioned. The downloaded a bunch of books you would find at most bookstores and then actually purchased copies of them much later once they were accused of violation copyrights.
I'm just trying to make that clear because I've heard a lot of people who don't understand that the violation wasn't about the act of training or material they used, it was just how they acquired the training material.
If the trained LLM spits out large, recognizable portions of licensed code and you use it in your product don’t count on that case to keep you from defending yourself in court. The court found in Bartz v. Anthropic that training was fair use. They also found that pirating content to train against was not fair use, and Anthropic paid $1,500,000,000 in a settlement.
There are licenses on most software source code. If you redistribute works derived from that code, you must abide by those licenses or you are violating the copyright. That’s what’s meant by “piracy" here.
Now if you have an LLM that has trained on code and learned to actually write new software, only small snippets too short to be protected by copyright should be identical between the training material and the output. However, if you’re getting output that is substantial in size and recognizably derivative from the original that’s an issue that hasn’t yet as far as I’m aware been settled in court. One would hope the major player LLMs don’t copy and paste large functional chunks of existing programs.
It would certainly seem to me that the code you sell after using an LLM should meet the same standards for difference in implementation as if it was written by a human. That should apply to both copyright protection and patent protection.
I find it pretty horrible that a company can pay a mere fine that is a small percentage of its total funding in exchange from materially benefiting from a conspiracy to commit a series of criminal acts.
If Anthropic hadn’t pirated training materials would they even exist? Would they still have been as competitive ?
Would they still have gotten every bit of VC funding in anticipation of future successes derived in part from past crimes?
What’s next ? Armed bank robbery when VC funding dries up?
Also fair use is much more limited in the EU. Don't know how it applies here or if there where any rulings. Are you going to stop doing business with the EU (and Japan etc.)?
The seller of the code has no visibility on the training set of the LLM. If the situation you're describing ends up being illegal, responsibility should fall on the LLM provider to provide tools to detect such overlap with their training sets, and on the clients to run the tools.
The provider of the LLM should want to enable this and to take on that responsibility (I mean take it from the clients), otherwise no one will want to use the tool. Maybe there could be AI tool-use lawsuit insurance, but I feel like that's worse than the copyright infringement detection tool for everyone involved.
I can see the tool happening in the EU, but nowhere else basically, especially in the US, the government sees "AI dominance" as a national priority and a national security priority.
Rather, I suspect viral copyleft is why this lawyer is focusing on the the GPL. It's the only(?) FOSS license that can force a proprietary codebase into the open.
Other than putting something into the public domain I don't really know any open source licence that doesn't require at least attribution. One can assume that 99.9% of training data had some sort of license requirements, so just blindly using it is a copyright violation. People just don't seem to care.
> The US Copyright Office confirmed this in January 2025, and the Supreme Court declined to disturb it in March 2026 when it turned away the Thaler appeal. Works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection, and that rule is now settled at the highest judicial level available.
Misstates the law. Denial of certiorari can happen for many reasons unrelated to the merits and does not settle the issue nationwide.
Fair and correct. Cert denial means the Court declined to hear the case, not that it endorsed the lower court's reasoning or settled the question nationally. The DC Circuit ruling stands and the Copyright Office's position is consistent, but that is stable doctrine rather than Supreme Court-settled law. Updated the piece to reflect this distinction accurately.
Also, I don't think there is any example testing the conclusion. There is no case to point at that any of the factors they listed are sufficient to convey authorship. Would love to be pointed to a case where rejecting decisions and redirecting to a different approach was deemed human authorship. What we do know is that you can disclaim the part of the code a human didn't author. In fact, the Copyright Office requires you disclose and disclaim. If anyone out there has more factual and citable sources please share.
You are right that no court has yet ruled that a specific set of human contributions to AI-assisted work was sufficient to establish authorship. What exists is the inverse: the Copyright Office has granted partial registrations where human-authored elements were separated from AI-generated elements, as in Zarya of the Dawn, where the human-written text was protected but the Midjourney images were not. The Allen v. Perlmutter case pending in Colorado is the first direct judicial test of whether iterative prompting and editing can constitute authorship. Until that decision, the positive threshold is genuinely unknown. The piece reflects this in the calibration section at the end, though your point is worth adding to the authorship discussion more explicitly.
It's in fact the opposite from what I've read. In one of the supreme court cases cited by the copyright office itself in its opinion of AI works (https://en.wikipedia.org/wiki/Community_for_Creative_Non-Vio...) it is deemed that just you advising something to do the work for you, giving criticisms and revisions, isn't enough for authorship or co-authorship.
The Supreme Court declining to take up an issue is taking a position.
Now different circuits can take a different view of the same issue. This is a common reason why the Supreme Court will grant cert: to resolve a circuit split. Appeals court judges know this and have at times (allegedly) intentnionally split to force an issue to the Supreme Court.
Even without settling the issue appeals courts will look at how other circuits have ruled and be guided by their reasoning, generally. The fact that the Supreme Court declined to grant cert actually carries weight.
the real issue is that the Thaler case was a different question: "Can AI be an author?" and the lower Court said no and SCOTUS left it along. But the question of "what is enough for the human to be the author" wasn't even part of the case. That is completely own checked.
Logically, I think there's a big difference between code which was produced from a simple generic prompt without other input vs code which was produced from a multiple complex prompts with large existing code as input.
When I'm feeding AI my code as input and it ends up producing new code which adheres to my architecture, my coding style and my detailed technical requirements, the copyright over the output should be mine since the code looks exactly like what I would have produced by hand, there is no creative input from the AI. It's just a code completion tool to save time.
I understand if someone leaves an LLM running as an agent for multiple days and it produces a whole bunch of code, then it's a very different process.
Fair point and worth being precise about. Cert denial is not meaningless: it leaves the lower court ruling intact, it signals the Court did not find the issue urgent enough to resolve now, and as you note, other circuits will look at the DC Circuit's reasoning. What it does not do is bind other circuits or establish Supreme Court precedent. The distinction matters here because if a Ninth Circuit case involving AI-generated code reaches a different conclusion, that circuit split would be live law regardless of the Thaler cert denial.
But it means that the appellate decision will retain precedence, no? Wouldn’t losing precedence be the primary legal effect of overturning that decision? All case law that hasn’t touched the Supreme Court could theoretically be challenged, but most of it isn’t, and it’s considered the law until it isn’t anymore, right? How would this be any different?
The decision is binding only within the jurisdiction of the Court of Appeals for the D.C. Circuit.
So it’s not correct to say “because SCOTUS denied cert, Thaler is now binding national copyright law.”
Practically speaking, it is binding on the US Copyright office (one of the parties in the case) in CADC. And that’s important. But copyright litigation happens all across the country, while this ruling only directly constrains the relatively small number of cases within CADC.
Yes, I didn’t imply national precedence. I imagine it would also signal to attorneys appealing cases other circuits that the same challenge will likely yield the same result.
Although this decision is not binding in other circuit courts, this decision still is something that you can bring to a judge in other courts. They are not required to follow this ruling because they are not in that circuit. However, they still will consider what other courts have said and that will be incentive to think hard before they do something different. A judge who does something different is generally expected to write up a reason why they did something different, and that's something that would be given to an appeals court if they do do something different for consideration of why the other court was wrong.
Yeah, I’ve heard lawyers use decisions in other jurisdictions to give weight to their line of reasoning. The SC saying they aren’t reviewing an appeal might not make that universally binding, but it signals that they don’t categorically reject the lower court’s decision.
I doubt any lawyer would mention the SC didn't review this - that is meaningless and judges know it. They will however mention this case. Even if they case goes against them they will mention it so they can say why it is wrong (the opposition will be sure to mention it so they have to be prepared to take it down)
Leaks, whistleblowers. Some circumstantial evidence will also do if there's enough of it. Like having hallucinated parts of code that do absolutely nothing, and can't be explained as e.g. leftovers from a refactor.
And what exactly does it mean to "direct" how the work is constructed?
If I enter dark factory mode and go live my life while it churns tokens, then it's not copyrightable, but if I interact with it at every turn, then it is?
> When the Supreme Court declined to hear the Thaler appeal in March 2026, it did not endorse the lower court's reasoning or settle the question nationally. Cert denial means the Court chose not to hear the case, nothing more. What it does mean is that the DC Circuit's ruling stands, the Copyright Office's position is intact, and no court has yet gone the other way.
Lets hire humans as pAIrrots? They see it, they rearrange it, they rename variables and then they "authored" it.
What a job- to start for as junior, but if you understand whats happening, you may augment the AIs code by giving "feedback" with enough time.
Free water but not electricity? I'll just hook up a generator to the shower...
These sorts of simplistic loopholes rarely work. Imagine if you could get copyright for the linux kernel by just rearranging it and renaming a few variables.
Is there really likely to be any? The design is very different isn't it? Ghidra with llm plugins is likely at a place a determined person could find out.
Do you really think with that massive amount of open code, some would not be injected in windows kernel (or even .net with mono, or even windows userland with wine)? It is easier to hide it: it is closed source, and they are probably using the same hiding tricks than those used to hide coding AI generated code (usually some level of refactoring to adapt to windows data structures).
Seems like way more effort reading the linux code, copying and adapting it to windows, and actively "hiding" it, than just writing code that fits your situation from the get go.
In my experience, reading and understanding code takes a lot more time than writing from scratch, so I don't really see what windows developers (assuming they are somewhat competent coders, this assumption may not hold after around 2010 or so) would have to gain by coping from linux.
Yep, that's why in many cases it is better to refactor already tested and debugged code.
Additionnally, the size of the code base increases the difficulty to spot 'obviously' refactored code from open source projects. There is a code complexity thresold. Coding AIs could help?
The only protection would be the honnesty of microsoft coders... wait, did I say "honnesty" and "microsoft coders" in the same sentence?
Ah the infamous "no I wrote it myself" submission in university coursework. Usually gets you a free visit to the guidance counsel and a bonus free mark (on your three strikes and you are out plagiarism form).
It also contradicts everything else I have read about Thaler. AFAIK the ruling was that the AI could not hold copyright. Thaler waived any claim to the the copyright holder himself.
The last two bullet points on this page cover this:
Furthermore, we shouldn't even be looking to the Supreme Court at all for this. Congress needs to define the laws around AI and copyright. The Supreme Court is likely avoiding cases in the hope that the legislature gets its act together.
Personally, I think that the human directing the agent owns the copyright for whatever is produced, but the ability for the agent to build it in the first place is based off of stolen IP.
I'm concerned about the copyright 'washing' this enables though, especially in OSS, and I think the right thing for OSS devs to do is to try to publish resulting code with the strongest copyleft licensing that they are comfortable with - https://jackson.dev/post/moral-ai-licensing/
Funny how the copyright industry was able to spin copyright infringment into the pejorative "stealing". If you still have the item, what was stolen?
Dowling v. United States, 473 U.S. 207 (1985): The Supreme Court ruled that the unauthorized sale of phonorecords of copyrighted musical compositions does not constitute "stolen, converted or taken by fraud" goods under the National Stolen Property Act
The issue is that of copyright law WRT to derivative works. Machine transformations on original works does not create a new copyright for the person that directed the machine transformation. That's why you can't pirate a bunch of media by simply adding a red pixel to the righthand corner or by color shifting the video.
Copyright law is very clear that if a machine does it, the original copyright on the input is kept. This is why your distributed binaries are still copyrighted, because the machine transformed, very significantly, the source code into binary which maintains the copyright throughout.
It would be inconsistent for the courts to suddenly decide that "actually, this specific type of machine transformation is actually innovative."
I know this is generally really bad for the AI industry, so they just ignore it until a court tells them they can't anymore. And they might get away with it as I don't have faith that the courts will be consistent.
Shredding is a machine transformation. Does it mean that shreds retain original copyright even if the content can't be restored and the provenance can't be traced? Just an example that treating all machine transformations equally with no regard to the specifics doesn't make much sense.
And the specifics of autoregressive pretraining is that it is lossy compression. Good luck finding which copyrighted materials have made it into the final weights.
> Does it mean that shreds retain original copyright even if the content can't be restored?
Yup, it absolutely does. In fact, that's why you are still violating copyright law by using bittorrent even though each of the users is only giving out a small slice or shred of the original content.
The US has a granted defense in the case of something like shredding called "Fair Use" but that doesn't mean or imply that a copyright is void simply because of a fair use claim.
> And the specifics of autoregressive pretraining is that it is lossy compression.
That doesn't matter. Why would it? If I take a FLAC recording and change it to an MP3. The fact that it was a lossy transform doesn't suddenly give me the legal right to distribute the MP3.
> Good luck finding which copyrighted materials have made it into the final weights.
That's what the NYT v. OpenAI lawsuit is all about. And for earlier models they could, in fact, pull out full NYT articles which proved they made it into the final weights.
Further, the NYT is currently in discovery which means OpenAI must open up to the NYT what goes into their weights. A move that, if OpenAI loses, other litigants can also use because there's a real good shot that OpenAI also included their works in the dataset.
Well, it's not the first time when the law contradicts laws of nature (for the entertainment of the future generations). Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.
> in fact, pull out full NYT articles
That's when they used their knowledge of the exact text they wanted to "retrieve" to get the text? It wouldn't be so efficient with a random number generator, but it's doable.
> Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.
You can restore shredded documents with enough time and effort. And if you did that and started making photo copies, even if they are incomplete, you will run afoul of copyright law.
Bittorrent is a relevant example because it shows that shredding doesn't destroy copyright.
Remember, copyright is about the right to copy something. Simply shredding or destroying a thing isn't applicable to copyright. Nor is giving that thing away. What's applicable is when you start to actually copy the thing.
I've meant idealized shredding: a destructive transformation, which is still a machine transformation (think blender instead of shredder). When you need the exact knowledge of a thing to make its (imperfect) copy using some mechanism, it doesn't mean that the mechanism violates copyright.
EDIT: I don't say that neural networks can't rote learn extensive passages (it's an effect of data duplication). I'm saying that they are not designed to do that and it's possible to prevent that (as demonstrated by the latest models).
I'd assume it's still a copyright violation if you copied and distributed the shredded copy.
The way I arrive at that is imagine you add just 1 pixel of static to a video, that'd still be a copyright violation. Now imagine you slowly keep adding those random pixels. Eventually you get to the point where the whole video is just static, but at some point it wasn't.
Now, would any media company or court sue over that? Probably not. But I believe that still falls under copy right (but maybe fair use?).
The issue with neural networks is they aren't people. Even when you point your LLM at a website and say "summarize this" the output of that summation would be owned by the website itself by nature of it being a machine transformed work.
Remembered, it's not just mere rote recitation which violates the law, any transformation counts as well. The fact that AI companies are preventing it doesn't really solve the problem that they are in fact transforming multiple copyrighted works into their responses.
When you point your browser at a website the browser creates a (transformed) local copy of the information that is owned by the website itself. The browser needs to do that to render the website on your screen. Is it a violation of copyright (that the website is willing to tolerate because it profits from advertisements)?
No, because your browser is dealing with the distribution of data in a way intended by the copyright holder. You also aren't redistributing the webpage after rendering. Client side modifications fall under fair use which is what keeps the likes of ad blockers and other page modifiers legal.
What would violate copyright is if you took that rendered page, turned it into a jpeg, and then hosted that jpeg from your own servers. That's the copying that would run afowl of copyright law.
A human is not a commercial product. Here we have commercial product that was created by using a lot of various copyrighted and protected IP, without licensing agreements, without paying, without even citing it.
LLMs seem to be so devoid of intelligence, I think it's arguable if that's learning: https://machinelearning.apple.com/research/illusion-of-think... Typically, you would imply a level of understanding when you say learning. LLMs apparently can't do that, by design.
Copy/pasting at scale is how tons of software has been written for a long time, or have we all forgotten the jokes people used to make about StackOverflow?
Yes I guess there's also no such thing as stealing in torrents since the computer "learns" the data and returns it in a transcoded fashion so it's technically not a reproduction. Yes LLMs can reproduce passages from copyrighted works verbatim but that's only because it "learned" it and it's just telling you what it "knows".
The mental calisthenics required to justify this stuff must be exhausting.
> The mental calisthenics required to justify this stuff must be exhausting.
It's only exhausting if you think copyright ever reasonably settled the matter of ownership of knowledge and want to morally justify an incoherent set of outcomes that they personally favor. In practice it's primarily been a tool for the powerful party in any dispute to hammer others for disrupting their business model. I think that's pretty much the only way attempting to apply ownership semantics to knowledge or information can end up.
This is a perfect example of 'begging the question'. Arriving at a conclusion from a fact assumed as true without evidence. Your reductio does not actually demonstrate that copyright applies to LLMs, because you did not demonstrate how transcoding is comparable to inference, just that LLMs can reproduce some passages from copyrighted works. You could also produce passages from copyrighted works by generating enough random sequences of words, but no one is arguing that is comparable to transcoding. That the people who do not share this conclusion are engaging in motivated reasoning is based only on your assumption and has no logical backing, and is therefore begging the question.
> Yes LLMs can reproduce passages from copyrighted works verbatim but that's only because it "learned" it and it's just telling you what it "knows".
Are you finding people that actually say this?
When it can quote something like that, it's a training error. A popular enough work gets quoted and copied by people online, and then it's not properly deduplicated. It's a very small fraction of works it can do that with, and the cleaner your data the less it happens.
I'll once again quote that stable diffusion launched with fewer weights than training images. It had some accidental memorizations, but there wasn't room for its core functionality to be memorization-based.
I find it more ridiculous to equate the act of a human learning with for-profit AI training without recompense to the authors of the training material.
The "learning" isn't learning really. I mean it might be, but if you define learning to be a human endeavor than AI can't learn.
It's perfectly reasonable to say it's okay for humans to do something but not okay for a computer program to do the same thing. We don't have to equate AI to humans, that's a choice and usually a bad one.
It's a relevant extension if you think the ability to learn from a work is a right people have that exempts them from the more general lockdown copyright would impose.
If you come at it from the view of copyright being a limited set of control over some areas but not others, then if copyright doesn't block human learning it shouldn't affect anything similar either, unless a specific rule is added to make those situations be handled differently.
It's also perfectly reasonable to say it's ok for a program or machine to do the same thing as a human. This has been the basis for the technological revolution since the dawn of technology.
It's legal and perfectly reasonable for a human being to combine organic fuels with oxygen from the air to create energy and CO2. Any law restricting that would be the worst form of tyranny.
It would not be reasonable to allow machines to do that at unlimited scale without restrictions.
(Hopefully the fossil fuels industry won't draw inspiration from the legal arguments made by AI companies...)
You're taking the metaphor much too seriously. It was only an example to illustrate that human rights don't automatically apply to machines. Let's not read too much into it.
You made a claim and used a metaphor to demonstrate that claim. I asked a very simple question about the bounds of the metaphor and thus the claim. You are dodging answering the questions which mean that you cannot defend the logic of your claim. Thus you have forfeited that your claim is valid and 'human rights don't automatically apply to machines' has not been illustrated.
What's your strategy for solving problems where there are diverse viewpoints if there is no desire to convince anyone else? Rhetoric is time proven set of communication standards that allow us to demonstrate the validity of our positions and thus gives us a way to find agreement or at least understand what others think. Few people are completely irrational and understanding why they think what they do, even if one does not agree with them, is important in a system where people have to co-exist with the decisions that effect everyone.
Because the alternative would be to just railroad people who don't agree, and even when it does work in one's favor the pendulum tends to swing back hard in response.
I think that it's absurd that we've jumped to the conclusion backpropagation in neural networks should be legally treated the same as human learning.
I mean I don't think think I could find a better description for following the derivatives of error in reproducing a set of works as creating a "derivative work".
>> ... we've jumped to the conclusion backpropagation in neural networks should be legally treated the same as human learning.
I agree. However, the reverse is also likely true, i.e., it cannot currently be denied that learning in humans is different from learning in artificial neural networks from the point of view of production of works that mix ideas/memes from several works processed/read. Surely, as the article says, copyright law talks exclusively about humans, not machines, not animals.
I understand the article - the point about 'learning' is that if the model and its outputs are a derivative works then the copyright belongs to the human creators of the works it was trained on.
Edit*: Or perhaps put more pseudo legally that the created works infringe on the copyrights of the original human creators.
The part I agree to is that copyright law calls out humans specifically as the potential owners of copyright. So what you suggest seems to be the only possibility out. Calling out humans could imply that when a human reads a thousand books and then writes something basis the same but which is not a substantial copy of anything explicitly read, that human owns the copyright to the text written. Whereas, if an artificial neural network does the same (hypothetically writing the same text), it would not.
The above does not follow from, imply or conclude anything about learning in artificial neural networks and humans being similar or dissimilar.
"Learning" for LLMs is just as goofy and propagandistic a metaphor as "stealing" for copyright. I find it predictive of your position that you'll accept one dumb metaphor for something that we didn't need a metaphor for, but not the other.
Are you for stealing and against learning?
We know exactly what is happening in both cases. We can talk about that, or we can use obfuscating euphemisms that make our preferred position seem obviously true.
Everybody has had a complete 180 in terms of copyright protections. Before, nobody cared about downloading music, movies, TV shows, or pirating games. Now, when the copyright law is affecting them, they are gungho about protecting these billion-dollar companies' copyrights.
You are attempting to invoke strawman. So is your point that there is not a significant overlap between posters who think that AI companies should not be allowed to pirated use copyrighted material in their training corpus and posters who themselves pirated copyrighted material such as movies, music, games, etc.?
Yes, that is their point. Do you have evidence against it?
I'm sure you can find some overlap, but I bet the vast majority is caused by people making a distinction between commercial and noncommercial piracy. I don't think there's a big cohort of piracy hypocrites.
Due to the nature of the argument, of course I do not have evidence for or against it. However, I am willing to leave it at that, because I think that any rational observer will be able to look at the general mood toward copyright/privacy online (including using Limewire back in the day, pirating movies, downloading Photoshop etc.) and come to their own conclusion whether or not it's plausible that there isn't a significant overlap between the two.
Its not about "billion-dollar companies' copyrights", but also about voluntary copyleft free software. If I license my code under GPL I don't want other persons/companies just whitewash that code through LLMs and use it in their proprietary code.
I agree with this, and I think that it is an open question whether or not training on copyrighted material is considered transformative or not. However, someone said that thumbnails of full photos are considered transformative enough to allow fair use, and LLM training is (in my opinion) clearly more transformative than converting a picture to a thumbnail. But we will see how it plays out.
The music and movie companies have power. They have the funds to bankrupt you with a small army of lawyers. You as an individual do not stand a chance against corporate lawyers. They can destroy your life over fairly minimal and non-violent offenses.
AI companies are backed by the very powerful. They can steal all they want and use the same army of lawyers to bankrupt any small rights holder. The big rights holders go to the same parties and allow it to happen.
Regardless of the actual take on copyright, both methods skullfuck the little guy without power.
People cry foul because, at least in the US, we claim to live in a free country based on equality, yet there is a very obvious caste system of the haves and the havenots.
It errodes the legitimacy of the system. Imagine if for years you see news reports of a mother getting a judgment against her where she owes 100s of thousands because she seeded a Brittany Spears song. Then you suddenly see the same laws that were leveraged to instill fear in you, tossed aside when the rich and powerful say it doesn't count anymore, you're going to cry foul!
It's not a hypocrisy of position on copyright, it's bearing witness to the illegitimacy of the laws they're bound by.
I find idea that the code could be copyrightable as weak. There are only so many ways to write a for loop. Similarly you can't copyright schematics (apart from exact visual representation as form of art). Code is just a schematic.
Copyrights already preclude short phrases for the same reason -- there are only so many ways in which short phrases could be produced. The moment a work becomes larger (large enough; AFAIK, the threshold is not precisely defined), the reasoning you applied fails to apply.
The Google-Oracle lawsuit did not decide whether APIs (when large in number) are copyrightable or not.
Let me get this straight: Since there are only so many ways to write a for loop, you doubt that for loops are copyrightable. From this you conclude that code, in general isn't copyrightable?
That's like saying "there's only so many ways to greet your neighbor, so any text that simply greets your neighbor isn't copyrightable – and therefore no text is copyrightable".
but the ability for the agent to build it in the first place is based off of stolen IP.
I honestly don't understand why the attitude that underlies this is so prevalent.
When I write code, what I write and how I write it is informed by having read countless source code files over my education and my career. Just as I ingest all that experience to fine-tune how my later code is written, so does the LLM from the code it's seen.
The immediate retort to that is that the LLM is looking at code that wasn't its to read. But I don't think that's a valid objection. Pretty much by definition, everything I've learned from has a copyright on it, and other than my own code on my own time, that copyright is owned by someone else. Much of the code that's built up my understanding has been protected by NDA, or even defense-department classifications: it wasn't mine in any way. But it still informs how I do all my future coding.
By analogy: I'm also an artist, especially since my retirement. My approach to photography was influenced by Ansel Adams, and countless other artists whose works I've seen displayed in museums, or in publications and online. My current approach to painting was inspired by Bob Ross and others, and the teachers who have helped me develop. I've taken pieces of what I've seen in all their work, and all of that comes out in my photos and paintings, to varying degrees.
I've taken ideas from others in code and in art, and produced something (hopefully!) different by combining those bits with my own perspective. I don't think anyone has a claim on my product because of this relationship.
Likewise, I know that many of my successors have learned from my code (heck, I led teams, wrote one book about software development!). And I hope that someday my artwork has developed to the point where there's something in it that's worth someone else's attention to assimilate. I've never for a minute - even decades before the advent of LLMs - hoped or even imagined that my work would remain locked up with me, and that the ideas would follow me to the grave.
As they say, we are all standing on the shoulders of giants. None of us would be able to achieve the tiniest fraction of what we have, without assimilating what has come before us. Through many layers of inheritance it's constantly being incorporated in subsequent works.
In a few decades at best, I'll be dead. It probably won't be very long after that when people even forget my name. But the idea that something I've done - my work in developing software systems, or in my photography and painting - will continue to have ripples through time, inspires me and gives me hope that I'll have some tiny shred of immortality beyond my personal demise.
Scale and the ability to generate a livelihood of your creations and/or the ability to control how what you have created is used, for instance, to demand attribution.
The attitude is derived from a general animus many have towards AI companies. They resent the efficacy of AI because it devalues individual expertise.
I can't imagine it really justifiable to say that training off data is the same as "stealing", when that same claim, that learned information that a person could retain and reproduce constitutes copyright infringement is the subject of many dystopian narratives, like this one, where once your brain is uploaded to the cloud you have to pay royalties based on every media product you remember.
Part of how AI works is that it's just really complicated compression, you can get AI to write out Harry Potter novels word for word with the right prompting.
When it picks out a rare bit of code, it will be simply copying that code, illegally, and presenting it without attribution or any licenses which is in fact breaking the law but AI companies are too important for the law to apply to them.
There's been instances where models have spat out comments in code that mention original authors, etc., effectively outing itself as a copyright thief.
There's nothing anyone can do about it, but the suspicion is that the big companies have taken everyone's code on GitHub, without consent, and trained on it.
And now are spitting out big chunks of copyrighted code and presented it as somehow transformed even though all they've actually done is change a few variable names.
It is copyright theft, but because programmers are little people, not Disney, we don't have any recourse.
When I write fizzbuzz do I owe royalties to the inventor of fizzbuzz? Is my brain copyright thieving because I can write out the song lyrics from memory?
> There's nothing anyone can do about it, but the suspicion is that the big companies have taken everyone's code on GitHub, without consent, and trained on it.
I asked agent X what is the source of training data it generated code from, it couldn’t say. Then I asked why the code implementation is exactly the same as the output of agent Y. It said they were trained on the same ‘high-quality library’, and still couldn’t say which one.
So I guess that’s fine because everyone is doing it.
You asked a machine that makes things up when it doesn't know the answer a question that it has no way of knowing the answer to. I don't know why you bothered to relay its response.
And now are spitting out big chunks of copyrighted code and presented it as somehow transformed even though all they've actually done is change a few variable names.
It's pretty likely that I've done the same thing. I mean, I've written enough CRUD functions in my life, for example, that in all likelihood I'm regurgitating stuff that's a copy, for all practical purposes, of stuff I've done before as work-for-hire for my employer. I'm not stealing intentionally or consciously, but it seems quite likely that it's happening. And that's probably true for many of you, at least that have been in the industry for a while.
For another human being to look at my open source code, learn from it, get inspired by it, appreciate what I did, and let it influence their own creativity would bring me joy. That's why I open sourced it in the first place.
Few people ever actually read open source code, but I'd like to think on the rare occasions they do, they share a connection with the author. I know when I read somebody else's code, for me to understand it I have to be thinking about the problem the same way they were when they wrote it. I feel empathy with them and can sometimes picture the struggle, backtracking, and eureka moments they went through to come up with their solution.
Somehow I don't get the same warm fuzzy feelings about a machine powered by investor money ingesting my work automatically, in milliseconds, and coldly compressing it down to a few nudges on a few weights out of trillions of parameters. All so the machine can produce outputs on-demand for lazy users who will never know of me or appreciate my little contribution, and ultimately for the financial benefit of some billionaires who see me as an obsolete waste of space.
We're moving into the 'industrial age of software'. You exact issue, of bespoke, well thought out and well-crafted code is one that craftsmen felt at the beginning of the industrial age. Now, parts are designed and churned out by machines that no one sees or cares about (generally speaking). This is where we are going with software, and production at a truly industrial scale has its place.
And so does well-crafted bespoke software.
The engineers who built the foundation for the industrial expansion of our forefathers went through the same exact thing we're going through now. They look at what existed, and use it to inform their efforts. This is what LLMs do.
I'm not attempting to moralize here, just comment on the parallels. Do I agree that a craftman's work is consumed by the juggernauts and no second thought is given? No. I think its a shame. But I also think the output will never match the artisans that practice now. By the very nature of the machines we employ, we cannot match the skill or thought that goes into bespoke code.
It is not even about quality. In fact with an LLM following my orders I can create higher quality code than I ever did before. I always was operating within a budget whether it was defined by the # of hours my customers were willing to pay for, or the # of hours I was personally willing to invest in a side project. This budget manifested in the form of cut features, limited test coverage, limited documentation, and so on. So given the same budget or even a slightly reduced budget I can actually make higher quality software with slop superpowers.
If I spend 2 hours designing the domain model, 1 hour slopping out a rough implementation, and 5 hours polishing it with a combo of handwritten and vibed refactorings, I will get a better result than if I spent 8 hours writing everything by hand.
So my point is not that vibe software is lower quality, as my experience has shown the opposite. It is simply that the spirit of sharing my work was done with the idea that I was sharing it with others who toiled in the same craft, not sharing for consumption by machine. Not that I ever contributed anything very important to the open source world, that anybody depended on. Just personal projects I thought were neat or educational.
In hindsight I would probably still have open sourced what I did, because I think it's valuable to have on record that I competently programmed stuff before AI even existed, like pre-atomic steel. But I don't know if I will open source any personal code going forward.
====
To put it more succinctly: if somebody "ripped off" my open source code in 2018, I wasn't mad about that. Even if they didn't bother to attribute me, well, at least they saw my stuff, had a human brain cell light up appreciating it, and thought it was worth stealing. I'm flattered. But with LLMs my work can be reappropriated without a single human ever directly knowing or caring about it.
Well put. I agree wholeheartedly with your sentiment.
Maybe this is me just being angry at the new world that's being created, but the beauty of the open source ecosystem was humans giving away things they found useful in the hope that other humans could find them useful too. Having a machine take all of that and regurgitate it somewhere else without that connection (for profit, no less) feels like a betrayal of that open source ethos.
Now in the back of my mind I worry that everything I open source will be scooped up by corporations to make them more rich and more powerful, so I end up not publishing anything (not that it was of any value). I suspect I'm not alone in feeling that way.
Humans should have more legal privileges than machines, just as individuals should have more legal privileges than corporations. It's really as simple as that. I don't want to gripe around making up justifications, that's how the law should be and if it turns out not to be that, I'm going to be nettled.
I live in the UK, and most US law is based upon English common law, it's not some immutable code given to us from above. It's based upon assumptions and capabilities of the entities participating in the system at the time the law was codified. It can and should change to make more sense if those assumptions and capabilities shift massively.
I get the individual/corporation distinction, but how is a machine another tier here? It's a tool, it can't have any rights at all. The wielder has rights, and curtailing their rights depending on what tool they're using to exercise them seems strange. Potentially justifiable, but it's a different axis from the nature of the actor.
Our positions are completely compatible. People are anthropomorphizing LLMs, saying that because humans train on protected works, then it is fine for LLMs to do the same.
If they have only the rights that their human creators have, then access to them cannot be sold, in the exact same way that I cannot sell you a database that I have collected filled with copyrighted material. The "humans do training too" argument only holds if you imbue LLMs with similar rights to humans.
I am allowed to sell myself (in a very limited capacity) to others for them to exploit my training, even if that training was on protected material, which is a privilege humans should have, but machines should not.
Thing is, LLMs level of compression of training set mean that effectively, under the same rules that say you cannot sell that database filled with copyright material, the LLM is fine to sell. Because you have to be able to meaningfully trace each claim to final output (weights). For example, for some older stable diffusion model, it was calculated that each individual work addition or removal resulted in about 1-2 bits of change, meaning the same rules would qualify it as not derivative work.
However, because it is an issue with (at least historical) goals of copyright law, the common pattern that is evolving is that AI is not granted copyright of any work it generates, making it a bit of poison pill for some of the egregious ideas of corporate abuse. Not sure if the weights will be considered copyrightable either.
Under a "sweat of the brow" doctrine, the creator of a work, even if it is completely unoriginal, is entitled to have that effort and expense protected; no one else may use such a work without permission, but must instead recreate the work by independent research or effort. The classic example is a telephone directory. In a "sweat of the brow" jurisdiction, such a directory may not be copied, but instead a competitor must independently collect the information to issue a competing directory. The same rule generally applies to databases and lists of facts.
306 The Human Authorship Requirement
The U.S. Copyright Office will register an original work of authorship, provided that the work was created by a human being. The copyright law only protects “the fruits of intellectual labor” that “are founded in the creative powers of the mind.” Trade-Mark Cases, 100 U.S. 82, 94 (1879). Because copyright law is limited to “original intellectual conceptions of the author,” the Office will refuse to register a claim if it determines that a human being did not create the work. Burrow-Giles Lithographic Co. v. Sarony, 111 U.S. 53, 58 (1884). For representative examples of works that do not satisfy this requirement, see Section 313.2 below.
313.2 Works That Lack Human Authorship
As discussed in Section 306, the Copyright Act protects “original works of authorship.” 17 U.S.C. § 102(a) (emphasis added). To qualify as a work of “authorship” a work must be created by a human being. See Burrow-Giles Lithographic Co., 111 U.S. at 58. Works that do not satisfy this requirement are not copyrightable.
...
Similarly, the Office will not register works produced by a machine or mere mechanical process that operates randomly or automatically without any creative input or intervention from a human author. The crucial question is “whether the ‘work’ is basically one of human authorship, with the computer [or other device] merely being an assisting instrument, or whether the traditional elements of authorship in the work (literary, artistic, or musical expression or elements of selection, arrangement, etc.) were actually conceived and executed not by man but by a machine.”
The question is, does Claude Code fall into that category of authorship without creative input or intervention from a human author?
The prompts may be copyrightable... but the output if you don't go in and fix it up and provide that minimal amount of human originality to it? That appears to still be an open question of law in the United States.
> When I write code, what I write and how I write it is informed by having read countless source code files over my education and my career. Just as I ingest all that experience to fine-tune how my later code is written, so does the LLM from the code it's seen.
You are presumably human. We have granted humans specific exemptions in copyright law. We have not granted that to LLMs. Why are we so eager to?
Because that allows us to create useful tools that we didn't have before. For me it feels like a carpenter going from a hand-saw to an electrical saw. Still requires the skills of a good carpenter, but faster and easier.
… so a bunch of people just decided that rights we granted to humans also apply to their tools? Without any discussion? This isn't how anything is supposed to work when it comes to common rules!
The common rules are so because we agree on them. On principle, in this case, we do not agree what the rule should be here and it's in a way unprecedented. We'll soon converge to a societal agreement. I hope society abstaining itself from tools will not be the answer.
What's special about LLMs in your argument? When I was an edgy teenager in the 90s, I'd argue that it's not piracy because the DivX representation of the movie isn't bit-for-bit identical to the Hollywood master or whatever. If your reasoning works for LLMs as the tools, surely it also works for video compression.
I'm not sure where in our lawbooks there are laws that specifically target humans to the exclusion of human-operated tools.
There's also a TON of irony here. What an about face it is, for the community at large* to switch from "information wants to be free, we support copyleft and FOSS" to leaning so heavily on an incredibly conservative reading of IP law.
> I'm not sure where in our lawbooks there are laws that specifically target humans to the exclusion of human-operated tools.
It doesn't need to. Laws are for humans.
Laws don't give rights to chainsaws. Or lawnmowers. Or kitchen knives, hammers, screwdrivers, and spades.
You can't use any of those to commit a crime and then claim that the law specifically did not exclude those tools.
Why are you seemingly in favour of carving out an exemption for LLMs?
Laws are for humans.
Arguing that the law did not specifically address "intentionally killing a person by tickling them till they died" means that you found a loophole which can be used to kill people is...
> I'm not sure where in our lawbooks there are laws that specifically target humans to the exclusion of human-operated tools.
If we take the point of view that LLMs are tools (I agree), then people need to be absolutely certain that these tools don't contain (compressed) representations of copyrighted works.
People seem not to want to do that. And they argue that the LLMs have "learned" or "been inspired" by the copyrighted works, which is OK for humans.
This is the problem. People can't even agree on which of two mutually exclusive defenses to appeal to! Are LLMs tools which we have to ensure aren't used to reproduce copyrighted work without permission, or are they entities that can be granted exemptions like humans can? It can't be both!
> There's also a TON of irony here. What an about face it is, for the community at large* to switch from "information wants to be free, we support copyleft and FOSS" to leaning so heavily on an incredibly conservative reading of IP law.
True. While IP-owning companies like Microsoft now say "it's online, so we can use it".
It's bizarre.
I'll tell you what: I'll drop my conservative stance in defense if FOSS when Windows and the latest Hollywood movie are "fair use" for consumption by whatever LLM I cook up.
If we take the point of view that LLMs are tools (I agree), then people need to be absolutely certain that these tools don't contain (compressed) representations of copyrighted works.
I've pointed out elsewhere in this thread that this is the opposite of how the real world works.
In actual fact, people who need software built hire a tool (e.g., a software developer like me) to build it for them. That tool - me or you - has inside it a tremendous library of copyrighted works represented. I've worked on enough different projects over the decades that the next CRUD function, or rule-driven data-entry tool, or whatever, that I build is going to draw very significantly from the last ones I built. And those last ones were copyrighted, with those rights held by my employer at the time, and maybe even protected by NDA or defense-style classifications.
Is your position that this is OK so long as it's stuff that I can keep in my squishy brain, but the moment that mechanism moves to silicon, it somehow becomes fundamentally different?
The other major argument I see in this thread is that for LLMs it's different because there's a third party who is aggregating the data, and selling me (or my employer) use of that tool. But this doesn't change the overall picture at all. It just adds one more layer of dereferencing into it. The addition of that middleman hasn't altered the moral landscape: how is hiring me, along with what's in my memory, different from hiring the combination of me plus a helper to supplement my memory? There's an aspect of scale, I suppose. With that helper I can achieve greater quantities, but it's not changing the story in a qualitative way.
> In actual fact, people who need software built hire a tool (e.g., a software developer like me) to build it for them. That tool - me or you - has inside it a tremendous library of copyrighted works represented.
Humans are distinct from tools, both ethically (to most people) and legally. You may not see it this way, but it is the majority opinion and the stance of the law in most jurisdictions. The rest of your paragraph falls apart without considering humans as tools.
(Incidentally: you can own tools. I don't think you want to open that door…)
> Is your position that this is OK so long as it's stuff that I can keep in my squishy brain, but the moment that mechanism moves to silicon, it somehow becomes fundamentally different?
Yes. We, humans, structured our laws because we consider ourselves and our squishy brains special.
This is, for example, why you don't get charged with murder for terminating a computer program. We, the humans, have decided that the right not to be terminated only applies to humans (and other animals, but then because we grant them that protection).
In many of those examples, there is payment to the creator of the works that others are learning from. Authors are paid for their books, when we listen to music on the radio the musician is paid royalties, etc. When you lead a team and mentor junior engineers you're being paid for your time.
The nature of the source material matters though. Training a model on open source software seems perfectly fair - it has explicitly been released to the public, and learning from the code has never been a contested use.
IMO the questions around coding models should be seen as less about LLMs and more as a subset of the conversation about large companies driving immense profits from the work of volunteers on open-source projects, i.e. it's more about open source than AI.
You’re not a product that was created by other human beings based on someone else’s IP.
It turns out that's false. We know that genes are patentable; remember back during the Human Genome Project, when there was such a rush to patent them? So genes are IP. (This seems bizarre to me, since they're patenting something that was found just sitting there, but this is what the system says right now.)
Well, two other humans (aka mom and dad) did create me, based on those patentable genes (and most likely including some genes that were, in fact, patented).
I'm not sure what to conclude from all of that, but I do think that it invalidates your argument.
It's a little more complicated, and I would argue that the court got it wrong, but you cannot patent a gene as it exists and rests in nature. You can patent the cDNA (reverse-transcribed mRNA) genetic code after intron removal, which they argue is not a natural thing, but I think they misunderstood the science, really the triviality of the "invention".
He's making a point about responsibility/liability.
If you only get copyright for the prompt you make, but not the output, then it's like being responsible only for the prompt, but not the output.
Ie he's only responsible for pushing the boulder up the hill. The fact that it rolled down from the hill and crushed someone's house "isn't his fault" (he doesn't get copyright on it).
>The Office concludes that, given current generally available technology, prompts alone do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectible ideas. While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.
If you're not the author then why would you have to be liable for it?
In some places simply not keeping the public street in front of property ice-free can incur liability, even when you are not actually there when it snows. There are so many such examples I'm kind of surprised to see this kind of confused argument made here.
But that's not at all a comparable situation though, because it is your party. It doesn't matter where it is, we assign "ownership" of the party to you. Even the language we use explicitly states that. In the case of copyright, we explicitly states (by the copyright office), that you are not the author of an AI generated work.
> If you're not the author then why would you have to be liable for it?
If you do not understand this make sure that you always operate within a framework of people who do because this soft of misunderstanding can cause you a world of grief.
Because you are the person shipping it, and as such regular liability applies. If I'm not the author of a book, and make a lot of copies and distribute those I'm liable for the content of that book, regardless of whether or not I hold the copyright to it. Conversely, if the original author sues because they feel their work infringes then that too is a liability that stems from the distribution.
And 'distribution' is a pretty wide term, not unlike 'interstate commerce', lots of things that you might not consider to be distribution can be classified as such in court.
Different laws do not come in packages, they apply individually, and sometimes they apply collectively but it isn't a menu where you can pick the combination that you think makes the most sense.
Oh, I do understand it - laws are contradictory and can do whatever people shout out the most that they should do (but they don't always work that way). I just think that it is extremely bad when laws work this way.
Technically when you select "copy image" instead of "copy image url" and paste that to a friend you're often committing copyright infringement. Do I think this is reasonable? Absolutely not. The same goes for this - the author should hold liability, so make the person who ends up causing the work to exist the damn author.
But nooo, we can't have that. Instead we need to have these convoluted exceptions that don't at all work how the real world works, so that lawyers can have even more work.
Besides, if we go by "the law" then we already have a court case where training an AI model is protected by fair use. But obviously that isn't satisfying enough for people, so they keep talking about how it's stealing (refer to my first sentence).
Also, this situation is going to get funny when some country decides that AI generated content does get copyright protection.
> Oh, I do understand it - laws are contradictory and can do whatever people shout out the most that they should do (but they don't always work that way). I just think that it is extremely bad when laws work this way.
You are completely misunderstanding GP's distinction between ownership and liability.
In short, if you use someone else's car to kill someone, you are still liable for killing that person even though you don't own the car.
Aren't you agreeing with him?
He pushed the boulder up the hill, thus he is responsible and liable for what happens. He is the author of the work of pushing the boulder up the hill.
In your analogy: He was driving the car, he is liable for the death. He is the author of the work of driving the car.
You are kinda unnecessarily introducing the creation of an object used for the work. Whoever did create the car/boulder is not liable for what happened.
So whoever made the LLM is not the author but the one who used it to create the code.
>>>> If you're not the author then why would you have to be liable for it?
And all his arguments after that are to support that claim. His claim is wrong.
Ownership and liability are independent of each and all his supporting arguments are dismissing this fact.
> Whoever did create the car/boulder is not liable for what happened.
Incorrect; whoever owns the car/boulder is not liable. The creator doesn't even enter this argument.
> So whoever made the LLM is not the author but the one who used it to create the code.
No; whoever created the LLM is irrelevant. The author who creates the code is similarly irrelevant. What matters in his argument is who owns the code, and this is also irrelevant to his argument, because ownership does not mean liability.
You can't really argue that things are in a certain way when that contradicts the way the law works, that's a recipe for disaster. The rules have been set, you can disagree with them and then you will be forced to litigate, which is both expensive and time consuming. Purposefully going against the grain is only for those with extremely deep pockets (and for lawyers...).
> Besides, if we go by "the law" then we already have a court case where training an AI model is protected by fair use.
Yes, but training an AI is a completely different thing than distributing the work product generated by that AI.
Note that I don't agree with all aspects of copyright law either, but I'll be happy to play by the rules as set today simply because I can't afford to be wrong and held liable for infringement. For instance I strongly believe that the length of copyright is a problem (and don't get me started on patents, especially on software). I also believe that only the original author should have copyright, not the company they worked for, their heirs (see Ravel for a really nasty case) or anybody else. I believe they should not be transferable at all.
But because I'm a nobody and not wealthy enough to challenge the likes of Disney in court I play by the rules.
As for 'this situation is going to get funny when some country decides that AI generated content does get copyright protection':
Copyright is one of the most harmonized legislative constructs in the world. Almost every country has adopted it, often without meaningful change. In practice US courts are obviously a very important driver behind changes in copyright law. But in general these changes tend to lean towards more protection for copyright owners, not less. So far the Trump admin has not touched copyright law in their usual heavy handed manner. I'm not sure if this is by design or by accident but maybe there are lines that even they can not easily cross without massive consequences.
Some parties in the AI/Copyright debate are talking about two sides of their mouth, for instance, Microsoft is heavily relying on being able to infringe on copyright at will but at the same time they are jealously guarding their own code. Such hypocrisy is going to be the main wedge that those in favor of strong copyright are going to use to reduce the chances that AI work product deserves copyright, after all, if it is original and not transformative then Microsoft could (and should!) train their AI on their own confidential code. But they're not doing that, maybe they know something you and I do not...
Imagine you cut the sentence "I'm going to kill you, this is an imminent threat." out of a book and hand it to someone.
It would be silly to consider you the author of that sentence in a copyright sense.
It would be equally silly to say you have no liability from that sentence.
Looking back at the boulder example, that LLM output has no consequences to be liable for if you throw it immediately into the trash bin. It's when you take boulder.txt and use it to do things that you have liability despite not having copyright.
That is not how responsibility works anywhere. If you are stealing a gun and murder someone with that gun, you are still responsible, even if it is not your gun.
If you have a more recent citation referring to case law that states the opposite then that would be great but afaik this article reflects the current state of affairs.
The human using the tool creates a prompt, there is then an automatic transformation of the prompt into code. Such automatic transformation is generally accepted as not to create a new work (after all, anybody else inputting the same prompt would have a reasonable expectation of generating the same output modulo some noise due to versioning and possibly other local context).
Claud code and in general AI generated code does not at present create a new work. But the prompt, that part which you input may be sufficiently creative to warrant copyright protection.
In the US, the copyright office (as the article you link to says), has declined to define “meaningful” contribution. If you want to argue that the user doesn’t own it for incredibly trivial prompts, I won’t argue (though I consider that to be non-useful code).
Every developer I’ve seen use these tools has have engaged in a meaningful contribution: specific directions across multiple prompts, often (though not always) editing the code afterwards, manually running the code and promoting for changes, etc.
Until the courts, legislators, or the copyright office define something otherwise, I’m highly confident of my assertion. (Mostly because of the insane number of hours I’ve spent with counsel on this. And, as a disclaimer, since I am biased: I worked on Copilot and Google’s various AI assisted coding products as an SVP and VP.)
If my business depended on a legal fiction to be true and I had invested a whole pile of effort + money into it being so then I would argue at every opportunity that 'of course it is legal'. But that's just a version of fake-it-until-you-make-it and in practice not all of those bets pay off.
The fact that meaningful contribution has not been defined is a strong signal that things are not nearly as clear cut as you make them out to be. Until there is a ruling that clearly establishes that the person that generated the prompt owns the copyright on the code I think it is misleading to suggest that this is already the case, your lawyers are not the lawyers of the parties that will end up hurt if it ends up not being so.
For contrast: we have a very clear idea on what things are copyrighted and in general these things do not rest on a foundation of IP appropriated from others outside of the license terms. The fact that the infringement is fine grained and effectively harms the rights of 1000s or more individuals doesn't change the heart of the matter, whoever wrote the code: it wasn't you.
Given your bias I'm not surprised that this would be your argument though, effectively you have created a copyright laundromat using code that you were nominally the steward of and not the owner but whether it stands long term or not is not up to your lawyers.
You warrant you wrote the code yourself, then it is found your code infringes on code owned by other entities. Now you have a tough choice: admit you lied about writing your code yourself tainting all of the code you claim you wrote since these tools became available or stand and take the infringement penalty which could be very substantial.
Judges and courts don't like playing silly games like this.
I've sued two parties for copyright infringement and won and a third settled out of court for a substantial sum. You don't tell a judge you don't need to prove you wrote the code, that's an automatic loss. Then there are such things as expert witnesses who will interview you and check how much you know about the code you claim you wrote.
>I've sued two parties for copyright infringement and won and a third settled out of court for a substantial sum. You don't tell a judge you don't need to prove you wrote the code, that's an automatic loss. Then there are such things as expert witnesses who will interview you and check how much you know about the code you claim you wrote.
This doesn't really make sense; in no way can an "expert" interview definitively assert someone wrote a piece of code or not, especially if the person has access to the code beforehand.
If that were true, a developer may own copyright over the source code, but nothing on the compiled binaries, and I could download practically all software available as compiled binaries and use for free.
Indeed a developer owns copyright over the source code and on the compiled binaries, because there is no expansion happening here but just a translation from one format into another, the kind of thing that has been ruled copyrightable since copyright exists. The same goes for translations from one human language into another, and anybody with knowledge of more than one language will be happy to acknowledge that translating is hard work. Even so, the translator does not hold copyright on the result, at best they can say they have created a derived work and it is the original author that continues to hold copyright.
Compilation and translation happen in a generic manner and does not rely on a mountain of other IP, it is really just a transformative tool that happens to do something useful, someone constructed it to be a very precise translation to the point that any mistakes in it are called bugs and we fix them to ensure the process stays deterministic. Translators try hard to 'get it right' too: to affect the intentions of the original author as little as possible.
When you use a model loaded up with noise or that you have trained exclusively on code that you actually wrote I think a strong case could be made that you own the copyright on that work product. But when you train that model on other people's work, especially without their consent or use a model that has been trained in that way you lose your right to call the output of that model yours.
You did not write it, and the transformative process requires terabytes of other people's IP and only a little bit by you.
As soon as you can prove that your contribution substantially outweighs the amount of IP contributed in total you would have a much stronger case.
>> Indeed a developer owns copyright over the source code and on the compiled binaries, because there is no expansion happening here but just a translation from one format into another ... does not rely on a mountain of other IP
... and, the license agreement of the compiler and libraries used / linked to practically always explicitly waive copyrights over the said non-mountain of IP.
>> As soon as you can prove that your contribution substantially outweighs the amount of IP contributed in total you would have a much stronger case.
... a much stronger case that you have a partial copyright over the work, which is now likely a derivative work. You still may not have a case that you own the copyright exclusively (or as the original article says, that your employer does).
>> No, that human owns the copyright on the prompt, not on the work product.
I think I may have misunderstood your original comment above. It seems intending to say:
No, that human owns the copyright on the prompt, not necessarily on the work product. The human may partially have copyright over the work product as well, "how much" being dependent on how much new creative expression from the human was involved vs that from others.
Both the compiler (in absence of inclusion of copyrighted libraries) and the LLM are considered to not add creative work and thus do not change copyright status of the works they transform.
You can consider the training set of the LLM or other AI model to be 3rd party libraries and the level of copyright from them applying to final output to be how much can be directly considered derivative, just as reading copyrighted code and being inspired by it does not pass that copyright to your work unless it's obviously derivative
>> You can consider the training set of the LLM or other AI model to be 3rd party libraries ...
I like this comparison -- training set as '3rd party libraries'. Except, of course, that the authors behind the training set may not have actually granted permission to use, whereas the 3rd party libraries usually have some permission by way of license.
The law only cares about how the work is distributed - if you acquired it legally by purchasing, yes you can train LLM on it, and with exception of moral rights in places like EU the author does not have more to say on it.
It's treated the same as human reading and learning from the work.
You have only the granted artificial monopoly on acts of distribution under US law
What is interesting is that I used to write a program and get a binary executable back from the compiler and I'd have copyright on the source and on the binary. Now I write a prompt and get a binary executable back from Claude and I have copyright on the prompt but depending on my creativity I might be able to have copyright on parts of the output binary. The questions remain: how much, which parts and how the hell could anyone ever tell. This really puts the color of the bits through a vat of dying solution.
> If that were true, a developer may own copyright over the source code, but nothing on the compiled binaries, and I could download practically all software available as compiled binaries and use for free.
If the compiled binaries (output) were produced by running the input (source code) over every program written, then sure.
But that's not what's happening with compilers, is it? The output of a prompt is dependent on copyrighted work of others every single time it is run.
The output of a compiler is not dependent on the copyright output of every other program.
1. The "every"ies in your comment are not to be taken literally either. :-)
>> If the compiled binaries (output) were produced by running the input (source code) over every program written, then sure.
2. More importantly, the above seems cyclically dependent on whether output from generative AI is deemed to be in public domain or not, which I consider is an open-ended issue as of now. It is not so 'sure' as yet. :-)
Copyright isn't some natural state of being though, it's something that's granted to people by the government to "promote the progress of science and useful arts". If copyright hinders things then I think it's reasonable that exceptions would be made.
This analysis yields very different results under utilitarianism vs rule utilitarianism.
Under the former, you could argue, "What I'm doing is a science or useful art, so if copyright exists to advance those things then taking a more permissive interpretation of copyright to allow my efforts to succeed is in the spirit of the law."
Under the latter, you could argue, "Works get published because as a rule, researchers and artists know they have lawful recourse through copyright if the work gets used without their consent. The absence of that rule incentivizes safeguarding works by treating them as secret and each disclosure as a matter of personal trust, so the existence of that rule promotes the sciences and useful arts."
I agree with this sentiment, because the person directing the agent can still direct it in a way where it'll produce a better or worse output than another person directing it.
If the LLM generates output that a court decides is sufficiently derivative, and especially (but not necessarily) if the LLM was trained on the source material being infringed, then whoever redistributes the derivative output is going to be liable for copyright infringement.
Creation of the LLM itself is transformative, but LLM output which infringes is not.
Is it true then that if someone stole an entire code base from a vibe coded app from a non permissively licensed project and that person claimed that it was derived from an LLM and was not stolen at all that the person who stole the code is not a thief because it came from the same place? Or are they a thief because someone else copyrighted it? How do vibe coders protect themselves not knowing who else has the same derivative code or who holds the copyright first? Or can't they?
The only thing a vibe coder should be able to copyright, is the prompt text they wrote. Not the output of the LLM, only the text they wrote to instruct the LLM what to do. And even that is pretty iffy, because most of it like "put a button on a page" is not copyright-able.
I could possibly see an argument for the owner being whoever paid for the tokens used, but honestly I think the argument for that is weaker than what you're suggesting; I'm merely playing devil's advocate here.
I don't think there's even a valid argument for any other ownership model, or at least none that I can think of.
I see the argument for whoever paid for the tokens. Or in the case of a free AI usage, the person who sent the prompt (or whoever they are acting on behalf of, i.e. the company they are working for at the time).
The primary issue being that it's all built on stolen data in the first place.
Even taking the least generous interpretation of what LLMs do and saying they're just "copy/pasting others' code" it's still not stealing because the original still exists and presumably still makes money. The original has to be gone for theft to have occurred.
In order to have a sane conversation about this we have to all agree not to lie.
I've created my own DSL, and instruct Claude Code how to generate code for this DSL using skills.
Since this is a new language, and not documented on the web nor on Github, Claude's ability is not based off of stolen IP. At best it's trained on other language concepts, just like we can train ourselves on code on GitHub.
Maybe a good reason to create a new programming language?
Interesting, but I still do not think this is as easy. The AI model is still trained on some existing works, and it is generating code in the new DSL or programming language still based some higher level ideas and expressions it has consumed during training. You have added just one more level of indirection. The output cannot anymore be verbatim copy of some existing work or non-short snippets, however, the output may still carry "expression" that are substantially similar to something pre-existing.
Note: IANAL. The above is just from my current understanding.
You can think that's how it should be. But that's not necessarily how it is. I'm reminded of the famous monkey selfie copyright dispute [1]. A photographer set up a camera and gave it to a monkey but after a legal dispute, courts decided nobody owned the copyright.
I can totally see this applying here as well.
Now this doesn't resolve the issue of AIs being trained on copyrighted works it had no rights to. The counterargument is that this is a derivative or transformative work but I don't believe that's settled law at all.
Do you think that human directing the agent owns copyright for any legal reason?
The case Community for Creative Non Violence Vs Reid (https://en.wikipedia.org/wiki/Community_for_Creative_Non-Vio...) solidifies a supreme court opinion that someone contracting a work and directing an author does not grant authorship to the commissioner of the work, it grants authorship to the person actually doing the work.
The author can grant authorship and copyright to the commissioner with a contract, but the monkey picture (and others) have solidified that only humans can be granted copyright. Since LLMs aren't human they can't hold copyright, and if the LLM doesn't have legal copyright then they don't have legal rights to assign copyright to you.
Interesting, though, that ownership of the code can still be transferred to the employer. So it's in the public domain (because not human authored) but owned by the employer (because the human and/or LLM was employed by the employer)? I don't really understand how this works.
I think what this means is that the employee may not be the copyright owner for multiple reasons, which are possibly applicable simultaneously. It does not imply that the employer owns copyright over the work that is in public domain, which would be a contradiction.
Copyright works on derivative rules - is the component of the work unmistakenly derived from another copyrighted work.
Under at least EU AI Act, any work done by AI is not granted copyright. But it does not mean copyright does not apply, it means the amount of work credited to AI is set at 0% (simplification). A human working off another's work unless it's perfect copy will have "credit" for changes that are judged creative/transformative, meaning a human plagiarizing something still can claim to have some degree of authorship. An AI won't.
In a sense, the copyright status of final work is a sort of "sum with dilution" were each work involved adds to claims, but AI's output is set at 0 - the prompt or further rework by human is not.
As for employer, details vary but generally "work for hire" rules and contracts do reassignment of material rights (in EU and some other places you can not reassign moral rights which are a different thing).
When you write code by hand, you are the author. As part of your contract with your employer you grant copyright and authorship to your employer by default (as stated in the contract).
The LLM is not employed by you or your employer, because you can't enter contracts with non human or non human organizations.
When you license a non-LLM code generation service (like a page that creates a website for you), that company owns the copyright of the generated website because their deterministic system generated code by defined rules and mechanisms that were defined by the code generation system. Assuming no LLM as part of that, there is no code that is generated by the system outside of the rules that they defined (it's not filling in the blanks that you or the code generation system didn't explicitly define).
Since they own the copyright of the website, they can then assign the copyright and authorship to you because of your license agreement to them.
Since the LLM is filling in the blanks on its own in undefined ways, it is the author and not Anthropic/OpenAI/ETC. That means that even though you have a license agreement with Anthropic/OpenAI/etc.. to transfer copyright, they didn't have copyright/authorship, the LLM did. And since the LLM can't legally own copyright/authorship (since it isn't a human) then it can't grant it to you and you can't then grant it to your employer.
yeah this is what I understand - that it's the copyright ownership that is being transferred. But if there's no copyright to transfer, what does the company end up owning?
It depends on what level of creative control you had over the code.
Code is protected by copyright as a literary work. The method is not protected by copyright, that would be the domain of patents. What's protected are the words.
If you say "Claude, build me a website about X" then you do not have any creative control over the literary work Claude is producing. You just told a machine to write it for you. Nor, like a compiler, is it derivative of any other work that you wrote.
If, on the other hand, you are working jointly with Claude to make specific changes to the code on a line-by-line basis, then you will have no problem claiming copyright over the code. Claude in this case is acting as a tool, but there's still a human making decisions about the code.
In the case where you wrote a bunch of markdown and then told Claude to generate the corresponding code but didn't have any involvement in writing the code itself, you could perhaps claim that the code is a derivative work of the markdown, a court would have to handle that case-by-case basis and evaluate how much control you exerted over the work.
I don't think case law totally supports the idea that working on a line by line basis means you have "no problem claiming copyright".
There problem is the LLM is still making assumptions on that line of code and thus it's still the main author (based on existing case law and the copyright office's opinion currently).
The markdown case is definitely more like the case I cited where the supreme court decided that specficiations and back and forth do not mean it's a deritive work and thus the actual implementor is the author, not the spec writer.
No, a copyright application can be filed with a corporation listed as the author. Watch for the copyright notice at the end of the next major movie you see.
However, until very recently the creative product must have been created by someone so there is an implicitly created copyright over the product in the first place. With AI output, that might not continue to be true, we don't really know how it'll work out yet.
In any case, the corporation did not create the product, people created it and their contractual relationship with the corporation defined how the ownership of that work was managed. So, I don't find it too unusual that this element of personhood is available to corporations.
The employees and contractors are the authors, and because of the contract they sign they assign copyright to the corporation. Corporations, as a collection of humans are allowed to have authorship.
LLMs are not companies and they are not humans in any way shape or form, and thus cannot get copyright nor grant copyright to a third party.
It's not that corporations can't hold copyright. But a corporation cannot mechanically create "original works of authorship" by a purely mechanical process. That process is limited to human authors. "Works for hire" would be a common case of a human creator (author) resulting in a corporate assignment (ownership), see: <https://en.wikipedia.org/wiki/Work_for_hire>.
Notably cases:
- The "monkey selfie" copyright case, in which photographer David Slater arranged for monkeys to take selfies. Copyright ownership denied by both the US Copyright Office (against Slater's claim) and (in a separate case arguing the monkey should hold copyright) by an appellate court: <https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...>, Naruto v. Slater, No. 16‑CV‑00063 (N.D. Cal. 2016).
- Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991). Simple compilations are not copyrightable regardless of whether created by humans.
- THALER v. PERLMUTTER (2023). "[T]his case presents only the question of whether a work generated autonomously by a computer system is eligible for copyright. In the absence of any human involvement in the creation of the work, the clear and straightforward answer is the one given by the Register: No." <https://caselaw.findlaw.com/court/us-dis-crt-dis-col/1149169...>.
This interpretation makes sense. I think even the 'fair use' clause in the US doesn't protect LLMs. One argument I've heard often is that LLMs synthesize their training set to produce novel output in the same way as a human would... That may be the case, but legally an LLM isn't a human. You can't look at the output of an LLM and say that it's 'fair use' with respect to its training set; it hasn't been established that AI has the same 'fair use' right as a human does; it's already pushing it that companies have this right (let alone an AI agent); anyway, that's just one problem... Also, this is ignoring the fact that the researchers who compiled the training set COPIED the original copyrighted data in order to produce that training set. They either copied the entire work into the training set or they fed the entire work directly into the LLM; in either case; at some point, the entire work was copied verbatim into the LLM's input layer before it was ingested by the AI. The researchers copied the copyrighted content without permission.
Also, when it comes to code, the case is even more damning because the vast majority of the code which LLMs are trained on was not only copyright but subject to an MIT license (at best) and even the MIT license, which is the most permissive license in existence, still says clearly:
"Permission is hereby granted, free of charge, to any person obtaining a copy of this software"
The word 'person' is used very intentionally here.
I think there should be several kinds of AI taxes which should be distributed to all copyright holders. There should be a tax to go to writers (and book authors), a tax to go to open source developers and a tax for the general population to distribute as UBI to account for small-form content like comments and photography...
People invested a lot of time building their entire careers around the assumption of copyright protection; so for it to be violated on such a scale would be a massive betrayal.
The LLM is just a database. It's like saying 'I own the copyright to what comes out of an API because I crafted the query' or 'I own the copyright to the responses I get from the bots on the Starship Titanic because I crafted the message they respond to'.
That's not what's been established to date in US caselaw:
THALER v. PERLMUTTER (2023). "[T]his case presents only the question of whether a work generated autonomously by a computer system is eligible for copyright. In the absence of any human involvement in the creation of the work, the clear and straightforward answer is the one given by the Register: No."
I want this question to have an interesting answer, but everyone knows that if this question ever goes to the courts, ownership will go to the people in charge with the money. The idea that Anthropic may not own Claude Code just because Claude wrote it is wishful thinking.
Best part is, it's likely to have a different answer in every country, who knows what'll happen, not every country implicitly sides with the ones with the most money.
It's not wishful thinking, and ownership isn't a foregone conclusion.
Sure the courts could mint a communist society with a few weird decisions about property rights, but this being the US do you really suppose that's likely?
There's really no legal question of any kind that models aren't people and therefore cannot own property (and also cannot enter into legal contract as would be required to reassign the intellectual property they don't and can't own)
The catch-22 is that the fact that models aren't people is only relevant if you treat them similar to a person. Like the US Copyright Office's opinion which treats it similar to a freelancer. If you treat the LLM as a machine similar to a camera, with the author expressing their existing intent through the tools of this machine, ownership is back on the table and more or less how it was before LLMs.
Well if the camera in addition to choosing autoexposure also decided how to frame the shots, which lens to use, where to stand, and everything else salient to the artistry of photography -- all without direct human intervention, then I would think the situation would again be analogous. If the camera could do all that because an intern was holding it, the intern would still own the shots even if their employer gave them the assignment.
That's why the intern signs an employment contract that reassigns their rights to their employer!!
The work-for-hire doctrine actually supports your intuition more than the AI authorship question does. The reason Anthropic likely owns Claude Code has little to do with whether Claude wrote it and everything to do with the employment contracts of the engineers who directed it. The DMCA takedown question is genuinely interesting though because DMCA requires the claimant to assert copyright ownership in good faith. If a court later found the codebase was predominantly AI-authored and therefore not copyrightable, the 8,000 takedowns could be challenged as bad faith DMCA claims. That is a different and more tractable legal question than the ownership one.
Work-for-hire doctrine doesnt automagically absolve you from IP law. Microsoft and Intel already learned this in the nineties when they paid San Francisco Canyon Company to steal Apple code.
The San Francisco Canyon case is a good example of exactly the right distinction. Work-for-hire determines who owns the output, but if the process of creating that output involved copying protected material, the infringement claim runs separately. The piece makes this point on the open source contamination section: owning the output and having a clean chain of title to the output are different questions. You can own AI-generated code and still have a copyleft problem in it.
I have trouble believing that the DMCA claims would be found to be in bad faith when they were made at a time when the question of what degree of human input is required to acquire copyright on AI generate code hasn't been resolved at all.
It doesn't seem like bad faith to think that copyright is stronger than the courts end up thinking, just being mistaken.
fair correction, updated the piece to reflect this. Bad faith under DMCA requires knowing the claim is false, not merely being wrong. A good faith belief in copyright ownership, even one that turns out to be mistaken, is a defense. The more accurate framing is that if the codebase is found to be predominantly AI-authored, the takedowns would fail on the threshold question of whether there is a valid copyright to assert, which is a different issue from intent.
As a developer, the fact that my source code passed through a compiler - an automated tool - doesn't give the author of the compiler any claim on my executable code.
As an artist, the fact that I used, e.g., Rebelle to paint a digital painting, or that I used Lightroom (including generative AI to fill, or other ML/AI tools to de-noise and sharpen my image) in editing a photograph, doesn't give EscapeMotion, Adobe, or Topaz, any claims to my product.
Why, then, would there be any chance that use of a tool like Claude - a tool that's super-advanced to be sure, but at the end of the day operates by way of a mathematical algorithms - would confer any claims to Anthropic?
If a court later found the codebase was predominantly AI-authored and therefore not copyrightable
Is figuring out the appropriate prompts to use in directing Clause qualitatively different than using a (much) higher-level abstraction in coding? That is, there was never any talk as we climbed the abstraction layer from machine code to assembly to Fortran or C to 4GLs to Rust etc., that the assembler/compiler/IDE builder would have any ownership claim on the produced executable. In what sense can Anthropic et al assert that their tool, which just transforms our directives to some lower-level representation, creates ownership of that lower-level representation?
Too late to edit, but OpenAI certainly doesn't want ownership or liability, for the CSAM they've produced. They certainly don't want ownership/liability of code which does $ONLYAWFULTHING.
They won't want to own code that is malicious\illegal\used in crime, although it's really weird to me that no one (in LEO) seems to care that, for example, grok generates CSAM, revenge porn, probably other illegal things, so they'll probably get to have their cake and eat it too.
Those things have precise legal definitions which it may not be entirely clear that an LLM can even generate them - especially in the USA where the 1st covers things that many would think illegal (and are illegal in other countries).
Zarya of the Dawn already settled it for Midjourney output: human-written elements were protected, AI-generated images were not. The character design didn't get copyright even though the human picked, prompted, and curated. Code isn't different. Prompting Claude to produce a function is closer to prompting Midjourney to produce a frame than to writing the function yourself.
The reason it feels different to engineers is that we're used to thinking of the compiler as the analogy. But a compiler is deterministic — same input, same output. An LLM isn't. That's the line the Copyright Office is drawing, and image cases got there first.
But is there anything stopping a human from applying for copyright in their own name? Does the fact that somebody can recreate the prompt invalidate their claim?
Copyright Office requires you to disclose AI involvement and disclaim the AI-generated parts. Zarya of the Dawn is the example — applicant filed for the whole graphic novel, got partial registration on the human-written text, refused on the Midjourney images. The reproducibility of the prompt isn't really the test. The test is whether a human made the expressive choices.
LLMs are amazing of course and we use them heavily ourselves - but not for modifying text that is to be posted to HN. Doing so leaves imprints on the language that readers are increasingly becoming allergic to, and we want HN to be a place human conversation.
Not really. Copyright registration is pretty much automatic. The Copyright Office does not check for duplicates. Patent registration involves actual examination for patentability. Issued patents are presumed valid (less so than they used to be), but issued copyrights are not. You have to litigate.
The US does not have "sweat of the brow" copyrights. It's the "spark" that creates the originality, not the work. Which is why you can't copyright a telephone directory (Feist vs. Rural Telephone) or a copy of an uncopyrighted image (Bridgeman vs. Corel) or a scan of a 3D object (Meshwerks vs. Toyota). Or the contents of a database as a collective work. Note that some EU countries do allow database copyright.
Interestingly, a corporation can be an author for copyright purposes. The movie industry pushed for that. We may in time see AI corporate personhood for IP purposes.
> But a compiler is deterministic — same input, same output. An LLM isn't.
Temperature 0 determinism is subject to active research. NVIDIA tried but failed so far, DeepSeek V4 seems to have done it. I hope judges won't be swayed by this an AI generated code will classified as uncopyrightable, just like Images are.
Fair point on temp-0. But I don't think determinism is what the courts will hang it on. A deterministic LLM still makes the expressive choices — naming, structure, control flow — that the human didn't make. The image cases didn't turn on whether you could re-roll the same Midjourney frame. They turned on who made the creative decisions. Same logic should hold for code.
Depends on the scale of LLM involvement, the copyright office left a pretty big carve out for things that are human sourced and then modified by LLM, or the reverse, LLM output thats modified by human intention. (They had to do this because there are already pseudo random elements to digital artwork, like say, render clouds and render noise, that might otherwise poison an artwork). In fact I dont think this has been tested with Highlight area > Prompt a change to this area of the image workflows.
They also mention in the same document that were LLMs to more closely approximate deterministic tools, they would be open to reevaluating. That is Requesting X gets X without substantial wiggle room.
I dont think that last part has been tested with an extremely large set of prompts and human generated input to create a more deterministic output. Even outside of code, where you see large prompts, creative writing LLM tools, NovelAI or Sudowrite for instance can have pages and pages of spec for the LLM, sometimes close to 50% of the size of the final output.
Then there's testing, review etc, human processes confirming that the output meets spec, updating it where needed intelligently.
There are also foreign courts, with similar rules about human intention, that have found in favor of prompts only, where it could be demonstrated that multiple rounds of prompts were used to refine the image.
I wouldnt call this settled at all tbh. And to be honest, a lot of this doesnt require exposure. you dont need to own up to LLM use in a lot of settings, proving LLM use is so difficult its easy to jump up the ladder from LLM (100%) to LLM (50%) and ultimately claim ownership.
The people who will get busted for this are basically just super lazy leaving ChatGPT responses in, failing to pay an editor, failing to modify images for anything more than layouts.
That's quite impressive approach from the companies' perspective. Let's first use claude code and then we'll think who the code belongs to.
I think that the gold rush approach happening right now around me (my company EMs forcing me to work with claude as fast as possible) show really short-sight of all the management people.
First - I lose my understanding of the code base by relying too much on claude code.
Second - we drop all the good coding practices (like XP, code review etc.) because claude is reviewing claude's code.
Third - we just take a big smelly dump on the teamwork - it's easier and cheaper to let one developer drive the whole change from backend to frontend, despite there are (or were) two different teams - one for FE, one for BE.
Fourth - code commenting was passe, as the code is documentation itself... Unless... there is a problem with the context (which is). So when the people were writing the code, they would not understand the over-engineered code because of their fault. But now we make a step back for our beloved claude because it has small context... It's unfair treatment.
I could go on and on. And all those cultural changes are because of money. So I dub this "goldrush", open my popcorn and see what happens next.
I rarely see #3 yield better solutions, it's usually better to collaborate as a team on requirements and gotchas, but let one person own implementation.
Also, it's supremely easy do the wrong abstractions long term and compromise premature internal designs that will start to starve of human mental modeling, hence explaining with accountability how things work and what the plans are when an incident happens.
Also, if the wrong generalizations are introduced, coded correctly and reviewed and approved by AIs, then who's even driving really?
> Third - we just take a big smelly dump on the teamwork - it's easier and cheaper to let one developer drive the whole change from backend to frontend, despite there are (or were) two different teams - one for FE, one for BE.
Agree with your other points, but IMO this one has always been better. You often need to design the backend and frontend to work with each other, and that requires a lot more coordination when it's separate teams.
One of the few things I do kind of like about LLM-assisted coding is that it's helping to bring back "lone wolf" programming. We currently default to using massive teams to build massive software because of all the work involved, but teams have a huge communication/documentation cost, and a lot can leak and be lost the more communication has to happen to get things done. Code assistants cut down on the "all the work involved" part, and I think will help to bring one-man shops back into fashion.
The fourth point about code commenting is the one that connects directly to the ownership question. When developers write comments to explain intent, those comments are evidence of human creative direction. When Claude writes the code and the comments, and the developer merges without adding their own explanation of the architectural decisions, the record of human authorship disappears along with the institutional knowledge. The documentation problem and the copyright problem are the same problem.
people quickly have forgotten: when copilot was announced, there were warnings not to use it for company code because of the license attribution problem. so what's changed? that anthropic is willing to defend and indemnify?
This is all well and good as an intellectual exercise, but in real life none of this matters. Almost no one thinks their code is copyrightable or seriously thinks their code is a moat. I've written the same chunks of code for a number of employers as has every engineer. We've all taken chunks from stack overflow and other places without carefully considering attribution.
This comes up in a few places as a kind of vindictive battle. One example is Oracle suing Google for too closely mimicking their API in Android. Here is an example:
> private static void rangeCheck(int arrayLen, int fromIndex, int toIndex) {
if (fromIndex > toIndex)
throw new IllegalArgumentException("fromIndex(" +
fromIndex +
") > toIndex(" +
toIndex + ")");
if (fromIndex < 0)
throw new ArrayIndexOutOfBoundsException(fromIndex);
if (toIndex > arrayLen)
throw new ArrayIndexOutOfBoundsException(toIndex);
}
And it was deemed fair use by the Supreme Court. Other times high frequency hedge funds sued exiting employees, sometimes successfully. In America, anyone can sue you for any reason, so sure, you'll have Ellison take a feud up with Page and Brin all the way up to the Supreme Court.
In 99.9% of instances none of this matter. Sure there's the technical letter of the law but in practice, and especially now, none of this matters.
> Almost no one thinks their code is copyrightable or seriously thinks their code is a moat.
You'd be surprised! Among non-software management types, they often think of the code as extremely valuable IP and a trade secret. I'm a CTO and I've made comments before to non/less technical peers about how the code (generally speaking) isn't that big of a secret, and I routinely get shocked expressions. In one case the company almost passed on a big contract because it required disclosure of the source code (with an NDA). When I told them that was a silly reason and explained why, they got it, but the old way of thinking still permeates and is a hard habit to break.
Edit: Fixed errant copy pasta error. Glad that wasn't a password :-)
You're right, I guess maybe I mean in any serious actionable way. Senior, non technical people leave plenty of money on the table by thinking they're protecting something valuable or they have some kind of secret sauce. It's all silly is what I meant to say, and digging into the technicalities of whether your code is truly copyrightable is kind of pointless. It's all vibes.
The place where it concretely matters is M&A due diligence. Acquirers are now routinely asking about AI tool usage in development and running license scans as a condition of closing. A codebase that cannot demonstrate human authorship over its core IP, or that contains GPL contamination, creates a representation and warranty problem in the purchase agreement. For most companies day to day you are right. For the companies that get acquired or raise institutional capital, the question becomes very concrete very quickly.
Very interesting, I had no idea. That's probably going to be a very painful lesson learned by all the startups that have been pumping out AI code. I know of several just among my peer groups that will be shocked and dismayed by this. Thanks for sharing that!
That is exactly the gap the piece is aimed at. The M&A conversation is where this becomes concrete very fast, and most founders shipping AI-assisted code have not had it yet.
Eh, it does and it doesn't. PE investors actively are asking why more of the portfolio companies aren't generating codebases using Claude Code. You are right that lawyers are asking about code generated by LLMs but this is more of a CYA out of ignorance more than anything else (btw - many purchase agreements have funny representations like "your code is free of bugs" which is downright hilarious).
So these two things are squarely at odds with eachother...meaning, I don't know any PE acquirers who are actively terminating deals because the target acquisition's code is generating by an LLM even if the lawyers try to get a rep about it in the purchase agreement.
For the record, I still have yet to have an M&A lawyer explain to me unilaterally that AI generated code is an infringement...hence the question "who owns the code Claude Code writes" is still open.
The tension you are describing is real and the piece does not capture it well enough. PE acquirers pushing portfolio companies toward Claude Code while their lawyers are adding AI code reps to purchase agreements is exactly the gap that will produce the first painful deal. The rep usually survives unsigned because neither side has done the analysis. When the first deal falls apart or a rep is breached post-close because of GPL contamination in an AI-assisted codebase, that will set the market standard faster than any court ruling.
> When the first deal falls apart or a rep is breached post-close because of GPL contamination in an AI-assisted codebase, that will set the market standard faster than any court ruling.
Assuming it ever does...first, GPL is hardly enforced and second, I feel like there is going to be enough money (e.g. Anthropic's own code it uses for the harness) that pushes back against it being problematic. We'll see.
Maybe LLM coding agents change the equation by making it much easier to adapt and use foreign and probably incomplete code. Getting you closer to competing with the original authors in a shorter amount of time than generating new code from scratch.
I work in M&A. Nearly every lawyer, accountant, investor, and software business owner thinks their code is solely valuable and a trade secret. I find it hilarious and try to be as diplomatic as possible about why it's not. They also willfully will give their client list to a potential acquirer but get super cagey they moment a third party provider asks for their code to be scanned.
This argument easily gets shut down when I asked why, Twitch, a $1B business didn't crater to their competition when their full codebase was leaked.
I’ve worked at too many places where I mused that if someone gave the source code to the competitors, it’d likely drive the competitors out of business as they tried to use it.
Keeping it proprietary probably has the greatest value in preserving the company’s reputation…
> Almost no one thinks their code is copyrightable
I think this is an unusual opinion.
Code may not be copyrightable in as small chunks as you put there, but in terms of larger pieces I think companies and individuals very often labour under the belief that code is intellectual property under copyright law.
If code isn't copyrightable, from where comes the GPL?
And why does anyone care if (for instance) some Microsoft code might have accidentally ended up in ReactOS, causing that project to need to go into a locked-down review mode for months or years? For that matter why do employers assert that they own the copyright in contracts?
I think it's the opposite - almost everyone thinks their code is copyrightable, outside of APIs and interop stuff, or things so simple as to be trivial.
If there is no artwork, there can be no copyright. If every character of the code to write is basically predetermined by the APIs you need to call, there is no artwork and no copyright.
Build a novel new API, and you'll be protected though.
It is based on the premise that if the proprietary licenses are valid, then also the open source licenses are valid.
So what is held as true is only the implication stated above and not the truth value of the claims that either kind of licenses are valid.
If the proprietary licenses are not valid, then it does not matter that also the open source licenses are not valid.
The open source licenses are intended as defenses against the people who would otherwise attempt to claim ownership of that code and apply a proprietary license to the code, i.e. exactly what now Anthropic and the like have done, together with their corporate customers.
Of course, if it is accepted that the code generated by an AI coding assistant is not copyrightable, then using it would not really be a violation of the original open source licenses. The problem is that even if this principle is the one accepted legally, at least for now, both Anthropic and their corporate customers appear to assume that they own the copyright for this code that should have been either non-copyrightable or governed by the original licenses of the code used for training.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.”
The copyright assertion is the very first line of the MIT license, and the right to copy the code is granted. Clearly a reasonable person would affirm that that license (and all similar licenses) are based on a premise that code can be copyrighted.
> It is based on the premise that if the proprietary licenses are valid, then also the open source licenses are valid.
>If the proprietary licenses are not valid, then it does not matter that also the open source licenses are not valid.
That’s not true. Imagine a world where proprietary licenses are made invalid.
In such a world a company could take open source code compile it and distribute it (or build a SaaS) without the source code.
Even if you only focus on licenses that don’t prohibit this, most of those licenses require attribution.
So even in a world where propriety licenses were invalid the majority of open source licenses would still have a purpose.
You’re attempting to split hairs to argue on a very subtle technicality, but you’re not even technically right.
MIT just disclaims all the author's rights except attribution. If it turns out the code isn't copyrightable, nothing really changes. A better example would be GPL.
I think it should be pretty clear that if you provided the tool the specification for the code you want, you have already provided creative input.
After all, is this not what happens with compilers as well? LLM agents are just quite advanced compilers that don't require the specification to be as detailed as with traditional compilers.
>it should be pretty clear that if you provided the tool the specification for the code you want, you have already provided creative input.
If you provided a human contractor with the specifications for the code you want, the courts have repeatedly made clear you have not provided the creative input from a copyright perspective, and the contractor needs to explicitly assign those rights to you if want to own the copyright on the code.
Let's say we didn't have assemblers, but instead we would have three professions:
- Specifiers, who make the specification for the system
- Programmers, who write C code
- Machine encoders, that take that C code and write machine code for a CPU
Would it be that the copyright would then belong to programmers, if no other explicit assignments would be made?
---
Thinking about it, probably yes: copyright of the spec belongs to specifies, copyright of the C belong to programmers, and copyright of machine code to machine encoders. Or would it depend on the amount of optimizations the machine encoders would do, i.e. is it creative or not? And then does this relate to the task and copyrightability of C compiler output, where optimizations can sometimes surprise the developer?
In music, you can have copyright for a composition (like, lyrics and sheet music), and then for a master record. If you sell a copy of a song, you generally have pay royalties to both copyright holders.
So, in your example, the specifiers would own the specification, the programmers the C code, and machine encoders own the machine code.
But the ownership wouldn't be complete. If you sell the machine code, you'd have to pay royalties to all three. If you only sold the C code, only to the specifiers and the programmers.
The compiler analogy is the right one to reach for and the Copyright Office addressed it directly: the question is not whether you provided input, it is whether the creative expression in the output reflects human authorship. With a traditional compiler, the programmer authors every expression in the source. With an LLM, the programmer authors the intent and the model makes the expressive decisions about structure, naming, pattern, and implementation. Whether that distinction matters legally is what Allen v. Perlmutter is working through right now. The summary judgment briefing completed in early 2026 and it may be the next landmark ruling on exactly this question.
Specifications are not necessarily creative input. Eg if I write a prompt that just says “write a rate limiter in Python”, there’s really no creative input. I didn’t decide on the API, or the algorithm to bucket requests, or where to store counters, or etc. I just gave it statements of fact, which are inherently not creative.
Compilers are different in that the resulting binaries are not separately copyrighted. They are the same object to the Copyright Office because one produces the other, in the same way that converting an image to a PDF is still the same copyright.
LLMs don’t do that. The stuff coming in may not be copyrighted, and may not be copyrightable. The stuff that comes out is not a rote series of transformations, there are decisions being made. In common use, running a prompt 10 times might yield 10 meaningfully different results.
I’m dubious the outcome will be “any level of prompting is enough creativity”.
Possibly; I'm not going to hazard a guess on what the Supreme Court will decide the exact bar is. I just don't think it will be either extreme. "Nothing is copyrighted" is too damaging to the economy, "everything is copyrighted" has weird impacts on non-LLM copyrights that conflict with precedent.
This is actually the opposite of what the copyright office has said. Directly addressing AI generated code/prompts, they compared it to someone who is commissioning art, describing to the artist what they want.
The copyright falls to the artist, not the person commissioning it.
Complicated in this case, because there is no artist.
it's well known that recipes cannot be copyrighted. But recipes still are protected intellectual property by trade secret law if they are treated as a secret by the holder of the recipe.
Claude code itself is a trade secret, and it is not open source, so its own copyrightability is moot till you get your hands on a copy of it with clean hands.
Recipes cannot be copyrighted because they are not expressions of human creativity. Software written by AIs are also not expressions of human creativity, so the balance is tilted in favor of AI generated copy not being copyrightable.
The Supreme Court or legislation could change this, and I'd guess there will be a movement to go in that direction, but till something like that succeeded it's not so.
> Software written by AIs are also not expressions of human creativity
I mean I'm not the biggest fan of AI on the planet by any means (which I think my post history would prove, lol), but isn't prompt design and steering the AI "human creativity"? In one of my AI-assisted projects I spent like a week in unending threads of posts trying to make the AI do stuff the way I wanted, testing the output, finding a bazillion of bugs and "basic bitch" solutions, asking for more robust this and edge case that. It felt like I wrote a novel. How is that not creativity (Crayon-eater or Picasso, creativity is creativity)?
To some extent yes. Your output at work is based on a combination of inputs from others in your organization, and is being paid for by your employer, so the organization owns the copyright on what you make for them.
I think from this view it makes sense that an LLM is a tool, and the operator of that tool (or their employer) can own the output.
The tricky part is when you squint and view an LLM with training input and prompted output as a machine that launders copyrighted input into customized output that is now copyrighted by a new owner.
A machine that vacuums up film reels and splices them according to a set of instructions by the user to create a compilation of recent animated Disney movies with the Shrek soundtrack superimposed would probably not pass legal challenges if the user of the tool attempted to claim full copyright on the output.
his prompt might be the result of human creativity but even in that case it's more than likely not to be a copyrightable expression of human creativity.
a copyrightable expression of human creativity in that case would need to be substantial enough in size to carry an imprint exlusive to your boss.
"why'd the chicken cross the road? to get to the other side" is not copyrightable. you can dress it up all you want, "why didst thy chickencock traverseth thee highway?..." etc would not qualify as something that would be exclusively yours/your bosses, because that trick is still rote.
BUT:
How do I love thee? Let me count the ways I like to see you work.
I love thee to the depth and breadth and height and number of your pull requests
My soul can reach, when feeling out of sight of your overnight toils
For the ends of being and ideal grace you provide me when you ship!
that would be copyrightable if it was original to your boss.
The law and interpretation of the law does not have the tidy and necessary obsession with fencepost errors and corner cases. It deals with them by stepping back and saying "what would an ordinary person think should be copyrightable vs what would be more akin to the wordgames that clever nerds on the playground get beat up for?
>isn't prompt design and steering the AI "human creativity"?
yes it is, but that does not make the response by the AI an expression of human creativity and therefore not copyrightable.
If you wrote down your teachings about prompt design and published a book, your expression of your creativity would be copyrightable, but your ideas expressed in the book would not be.
if you-the-creative-human's prompt design was written by you as a computer program, that expression of your ideas would be copyrightable. but other people could just express themselves by typing in what your program does without using your particular expression and would not be stopped by copyright.
it's easy to get in the intellectual weeds questioning this, but just step back to, copyright was intended to give authors an income from their work, without stopping other authors from writing their own works. Everybody gets to write a King Lear play if they want, they just can't copy somebody else's expression of the ideas. What expression is trying to capture is "what makes you different from me, be we alike in most other ways"
as a funny sidelight, the titles of books and movies are not copyrightable nor considered part of the copyrighted work, because they are not considered to, in a sense, "leave enough room to contain expressions of human creativity" although when considered in the larger context might contain creative puns or double meanings that make illuminating sense.
However, the title of a movie may be a trademarked term (like Pokemon, Xformerz, or whatnot) but trademark has a "type" of good or service component (line of business) and the trademark would apply to action figures (dolls) and pajamas (clothing), but not to the film itself.
i wasn't advocating for trade secrets as "equal" or "the way to go", i was trying to explain in simple terms how to think about copyright issues in concordance with the existing legal structures
people here who have not much experience were intellectually trying to reinvent wheels and I wanted to save them time in structuring their arguments. I have been exposed to various tips of the legal iceberg and was thrilled to learn what I learned and trying to pass it on.
> Claude code itself is a trade secret, and it is not open source, so its own copyrightability is moot till you get your hands on a copy of it with clean hands.
> What should matter is intent, the human that gives the orders.
I'd like to hear more nuance with regards to this line of reasoning. Can you conceive of a model that contains highly non-trivial representations of IP owned by others than yourself? Can you conceive that you might "order" the model to "produce" that IP? What happens then?
Try this both for "open source code" as the IP, and "the novel I wrote", and "latest Hollywood movie". The model does not have to be a real model currently available. It's just a thought experiment.
Try also to elaborate on the sliding scale between "an AI model" and "a compression system".
> If you hover over a line of code in your application, coding assistance services will display code strings of supported function calls available through the coding assistance service that are also present in your current code file. Coding assistive services will retrieve snippets from publicly available open source code showing how others are using those same functions. 3. THIRD PARTY COMPONENTS. The software may include third party components with separate legal notices or governed by other agreements, as may be described in the notices file(s) accompanying the software.
I've read that paragraph multiple times (both in the original and in your post) and I don't see anything that says who owns the resulting text. Just where it comes from. Am I missing something obvious?
>will retrieve snippets from publicly available open source code
Pretty sure it depends on the license the open source project uses. I dont think it's too troublesome if the autocomplete was truly only taken from open source projects, but it wouldn't surprise me if most closed source projects are also weighted into these models...
>What should matter is intent, the human that gives the orders.
If you are instructed by your professor to write an application, do you own the copyright or the professor?
Suddenly, you think you own the copyright again. In fact, in every case, you think you own the copyright. Because of your feelings. That's a common opinion here on HN too. You don't have this opinion by any logical stance. Nor by any legal doctrine.
The fact is: Copyright law applies to human authors. AI is not a human.
This is of course assuming you take AI-generated code unchanged. But you don't, in my experience. And that generates a new work fully copyrightable even if the original wasn't. Just like how the fad a decade or so ago of taking Tolstoy and Jane Austen works and adding new elements -- "Android Karenina" and "Sense and Sensibility and Sea Monsters" are copyrighted works even if the majority of the text in them was from public domain sources.
I'm sure it's not quite that simple. Only parts the parts of those knock-off works that aren't public domain could be copyrightable. If you only own the copyright to ten lines in a 10k line codebase, then it's probably fair use for someone else to just to take the whole thing.
Anna Karenina is public domain, assuming you’re talking about the original? If you translate it then maybe you could release it under GPL, but bit odd?
I think you missed the "what if". It was just a point about how the constructed scenario might be different to the real scenario. Most AIs are not trained only on public-domain work.
You use humans to edit AI code? When you level up you are just using AI to write, AI to review, AI to edit, AI to test. Not a lot of steps left for meat bags.
AI for review is terrible, and by no fault of their own. It's our job to specify and document intention, domain and the right problems to solve, and that is just hard to do. No getting around it. That's job security for us meat bags.
Skimming over the article, it's a lot about what the copyright office said and very little about what courts said. But the opinion of the copyright office doesn't have any legal force. Regulations passed by the copyright office would be binding, but their opinions are just opinions. We will have to wait until relevant court cases reach a conclusion. And so far running litigation isn't even about that question, it's about infringing the rights of works that are in the training data
Here's a question I have: if the AI generated image is of a character of which you own the IP, don't you have protections based on the character regardless of who gets copyright protections from authorship of the image?
Yeah if you have a copyright on the character, the AI generated image doesn’t change that. It doesn’t give you more of less protection than you already had.
> That's arbitrary and quite unproductive convo to be honest.
Yeah but that’s what the legal system ostensibly does. Splitting fine hairs over whether a derived work is “transformative” is something lawyers and judges have been arguing and deciding for centuries. Just because it’s hard to define a bright red line, doesn’t mean the decision is arbitrary. Courts will mull over whether a dotted quarter note on the fourth bar of a melody constitutes an independent work all day long. It seems absurd, but deciding blurry lines are what courts are built to handle.
That makes no sense because what if you refactor your code ad infinitum using AI? You spin up a working implementation, then read through the code, catalog the changes like interface, docs, code quality and patterns and delegate to the AI to write what you would.
It's 100% AI code and it's 100% human code. That distinction is what's counterproductive.
Wrong. This territory was heavily covered in music before this code concept - it has to be “transformative” in the eyes of the law. Even going in and cleaning up code or adding 10-25% new code won’t pass this threshold. Don't bother arguing with me on this, just accept reality and deal with it.
My copy of "Sense and Sensibility and Sea Monsters" is explicitly listed as being copyrighted by Ben H. Winters in 2009 despite the majority of the words being Austen's, though. Perhaps music has different rules compared to text. I suspect Winters and his publisher have investigated the legality of this more than either of us have.
Jane Austen died long enough ago that her works are in the public domain, so Winters did not need a license to use it. That does not mean that he gained rights to her work: if he tried to sue someone for use of anything which appeared in the original, he would lose in court because it’s easy to show that copies made before he was born had the same text. This also how they prevent people trying to extend copyright by making minor changes to an existing work: the new copyright only covers the additions.
There’s a very accessible summary of the United States rules here:
If you modify the work, that creates a derived work from whatever copyright the original works has, not a new work that is fully copyrightable.
As the article says in the Tl;DR at the top the code may be contaminated by open source licenses
> Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see
> This is of course assuming you take AI-generated code unchanged. But you don't, in my experience. And that generates a new work fully copyrightable even if the original wasn't.
That's not how copyright works. The modified version is derivative. You can't just take the Linux kernel, make some changes, and slap a new license on it.
My opinion, copyright has mattered very little in the corporate world. Copyright is effectively meaningless with SaaS, and the compiled software ran on your machine is protected more by technical controls and EULAs. A world where copyright didn't exist for software would look nearly the same for the commercial world. Trade secrets, NDAs, and employment contracts bind workers more than copyright. The only thing that the question of copyright has real world impact is open source, but even then only for more restrictive licenses such as gpl.
What is being licensed by the End User License Agreement (EULA) is the copyright on the code and its artefacts (executable bytes, etc.) - you can't have an EULA without having the copyright to license.
You can have an eula on anything, it's a contract. You don't need copyright to enforce terms that two parties have agreed upon. The only thing copyright can do is force anyone in possession of copyrightable material to honor a eula. If you can only get software through approved channels, it's hard to avoid an eula. You would have to obtain it through the same pirated channels you have to now.
One question I have is this: if an employee produces code predominantly generated by AI, it means that it is not copyrightable. Does that mean that the employee can take that code and publish it on the Internet?
Or is it still IP even if it is not copyrightable? That would feel weird: if it's in the public domain, then it's not IP, is it?
A recipe isn't copyrightable but is still protected under trade secret law. I imagine that the same would apply. I think the major difference with software copyright is that I can just decompile your binary or copy a binary and give it to other people. For SAAS companies that don't distribute binaries, I imagine they basically have the same protections against rogue employees.
To look at it another way, just because some code I work on at my job is derived from open source MIT-licensed code doesn't mean I personally have the right to distribute it if my company doesn't want me to. I'd guess this comes under some generic "confidential information" clause in the employment contract.
Hmm your example is different: if you manually write code, there is a copyright for it whether it is derived from an MIT-licence or not. If you don't own that copyright (because your employer does), then you don't have the right to distribute it because it is not your code.
If you generate the same code with AI, now it does not have a copyright. If it depends on an MIT library, then the MIT library has a copyright and you have to honour the licence. But the code you produced does not have a copyright (because it was generated by an AI). And therefore nobody "owns" it. My question is: can your employer prevent you from distributing something they don't own?
This is a very long-standing and AFAIK never explicitly decided copyright and human rights question: If something is Public Domain, are contracts restricting distribution valid? Is our right to information or knowledge a fundamental human right that is not permissible to take from others, such that restrictions greater than those imposed directly by the State are invalid? In a healthy society, "I have created an extraction machine and your actions are hindering my extraction" is not a valid argument. So at the very least contracts restricting rights to public dmoain works should be allowed only with heavy restrictions as to when, how, and for how long they are binding - much like the legality of non-competes have has steadily reduced in many places in recent years.
CC0 came about in part because of this ambiguity. To deal with it, part of CC0 basically says - even if there would still be restrictions to this if it were only in the public domain, I renounce those theoretical rights.
Outside the underdeveloped legal framework, I believe knowledge and truth is like life, and human society has some continued philosophical growth required here.
That is exactly the right question and the answer is genuinely strange. Uncopyrightable work falls into the public domain, which means anyone can use it, copy it, or build on it freely. The employer can still call it a trade secret and protect it through confidentiality obligations in employment contracts, but that protection is contractual rather than property-based. A trade secret loses protection the moment it is disclosed. So the employer's claim over purely AI-generated code is essentially: "you cannot share this" rather than "we own this." Those are meaningfully different legal positions, and most companies have not thought through which one they actually have.
So employees are not allowed to distribute the code, but if it leaks, then it is public and the company cannot do anything about it. Correct? That's what happened to Anthropic I think?
Yes, and if the same come ends up in someone else's hands, they can state "we didn't steal it, a GenAI generated it for us, the same as it did for you".
Given the non-deterministic operation of current GenAI systems (a major difference from compilers), it would probably be hard to prove either position.
Anyone can produce low-quality code, with or without AI. Agents have gotten exceptionally good however, and everyone should be including them in their workflow if they're able to.
Depending on the scale. If you ask Clause to one-shot an app from a nebulous description, you get a prototype which you would understandably loathe to own the code of. If you plan carefully and limit the scope, you get code that you understand, can approve of, and are okay owning further down the line.
I spent two and a half hours writing up a detailed outline for a small webapp. Claude popped it out in one shot 100% working., I added features after but the time you spend on a good outline saves hours later.
I think it was tor.com that last year had a story where the newbie hired for the corporate HR dept ended up being the last human left after all others were replaced.
The whole thing with GPL code seems like a mess and surely couldn't be set as actual precedent, right? It is totally infeasible for me to check every single GPL project on every code hosting platform to see if the code Claude etc produced is too similar. If a set of training data used for the model was released to check against that would be one thing, but you can't honestly expect someone to check every repo available from all time to see if a model (that you are not informed of what it was trained on and therefore could reproduce) might've reproduced code from it.
That's not at all like checking the dependency chain of a dependency or anything as you can just read the licence of anything you're choosing to use. Surely the precedent would have to be that a model trained on GPL code has itself been infected by GPL, and therefore must have all source/weights released too if the assumption here is that it can have embedded the code well enough to be able to reproduce it?
> Surely the precedent would have to be that a model trained on GPL code has itself been infected by GPL, and therefore must have all source/weights released
I don't see how this follows, unless we also agree that humans who have ever read any GPL code are themselves permanently tainted and therefore cannot produce anything that isn't influenced even slightly by said code.
Is it just because we think the robot does a better job at learning than we do? It's an impossible line to draw, I agree, but I don't agree that the answer is "well then everything must be considered tainted," I say the answer is "ignore a vestigial concern of a bygone era."
The robot does a better job at reproduction. I don't think there exists a definition of "learning" unambiguous enough to make the claim that it learns better than humans. Specifically, published models don't learn at all -- after the training phase, the model weights are fully static.
Duplicating BSD-licensed code without copyright attribution and mention of the original license is just as much a violation of the original copyright -- that applies regardless of additional copyleft requirements imposed by the GPL. A different but no less serious restriction applies to all the code examples on MSDN: the license disallows using the samples in production code.
LLMs are effectively copyright laundering machines, and barring any indemnification clauses in the ToS (of course there are none), full liability lies with the user.
> but you can't honestly expect someone to check every repo available from all time to see if a model [...] might've reproduced code from it.
Well, if you care about not violating any licenses, you could buy services from an LLM provider that was only trained on code in the Public Domain (or code that the LLM provider licensed for that purpose), and/or buy some kind of legal guarantee from the LLM provider that the code produced is "clean".
Of course, that'd be much more expensive than current offerings, but it would reflect the real cost of software development, not just YOLOing it, from a legal perspective.
When I wrote a book, part of the contract with my publisher was that I had to attest that I actually wrote the book myself, that quotes were properly attributed etc. If you buy code-writing services, why shouldn't it contain similar clauses?
> It is totally infeasible for me to check every single GPL project on every code hosting platform to see if the code Claude etc produced is too similar.
I would say that choosing a tool that makes it infeasible doesn't actually excuse you from doing it.
Claude don't write code? The LLM writes code. Claude loops the LLM into writing consistent code. Humans loop Claude into consistently looping the LLM.
Who own's the code? Who owns a potato? If the code is the produce of the LLM and that costs tokens, the owner of the code is the one who paid for the tokens. Money, time or attention, someone pays for the tokens, owns the code.
Similar to most entertainment: you have the right to consume, but not the right to adapt into your works and distribute them.
Even consumption is usually limited to private usage: in my country, a consumer subscription is not enough to broadcast in a cafe or even a waiting room.
I’m no lawyer but I feel that meta, my employer, wouldn’t be letting us go hog-wild with Claude code if they weren’t completely confident that they fully owned the outputs, whether we change it or not.
Meta's confidence almost certainly rests on the employment contracts and IP assignment clauses, not on a legal theory that AI output is inherently copyrightable. The enterprise agreement with Anthropic assigns outputs to the licensee. The employment contract assigns work product to Meta. Those two documents together give Meta a defensible ownership position regardless of the authorship question. The interesting gap is for developers using personal accounts or consumer plans on side projects, where neither of those documents exists.
I don't understand how a company can have IP copyright rights on code that is inherently uncopyrightable (in the unlikely event scotus rules that way).
Worst case, meta will sue the programmer who produced infringing code.
I mean if the code is not copyrighteable that does not mean anything; it's just public domain code except that meta will just use good old security by obscurity to protect it. If somehow a meta programmer vibes code, say, VVVVVV, and Terry Cavanagh recognizes it on his facebook feed and sues meta, and wins, all that will happen is that meta will take down the copy of VVVVVV, will fire and sue the engineer that vibe coded it and call it a day.
Seems to gloss over other kinds of contamination, beyond GPL code. Code from pirated text books, the problem with the entire language model being trained on copyright data, and on the possibility of the training data containing various copyrighted code.
Anthropic "solved" this by intermingling the texts extracted from pirated books (illegal) with texts extracted from the physical books they bought and destroyed (legal), so no one can clearly say if the copyrighted material it spits out came from a legal source or not. Everyone rejoiced.
The intermingling argument is actually central to the Bartz settlement structure. The settlement required destruction of the pirated dataset specifically because commingled training data creates an unresolvable provenance problem. For deployers building on Claude, EDPB Opinion 28/2024 requires a documented assessment of the foundation model's training data legal basis before deployment. "We cannot tell which outputs came from which source" is not a satisfactory answer to a regulator running that assessment. wrote about it before here: https://legallayer.substack.com/p/i-read-every-edpb-document...
They're only legal if training is fair use - and even I don't think it's immediately clear what would be the legal status of verbatim regurgitation of code in copyright, or code protected by patents?
AFAIK I (as a human developer) can't assume that I can go and copy code out of a text book, and then assume copyright and charge for a license to it?
The judge seems to have said it's because they "transformed" the books (destroying them after digitalizing) in the process, that made it legal.
> Ultimately, Judge William Alsup ruled that this destructive scanning operation qualified as fair use—but only because Anthropic had legally purchased the books first, destroyed each print copy after scanning, and kept the digital files internally rather than distributing them. The judge compared the process to “conserv[ing] space” through format conversion and found it transformative. - https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...
Interesting - so local models, like Google Gemini is then likely pirated by this interpretation - because the model is distributed? Ditto open weight models?
I've seen copyright notices that explicitly forbid use for AI training. Would this "transformation" argument still hold in such cases?
For example:
No Generative AI Training Use
For avoidance of doubt, Author reserves the rights, and grants no rights to, reproduce and/or otherwise use the Work in any manner for purposes of training artificial intelligence or machine learning technologies to generate text, text to speech, voice, or audio including without limitation, technologies that are capable of generating works in the same style or genre as the Work, unless individual or entity obtains Author’s specific and express permission to do so. Nor does any individual or entity have the right to sublicense others to reproduce and/or otherwise use the Work in any manner for the purposes of training artificial intelligence or machine learning technologies to generate text, text to speech, voice, or audio without Author’s specific and express permission.
Only if you also manage to purchase and destroy the source material, I suppose? In Anthropics case it wouldn't have worked if they've stolen/rented the books then destroyed them, but in the judge's eyes it was legal because legally purchased -> destroyed.
Nobody disputes that I own the copyright in a sound recording I made just by pushing the red button on my recorder. So it is a mystery to me that copyright to any sort of human conditioned machine generation is in dispute.
The sound recording analogy breaks down at the point where the recorder makes no creative decisions. Pressing record captures what is already there. Prompting Claude generates something that did not exist, through decisions the model makes about structure, naming, pattern, and implementation. The closer analogy is hiring a session musician and telling them the key and tempo. You own the recording under work-for-hire if they signed the right contract, but the creative expression in the performance is theirs unless explicitly assigned. The button you push to start the model is not the same button as the one on the recorder.
Fourier theory says that any sound, however complex, can be synthesized by summing sines and cosines. That's what an LLM does, if you twist the metaphor enough. It synthesizes complex outputs from simpler basis functions that are, or should be, uncopyrightable.
The fact that it inferred those basis functions from studying copyrighted works doesn't seem relevant. Nor does the fact that the "Fourier sums" sometimes coincide with larger fragments of works that are copyrighted. How weird would it be if that didn't happen?
Nobody is doing that, though. You might get a watermarked screenshot or stock photo now and then, or a couple of mostly-verbatim paragraphs from Harry Potter.
In any case, if the copyright mafia insists on butting heads with AI, they'll find that the fight doesn't quite play out the way it has in the past.
> Prompting Claude generates something that did not exist, through decisions the model makes about structure, naming, pattern, and implementation.
LLMs don't make decisions. Their output is completely determined by an algorithm using the human prompt, fixed weights, and a random seed. No different than the many effects humans use in image or audio editors. Nobody ever questioned whether art made using only those effects on a blank canvas was subject to copyright.
"if Claude was trained on the LGPL-licensed codebase and its output reflects patterns learned from that code, can the output be treated as license-free? The emerging legal consensus is probably not, and assuming it can creates significant liability for anyone shipping that code commercially."
Is there any citation for this "legal consensus"? I was not aware there was any evidence backed stances on this topic as of yet
This sounds like a problem that's pretty easy to get around.
CC does not need LGPL code. There's more than enough BSD and Apache code to go around.
And they can generate synthetic data that is better than LGPL for their training.
It's also a problem that does not seem feasible to meaningfully enforce.
It's easy to generate CC code and lie and say you didn't. It would be hard to prove that you did, especially if you took any precautions to make it even slightly difficult that you did.
Unlike GPL, BSD and Apache licenses do not claim to also cover your non-AI-generated code that only invokes the AI-generated code.
However, even if the BSD/Apache/MIT licensed code can be incorporated freely in your application, you still have no right to remove the copyright notices from it and/or to claim that you own the copyright for it.
Therefore, unless the AI model has been trained only on non-copyrighted public-domain code, incorporating the generated code in your application means that you have removed the copyright notices from it, which is not allowed by the original licenses.
There is absolutely no doubt that using an AI coding assistant works around the copyright laws, but it is still equivalent with doing copy and paste with fragments from copyrighted works into your source code.
I consider that copyright should not be applicable to program sources, at least not in its current form, so reusing parts from other programs should be fair use, but only if human programmers would be allowed to do the same.
> However, even if the BSD/Apache/MIT licensed code can be incorporated freely in your application, you still have no right to remove the copyright notices from it and/or to claim that you own the copyright for it.
I can't speak for all licenses, but I'm familiar with at least one BSD license. That's almost the entire point of it...
You cannot take their literal code and call it your own. You can derive code from it and call it your own. That's what LLMs primarily do.
The chardet dispute is the closest thing to an active test case on this specific question, and you are right that it has not resolved into settled law. "Emerging legal consensus" was imprecise. The more accurate framing is: the legal community's working assumption, based on how copyright doctrine treats derivative works, is that training-data provenance travels with the output. That assumption has not been tested definitively in court yet.
With sufficient obfuscation (which models seem to provide intrinsically), how would anyone know to sue? On top of that, only the most major sorts of litigation have the legal force to pierce even the flimsiest of obfuscation... this is likely all moot.
If some GPL-licensed group were to sue some commercial software project that they do not have the source code for, what would even give it away? But they throw $1 million at a lawyer who can at least get it to the discovery phase somehow, and the source code is provided. It looks to be shit, but maybe an expert witness would come along and say "that looks inspired by the open source project". Where does it go from there? The model is a black box, but maybe you've got a superhero lawyer who manages to rope in Anthropic or OpenAI, and you can see how it produced the code given those prompts. What now? Are there any expert witnesses who both could say and would say that it was "bulk copying-pasting code". And if it were, what jury is going to go for that theory of the crime? Copying-and-pasting, but the code doesn't match, except in short little strings that any code might match. This isn't a slamdunk, and it's not going to proceed very far unless it's another Google-vs-Oracle shitfest.
Why should it be any different than it ever was? If a release manager checked it but didn’t catch the vulnerability, they have some culpability. If the developer shipped the code without checking it, they have some culpability too. Ultimately, if they both work under an organization that they report to, they’re responsible to that organization, which is, in turn, accountable to its customers (and investors perhaps.)
> What to preserve:
Commit messages that describe what you changed and why, not just what the AI generated. “Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch” is evidence. “Add rate limiting module” is not.
> The second commit message versus the first is the difference between a defensible authorship claim and a clean “Claude wrote this” record.
That makes no sense to me, as the commit message is probably LLM generated as well. (and even easier to generate as it doesn't have to compile or pass automated tests).
Three things matter when it comes to eating my breakfast sandwich:
1/ Was the pork in my sausage reared on a farm that meets agricultural standards?
2/ Was the food handled safely by the kitchen that cooked my food?
3/ Does the owner of the diner pay kitchen wages in accordance with labor law?
By contrast, I have no idea what went into the models I use, what system prompts have prejudiced it, and whose IP has been exploited in pursuit of my answer.
That’s being charitable, really. In practice the open secret of the AI industry is that the vast majority of training data, for want of a better word even if it is likely to be the most precise description, is stolen data.
Probably, yes, but the burden of proof is with us not them.
I'm already glad some companies have the guts to open their models because proving it for open models is probably a lot easier than for a model behind a service.
That's a matter of changing a law, it's all up to the people and their representatives. We talk as if everything is set on stone but if there really is a will, there is a way.
The proof is the $stupid-billion infrastructure built and kept up to host mousetraps armed with free cheese made of virtue signalling about doing the right thing and sharing the code with the world for free.
The media industry loves to quote ridiculous numbers on lost revenue due to piracy etc. May be a rough ballpark numbers will get them to do something about this theft.
Can someone put a rough estimate on potential revenue loss (direct and incidental) from training AI with industry wise breakup.
It’s wrong to stop progress. I just want to know what data went into my model and have access to the same data. The same way we have national libraries of books but with the caveat that I don’t really know how one is supposed to browse petabytes of OpenAI .zips like I browse old books.
If the data is proprietary (eg Meta’s stash of FB comments) then I am satisfied to be told it’s private and I can’t see it. If, however, the works were public then give me a URL if it’s live or a cached copy if it isn’t.
This is a big question that makes my employer nervous about using LLM-generated code, along with the even-more-unresolved question "what happens if the LLM outputs an algorithm that is protected by patent?" (particularly worrying because we know the base training included patent descriptions.) Questionable copyright can often be worked around (particularly since we don't distribute source) but infringing on a patent can destroy a company.
The elephant in the room, of course, is what constitutes “meaningful human authorship.” However, I cannot shake off the feeling that all user interactions with these AI models are being logged. Perhaps this may turn out to be the bigger concern in a potential legal battle than code authorship.
The meaningful human authorship question is the elephant, agreed, and the regulators have deliberately refused to quantify it for exactly the reason you describe any bright line number becomes a target to game rather than a standard to meet.
The logging point is sharper than it might appear. In a copyright dispute over AI-assisted code, interaction logs could cut both ways. A plaintiff trying to establish human authorship would want the logs to show substantial architectural redirection, multiple rejections of Claude output, and documented reasoning for structural decisions. A defendant challenging that authorship claim would subpoena the same logs to show verbatim acceptance of output without modification.
The practical implication i guess here,that the developers who want to preserve a copyright claim over AI-assisted code should treat their prompt history as a legal document from the start. It seems all over the world the logs are the evidence. Whether they help or hurt depends entirely on what they show.
The bit about treating one’s prompt history as a legal document has really struck a nerve with me. I’ve been keeping a separate git history solely for my prompts. Initially, the goals were simple: reuse prompts, turn some into skills, etc. But in light of the insights from the article and the discussions here, I need to treat this practice as serious business.
I wrote an R library doing some simple regressions using the GPU, with Claude. I asked it to provide the same API as lm, glm and some other base R functions. It copied their code wholesale without mentioning it to me. So, now my library is GPL… which is not a big deal in this context, but it was quite a shock.
Note to anyone reading this: the author is actively reading the comments and updating the piece based off reported issues. As a result, no meaningful discussion will take place here.
I think it's pretty clear cut, whoever is paying for your agentic coding tool subscription is part of the litmus test.
I use my own computer, I pay for my own subscription and I build my open source projects then the code belongs to me.
If I use my company's computer, they pay for my subscription and we work on the company's projects then the code belongs to the company.
In any step of the way if some copy-left or any other form of exotic open source license is violated, who pays for discovery? Is it someone in Russia who created a popular OSS library that is now owed? How will it be enforced?
Twice in my career the owners of a company have wanted to sue competitors for stealing their "product" after poaching our staff.
Each time, the lawyers came in and basically told us that suing them for copyright is suicide, will inevitably be nearly impossible to prove, and money would be better spent in many other areas.
In fact, we ended up suing them (and they settled) for stealing our copyrighted clinical content, which they copied so blatantly they left our own typos and customer support phone number in it.
Go ahead, try to sue over your copyrighted code, 10 years and 100M later you will end up like Google v Oracle. What if the code is even 5% different? What about elements dictated by external constraints; hardware, industry standards, common programming practices, these aren't copyrightable.
Then you have merger doctrine, how many ways can we really represent the same basic functions?
Same goes with the copyleft argument, "code resembling copyleft" is incredibly vague, it would need to be verbatim the code, not resembling. Then you have the history of copyleft, there have been many abuses of copyleft and only ~10 notable lawsuits. Now because AI wrote it (which makes it _even harder_ to enforce), we will see a sudden outburst of copyleft cases? I doubt it.
Ultimately anyone can sue you for any reason, nothing is stopping anyone right now from suing you claiming AI stole their copyleft code.
The documentation advice is practical, but commit messages
and prompt logs are self-reported. "Meaningful human
authorship" needs a verifiable evidentiary chain,
not attestations.
> Code that Claude Code or Cursor generated and you accepted without meaningful modification may not be copyrightable by anyone.
Except if it happens to regurgitate a significant excerpt of some existing work, then the authors of that can assert their copyright; i.e. claim that it infringes.
Lawyers I have spoken to have stated strongly that they believe collective works doctrine will provide strong protections for most mature and sizable software. I see no mention of these considerations here.
Did Claude Code not start out as human input? Would it not be safe to say that a reasonable amount of it is still human input? But also, just because its mysteriously "not theirs" doesn't mean they magically have to give you the code.
This particular AI-ism really encapsulates what annoys me about some AI-isms. I don't mind the delves and the em-dashes that just give away the AI source of what otherwise might be good text. But these structural pieces just feel fundamentally not for the reader. Part of it is blatant pick-me language for the human feedback ("hey look you wanted plain language I did that") and part of it feels like it's just helping the future token stream (thinking-like tokens polluting the actual text).
The not-this-but-that, the sycophancy, the symbolizing-vague-significance, they all have this flavor of serving a process that's no longer there as I now need to read it. It gives a similar sickening feeling to the one I get seeing something designed by committee.
Claude is not a legal entity, it is a software tool that outputs text based on statistics. There is a user that used a tool to create text and that user is the legal entity responsible for the text in any legal way that matters.
Anything else would be completely ridiculous given current laws in most countries.
It would be as ridiculous as blaming the car in a car accident where you drove over someone.
>It would be as ridiculous as blaming the car in a car accident where you drove over someone.
No more ridiculous than you posting something you know nothing about.
Just because you don't get the copyright doesn't mean claude does. The fact that claude is not a legal entity has no bearing on whether or not you are entitled to a copyright for a work you did not create.
Those "statistics" that the output is based on are often under licenses that forbid making proprietary software with them for example. It is not the same as using Word.
The statistics is generally not. But the data used to learn the statistics may have been under license.
Learning from licensed material is generally accepted in humans, you may learn from something and then create something else and the new thing is not considered legally problematic with the exception of patents i guess.
Whether the same thing holds true for electronic systems is where people disagree if you look at the problem space in its essence. I land on the side that it is the same thing(humans and electronic systems learning), some seam to think it is a different thing.
Maybe the useful test is not “who wrote this line?” but “can you show how it went from requirement/prompt/context to diff to human review/tests?” If you can’t, ownership is only one issue. You also can’t tell what was accepted as engineering work versus just copied output.
This is actually closer to how the Copyright Office thinks about it than the article makes clear. The registration guidance that emerged from the Thaler proceedings specifically asks applicants to describe the human creative contributions and how the AI was used. A documented workflow showing requirement, architectural decision, rejection of AI output, human restructuring, and review creates a paper trail that maps directly onto what the Office looks for. The can you show how it got here test you are describing is the practical version of the legal standard.
Good overview of the issues. I'm sure there are a few nits to pick with that.
But something that is overlooked is that the world is bigger than the US and it's an absolute zoo out there in terms of copyright laws in different countries. Anything you think you might understand about this topic goes out of the window if you have international customers or provide software services outside the US. Or are not actually based there to begin with. And there are treaties between countries to consider as well.
Courts tend to try to be consistent with previous rulings, interpretations, etc. When it comes to copyright, there are a few centuries of such rulings. The commonly held opinions among developers that aren't lawyers are that AI is somehow different. And of course since the law hasn't actually changed, the simple legal question then becomes "How?". And the answer to that seems to involve a lot of different notions.
For example, "AIs are not people, and therefore any content produced by them isn't covered by copyright to begin with" is one of the notions brought up in the article. A lawyer might have some legal nits to pick with that one but it seems to broadly be the common interpretation. So AI's don't violate copyright by doing what they do. In the same way you can't charge a Xerox machine with copyright infringements. Or Xerox. But you could go after a person using one.
And another notion is that any content distributed by a human can be infringing on somebody else's copyright and that party can try to argue their case in a court and ask for compensation. Note that that sentence doesn't involve the word AI in any way. How the infringing party creates/copies the content is actually irrelevant. Either it infringes or it doesn't. You could be using AI, a Tibetan Monk copying things by hand, trained monkeys hitting the keyboard randomly, a photo copier, or whatever. It does not really matter from a legal point of view. All that matters is that you somehow obtained a copy of an apparently copyrighted work. AI is just yet another way to create copies and not in any way special here.
There are of course lots of legal fine points to make to how models are trained, how training data is handled, etc. But if you break each of those down it boils down to "this large blob of random numbers doesn't really resemble the shape or form of some copyrighted thing" and "Anthropic used dodgy means to get their hands on copies of copyrighted work". I actually received a letter inviting me to claim some money back from them recently, like many other copyright holders.
Most of this is based on Copyright legal framework, which is surprisingly homogeneous around the world. The discussions about ownership of AI-generated material are exactly the same in EU.
Copyright law kind of transcends national borders by certain international treaties like the Berne Convention. Which is why the US copyright holders could enforce their "woulnd't steal a car" threats in Europe.
It’s the same as photography. No photographer built the multibillion dollar supply chain for the optics train in a camera, nor did they build the city scape they are enjoying as a background, they simply set the stage and push a button.
On a related note, another question: who owns the paper that Claude (or OpenAI) wrote? Should such paper submissions in conferences call out the model(s) used to write the paper itself?
This is the sharpest point in the thread. You are right if the output has no copyright to begin with, there is nothing to assign. The employer's contractual claim over purely AI-generated code is not a copyright claim, it is a trade secret and confidentiality claim. Those are weaker protections: they require the information to remain secret, they do not survive disclosure, and they cannot be enforced against independent creation of the same code. Most IP assignment clauses in employment contracts were not drafted with this scenario in mind and may be claiming rights that do not legally exist.
The model ownership question and the output ownership question run on separate legal tracks and the piece focuses on the second deliberately. On the first: the model weights are owned by Anthropic under work-for-hire from their engineers regardless of what the training data contained. Training data copyright infringement is a separate tort claim against Anthropic, not a basis for anyone else to claim ownership of the model. The Bartz settlement resolved the pirated books claim without disturbing Anthropic's ownership of the weights. Owning the training data does not give you ownership of the model trained on it, any more than owning the paint gives you ownership of the painting.
I'm still flabbergasted that people – and big, visible companies with big targets on their backs – choose to keep on using the output of LLMs without having an answer to these questions.
And I'm worried that once that has been sufficiently normalized, laws and interpretations of them will adapt to whatever best suits those users. Which will mean copyrightwashing of FOSS. My only hope then is that surely if free software can be copyright-washed by the big guys, then so can the little guy copyright-wash the big guys' blockbuster movies or whatever, which might lead to some sort of reckoning.
The idea that the provenance of a given tool's code inherently pollutes the material it's used with seems kind of illogical. Wouldn't it follow from this premise that any code written using open source IDEs and debugged with open source debuggers and other tooling would itself then be considered copyleft? Are works written with LibreOffice not copyrightable?
There's obviously a huge issue with the legitimacy and ownership of training data being fed to LLMs. That seems like an issue between the owners of that IP and the people training the models and selling them as services more than the people using the tool. Isn't this just another flavor of SCO trying to extort money out of companies using Linux?
IMO this is the greatest argument against AI as technofascism. The general public seems to believe that AI will usher in technofascism by claiming corporate ownership of AI output: the independent entrepreneur will be unable to compete against the corporations compute, every piece of data about you will be stolen and monetized by AI, and you will own nothing.
But AI might in fact do the exact opposite and reverse the privatization trend that the West has been going through for the last 400 years. All of our copyright laws rely on the idea that there is a human consciousness behind the copyright. The more AI has input, the less we can claim ownership. If AI returns everything to the commons, then it results in a much more egalitarian world.
Hilariously, many people, especially artists, see the return of the commons as an assault against them. They’re so captured by copyright that they assume any infringement on their copyright is inherently fascist. It’s ridiculous. Copyright is a corporations number 1 weapon when it comes to creating a moat and keeping the masses out.
The original intent of copyright, in fact, was an incentive to return an idea to the commons. Experts used to hide their discoveries in order to keep them for themselves. Copyright provided an opportunity to release this knowledge and still profit. There were even several cases where it was established that those who claimed copyright could retain copyright even if the idea had been previously discovered. This created a huge incentive: release the knowledge or risk having your process copyrighted by the opposition. But that system worked because copyright could only exist for so long (14 years, doubled if they filed again.)
Now copyright is a lifelong sentence at almost 100 years. The entire purpose of it has been undermined. Corporations own all your childhood and by the time you can profit off of it, it’s outdated.
A world where the mainstream is primarily a commons seems to me like an egalitarian world. I’d like to live in that world.
The original bargain you describe, limited term in exchange for public disclosure, is exactly what makes the current situation strange. If AI-generated output falls into the public domain immediately, that is actually closer to the original intent of copyright than 95-year terms. The legal question is whether that outcome happens by design or by accident, and what it means for the people building products on top of AI-generated codebases right now.
It seems that author unironically advises to write your commit messages like this: "Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch", to have a chance at defense in potential court hearing. I find it funny, if vindicating for my personal approach. If the expectation is to "restructure, reject, rewrite" what "AI" spits out, why use "AI" at all at this point???
Copyright has a lot to do with what we as a society want to protect and encourage. We want to protect an author that put the hours into creating a book, as opposed to the person creating a copy of that work. The person copying can claim they put in work too but the claim is not strong enough to override our preference to protect original authors.
Part of the problem with generated works is that it is lower effort like the person copying something. It’s not an activity that demands special protection like original authorship. I believe this is a large part of the reasoning.
AI is a monster to our current copyright system - monster in the philosophical sense, that is, an example that destroys the concept.
First, its creation is (claimed to be) extremely useful for society, but in order to be created it requires ignoring copyright for pretty much everything ever written. Something we kinda shrugged under the table.
Then, it introduces an extreme jump down in creation effort - so if the focus is protection of effortful creation, nothing with AI use qualifies. But of course, you'd want society to benefit from effortlessness in general, spending more effort than needed in a task is the opposite of efficiency.
What if no meaningful thought was put into the code (entirely vibe-coded slop), but it’s made for your employer? Shouldn’t the work be uncopyrightable?
@dang, just wanted to say that it seems that the response to your statement does also seem to be AI generated. Dead-internet theory is turning real day by the day, oof.
On that matter, wouldn't an AI flag for submissions help hn? I wouldn't flag a submission for LLM style as it is too harsh, but I don't want to read them -- if only because I don't like LLM prose.
There are so many submissions where most of the discussion is about whether the content has any human effort behind, or the LLM was just a purely assistive role like translating. It's really devaluing hn, IMO.
Not sure how much an AI flag would help, or introduce new issues, given how difficult the problem is, though.
Ask chatgpt deep research citing court cases and it shows dark factory swe code are not copyrightable under current precedents.
Even steering it with prompts isn't enough. The guy couldn't copyright the image he made with ai, code is no different.
Maybe prompts written by humans are copyrightable.
Can't wait for the Billionaires to entrench in court they can steal everything for these machines and claim it as their own and maybe even reach for anything that it helps produce. Fuck that
I find it distasteful and disturbing that copyright infringement by the people training the LLM in violation of a license is considered contamination by the licensed code. It’s not contamination. The code didn’t seep into your codebase. If the LLM was trained in such a way that portions of code long enough to be protectable then the license was violated by humans. The liability for the problem doesn’t lie on the shoulders of the contributors to the originally licensed code. It lies on the people inserting it into your codebase without following the terms of the license.
The article also singles out the GPL repeatedly as a source of contamination. It doesn’t mention source-available proprietary licenses. It doesn’t mention code put online with no clear license, which according to the Bern Convention and the laws in at least the United States is automatically copyright protected with no license for use by others at all. It doesn’t talk about attribution for BSD-style or CC-SA-Attribution licenses. There’s no mention of leaked proprietary code. It just singles out GPL as some sort of unique problem.
This seems quite shoddy and biased for an article by someone who’s writing about the law.
It is probably fair that a huge share of code that is Foss is licensed under GPL, much larger than the share of source available proprietary licensed code
I would have assumed the opposite is true. Do you have any data to back that up?
You would assume that there is more proprietary code available to read on the internet than GPL code? Do you have any rationale for that assumption?
Basically all GPL code is available on the web and there is a vast amount of it. I barely see any current non-FOSS code on the internet, although I think it would be fair to count the big projects who have been using pseudo-OSS licenses lately as proprietary. Wouldn't a safer assumption be a ratio of 10:1 or 100:1 for lines of GPL vs. lines of "shared source?"
Are you aware of non-GPL FOSS licenses?
There is a lot of code on the internet that isn't accompanied by a FOSS license or any license that permits reuse, or any license at all.
Is GPL a larger share of source out there than BSD, MIT, ISC, CC, BSL, Apache, and source available combined? Enough bigger that it is repeatedly mentioned as a singular issue without so much as the words “or other licenses”?
Here's github statistics from 2015 https://github.blog/open-source/open-source-license-usage-on...
MIT is used by more projects than GPL.
That's the wrong metric, however. Thousands of small pet repos are unlikely to have more code than a single Chromium repo (mostly LGPL), Linux, Qt, etc.
> training the LLM in violation of a license
Bartz v. Anthropic found that this is fair use, so the license doesn't play into it.
I thought fair use was decided on a case by case basis, and could not be guaranteed? If true, wouldn't that mean that in other cases it could be ruled differently?
I don't have the exact ruling in front of me, but IIRC the judge pretty clearly said that training a model was fair use. IIRC, he declared it "quintessentially transformative".
The case by case basis was about acquisition and possession of the copyrighted material. Anthropic pirated a large number of books and illegally stored digital copies of many that they did purchase legally. The training being protected doesn't give them the right to violate copyright in that way.
Google, for example, purchased print versions of their training material and had a small army of employees digitize them and then delete the digital copies when they were done. That hasn't been challenged AFAIK, but would likely have been found to be not a violation. That's I think what was meant by case by case basis.
It's like if someone breaks into my house and I shoot them with my gun, that's very likely self defense, but if I'm not allowed to own a gun, I may still end up in trouble with the law.
Whether or not you’re pirating and making illegal copies of something depends greatly on the terms under which you’re allowed to make those copies. You can copy GPL-licensed code all day every day so long as you abide by the license. The same is true of the BSD licenses, MIT, ISC, Apache, et cetera.
If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright.
> If you’re copying or making substantially derivative works of them outside the terms of the license, you’re violating the copyright
I don't disagree with that.
What I'm saying is that the judge ruled that training a model using copyrighted books wasn't derivative. It was transformative, so the training wasn't a copyright violation.
He then went on to say that the way Anthropic acquired and handled that material was a copyright violation because Anthropic pirated and copied a large number of books that were not under a license like the ones you mentioned. The downloaded a bunch of books you would find at most bookstores and then actually purchased copies of them much later once they were accused of violation copyrights.
I'm just trying to make that clear because I've heard a lot of people who don't understand that the violation wasn't about the act of training or material they used, it was just how they acquired the training material.
If the trained LLM spits out large, recognizable portions of licensed code and you use it in your product don’t count on that case to keep you from defending yourself in court. The court found in Bartz v. Anthropic that training was fair use. They also found that pirating content to train against was not fair use, and Anthropic paid $1,500,000,000 in a settlement.
There are licenses on most software source code. If you redistribute works derived from that code, you must abide by those licenses or you are violating the copyright. That’s what’s meant by “piracy" here.
Now if you have an LLM that has trained on code and learned to actually write new software, only small snippets too short to be protected by copyright should be identical between the training material and the output. However, if you’re getting output that is substantial in size and recognizably derivative from the original that’s an issue that hasn’t yet as far as I’m aware been settled in court. One would hope the major player LLMs don’t copy and paste large functional chunks of existing programs.
It would certainly seem to me that the code you sell after using an LLM should meet the same standards for difference in implementation as if it was written by a human. That should apply to both copyright protection and patent protection.
I find it pretty horrible that a company can pay a mere fine that is a small percentage of its total funding in exchange from materially benefiting from a conspiracy to commit a series of criminal acts.
If Anthropic hadn’t pirated training materials would they even exist? Would they still have been as competitive ?
Would they still have gotten every bit of VC funding in anticipation of future successes derived in part from past crimes?
What’s next ? Armed bank robbery when VC funding dries up?
Also fair use is much more limited in the EU. Don't know how it applies here or if there where any rulings. Are you going to stop doing business with the EU (and Japan etc.)?
The seller of the code has no visibility on the training set of the LLM. If the situation you're describing ends up being illegal, responsibility should fall on the LLM provider to provide tools to detect such overlap with their training sets, and on the clients to run the tools.
The provider of the LLM should want to enable this and to take on that responsibility (I mean take it from the clients), otherwise no one will want to use the tool. Maybe there could be AI tool-use lawsuit insurance, but I feel like that's worse than the copyright infringement detection tool for everyone involved.
I can see the tool happening in the EU, but nowhere else basically, especially in the US, the government sees "AI dominance" as a national priority and a national security priority.
They probably focused on the GPL because of its viral copyleft features.
Do you suspect that an LLM that would recreate a substantial portion of a licensed work would honor any license? Even a 2-clause BSD one?
Rather, I suspect viral copyleft is why this lawyer is focusing on the the GPL. It's the only(?) FOSS license that can force a proprietary codebase into the open.
Other than putting something into the public domain I don't really know any open source licence that doesn't require at least attribution. One can assume that 99.9% of training data had some sort of license requirements, so just blindly using it is a copyright violation. People just don't seem to care.
Misstates the law. Denial of certiorari can happen for many reasons unrelated to the merits and does not settle the issue nationwide.
Fair and correct. Cert denial means the Court declined to hear the case, not that it endorsed the lower court's reasoning or settled the question nationally. The DC Circuit ruling stands and the Copyright Office's position is consistent, but that is stable doctrine rather than Supreme Court-settled law. Updated the piece to reflect this distinction accurately.
Since this is a tech audience... the Supreme Court uses a bounded priority queue. An unbounded queue would risk growing impractically large.
There are some kinds of cases where the Court has "original jurisdiction," meaning they must hear them, but those are very rare.
Also, I don't think there is any example testing the conclusion. There is no case to point at that any of the factors they listed are sufficient to convey authorship. Would love to be pointed to a case where rejecting decisions and redirecting to a different approach was deemed human authorship. What we do know is that you can disclaim the part of the code a human didn't author. In fact, the Copyright Office requires you disclose and disclaim. If anyone out there has more factual and citable sources please share.
You are right that no court has yet ruled that a specific set of human contributions to AI-assisted work was sufficient to establish authorship. What exists is the inverse: the Copyright Office has granted partial registrations where human-authored elements were separated from AI-generated elements, as in Zarya of the Dawn, where the human-written text was protected but the Midjourney images were not. The Allen v. Perlmutter case pending in Colorado is the first direct judicial test of whether iterative prompting and editing can constitute authorship. Until that decision, the positive threshold is genuinely unknown. The piece reflects this in the calibration section at the end, though your point is worth adding to the authorship discussion more explicitly.
It's in fact the opposite from what I've read. In one of the supreme court cases cited by the copyright office itself in its opinion of AI works (https://en.wikipedia.org/wiki/Community_for_Creative_Non-Vio...) it is deemed that just you advising something to do the work for you, giving criticisms and revisions, isn't enough for authorship or co-authorship.
While it's not code related, the copyright office's opinion is a good read and I don't see any reason to believe it's opinion is different for works of text vs works of physical art: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
The Supreme Court declining to take up an issue is taking a position.
Now different circuits can take a different view of the same issue. This is a common reason why the Supreme Court will grant cert: to resolve a circuit split. Appeals court judges know this and have at times (allegedly) intentnionally split to force an issue to the Supreme Court.
Even without settling the issue appeals courts will look at how other circuits have ruled and be guided by their reasoning, generally. The fact that the Supreme Court declined to grant cert actually carries weight.
the real issue is that the Thaler case was a different question: "Can AI be an author?" and the lower Court said no and SCOTUS left it along. But the question of "what is enough for the human to be the author" wasn't even part of the case. That is completely own checked.
Logically, I think there's a big difference between code which was produced from a simple generic prompt without other input vs code which was produced from a multiple complex prompts with large existing code as input.
When I'm feeding AI my code as input and it ends up producing new code which adheres to my architecture, my coding style and my detailed technical requirements, the copyright over the output should be mine since the code looks exactly like what I would have produced by hand, there is no creative input from the AI. It's just a code completion tool to save time.
I understand if someone leaves an LLM running as an agent for multiple days and it produces a whole bunch of code, then it's a very different process.
Fair point and worth being precise about. Cert denial is not meaningless: it leaves the lower court ruling intact, it signals the Court did not find the issue urgent enough to resolve now, and as you note, other circuits will look at the DC Circuit's reasoning. What it does not do is bind other circuits or establish Supreme Court precedent. The distinction matters here because if a Ninth Circuit case involving AI-generated code reaches a different conclusion, that circuit split would be live law regardless of the Thaler cert denial.
No it is not.
United States v. Carver, 260 U. S. 482, 490 (1923).
Moreover, SCOTUS does not decide issues, they decide cases.
Upjohn Co. v. United States, 449 U. S. 383, 386 (1981).
It does settle the law in as far as maintaining the status quo.
But it means that the appellate decision will retain precedence, no? Wouldn’t losing precedence be the primary legal effect of overturning that decision? All case law that hasn’t touched the Supreme Court could theoretically be challenged, but most of it isn’t, and it’s considered the law until it isn’t anymore, right? How would this be any different?
The decision is binding only within the jurisdiction of the Court of Appeals for the D.C. Circuit.
So it’s not correct to say “because SCOTUS denied cert, Thaler is now binding national copyright law.”
Practically speaking, it is binding on the US Copyright office (one of the parties in the case) in CADC. And that’s important. But copyright litigation happens all across the country, while this ruling only directly constrains the relatively small number of cases within CADC.
Yes, I didn’t imply national precedence. I imagine it would also signal to attorneys appealing cases other circuits that the same challenge will likely yield the same result.
Although this decision is not binding in other circuit courts, this decision still is something that you can bring to a judge in other courts. They are not required to follow this ruling because they are not in that circuit. However, they still will consider what other courts have said and that will be incentive to think hard before they do something different. A judge who does something different is generally expected to write up a reason why they did something different, and that's something that would be given to an appeals court if they do do something different for consideration of why the other court was wrong.
Yeah, I’ve heard lawyers use decisions in other jurisdictions to give weight to their line of reasoning. The SC saying they aren’t reviewing an appeal might not make that universally binding, but it signals that they don’t categorically reject the lower court’s decision.
I doubt any lawyer would mention the SC didn't review this - that is meaningless and judges know it. They will however mention this case. Even if they case goes against them they will mention it so they can say why it is wrong (the opposition will be sure to mention it so they have to be prepared to take it down)
> meaningful human authorship
How is this defined? Is my code review "meaningful" ? Are my amendments and edits to the generated code "human authorship" ?
read the article?
From the article:
> Specifying an objective to the model is not enough. Directing how the work is constructed is what counts.
That's interesting but how is anyone supposed to prove it? They would have to get their hands on your prompts.
Leaks, whistleblowers. Some circumstantial evidence will also do if there's enough of it. Like having hallucinated parts of code that do absolutely nothing, and can't be explained as e.g. leftovers from a refactor.
> They would have to get their hands on your prompts
Unless you are running a local model, your prompts are almost certainly logged by your inference provider, and would only be a subpoena away?
That still sounds incredibly vague and open for interpretation. For example, is setting up md files defining how you want things to be written enough?
And what exactly does it mean to "direct" how the work is constructed?
If I enter dark factory mode and go live my life while it churns tokens, then it's not copyrightable, but if I interact with it at every turn, then it is?
From TFA:
> When the Supreme Court declined to hear the Thaler appeal in March 2026, it did not endorse the lower court's reasoning or settle the question nationally. Cert denial means the Court chose not to hear the case, nothing more. What it does mean is that the DC Circuit's ruling stands, the Copyright Office's position is intact, and no court has yet gone the other way.
Your quoted text is no longer in TFA.
c.f. OP’s comments in this thread.
Because the author acted on that comment.
Lets hire humans as pAIrrots? They see it, they rearrange it, they rename variables and then they "authored" it. What a job- to start for as junior, but if you understand whats happening, you may augment the AIs code by giving "feedback" with enough time.
Free water but not electricity? I'll just hook up a generator to the shower...
These sorts of simplistic loopholes rarely work. Imagine if you could get copyright for the linux kernel by just rearranging it and renaming a few variables.
I wonder how much of linux and *BSD is in the windows kernel.
Is there really likely to be any? The design is very different isn't it? Ghidra with llm plugins is likely at a place a determined person could find out.
Do you really think with that massive amount of open code, some would not be injected in windows kernel (or even .net with mono, or even windows userland with wine)? It is easier to hide it: it is closed source, and they are probably using the same hiding tricks than those used to hide coding AI generated code (usually some level of refactoring to adapt to windows data structures).
Seems like way more effort reading the linux code, copying and adapting it to windows, and actively "hiding" it, than just writing code that fits your situation from the get go.
In my experience, reading and understanding code takes a lot more time than writing from scratch, so I don't really see what windows developers (assuming they are somewhat competent coders, this assumption may not hold after around 2010 or so) would have to gain by coping from linux.
If you write from scratch, you reintroduce "solve" problems. Which it is why its https://www.joelonsoftware.com/2000/04/06/things-you-should-...
Yep, that's why in many cases it is better to refactor already tested and debugged code.
Additionnally, the size of the code base increases the difficulty to spot 'obviously' refactored code from open source projects. There is a code complexity thresold. Coding AIs could help?
The only protection would be the honnesty of microsoft coders... wait, did I say "honnesty" and "microsoft coders" in the same sentence?
judging from the attribution notices, windows contains a non trivial amount of BSD
Ah the infamous "no I wrote it myself" submission in university coursework. Usually gets you a free visit to the guidance counsel and a bonus free mark (on your three strikes and you are out plagiarism form).
Not with the way the ultrarich ones do it the old-fashioned way - paying a few people very well. Shows executive potential, after all.
It also contradicts everything else I have read about Thaler. AFAIK the ruling was that the AI could not hold copyright. Thaler waived any claim to the the copyright holder himself.
The last two bullet points on this page cover this:
https://www.authorsalliance.org/2025/03/19/thaler-v-perlmutt...
The site also explains the qualifications and experience in copyright law of the author of the above - unlike the article here.
Furthermore, we shouldn't even be looking to the Supreme Court at all for this. Congress needs to define the laws around AI and copyright. The Supreme Court is likely avoiding cases in the hope that the legislature gets its act together.
100%. This is the real fix, we have new situations, we need new laws. Unfortunately Congress is currently broken.
Personally, I think that the human directing the agent owns the copyright for whatever is produced, but the ability for the agent to build it in the first place is based off of stolen IP.
I'm concerned about the copyright 'washing' this enables though, especially in OSS, and I think the right thing for OSS devs to do is to try to publish resulting code with the strongest copyleft licensing that they are comfortable with - https://jackson.dev/post/moral-ai-licensing/
Funny how the copyright industry was able to spin copyright infringment into the pejorative "stealing". If you still have the item, what was stolen?
Dowling v. United States, 473 U.S. 207 (1985): The Supreme Court ruled that the unauthorized sale of phonorecords of copyrighted musical compositions does not constitute "stolen, converted or taken by fraud" goods under the National Stolen Property Act
I don't think it's unreasonable to consider it stolen potential profit, but agreed that's not how they spin it
I still find the idea that "learning" from code is "stealing" kind of ridiculous.
Learning, probably not.
Copy/pasting at scale, yes
It is learning though. It’s not just copying the code.
Code gets turned into tokens and then it learns the next most likely token.
The issue that I see most people talk about it the scale at which is learnt.
A human will learn from other people’s code but not from every persons code.
The issue is that of copyright law WRT to derivative works. Machine transformations on original works does not create a new copyright for the person that directed the machine transformation. That's why you can't pirate a bunch of media by simply adding a red pixel to the righthand corner or by color shifting the video.
Copyright law is very clear that if a machine does it, the original copyright on the input is kept. This is why your distributed binaries are still copyrighted, because the machine transformed, very significantly, the source code into binary which maintains the copyright throughout.
It would be inconsistent for the courts to suddenly decide that "actually, this specific type of machine transformation is actually innovative."
I know this is generally really bad for the AI industry, so they just ignore it until a court tells them they can't anymore. And they might get away with it as I don't have faith that the courts will be consistent.
Shredding is a machine transformation. Does it mean that shreds retain original copyright even if the content can't be restored and the provenance can't be traced? Just an example that treating all machine transformations equally with no regard to the specifics doesn't make much sense.
And the specifics of autoregressive pretraining is that it is lossy compression. Good luck finding which copyrighted materials have made it into the final weights.
> Does it mean that shreds retain original copyright even if the content can't be restored?
Yup, it absolutely does. In fact, that's why you are still violating copyright law by using bittorrent even though each of the users is only giving out a small slice or shred of the original content.
The US has a granted defense in the case of something like shredding called "Fair Use" but that doesn't mean or imply that a copyright is void simply because of a fair use claim.
> And the specifics of autoregressive pretraining is that it is lossy compression.
That doesn't matter. Why would it? If I take a FLAC recording and change it to an MP3. The fact that it was a lossy transform doesn't suddenly give me the legal right to distribute the MP3.
> Good luck finding which copyrighted materials have made it into the final weights.
That's what the NYT v. OpenAI lawsuit is all about. And for earlier models they could, in fact, pull out full NYT articles which proved they made it into the final weights.
Further, the NYT is currently in discovery which means OpenAI must open up to the NYT what goes into their weights. A move that, if OpenAI loses, other litigants can also use because there's a real good shot that OpenAI also included their works in the dataset.
> Yup, it absolutely does
Well, it's not the first time when the law contradicts laws of nature (for the entertainment of the future generations). Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.
> in fact, pull out full NYT articles
That's when they used their knowledge of the exact text they wanted to "retrieve" to get the text? It wouldn't be so efficient with a random number generator, but it's doable.
> Bittorent is not a relevant example, because the system is designed to restore the work in its fullness.
You can restore shredded documents with enough time and effort. And if you did that and started making photo copies, even if they are incomplete, you will run afoul of copyright law.
Bittorrent is a relevant example because it shows that shredding doesn't destroy copyright.
Remember, copyright is about the right to copy something. Simply shredding or destroying a thing isn't applicable to copyright. Nor is giving that thing away. What's applicable is when you start to actually copy the thing.
I've meant idealized shredding: a destructive transformation, which is still a machine transformation (think blender instead of shredder). When you need the exact knowledge of a thing to make its (imperfect) copy using some mechanism, it doesn't mean that the mechanism violates copyright.
EDIT: I don't say that neural networks can't rote learn extensive passages (it's an effect of data duplication). I'm saying that they are not designed to do that and it's possible to prevent that (as demonstrated by the latest models).
I'd assume it's still a copyright violation if you copied and distributed the shredded copy.
The way I arrive at that is imagine you add just 1 pixel of static to a video, that'd still be a copyright violation. Now imagine you slowly keep adding those random pixels. Eventually you get to the point where the whole video is just static, but at some point it wasn't.
Now, would any media company or court sue over that? Probably not. But I believe that still falls under copy right (but maybe fair use?).
The issue with neural networks is they aren't people. Even when you point your LLM at a website and say "summarize this" the output of that summation would be owned by the website itself by nature of it being a machine transformed work.
Remembered, it's not just mere rote recitation which violates the law, any transformation counts as well. The fact that AI companies are preventing it doesn't really solve the problem that they are in fact transforming multiple copyrighted works into their responses.
When you point your browser at a website the browser creates a (transformed) local copy of the information that is owned by the website itself. The browser needs to do that to render the website on your screen. Is it a violation of copyright (that the website is willing to tolerate because it profits from advertisements)?
No, because your browser is dealing with the distribution of data in a way intended by the copyright holder. You also aren't redistributing the webpage after rendering. Client side modifications fall under fair use which is what keeps the likes of ad blockers and other page modifiers legal.
What would violate copyright is if you took that rendered page, turned it into a jpeg, and then hosted that jpeg from your own servers. That's the copying that would run afowl of copyright law.
A human is not a commercial product. Here we have commercial product that was created by using a lot of various copyrighted and protected IP, without licensing agreements, without paying, without even citing it.
LLMs seem to be so devoid of intelligence, I think it's arguable if that's learning: https://machinelearning.apple.com/research/illusion-of-think... Typically, you would imply a level of understanding when you say learning. LLMs apparently can't do that, by design.
Copy/pasting at scale is how tons of software has been written for a long time, or have we all forgotten the jokes people used to make about StackOverflow?
If you can set a copyright trap and an LLM reproduces it I think it's pretty clear cut that it's more than just "learning".
I have seen LLMs do all sorts of crap which was clearly reproduction of training material.
This is also why people are most impressed with how much better it is at reproducing boilerplate rather than, say, imaginative new ideas.
Remember last year (?) when one of the major AIs produced a bit of code that included Jeff Geerling's name in a comment?
If there were the case, then imagine having to give it back!
Yes I guess there's also no such thing as stealing in torrents since the computer "learns" the data and returns it in a transcoded fashion so it's technically not a reproduction. Yes LLMs can reproduce passages from copyrighted works verbatim but that's only because it "learned" it and it's just telling you what it "knows".
The mental calisthenics required to justify this stuff must be exhausting.
> The mental calisthenics required to justify this stuff must be exhausting.
It's only exhausting if you think copyright ever reasonably settled the matter of ownership of knowledge and want to morally justify an incoherent set of outcomes that they personally favor. In practice it's primarily been a tool for the powerful party in any dispute to hammer others for disrupting their business model. I think that's pretty much the only way attempting to apply ownership semantics to knowledge or information can end up.
Correct.
Knowledge consists of, roughly speaking, thoughts.
(a "justified true belief" - per https://plato.stanford.edu/entries/knowledge-analysis/ - is a kind of thought)
The "thinking" part of a "thinking being" - that also consists of thoughts.
If your knowledges are someone's property, you are someone's property.
A society where all knowledge is proprietary, is a society of ubiquitous slavery.
Maybe multi-layered, maybe fractional, maybe with a smiley-face drawn on top.
Doesn't matter.
Humans have been known to recite entire parts from plays from memory, live in front of audiences even.
And they are legally required to license the play to do that, if it's still in copyright.
Only to perform it, not learn it.
And LLMs perform when you prompt them.
This is a perfect example of 'begging the question'. Arriving at a conclusion from a fact assumed as true without evidence. Your reductio does not actually demonstrate that copyright applies to LLMs, because you did not demonstrate how transcoding is comparable to inference, just that LLMs can reproduce some passages from copyrighted works. You could also produce passages from copyrighted works by generating enough random sequences of words, but no one is arguing that is comparable to transcoding. That the people who do not share this conclusion are engaging in motivated reasoning is based only on your assumption and has no logical backing, and is therefore begging the question.
> Yes LLMs can reproduce passages from copyrighted works verbatim but that's only because it "learned" it and it's just telling you what it "knows".
Are you finding people that actually say this?
When it can quote something like that, it's a training error. A popular enough work gets quoted and copied by people online, and then it's not properly deduplicated. It's a very small fraction of works it can do that with, and the cleaner your data the less it happens.
I'll once again quote that stable diffusion launched with fewer weights than training images. It had some accidental memorizations, but there wasn't room for its core functionality to be memorization-based.
I find it more ridiculous to equate the act of a human learning with for-profit AI training without recompense to the authors of the training material.
If I “learned” your essay and handed it in, would you be happy with that?
The "learning" isn't learning really. I mean it might be, but if you define learning to be a human endeavor than AI can't learn.
It's perfectly reasonable to say it's okay for humans to do something but not okay for a computer program to do the same thing. We don't have to equate AI to humans, that's a choice and usually a bad one.
If one defines 'flying' to be a bird's endeavor, then humans can't fly.
Now, if you'll excuse me, I need to catch a metal shuttle that chucks itself through the air on wings.
Sure as a word it can be broad, as a concept in our legal system that should be much more nuanced.
The relevant extension of your analogy is should birds be required to obey FAA rules? Or should plane factories be protected as nesting sites?
Relevant: https://www.bluewin.ch/en/news/swiss-company-builds-airport-...
It's a relevant extension if you think the ability to learn from a work is a right people have that exempts them from the more general lockdown copyright would impose.
If you come at it from the view of copyright being a limited set of control over some areas but not others, then if copyright doesn't block human learning it shouldn't affect anything similar either, unless a specific rule is added to make those situations be handled differently.
It's also perfectly reasonable to say it's ok for a program or machine to do the same thing as a human. This has been the basis for the technological revolution since the dawn of technology.
It's legal and perfectly reasonable for a human being to combine organic fuels with oxygen from the air to create energy and CO2. Any law restricting that would be the worst form of tyranny.
It would not be reasonable to allow machines to do that at unlimited scale without restrictions.
(Hopefully the fossil fuels industry won't draw inspiration from the legal arguments made by AI companies...)
> It's legal and perfectly reasonable for a human being to combine organic fuels with oxygen from the air to create energy and CO2.
Is there any line past which it becomes unreasonable?
> It would not be reasonable to allow machines to do that at unlimited scale without restrictions.
If the machines were a replacement for a damaged respiratory system in a human would it reasonable?
What about if the machine were being used by a human to do something else that was important?
Where is the line where it becomes reasonable?
> Is there any line past which it becomes unreasonable?
That's exactly the question we should be asking about AI and fair use.
Are you refusing to engage with your own metaphor?
You're taking the metaphor much too seriously. It was only an example to illustrate that human rights don't automatically apply to machines. Let's not read too much into it.
You made a claim and used a metaphor to demonstrate that claim. I asked a very simple question about the bounds of the metaphor and thus the claim. You are dodging answering the questions which mean that you cannot defend the logic of your claim. Thus you have forfeited that your claim is valid and 'human rights don't automatically apply to machines' has not been illustrated.
Fortunately I don't care whether you're convinced. I doubt our discussion here will change policy in any way.
What's your strategy for solving problems where there are diverse viewpoints if there is no desire to convince anyone else? Rhetoric is time proven set of communication standards that allow us to demonstrate the validity of our positions and thus gives us a way to find agreement or at least understand what others think. Few people are completely irrational and understanding why they think what they do, even if one does not agree with them, is important in a system where people have to co-exist with the decisions that effect everyone.
Because the alternative would be to just railroad people who don't agree, and even when it does work in one's favor the pendulum tends to swing back hard in response.
I think that it's absurd that we've jumped to the conclusion backpropagation in neural networks should be legally treated the same as human learning.
I mean I don't think think I could find a better description for following the derivatives of error in reproducing a set of works as creating a "derivative work".
>> ... we've jumped to the conclusion backpropagation in neural networks should be legally treated the same as human learning.
I agree. However, the reverse is also likely true, i.e., it cannot currently be denied that learning in humans is different from learning in artificial neural networks from the point of view of production of works that mix ideas/memes from several works processed/read. Surely, as the article says, copyright law talks exclusively about humans, not machines, not animals.
I understand the article - the point about 'learning' is that if the model and its outputs are a derivative works then the copyright belongs to the human creators of the works it was trained on.
Edit*: Or perhaps put more pseudo legally that the created works infringe on the copyrights of the original human creators.
The part I agree to is that copyright law calls out humans specifically as the potential owners of copyright. So what you suggest seems to be the only possibility out. Calling out humans could imply that when a human reads a thousand books and then writes something basis the same but which is not a substantial copy of anything explicitly read, that human owns the copyright to the text written. Whereas, if an artificial neural network does the same (hypothetically writing the same text), it would not.
The above does not follow from, imply or conclude anything about learning in artificial neural networks and humans being similar or dissimilar.
Is "learning" the correct term?
Or is it "plagiarism"?
"Learning" for LLMs is just as goofy and propagandistic a metaphor as "stealing" for copyright. I find it predictive of your position that you'll accept one dumb metaphor for something that we didn't need a metaphor for, but not the other.
Are you for stealing and against learning?
We know exactly what is happening in both cases. We can talk about that, or we can use obfuscating euphemisms that make our preferred position seem obviously true.
“Stolen” as in “profited on IP against terms and conditions of the license”.
Everybody has had a complete 180 in terms of copyright protections. Before, nobody cared about downloading music, movies, TV shows, or pirating games. Now, when the copyright law is affecting them, they are gungho about protecting these billion-dollar companies' copyrights.
A more logical explanation would be that there are different opinions and those who complain are usually louder.
Yes, that's my point. They are different and contradictory opinions, which show hypocrisy.
No it is not your point. You're just arguing about a strawman that holds both of those contradictory positions.
You are attempting to invoke strawman. So is your point that there is not a significant overlap between posters who think that AI companies should not be allowed to pirated use copyrighted material in their training corpus and posters who themselves pirated copyrighted material such as movies, music, games, etc.?
Yes, that is their point. Do you have evidence against it?
I'm sure you can find some overlap, but I bet the vast majority is caused by people making a distinction between commercial and noncommercial piracy. I don't think there's a big cohort of piracy hypocrites.
Due to the nature of the argument, of course I do not have evidence for or against it. However, I am willing to leave it at that, because I think that any rational observer will be able to look at the general mood toward copyright/privacy online (including using Limewire back in the day, pirating movies, downloading Photoshop etc.) and come to their own conclusion whether or not it's plausible that there isn't a significant overlap between the two.
Its not about "billion-dollar companies' copyrights", but also about voluntary copyleft free software. If I license my code under GPL I don't want other persons/companies just whitewash that code through LLMs and use it in their proprietary code.
I agree with this, and I think that it is an open question whether or not training on copyrighted material is considered transformative or not. However, someone said that thumbnails of full photos are considered transformative enough to allow fair use, and LLM training is (in my opinion) clearly more transformative than converting a picture to a thumbnail. But we will see how it plays out.
It's all power.
The music and movie companies have power. They have the funds to bankrupt you with a small army of lawyers. You as an individual do not stand a chance against corporate lawyers. They can destroy your life over fairly minimal and non-violent offenses.
AI companies are backed by the very powerful. They can steal all they want and use the same army of lawyers to bankrupt any small rights holder. The big rights holders go to the same parties and allow it to happen.
Regardless of the actual take on copyright, both methods skullfuck the little guy without power.
People cry foul because, at least in the US, we claim to live in a free country based on equality, yet there is a very obvious caste system of the haves and the havenots.
It errodes the legitimacy of the system. Imagine if for years you see news reports of a mother getting a judgment against her where she owes 100s of thousands because she seeded a Brittany Spears song. Then you suddenly see the same laws that were leveraged to instill fear in you, tossed aside when the rich and powerful say it doesn't count anymore, you're going to cry foul!
It's not a hypocrisy of position on copyright, it's bearing witness to the illegitimacy of the laws they're bound by.
I find idea that the code could be copyrightable as weak. There are only so many ways to write a for loop. Similarly you can't copyright schematics (apart from exact visual representation as form of art). Code is just a schematic.
Note: IANAL
Copyrights already preclude short phrases for the same reason -- there are only so many ways in which short phrases could be produced. The moment a work becomes larger (large enough; AFAIK, the threshold is not precisely defined), the reasoning you applied fails to apply.
The Google-Oracle lawsuit did not decide whether APIs (when large in number) are copyrightable or not.
Let me get this straight: Since there are only so many ways to write a for loop, you doubt that for loops are copyrightable. From this you conclude that code, in general isn't copyrightable?
That's like saying "there's only so many ways to greet your neighbor, so any text that simply greets your neighbor isn't copyrightable – and therefore no text is copyrightable".
but the ability for the agent to build it in the first place is based off of stolen IP.
I honestly don't understand why the attitude that underlies this is so prevalent.
When I write code, what I write and how I write it is informed by having read countless source code files over my education and my career. Just as I ingest all that experience to fine-tune how my later code is written, so does the LLM from the code it's seen.
The immediate retort to that is that the LLM is looking at code that wasn't its to read. But I don't think that's a valid objection. Pretty much by definition, everything I've learned from has a copyright on it, and other than my own code on my own time, that copyright is owned by someone else. Much of the code that's built up my understanding has been protected by NDA, or even defense-department classifications: it wasn't mine in any way. But it still informs how I do all my future coding.
By analogy: I'm also an artist, especially since my retirement. My approach to photography was influenced by Ansel Adams, and countless other artists whose works I've seen displayed in museums, or in publications and online. My current approach to painting was inspired by Bob Ross and others, and the teachers who have helped me develop. I've taken pieces of what I've seen in all their work, and all of that comes out in my photos and paintings, to varying degrees.
I've taken ideas from others in code and in art, and produced something (hopefully!) different by combining those bits with my own perspective. I don't think anyone has a claim on my product because of this relationship.
Likewise, I know that many of my successors have learned from my code (heck, I led teams, wrote one book about software development!). And I hope that someday my artwork has developed to the point where there's something in it that's worth someone else's attention to assimilate. I've never for a minute - even decades before the advent of LLMs - hoped or even imagined that my work would remain locked up with me, and that the ideas would follow me to the grave.
As they say, we are all standing on the shoulders of giants. None of us would be able to achieve the tiniest fraction of what we have, without assimilating what has come before us. Through many layers of inheritance it's constantly being incorporated in subsequent works.
In a few decades at best, I'll be dead. It probably won't be very long after that when people even forget my name. But the idea that something I've done - my work in developing software systems, or in my photography and painting - will continue to have ripples through time, inspires me and gives me hope that I'll have some tiny shred of immortality beyond my personal demise.
Scale and the ability to generate a livelihood of your creations and/or the ability to control how what you have created is used, for instance, to demand attribution.
The attitude is derived from a general animus many have towards AI companies. They resent the efficacy of AI because it devalues individual expertise.
I can't imagine it really justifiable to say that training off data is the same as "stealing", when that same claim, that learned information that a person could retain and reproduce constitutes copyright infringement is the subject of many dystopian narratives, like this one, where once your brain is uploaded to the cloud you have to pay royalties based on every media product you remember.
https://www.youtube.com/watch?v=IFe9wiDfb0E
Part of how AI works is that it's just really complicated compression, you can get AI to write out Harry Potter novels word for word with the right prompting.
When it picks out a rare bit of code, it will be simply copying that code, illegally, and presenting it without attribution or any licenses which is in fact breaking the law but AI companies are too important for the law to apply to them.
There's been instances where models have spat out comments in code that mention original authors, etc., effectively outing itself as a copyright thief.
There's nothing anyone can do about it, but the suspicion is that the big companies have taken everyone's code on GitHub, without consent, and trained on it.
And now are spitting out big chunks of copyrighted code and presented it as somehow transformed even though all they've actually done is change a few variable names.
It is copyright theft, but because programmers are little people, not Disney, we don't have any recourse.
Anthropic was sued successfully for training on books, the law still applies to them
https://www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-...
When I write fizzbuzz do I owe royalties to the inventor of fizzbuzz? Is my brain copyright thieving because I can write out the song lyrics from memory?
I think if you write fizzbuzz and then sell it, without attribution, and it goes against the original fizzbuzz license, then you’re infringing.
They got sued for downloading pirated books and not for using them for training. Huge difference.
Indeed, the court actually explicitly held that Anthropic had the right to train their AIs on books, so long as they paid for them.
> There's nothing anyone can do about it, but the suspicion is that the big companies have taken everyone's code on GitHub, without consent, and trained on it.
I asked agent X what is the source of training data it generated code from, it couldn’t say. Then I asked why the code implementation is exactly the same as the output of agent Y. It said they were trained on the same ‘high-quality library’, and still couldn’t say which one.
So I guess that’s fine because everyone is doing it.
You asked a machine that makes things up when it doesn't know the answer a question that it has no way of knowing the answer to. I don't know why you bothered to relay its response.
And now are spitting out big chunks of copyrighted code and presented it as somehow transformed even though all they've actually done is change a few variable names.
It's pretty likely that I've done the same thing. I mean, I've written enough CRUD functions in my life, for example, that in all likelihood I'm regurgitating stuff that's a copy, for all practical purposes, of stuff I've done before as work-for-hire for my employer. I'm not stealing intentionally or consciously, but it seems quite likely that it's happening. And that's probably true for many of you, at least that have been in the industry for a while.
This is the answer. People don't like having their livelihood threatened so they kick the thing that threatens it.
For another human being to look at my open source code, learn from it, get inspired by it, appreciate what I did, and let it influence their own creativity would bring me joy. That's why I open sourced it in the first place.
Few people ever actually read open source code, but I'd like to think on the rare occasions they do, they share a connection with the author. I know when I read somebody else's code, for me to understand it I have to be thinking about the problem the same way they were when they wrote it. I feel empathy with them and can sometimes picture the struggle, backtracking, and eureka moments they went through to come up with their solution.
Somehow I don't get the same warm fuzzy feelings about a machine powered by investor money ingesting my work automatically, in milliseconds, and coldly compressing it down to a few nudges on a few weights out of trillions of parameters. All so the machine can produce outputs on-demand for lazy users who will never know of me or appreciate my little contribution, and ultimately for the financial benefit of some billionaires who see me as an obsolete waste of space.
I guess I'm just irrational that way.
We're moving into the 'industrial age of software'. You exact issue, of bespoke, well thought out and well-crafted code is one that craftsmen felt at the beginning of the industrial age. Now, parts are designed and churned out by machines that no one sees or cares about (generally speaking). This is where we are going with software, and production at a truly industrial scale has its place.
And so does well-crafted bespoke software.
The engineers who built the foundation for the industrial expansion of our forefathers went through the same exact thing we're going through now. They look at what existed, and use it to inform their efforts. This is what LLMs do.
I'm not attempting to moralize here, just comment on the parallels. Do I agree that a craftman's work is consumed by the juggernauts and no second thought is given? No. I think its a shame. But I also think the output will never match the artisans that practice now. By the very nature of the machines we employ, we cannot match the skill or thought that goes into bespoke code.
It is not even about quality. In fact with an LLM following my orders I can create higher quality code than I ever did before. I always was operating within a budget whether it was defined by the # of hours my customers were willing to pay for, or the # of hours I was personally willing to invest in a side project. This budget manifested in the form of cut features, limited test coverage, limited documentation, and so on. So given the same budget or even a slightly reduced budget I can actually make higher quality software with slop superpowers.
If I spend 2 hours designing the domain model, 1 hour slopping out a rough implementation, and 5 hours polishing it with a combo of handwritten and vibed refactorings, I will get a better result than if I spent 8 hours writing everything by hand.
So my point is not that vibe software is lower quality, as my experience has shown the opposite. It is simply that the spirit of sharing my work was done with the idea that I was sharing it with others who toiled in the same craft, not sharing for consumption by machine. Not that I ever contributed anything very important to the open source world, that anybody depended on. Just personal projects I thought were neat or educational.
In hindsight I would probably still have open sourced what I did, because I think it's valuable to have on record that I competently programmed stuff before AI even existed, like pre-atomic steel. But I don't know if I will open source any personal code going forward.
====
To put it more succinctly: if somebody "ripped off" my open source code in 2018, I wasn't mad about that. Even if they didn't bother to attribute me, well, at least they saw my stuff, had a human brain cell light up appreciating it, and thought it was worth stealing. I'm flattered. But with LLMs my work can be reappropriated without a single human ever directly knowing or caring about it.
Well put. I agree wholeheartedly with your sentiment.
Maybe this is me just being angry at the new world that's being created, but the beauty of the open source ecosystem was humans giving away things they found useful in the hope that other humans could find them useful too. Having a machine take all of that and regurgitate it somewhere else without that connection (for profit, no less) feels like a betrayal of that open source ethos.
Now in the back of my mind I worry that everything I open source will be scooped up by corporations to make them more rich and more powerful, so I end up not publishing anything (not that it was of any value). I suspect I'm not alone in feeling that way.
Humans should have more legal privileges than machines, just as individuals should have more legal privileges than corporations. It's really as simple as that. I don't want to gripe around making up justifications, that's how the law should be and if it turns out not to be that, I'm going to be nettled.
I live in the UK, and most US law is based upon English common law, it's not some immutable code given to us from above. It's based upon assumptions and capabilities of the entities participating in the system at the time the law was codified. It can and should change to make more sense if those assumptions and capabilities shift massively.
I get the individual/corporation distinction, but how is a machine another tier here? It's a tool, it can't have any rights at all. The wielder has rights, and curtailing their rights depending on what tool they're using to exercise them seems strange. Potentially justifiable, but it's a different axis from the nature of the actor.
Our positions are completely compatible. People are anthropomorphizing LLMs, saying that because humans train on protected works, then it is fine for LLMs to do the same.
If they have only the rights that their human creators have, then access to them cannot be sold, in the exact same way that I cannot sell you a database that I have collected filled with copyrighted material. The "humans do training too" argument only holds if you imbue LLMs with similar rights to humans.
I am allowed to sell myself (in a very limited capacity) to others for them to exploit my training, even if that training was on protected material, which is a privilege humans should have, but machines should not.
Thing is, LLMs level of compression of training set mean that effectively, under the same rules that say you cannot sell that database filled with copyright material, the LLM is fine to sell. Because you have to be able to meaningfully trace each claim to final output (weights). For example, for some older stable diffusion model, it was calculated that each individual work addition or removal resulted in about 1-2 bits of change, meaning the same rules would qualify it as not derivative work.
However, because it is an issue with (at least historical) goals of copyright law, the common pattern that is evolving is that AI is not granted copyright of any work it generates, making it a bit of poison pill for some of the egregious ideas of corporate abuse. Not sure if the weights will be considered copyrightable either.
Specifically in the area of copyright, there's a significant distinction between UK and US law.
The UK works under the "sweat of the brow" doctrine for copyright. https://en.wikipedia.org/wiki/Sweat_of_the_brow
In the US, it is a minimal threshold of human originality and Feist Publications, Inc. v. Rural Telephone Service Co. https://en.wikipedia.org/wiki/Feist_Publications,_Inc._v._Ru.... https://en.wikipedia.org/wiki/Copyright_law_of_the_United_St...
The inclusion of "human" is important there. https://copyright.gov/comp3/chap300/ch300-copyrightable-auth... - the human authorship is mentioned several times.
The question is, does Claude Code fall into that category of authorship without creative input or intervention from a human author?
The prompts may be copyrightable... but the output if you don't go in and fix it up and provide that minimal amount of human originality to it? That appears to still be an open question of law in the United States.
> When I write code, what I write and how I write it is informed by having read countless source code files over my education and my career. Just as I ingest all that experience to fine-tune how my later code is written, so does the LLM from the code it's seen.
You are presumably human. We have granted humans specific exemptions in copyright law. We have not granted that to LLMs. Why are we so eager to?
Because that allows us to create useful tools that we didn't have before. For me it feels like a carpenter going from a hand-saw to an electrical saw. Still requires the skills of a good carpenter, but faster and easier.
… so a bunch of people just decided that rights we granted to humans also apply to their tools? Without any discussion? This isn't how anything is supposed to work when it comes to common rules!
The common rules are so because we agree on them. On principle, in this case, we do not agree what the rule should be here and it's in a way unprecedented. We'll soon converge to a societal agreement. I hope society abstaining itself from tools will not be the answer.
And the process by which we agree is lawmaking.
Ok, so I use the LLM. I use the tool. Can I now apply the exemption to me?
Are you telling me that I can use the thing, but I can't use it if I process it through an LLM? It get slippery, fast.
No, that's how copyright normally works.
If I write a story, I can put it online. That doesn't mean it's ok to take that story and publish it in an anthology.
What's special about LLMs in your argument? When I was an edgy teenager in the 90s, I'd argue that it's not piracy because the DivX representation of the movie isn't bit-for-bit identical to the Hollywood master or whatever. If your reasoning works for LLMs as the tools, surely it also works for video compression.
I'm not sure where in our lawbooks there are laws that specifically target humans to the exclusion of human-operated tools.
There's also a TON of irony here. What an about face it is, for the community at large* to switch from "information wants to be free, we support copyleft and FOSS" to leaning so heavily on an incredibly conservative reading of IP law.
> I'm not sure where in our lawbooks there are laws that specifically target humans to the exclusion of human-operated tools.
It doesn't need to. Laws are for humans.
Laws don't give rights to chainsaws. Or lawnmowers. Or kitchen knives, hammers, screwdrivers, and spades.
You can't use any of those to commit a crime and then claim that the law specifically did not exclude those tools.
Why are you seemingly in favour of carving out an exemption for LLMs?
Laws are for humans.
Arguing that the law did not specifically address "intentionally killing a person by tickling them till they died" means that you found a loophole which can be used to kill people is...
well, it's in the "not even wrong" category...
> I'm not sure where in our lawbooks there are laws that specifically target humans to the exclusion of human-operated tools.
If we take the point of view that LLMs are tools (I agree), then people need to be absolutely certain that these tools don't contain (compressed) representations of copyrighted works.
People seem not to want to do that. And they argue that the LLMs have "learned" or "been inspired" by the copyrighted works, which is OK for humans.
This is the problem. People can't even agree on which of two mutually exclusive defenses to appeal to! Are LLMs tools which we have to ensure aren't used to reproduce copyrighted work without permission, or are they entities that can be granted exemptions like humans can? It can't be both!
> There's also a TON of irony here. What an about face it is, for the community at large* to switch from "information wants to be free, we support copyleft and FOSS" to leaning so heavily on an incredibly conservative reading of IP law.
True. While IP-owning companies like Microsoft now say "it's online, so we can use it".
It's bizarre.
I'll tell you what: I'll drop my conservative stance in defense if FOSS when Windows and the latest Hollywood movie are "fair use" for consumption by whatever LLM I cook up.
If we take the point of view that LLMs are tools (I agree), then people need to be absolutely certain that these tools don't contain (compressed) representations of copyrighted works.
I've pointed out elsewhere in this thread that this is the opposite of how the real world works.
In actual fact, people who need software built hire a tool (e.g., a software developer like me) to build it for them. That tool - me or you - has inside it a tremendous library of copyrighted works represented. I've worked on enough different projects over the decades that the next CRUD function, or rule-driven data-entry tool, or whatever, that I build is going to draw very significantly from the last ones I built. And those last ones were copyrighted, with those rights held by my employer at the time, and maybe even protected by NDA or defense-style classifications.
Is your position that this is OK so long as it's stuff that I can keep in my squishy brain, but the moment that mechanism moves to silicon, it somehow becomes fundamentally different?
The other major argument I see in this thread is that for LLMs it's different because there's a third party who is aggregating the data, and selling me (or my employer) use of that tool. But this doesn't change the overall picture at all. It just adds one more layer of dereferencing into it. The addition of that middleman hasn't altered the moral landscape: how is hiring me, along with what's in my memory, different from hiring the combination of me plus a helper to supplement my memory? There's an aspect of scale, I suppose. With that helper I can achieve greater quantities, but it's not changing the story in a qualitative way.
> In actual fact, people who need software built hire a tool (e.g., a software developer like me) to build it for them. That tool - me or you - has inside it a tremendous library of copyrighted works represented.
Humans are distinct from tools, both ethically (to most people) and legally. You may not see it this way, but it is the majority opinion and the stance of the law in most jurisdictions. The rest of your paragraph falls apart without considering humans as tools.
(Incidentally: you can own tools. I don't think you want to open that door…)
> Is your position that this is OK so long as it's stuff that I can keep in my squishy brain, but the moment that mechanism moves to silicon, it somehow becomes fundamentally different?
Yes. We, humans, structured our laws because we consider ourselves and our squishy brains special.
This is, for example, why you don't get charged with murder for terminating a computer program. We, the humans, have decided that the right not to be terminated only applies to humans (and other animals, but then because we grant them that protection).
We did not grant human exemptions in copyright law.
We gave certain temporary monopoly on certain uses to humans under rules little understood by laymen even if their livelihood depends on it.
... and from that temporary monopoly humans have exemptions (critique, inspiration, etc.)
It's generally less an exemption and more a constraint on the monopoly, at least in spirit of the law
In many of those examples, there is payment to the creator of the works that others are learning from. Authors are paid for their books, when we listen to music on the radio the musician is paid royalties, etc. When you lead a team and mentor junior engineers you're being paid for your time.
The nature of the source material matters though. Training a model on open source software seems perfectly fair - it has explicitly been released to the public, and learning from the code has never been a contested use.
IMO the questions around coding models should be seen as less about LLMs and more as a subset of the conversation about large companies driving immense profits from the work of volunteers on open-source projects, i.e. it's more about open source than AI.
You’re confusing yourself with a commercial product. You’re not a product that was created by other human beings based on someone else’s IP.
You’re not a product that was created by other human beings based on someone else’s IP.
It turns out that's false. We know that genes are patentable; remember back during the Human Genome Project, when there was such a rush to patent them? So genes are IP. (This seems bizarre to me, since they're patenting something that was found just sitting there, but this is what the system says right now.)
Well, two other humans (aka mom and dad) did create me, based on those patentable genes (and most likely including some genes that were, in fact, patented).
I'm not sure what to conclude from all of that, but I do think that it invalidates your argument.
It's a little more complicated, and I would argue that the court got it wrong, but you cannot patent a gene as it exists and rests in nature. You can patent the cDNA (reverse-transcribed mRNA) genetic code after intron removal, which they argue is not a natural thing, but I think they misunderstood the science, really the triviality of the "invention".
https://en.wikipedia.org/wiki/Association_for_Molecular_Path....
No, that human owns the copyright on the prompt, not on the work product.
So I’m responsible for pushing the giant boulder at the top of the hill.
The humans at the bottom who were crushed should blame the boulder, which happened to be moving.
I'm not sure what point you are trying to make.
He's making a point about responsibility/liability.
If you only get copyright for the prompt you make, but not the output, then it's like being responsible only for the prompt, but not the output.
Ie he's only responsible for pushing the boulder up the hill. The fact that it rolled down from the hill and crushed someone's house "isn't his fault" (he doesn't get copyright on it).
Well, you are responsible for the consequences. Liability is simply a different thing than copyright.
The copyright office says that you don't get copyright because you're not considered the author:
https://www.copyright.gov/ai/
>The Office concludes that, given current generally available technology, prompts alone do not provide sufficient human control to make users of an AI system the authors of the output. Prompts essentially function as instructions that convey unprotectible ideas. While highly detailed prompts could contain the user’s desired expressive elements, at present they do not control how the AI system processes them in generating the output.
If you're not the author then why would you have to be liable for it?
If you hold an illegal party on public land, you would still be liable, even though you did not own the land.
In some places simply not keeping the public street in front of property ice-free can incur liability, even when you are not actually there when it snows. There are so many such examples I'm kind of surprised to see this kind of confused argument made here.
But that's not at all a comparable situation though, because it is your party. It doesn't matter where it is, we assign "ownership" of the party to you. Even the language we use explicitly states that. In the case of copyright, we explicitly states (by the copyright office), that you are not the author of an AI generated work.
Same point goes to if an animal takes a picture.
> If you're not the author then why would you have to be liable for it?
If you do not understand this make sure that you always operate within a framework of people who do because this soft of misunderstanding can cause you a world of grief.
Because you are the person shipping it, and as such regular liability applies. If I'm not the author of a book, and make a lot of copies and distribute those I'm liable for the content of that book, regardless of whether or not I hold the copyright to it. Conversely, if the original author sues because they feel their work infringes then that too is a liability that stems from the distribution.
And 'distribution' is a pretty wide term, not unlike 'interstate commerce', lots of things that you might not consider to be distribution can be classified as such in court.
Different laws do not come in packages, they apply individually, and sometimes they apply collectively but it isn't a menu where you can pick the combination that you think makes the most sense.
Oh, I do understand it - laws are contradictory and can do whatever people shout out the most that they should do (but they don't always work that way). I just think that it is extremely bad when laws work this way.
Technically when you select "copy image" instead of "copy image url" and paste that to a friend you're often committing copyright infringement. Do I think this is reasonable? Absolutely not. The same goes for this - the author should hold liability, so make the person who ends up causing the work to exist the damn author.
But nooo, we can't have that. Instead we need to have these convoluted exceptions that don't at all work how the real world works, so that lawyers can have even more work.
Besides, if we go by "the law" then we already have a court case where training an AI model is protected by fair use. But obviously that isn't satisfying enough for people, so they keep talking about how it's stealing (refer to my first sentence).
Also, this situation is going to get funny when some country decides that AI generated content does get copyright protection.
> Oh, I do understand it - laws are contradictory and can do whatever people shout out the most that they should do (but they don't always work that way). I just think that it is extremely bad when laws work this way.
You are completely misunderstanding GP's distinction between ownership and liability.
In short, if you use someone else's car to kill someone, you are still liable for killing that person even though you don't own the car.
Do you disagree with that statement?
Aren't you agreeing with him? He pushed the boulder up the hill, thus he is responsible and liable for what happens. He is the author of the work of pushing the boulder up the hill.
In your analogy: He was driving the car, he is liable for the death. He is the author of the work of driving the car.
You are kinda unnecessarily introducing the creation of an object used for the work. Whoever did create the car/boulder is not liable for what happened.
So whoever made the LLM is not the author but the one who used it to create the code.
> Aren't you agreeing with him?
No. His claim is:
>>>> If you're not the author then why would you have to be liable for it?
And all his arguments after that are to support that claim. His claim is wrong.
Ownership and liability are independent of each and all his supporting arguments are dismissing this fact.
> Whoever did create the car/boulder is not liable for what happened.
Incorrect; whoever owns the car/boulder is not liable. The creator doesn't even enter this argument.
> So whoever made the LLM is not the author but the one who used it to create the code.
No; whoever created the LLM is irrelevant. The author who creates the code is similarly irrelevant. What matters in his argument is who owns the code, and this is also irrelevant to his argument, because ownership does not mean liability.
You can't really argue that things are in a certain way when that contradicts the way the law works, that's a recipe for disaster. The rules have been set, you can disagree with them and then you will be forced to litigate, which is both expensive and time consuming. Purposefully going against the grain is only for those with extremely deep pockets (and for lawyers...).
> Besides, if we go by "the law" then we already have a court case where training an AI model is protected by fair use.
Yes, but training an AI is a completely different thing than distributing the work product generated by that AI.
Note that I don't agree with all aspects of copyright law either, but I'll be happy to play by the rules as set today simply because I can't afford to be wrong and held liable for infringement. For instance I strongly believe that the length of copyright is a problem (and don't get me started on patents, especially on software). I also believe that only the original author should have copyright, not the company they worked for, their heirs (see Ravel for a really nasty case) or anybody else. I believe they should not be transferable at all.
But because I'm a nobody and not wealthy enough to challenge the likes of Disney in court I play by the rules.
As for 'this situation is going to get funny when some country decides that AI generated content does get copyright protection':
Copyright is one of the most harmonized legislative constructs in the world. Almost every country has adopted it, often without meaningful change. In practice US courts are obviously a very important driver behind changes in copyright law. But in general these changes tend to lean towards more protection for copyright owners, not less. So far the Trump admin has not touched copyright law in their usual heavy handed manner. I'm not sure if this is by design or by accident but maybe there are lines that even they can not easily cross without massive consequences.
Some parties in the AI/Copyright debate are talking about two sides of their mouth, for instance, Microsoft is heavily relying on being able to infringe on copyright at will but at the same time they are jealously guarding their own code. Such hypocrisy is going to be the main wedge that those in favor of strong copyright are going to use to reduce the chances that AI work product deserves copyright, after all, if it is original and not transformative then Microsoft could (and should!) train their AI on their own confidential code. But they're not doing that, maybe they know something you and I do not...
Imagine you cut the sentence "I'm going to kill you, this is an imminent threat." out of a book and hand it to someone.
It would be silly to consider you the author of that sentence in a copyright sense.
It would be equally silly to say you have no liability from that sentence.
Looking back at the boulder example, that LLM output has no consequences to be liable for if you throw it immediately into the trash bin. It's when you take boulder.txt and use it to do things that you have liability despite not having copyright.
That is not how responsibility works anywhere. If you are stealing a gun and murder someone with that gun, you are still responsible, even if it is not your gun.
That’s now how it works. The human using the tool (like claude code, etc) owns the copyright of the code generated.
No, you are wrong about this.
See:
https://technophilosoph.com/en/2025/02/07/ai-prompts-and-out...
If you have a more recent citation referring to case law that states the opposite then that would be great but afaik this article reflects the current state of affairs.
The human using the tool creates a prompt, there is then an automatic transformation of the prompt into code. Such automatic transformation is generally accepted as not to create a new work (after all, anybody else inputting the same prompt would have a reasonable expectation of generating the same output modulo some noise due to versioning and possibly other local context).
Claud code and in general AI generated code does not at present create a new work. But the prompt, that part which you input may be sufficiently creative to warrant copyright protection.
In the US, the copyright office (as the article you link to says), has declined to define “meaningful” contribution. If you want to argue that the user doesn’t own it for incredibly trivial prompts, I won’t argue (though I consider that to be non-useful code).
Every developer I’ve seen use these tools has have engaged in a meaningful contribution: specific directions across multiple prompts, often (though not always) editing the code afterwards, manually running the code and promoting for changes, etc.
Until the courts, legislators, or the copyright office define something otherwise, I’m highly confident of my assertion. (Mostly because of the insane number of hours I’ve spent with counsel on this. And, as a disclaimer, since I am biased: I worked on Copilot and Google’s various AI assisted coding products as an SVP and VP.)
If my business depended on a legal fiction to be true and I had invested a whole pile of effort + money into it being so then I would argue at every opportunity that 'of course it is legal'. But that's just a version of fake-it-until-you-make-it and in practice not all of those bets pay off.
The fact that meaningful contribution has not been defined is a strong signal that things are not nearly as clear cut as you make them out to be. Until there is a ruling that clearly establishes that the person that generated the prompt owns the copyright on the code I think it is misleading to suggest that this is already the case, your lawyers are not the lawyers of the parties that will end up hurt if it ends up not being so.
For contrast: we have a very clear idea on what things are copyrighted and in general these things do not rest on a foundation of IP appropriated from others outside of the license terms. The fact that the infringement is fine grained and effectively harms the rights of 1000s or more individuals doesn't change the heart of the matter, whoever wrote the code: it wasn't you.
Given your bias I'm not surprised that this would be your argument though, effectively you have created a copyright laundromat using code that you were nominally the steward of and not the owner but whether it stands long term or not is not up to your lawyers.
Obviously, we aren’t going to agree on this at all. I hope you have a good day.
Prove I did not write my code if I do not tell you which tools I used. =}
That's not how that works.
You warrant you wrote the code yourself, then it is found your code infringes on code owned by other entities. Now you have a tough choice: admit you lied about writing your code yourself tainting all of the code you claim you wrote since these tools became available or stand and take the infringement penalty which could be very substantial.
Judges and courts don't like playing silly games like this.
I've sued two parties for copyright infringement and won and a third settled out of court for a substantial sum. You don't tell a judge you don't need to prove you wrote the code, that's an automatic loss. Then there are such things as expert witnesses who will interview you and check how much you know about the code you claim you wrote.
>I've sued two parties for copyright infringement and won and a third settled out of court for a substantial sum. You don't tell a judge you don't need to prove you wrote the code, that's an automatic loss. Then there are such things as expert witnesses who will interview you and check how much you know about the code you claim you wrote.
This doesn't really make sense; in no way can an "expert" interview definitively assert someone wrote a piece of code or not, especially if the person has access to the code beforehand.
They don't need to prove it 100%. They just have to show that it's likely you did.
I believe the standard can be as low as "more likely than not".
I’ve done temp. transcription for grey hat medical “expert witnesses” who are paid by innocent and guilty alike, fwiw - mercenaries.
If that were true, a developer may own copyright over the source code, but nothing on the compiled binaries, and I could download practically all software available as compiled binaries and use for free.
Indeed a developer owns copyright over the source code and on the compiled binaries, because there is no expansion happening here but just a translation from one format into another, the kind of thing that has been ruled copyrightable since copyright exists. The same goes for translations from one human language into another, and anybody with knowledge of more than one language will be happy to acknowledge that translating is hard work. Even so, the translator does not hold copyright on the result, at best they can say they have created a derived work and it is the original author that continues to hold copyright.
Compilation and translation happen in a generic manner and does not rely on a mountain of other IP, it is really just a transformative tool that happens to do something useful, someone constructed it to be a very precise translation to the point that any mistakes in it are called bugs and we fix them to ensure the process stays deterministic. Translators try hard to 'get it right' too: to affect the intentions of the original author as little as possible.
When you use a model loaded up with noise or that you have trained exclusively on code that you actually wrote I think a strong case could be made that you own the copyright on that work product. But when you train that model on other people's work, especially without their consent or use a model that has been trained in that way you lose your right to call the output of that model yours.
You did not write it, and the transformative process requires terabytes of other people's IP and only a little bit by you.
As soon as you can prove that your contribution substantially outweighs the amount of IP contributed in total you would have a much stronger case.
+1
Adding two subtle points:
>> Indeed a developer owns copyright over the source code and on the compiled binaries, because there is no expansion happening here but just a translation from one format into another ... does not rely on a mountain of other IP
... and, the license agreement of the compiler and libraries used / linked to practically always explicitly waive copyrights over the said non-mountain of IP.
>> As soon as you can prove that your contribution substantially outweighs the amount of IP contributed in total you would have a much stronger case.
... a much stronger case that you have a partial copyright over the work, which is now likely a derivative work. You still may not have a case that you own the copyright exclusively (or as the original article says, that your employer does).
>> No, that human owns the copyright on the prompt, not on the work product.
I think I may have misunderstood your original comment above. It seems intending to say:
No, that human owns the copyright on the prompt, not necessarily on the work product. The human may partially have copyright over the work product as well, "how much" being dependent on how much new creative expression from the human was involved vs that from others.
That is in fact correct.
Both the compiler (in absence of inclusion of copyrighted libraries) and the LLM are considered to not add creative work and thus do not change copyright status of the works they transform.
You can consider the training set of the LLM or other AI model to be 3rd party libraries and the level of copyright from them applying to final output to be how much can be directly considered derivative, just as reading copyrighted code and being inspired by it does not pass that copyright to your work unless it's obviously derivative
>> You can consider the training set of the LLM or other AI model to be 3rd party libraries ...
I like this comparison -- training set as '3rd party libraries'. Except, of course, that the authors behind the training set may not have actually granted permission to use, whereas the 3rd party libraries usually have some permission by way of license.
The law only cares about how the work is distributed - if you acquired it legally by purchasing, yes you can train LLM on it, and with exception of moral rights in places like EU the author does not have more to say on it.
It's treated the same as human reading and learning from the work.
You have only the granted artificial monopoly on acts of distribution under US law
What is interesting is that I used to write a program and get a binary executable back from the compiler and I'd have copyright on the source and on the binary. Now I write a prompt and get a binary executable back from Claude and I have copyright on the prompt but depending on my creativity I might be able to have copyright on parts of the output binary. The questions remain: how much, which parts and how the hell could anyone ever tell. This really puts the color of the bits through a vat of dying solution.
> If that were true, a developer may own copyright over the source code, but nothing on the compiled binaries, and I could download practically all software available as compiled binaries and use for free.
If the compiled binaries (output) were produced by running the input (source code) over every program written, then sure.
But that's not what's happening with compilers, is it? The output of a prompt is dependent on copyrighted work of others every single time it is run.
The output of a compiler is not dependent on the copyright output of every other program.
I think your comments are originating in how I may have taken jacquesm's comment too literally, as I just wrote here https://news.ycombinator.com/item?id=47944938
However:
1. The "every"ies in your comment are not to be taken literally either. :-)
>> If the compiled binaries (output) were produced by running the input (source code) over every program written, then sure.
2. More importantly, the above seems cyclically dependent on whether output from generative AI is deemed to be in public domain or not, which I consider is an open-ended issue as of now. It is not so 'sure' as yet. :-)
Copyright isn't some natural state of being though, it's something that's granted to people by the government to "promote the progress of science and useful arts". If copyright hinders things then I think it's reasonable that exceptions would be made.
This analysis yields very different results under utilitarianism vs rule utilitarianism.
Under the former, you could argue, "What I'm doing is a science or useful art, so if copyright exists to advance those things then taking a more permissive interpretation of copyright to allow my efforts to succeed is in the spirit of the law."
Under the latter, you could argue, "Works get published because as a rule, researchers and artists know they have lawful recourse through copyright if the work gets used without their consent. The absence of that rule incentivizes safeguarding works by treating them as secret and each disclosure as a matter of personal trust, so the existence of that rule promotes the sciences and useful arts."
I agree with this sentiment, because the person directing the agent can still direct it in a way where it'll produce a better or worse output than another person directing it.
Copyright laundering is an illusion.
If the LLM generates output that a court decides is sufficiently derivative, and especially (but not necessarily) if the LLM was trained on the source material being infringed, then whoever redistributes the derivative output is going to be liable for copyright infringement.
Creation of the LLM itself is transformative, but LLM output which infringes is not.
Is it true then that if someone stole an entire code base from a vibe coded app from a non permissively licensed project and that person claimed that it was derived from an LLM and was not stolen at all that the person who stole the code is not a thief because it came from the same place? Or are they a thief because someone else copyrighted it? How do vibe coders protect themselves not knowing who else has the same derivative code or who holds the copyright first? Or can't they?
The only thing a vibe coder should be able to copyright, is the prompt text they wrote. Not the output of the LLM, only the text they wrote to instruct the LLM what to do. And even that is pretty iffy, because most of it like "put a button on a page" is not copyright-able.
I could possibly see an argument for the owner being whoever paid for the tokens used, but honestly I think the argument for that is weaker than what you're suggesting; I'm merely playing devil's advocate here.
I don't think there's even a valid argument for any other ownership model, or at least none that I can think of.
I see the argument for whoever paid for the tokens. Or in the case of a free AI usage, the person who sent the prompt (or whoever they are acting on behalf of, i.e. the company they are working for at the time).
The primary issue being that it's all built on stolen data in the first place.
Even taking the least generous interpretation of what LLMs do and saying they're just "copy/pasting others' code" it's still not stealing because the original still exists and presumably still makes money. The original has to be gone for theft to have occurred.
In order to have a sane conversation about this we have to all agree not to lie.
I've created my own DSL, and instruct Claude Code how to generate code for this DSL using skills.
Since this is a new language, and not documented on the web nor on Github, Claude's ability is not based off of stolen IP. At best it's trained on other language concepts, just like we can train ourselves on code on GitHub.
Maybe a good reason to create a new programming language?
Interesting, but I still do not think this is as easy. The AI model is still trained on some existing works, and it is generating code in the new DSL or programming language still based some higher level ideas and expressions it has consumed during training. You have added just one more level of indirection. The output cannot anymore be verbatim copy of some existing work or non-short snippets, however, the output may still carry "expression" that are substantially similar to something pre-existing.
Note: IANAL. The above is just from my current understanding.
You can think that's how it should be. But that's not necessarily how it is. I'm reminded of the famous monkey selfie copyright dispute [1]. A photographer set up a camera and gave it to a monkey but after a legal dispute, courts decided nobody owned the copyright.
I can totally see this applying here as well.
Now this doesn't resolve the issue of AIs being trained on copyrighted works it had no rights to. The counterargument is that this is a derivative or transformative work but I don't believe that's settled law at all.
[1]: https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...
Do you think that human directing the agent owns copyright for any legal reason?
The case Community for Creative Non Violence Vs Reid (https://en.wikipedia.org/wiki/Community_for_Creative_Non-Vio...) solidifies a supreme court opinion that someone contracting a work and directing an author does not grant authorship to the commissioner of the work, it grants authorship to the person actually doing the work.
The author can grant authorship and copyright to the commissioner with a contract, but the monkey picture (and others) have solidified that only humans can be granted copyright. Since LLMs aren't human they can't hold copyright, and if the LLM doesn't have legal copyright then they don't have legal rights to assign copyright to you.
Interesting, though, that ownership of the code can still be transferred to the employer. So it's in the public domain (because not human authored) but owned by the employer (because the human and/or LLM was employed by the employer)? I don't really understand how this works.
Note: IANAL
I think what this means is that the employee may not be the copyright owner for multiple reasons, which are possibly applicable simultaneously. It does not imply that the employer owns copyright over the work that is in public domain, which would be a contradiction.
yeah, that makes sense
Copyright works on derivative rules - is the component of the work unmistakenly derived from another copyrighted work.
Under at least EU AI Act, any work done by AI is not granted copyright. But it does not mean copyright does not apply, it means the amount of work credited to AI is set at 0% (simplification). A human working off another's work unless it's perfect copy will have "credit" for changes that are judged creative/transformative, meaning a human plagiarizing something still can claim to have some degree of authorship. An AI won't.
In a sense, the copyright status of final work is a sort of "sum with dilution" were each work involved adds to claims, but AI's output is set at 0 - the prompt or further rework by human is not.
As for employer, details vary but generally "work for hire" rules and contracts do reassignment of material rights (in EU and some other places you can not reassign moral rights which are a different thing).
When you write code by hand, you are the author. As part of your contract with your employer you grant copyright and authorship to your employer by default (as stated in the contract).
The LLM is not employed by you or your employer, because you can't enter contracts with non human or non human organizations.
When you license a non-LLM code generation service (like a page that creates a website for you), that company owns the copyright of the generated website because their deterministic system generated code by defined rules and mechanisms that were defined by the code generation system. Assuming no LLM as part of that, there is no code that is generated by the system outside of the rules that they defined (it's not filling in the blanks that you or the code generation system didn't explicitly define).
Since they own the copyright of the website, they can then assign the copyright and authorship to you because of your license agreement to them.
Since the LLM is filling in the blanks on its own in undefined ways, it is the author and not Anthropic/OpenAI/ETC. That means that even though you have a license agreement with Anthropic/OpenAI/etc.. to transfer copyright, they didn't have copyright/authorship, the LLM did. And since the LLM can't legally own copyright/authorship (since it isn't a human) then it can't grant it to you and you can't then grant it to your employer.
yeah this is what I understand - that it's the copyright ownership that is being transferred. But if there's no copyright to transfer, what does the company end up owning?
It depends on what level of creative control you had over the code.
Code is protected by copyright as a literary work. The method is not protected by copyright, that would be the domain of patents. What's protected are the words.
If you say "Claude, build me a website about X" then you do not have any creative control over the literary work Claude is producing. You just told a machine to write it for you. Nor, like a compiler, is it derivative of any other work that you wrote.
If, on the other hand, you are working jointly with Claude to make specific changes to the code on a line-by-line basis, then you will have no problem claiming copyright over the code. Claude in this case is acting as a tool, but there's still a human making decisions about the code.
In the case where you wrote a bunch of markdown and then told Claude to generate the corresponding code but didn't have any involvement in writing the code itself, you could perhaps claim that the code is a derivative work of the markdown, a court would have to handle that case-by-case basis and evaluate how much control you exerted over the work.
I don't think case law totally supports the idea that working on a line by line basis means you have "no problem claiming copyright".
There problem is the LLM is still making assumptions on that line of code and thus it's still the main author (based on existing case law and the copyright office's opinion currently).
The markdown case is definitely more like the case I cited where the supreme court decided that specficiations and back and forth do not mean it's a deritive work and thus the actual implementor is the author, not the spec writer.
> only humans can be granted copyright.
No, a copyright application can be filed with a corporation listed as the author. Watch for the copyright notice at the end of the next major movie you see.
However, until very recently the creative product must have been created by someone so there is an implicitly created copyright over the product in the first place. With AI output, that might not continue to be true, we don't really know how it'll work out yet.
In any case, the corporation did not create the product, people created it and their contractual relationship with the corporation defined how the ownership of that work was managed. So, I don't find it too unusual that this element of personhood is available to corporations.
This isn't what the copyright means.
The employees and contractors are the authors, and because of the contract they sign they assign copyright to the corporation. Corporations, as a collection of humans are allowed to have authorship.
LLMs are not companies and they are not humans in any way shape or form, and thus cannot get copyright nor grant copyright to a third party.
Under US copyright law, copyright protection subsists "in original works of authorship fixed in any tangible medium of expression":
<https://www.law.cornell.edu/uscode/text/17/102>.
It's not that corporations can't hold copyright. But a corporation cannot mechanically create "original works of authorship" by a purely mechanical process. That process is limited to human authors. "Works for hire" would be a common case of a human creator (author) resulting in a corporate assignment (ownership), see: <https://en.wikipedia.org/wiki/Work_for_hire>.
Notably cases:
- The "monkey selfie" copyright case, in which photographer David Slater arranged for monkeys to take selfies. Copyright ownership denied by both the US Copyright Office (against Slater's claim) and (in a separate case arguing the monkey should hold copyright) by an appellate court: <https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...>, Naruto v. Slater, No. 16‑CV‑00063 (N.D. Cal. 2016).
- Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991). Simple compilations are not copyrightable regardless of whether created by humans.
- THALER v. PERLMUTTER (2023). "[T]his case presents only the question of whether a work generated autonomously by a computer system is eligible for copyright. In the absence of any human involvement in the creation of the work, the clear and straightforward answer is the one given by the Register: No." <https://caselaw.findlaw.com/court/us-dis-crt-dis-col/1149169...>.
This interpretation makes sense. I think even the 'fair use' clause in the US doesn't protect LLMs. One argument I've heard often is that LLMs synthesize their training set to produce novel output in the same way as a human would... That may be the case, but legally an LLM isn't a human. You can't look at the output of an LLM and say that it's 'fair use' with respect to its training set; it hasn't been established that AI has the same 'fair use' right as a human does; it's already pushing it that companies have this right (let alone an AI agent); anyway, that's just one problem... Also, this is ignoring the fact that the researchers who compiled the training set COPIED the original copyrighted data in order to produce that training set. They either copied the entire work into the training set or they fed the entire work directly into the LLM; in either case; at some point, the entire work was copied verbatim into the LLM's input layer before it was ingested by the AI. The researchers copied the copyrighted content without permission.
Also, when it comes to code, the case is even more damning because the vast majority of the code which LLMs are trained on was not only copyright but subject to an MIT license (at best) and even the MIT license, which is the most permissive license in existence, still says clearly:
"Permission is hereby granted, free of charge, to any person obtaining a copy of this software"
The word 'person' is used very intentionally here.
I think there should be several kinds of AI taxes which should be distributed to all copyright holders. There should be a tax to go to writers (and book authors), a tax to go to open source developers and a tax for the general population to distribute as UBI to account for small-form content like comments and photography...
People invested a lot of time building their entire careers around the assumption of copyright protection; so for it to be violated on such a scale would be a massive betrayal.
I wonder what OSS licenses would have looked like if we saw all of this coming.
The LLM is just a database. It's like saying 'I own the copyright to what comes out of an API because I crafted the query' or 'I own the copyright to the responses I get from the bots on the Starship Titanic because I crafted the message they respond to'.
That's not what's been established to date in US caselaw:
THALER v. PERLMUTTER (2023). "[T]his case presents only the question of whether a work generated autonomously by a computer system is eligible for copyright. In the absence of any human involvement in the creation of the work, the clear and straightforward answer is the one given by the Register: No."
<https://caselaw.findlaw.com/court/us-dis-crt-dis-col/1149169...>.
I want this question to have an interesting answer, but everyone knows that if this question ever goes to the courts, ownership will go to the people in charge with the money. The idea that Anthropic may not own Claude Code just because Claude wrote it is wishful thinking.
Best part is, it's likely to have a different answer in every country, who knows what'll happen, not every country implicitly sides with the ones with the most money.
Depends on where they pay their taxes generally.
Well, eventually it'll probably be added to the Berne Convention agreement or some such.
That's my feeling on the endgame too, but it'll probably be a decade before we get anywhere near it.
It's not wishful thinking, and ownership isn't a foregone conclusion.
Sure the courts could mint a communist society with a few weird decisions about property rights, but this being the US do you really suppose that's likely?
There's really no legal question of any kind that models aren't people and therefore cannot own property (and also cannot enter into legal contract as would be required to reassign the intellectual property they don't and can't own)
The catch-22 is that the fact that models aren't people is only relevant if you treat them similar to a person. Like the US Copyright Office's opinion which treats it similar to a freelancer. If you treat the LLM as a machine similar to a camera, with the author expressing their existing intent through the tools of this machine, ownership is back on the table and more or less how it was before LLMs.
Well if the camera in addition to choosing autoexposure also decided how to frame the shots, which lens to use, where to stand, and everything else salient to the artistry of photography -- all without direct human intervention, then I would think the situation would again be analogous. If the camera could do all that because an intern was holding it, the intern would still own the shots even if their employer gave them the assignment.
That's why the intern signs an employment contract that reassigns their rights to their employer!!
The work-for-hire doctrine actually supports your intuition more than the AI authorship question does. The reason Anthropic likely owns Claude Code has little to do with whether Claude wrote it and everything to do with the employment contracts of the engineers who directed it. The DMCA takedown question is genuinely interesting though because DMCA requires the claimant to assert copyright ownership in good faith. If a court later found the codebase was predominantly AI-authored and therefore not copyrightable, the 8,000 takedowns could be challenged as bad faith DMCA claims. That is a different and more tractable legal question than the ownership one.
Work-for-hire doctrine doesnt automagically absolve you from IP law. Microsoft and Intel already learned this in the nineties when they paid San Francisco Canyon Company to steal Apple code.
https://en.wikipedia.org/wiki/San_Francisco_Canyon_Company
LLMs are just code stealers, will gladly generate Carmacks inverse for you with original comments.
The San Francisco Canyon case is a good example of exactly the right distinction. Work-for-hire determines who owns the output, but if the process of creating that output involved copying protected material, the infringement claim runs separately. The piece makes this point on the open source contamination section: owning the output and having a clean chain of title to the output are different questions. You can own AI-generated code and still have a copyleft problem in it.
I have trouble believing that the DMCA claims would be found to be in bad faith when they were made at a time when the question of what degree of human input is required to acquire copyright on AI generate code hasn't been resolved at all.
It doesn't seem like bad faith to think that copyright is stronger than the courts end up thinking, just being mistaken.
fair correction, updated the piece to reflect this. Bad faith under DMCA requires knowing the claim is false, not merely being wrong. A good faith belief in copyright ownership, even one that turns out to be mistaken, is a defense. The more accurate framing is that if the codebase is found to be predominantly AI-authored, the takedowns would fail on the threshold question of whether there is a valid copyright to assert, which is a different issue from intent.
I can't see how that can work.
As a developer, the fact that my source code passed through a compiler - an automated tool - doesn't give the author of the compiler any claim on my executable code.
As an artist, the fact that I used, e.g., Rebelle to paint a digital painting, or that I used Lightroom (including generative AI to fill, or other ML/AI tools to de-noise and sharpen my image) in editing a photograph, doesn't give EscapeMotion, Adobe, or Topaz, any claims to my product.
Why, then, would there be any chance that use of a tool like Claude - a tool that's super-advanced to be sure, but at the end of the day operates by way of a mathematical algorithms - would confer any claims to Anthropic?
If a court later found the codebase was predominantly AI-authored and therefore not copyrightable
Is figuring out the appropriate prompts to use in directing Clause qualitatively different than using a (much) higher-level abstraction in coding? That is, there was never any talk as we climbed the abstraction layer from machine code to assembly to Fortran or C to 4GLs to Rust etc., that the assembler/compiler/IDE builder would have any ownership claim on the produced executable. In what sense can Anthropic et al assert that their tool, which just transforms our directives to some lower-level representation, creates ownership of that lower-level representation?
I love that genAI art will not be copyrightable and genAI code will be. The power of the Almighty Dollar at work.
I'm not sure Anthropic would appreciate the liability that ownership would imply.
Too late to edit, but OpenAI certainly doesn't want ownership or liability, for the CSAM they've produced. They certainly don't want ownership/liability of code which does $ONLYAWFULTHING.
They won't want to own code that is malicious\illegal\used in crime, although it's really weird to me that no one (in LEO) seems to care that, for example, grok generates CSAM, revenge porn, probably other illegal things, so they'll probably get to have their cake and eat it too.
Those things have precise legal definitions which it may not be entirely clear that an LLM can even generate them - especially in the USA where the 1st covers things that many would think illegal (and are illegal in other countries).
This is the same shape as the image cases.
Zarya of the Dawn already settled it for Midjourney output: human-written elements were protected, AI-generated images were not. The character design didn't get copyright even though the human picked, prompted, and curated. Code isn't different. Prompting Claude to produce a function is closer to prompting Midjourney to produce a frame than to writing the function yourself.
The reason it feels different to engineers is that we're used to thinking of the compiler as the analogy. But a compiler is deterministic — same input, same output. An LLM isn't. That's the line the Copyright Office is drawing, and image cases got there first.
But is there anything stopping a human from applying for copyright in their own name? Does the fact that somebody can recreate the prompt invalidate their claim?
Filing isn't the gate, registration is.
Copyright Office requires you to disclose AI involvement and disclaim the AI-generated parts. Zarya of the Dawn is the example — applicant filed for the whole graphic novel, got partial registration on the human-written text, refused on the Midjourney images. The reproducibility of the prompt isn't really the test. The test is whether a human made the expressive choices.
Your comments are getting classified by our software as LLM-generated or (more likely) LLM-edited. It's impossible to be certain, of course, but if this is the case—can you please not do this? It's not allowed here - see https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079.
LLMs are amazing of course and we use them heavily ourselves - but not for modifying text that is to be posted to HN. Doing so leaves imprints on the language that readers are increasingly becoming allergic to, and we want HN to be a place human conversation.
Wow, yes sir! I was using Claude to write faster. But I understand. Thanks for the note.
> Filing isn't the gate, registration is.
Not really. Copyright registration is pretty much automatic. The Copyright Office does not check for duplicates. Patent registration involves actual examination for patentability. Issued patents are presumed valid (less so than they used to be), but issued copyrights are not. You have to litigate.
The US does not have "sweat of the brow" copyrights. It's the "spark" that creates the originality, not the work. Which is why you can't copyright a telephone directory (Feist vs. Rural Telephone) or a copy of an uncopyrighted image (Bridgeman vs. Corel) or a scan of a 3D object (Meshwerks vs. Toyota). Or the contents of a database as a collective work. Note that some EU countries do allow database copyright.
Interestingly, a corporation can be an author for copyright purposes. The movie industry pushed for that. We may in time see AI corporate personhood for IP purposes.
What you're asking is, "could someone do fraud" and "would being found out invalidate their copyright". To both of which the answer is generally, yes.
It'd be a form of plagiarism, just with different consequences to the most common form.
> But a compiler is deterministic — same input, same output. An LLM isn't.
Temperature 0 determinism is subject to active research. NVIDIA tried but failed so far, DeepSeek V4 seems to have done it. I hope judges won't be swayed by this an AI generated code will classified as uncopyrightable, just like Images are.
Fair point on temp-0. But I don't think determinism is what the courts will hang it on. A deterministic LLM still makes the expressive choices — naming, structure, control flow — that the human didn't make. The image cases didn't turn on whether you could re-roll the same Midjourney frame. They turned on who made the creative decisions. Same logic should hold for code.
Depends on the scale of LLM involvement, the copyright office left a pretty big carve out for things that are human sourced and then modified by LLM, or the reverse, LLM output thats modified by human intention. (They had to do this because there are already pseudo random elements to digital artwork, like say, render clouds and render noise, that might otherwise poison an artwork). In fact I dont think this has been tested with Highlight area > Prompt a change to this area of the image workflows.
They also mention in the same document that were LLMs to more closely approximate deterministic tools, they would be open to reevaluating. That is Requesting X gets X without substantial wiggle room.
I dont think that last part has been tested with an extremely large set of prompts and human generated input to create a more deterministic output. Even outside of code, where you see large prompts, creative writing LLM tools, NovelAI or Sudowrite for instance can have pages and pages of spec for the LLM, sometimes close to 50% of the size of the final output.
Then there's testing, review etc, human processes confirming that the output meets spec, updating it where needed intelligently.
There are also foreign courts, with similar rules about human intention, that have found in favor of prompts only, where it could be demonstrated that multiple rounds of prompts were used to refine the image.
I wouldnt call this settled at all tbh. And to be honest, a lot of this doesnt require exposure. you dont need to own up to LLM use in a lot of settings, proving LLM use is so difficult its easy to jump up the ladder from LLM (100%) to LLM (50%) and ultimately claim ownership.
The people who will get busted for this are basically just super lazy leaving ChatGPT responses in, failing to pay an editor, failing to modify images for anything more than layouts.
AFIK: Even the slightest modification of the work is transformative and will produce copyrighted material.
It does not have to be substantial transformation.
That's quite impressive approach from the companies' perspective. Let's first use claude code and then we'll think who the code belongs to.
I think that the gold rush approach happening right now around me (my company EMs forcing me to work with claude as fast as possible) show really short-sight of all the management people.
First - I lose my understanding of the code base by relying too much on claude code.
Second - we drop all the good coding practices (like XP, code review etc.) because claude is reviewing claude's code.
Third - we just take a big smelly dump on the teamwork - it's easier and cheaper to let one developer drive the whole change from backend to frontend, despite there are (or were) two different teams - one for FE, one for BE.
Fourth - code commenting was passe, as the code is documentation itself... Unless... there is a problem with the context (which is). So when the people were writing the code, they would not understand the over-engineered code because of their fault. But now we make a step back for our beloved claude because it has small context... It's unfair treatment.
I could go on and on. And all those cultural changes are because of money. So I dub this "goldrush", open my popcorn and see what happens next.
I rarely see #3 yield better solutions, it's usually better to collaborate as a team on requirements and gotchas, but let one person own implementation.
But both backend and front-end? Do everyone have to be full stack?
Also, it's supremely easy do the wrong abstractions long term and compromise premature internal designs that will start to starve of human mental modeling, hence explaining with accountability how things work and what the plans are when an incident happens. Also, if the wrong generalizations are introduced, coded correctly and reviewed and approved by AIs, then who's even driving really?
> Third - we just take a big smelly dump on the teamwork - it's easier and cheaper to let one developer drive the whole change from backend to frontend, despite there are (or were) two different teams - one for FE, one for BE.
Agree with your other points, but IMO this one has always been better. You often need to design the backend and frontend to work with each other, and that requires a lot more coordination when it's separate teams.
One of the few things I do kind of like about LLM-assisted coding is that it's helping to bring back "lone wolf" programming. We currently default to using massive teams to build massive software because of all the work involved, but teams have a huge communication/documentation cost, and a lot can leak and be lost the more communication has to happen to get things done. Code assistants cut down on the "all the work involved" part, and I think will help to bring one-man shops back into fashion.
On the other hand, separating FE and BE between two teams, necessitating proper interfaces, can often be considered a feature.
The fourth point about code commenting is the one that connects directly to the ownership question. When developers write comments to explain intent, those comments are evidence of human creative direction. When Claude writes the code and the comments, and the developer merges without adding their own explanation of the architectural decisions, the record of human authorship disappears along with the institutional knowledge. The documentation problem and the copyright problem are the same problem.
I opened my popcorn for the unholy trinity of HN x law x AI, your comment was one of my faves, love the purple prose. :)
people quickly have forgotten: when copilot was announced, there were warnings not to use it for company code because of the license attribution problem. so what's changed? that anthropic is willing to defend and indemnify?
This is all well and good as an intellectual exercise, but in real life none of this matters. Almost no one thinks their code is copyrightable or seriously thinks their code is a moat. I've written the same chunks of code for a number of employers as has every engineer. We've all taken chunks from stack overflow and other places without carefully considering attribution.
This comes up in a few places as a kind of vindictive battle. One example is Oracle suing Google for too closely mimicking their API in Android. Here is an example:
> private static void rangeCheck(int arrayLen, int fromIndex, int toIndex) {
fromIndex +
toIndex + ")");
}
And it was deemed fair use by the Supreme Court. Other times high frequency hedge funds sued exiting employees, sometimes successfully. In America, anyone can sue you for any reason, so sure, you'll have Ellison take a feud up with Page and Brin all the way up to the Supreme Court.
In 99.9% of instances none of this matter. Sure there's the technical letter of the law but in practice, and especially now, none of this matters.
https://www.supremecourt.gov/opinions/20pdf/18-956_d18f.pdf
> Almost no one thinks their code is copyrightable
Then why does reverse engineered code need to be a clean room implementation?
Ask any emulator developer or the developers of ReactOS
https://reactos.org/forum/viewtopic.php?t=21740
> Almost no one thinks their code is copyrightable or seriously thinks their code is a moat.
You'd be surprised! Among non-software management types, they often think of the code as extremely valuable IP and a trade secret. I'm a CTO and I've made comments before to non/less technical peers about how the code (generally speaking) isn't that big of a secret, and I routinely get shocked expressions. In one case the company almost passed on a big contract because it required disclosure of the source code (with an NDA). When I told them that was a silly reason and explained why, they got it, but the old way of thinking still permeates and is a hard habit to break.
Edit: Fixed errant copy pasta error. Glad that wasn't a password :-)
You're right, I guess maybe I mean in any serious actionable way. Senior, non technical people leave plenty of money on the table by thinking they're protecting something valuable or they have some kind of secret sauce. It's all silly is what I meant to say, and digging into the technicalities of whether your code is truly copyrightable is kind of pointless. It's all vibes.
The place where it concretely matters is M&A due diligence. Acquirers are now routinely asking about AI tool usage in development and running license scans as a condition of closing. A codebase that cannot demonstrate human authorship over its core IP, or that contains GPL contamination, creates a representation and warranty problem in the purchase agreement. For most companies day to day you are right. For the companies that get acquired or raise institutional capital, the question becomes very concrete very quickly.
Very interesting, I had no idea. That's probably going to be a very painful lesson learned by all the startups that have been pumping out AI code. I know of several just among my peer groups that will be shocked and dismayed by this. Thanks for sharing that!
That is exactly the gap the piece is aimed at. The M&A conversation is where this becomes concrete very fast, and most founders shipping AI-assisted code have not had it yet.
Eh, it does and it doesn't. PE investors actively are asking why more of the portfolio companies aren't generating codebases using Claude Code. You are right that lawyers are asking about code generated by LLMs but this is more of a CYA out of ignorance more than anything else (btw - many purchase agreements have funny representations like "your code is free of bugs" which is downright hilarious).
So these two things are squarely at odds with eachother...meaning, I don't know any PE acquirers who are actively terminating deals because the target acquisition's code is generating by an LLM even if the lawyers try to get a rep about it in the purchase agreement.
For the record, I still have yet to have an M&A lawyer explain to me unilaterally that AI generated code is an infringement...hence the question "who owns the code Claude Code writes" is still open.
The tension you are describing is real and the piece does not capture it well enough. PE acquirers pushing portfolio companies toward Claude Code while their lawyers are adding AI code reps to purchase agreements is exactly the gap that will produce the first painful deal. The rep usually survives unsigned because neither side has done the analysis. When the first deal falls apart or a rep is breached post-close because of GPL contamination in an AI-assisted codebase, that will set the market standard faster than any court ruling.
> When the first deal falls apart or a rep is breached post-close because of GPL contamination in an AI-assisted codebase, that will set the market standard faster than any court ruling.
Assuming it ever does...first, GPL is hardly enforced and second, I feel like there is going to be enough money (e.g. Anthropic's own code it uses for the harness) that pushes back against it being problematic. We'll see.
Maybe LLM coding agents change the equation by making it much easier to adapt and use foreign and probably incomplete code. Getting you closer to competing with the original authors in a shorter amount of time than generating new code from scratch.
Totally agreed.
I work in M&A. Nearly every lawyer, accountant, investor, and software business owner thinks their code is solely valuable and a trade secret. I find it hilarious and try to be as diplomatic as possible about why it's not. They also willfully will give their client list to a potential acquirer but get super cagey they moment a third party provider asks for their code to be scanned.
This argument easily gets shut down when I asked why, Twitch, a $1B business didn't crater to their competition when their full codebase was leaked.
I’ve worked at too many places where I mused that if someone gave the source code to the competitors, it’d likely drive the competitors out of business as they tried to use it.
Keeping it proprietary probably has the greatest value in preserving the company’s reputation…
> Almost no one thinks their code is copyrightable
I think this is an unusual opinion.
Code may not be copyrightable in as small chunks as you put there, but in terms of larger pieces I think companies and individuals very often labour under the belief that code is intellectual property under copyright law.
If code isn't copyrightable, from where comes the GPL?
And why does anyone care if (for instance) some Microsoft code might have accidentally ended up in ReactOS, causing that project to need to go into a locked-down review mode for months or years? For that matter why do employers assert that they own the copyright in contracts?
I think it's the opposite - almost everyone thinks their code is copyrightable, outside of APIs and interop stuff, or things so simple as to be trivial.
Nobody ever talks about convergence.
You, right now, are taking about convergence.
If there is no artwork, there can be no copyright. If every character of the code to write is basically predetermined by the APIs you need to call, there is no artwork and no copyright.
Build a novel new API, and you'll be protected though.
Why were the HFT firms suing employees?
> Almost no one thinks their code is copyrightable
Every open source license is built on the premise that code is copyrightable.
No.
It is based on the premise that if the proprietary licenses are valid, then also the open source licenses are valid.
So what is held as true is only the implication stated above and not the truth value of the claims that either kind of licenses are valid.
If the proprietary licenses are not valid, then it does not matter that also the open source licenses are not valid.
The open source licenses are intended as defenses against the people who would otherwise attempt to claim ownership of that code and apply a proprietary license to the code, i.e. exactly what now Anthropic and the like have done, together with their corporate customers.
Of course, if it is accepted that the code generated by an AI coding assistant is not copyrightable, then using it would not really be a violation of the original open source licenses. The problem is that even if this principle is the one accepted legally, at least for now, both Anthropic and their corporate customers appear to assume that they own the copyright for this code that should have been either non-copyrightable or governed by the original licenses of the code used for training.
Yes.
“ Copyright <YEAR> <COPYRIGHT HOLDER>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.”
The copyright assertion is the very first line of the MIT license, and the right to copy the code is granted. Clearly a reasonable person would affirm that that license (and all similar licenses) are based on a premise that code can be copyrighted.
> It is based on the premise that if the proprietary licenses are valid, then also the open source licenses are valid.
>If the proprietary licenses are not valid, then it does not matter that also the open source licenses are not valid.
That’s not true. Imagine a world where proprietary licenses are made invalid.
In such a world a company could take open source code compile it and distribute it (or build a SaaS) without the source code.
Even if you only focus on licenses that don’t prohibit this, most of those licenses require attribution.
So even in a world where propriety licenses were invalid the majority of open source licenses would still have a purpose.
You’re attempting to split hairs to argue on a very subtle technicality, but you’re not even technically right.
MIT just disclaims all the author's rights except attribution. If it turns out the code isn't copyrightable, nothing really changes. A better example would be GPL.
I mentioned that in my comment, but attribution is a big deal.
I think it should be pretty clear that if you provided the tool the specification for the code you want, you have already provided creative input.
After all, is this not what happens with compilers as well? LLM agents are just quite advanced compilers that don't require the specification to be as detailed as with traditional compilers.
To me this is like asking who owns the binary files a compiler generates.
>it should be pretty clear that if you provided the tool the specification for the code you want, you have already provided creative input.
If you provided a human contractor with the specifications for the code you want, the courts have repeatedly made clear you have not provided the creative input from a copyright perspective, and the contractor needs to explicitly assign those rights to you if want to own the copyright on the code.
Let's say we didn't have assemblers, but instead we would have three professions:
- Specifiers, who make the specification for the system
- Programmers, who write C code
- Machine encoders, that take that C code and write machine code for a CPU
Would it be that the copyright would then belong to programmers, if no other explicit assignments would be made?
---
Thinking about it, probably yes: copyright of the spec belongs to specifies, copyright of the C belong to programmers, and copyright of machine code to machine encoders. Or would it depend on the amount of optimizations the machine encoders would do, i.e. is it creative or not? And then does this relate to the task and copyrightability of C compiler output, where optimizations can sometimes surprise the developer?
In music, you can have copyright for a composition (like, lyrics and sheet music), and then for a master record. If you sell a copy of a song, you generally have pay royalties to both copyright holders.
So, in your example, the specifiers would own the specification, the programmers the C code, and machine encoders own the machine code.
But the ownership wouldn't be complete. If you sell the machine code, you'd have to pay royalties to all three. If you only sold the C code, only to the specifiers and the programmers.
LLMs aren’t human.
The compiler analogy is the right one to reach for and the Copyright Office addressed it directly: the question is not whether you provided input, it is whether the creative expression in the output reflects human authorship. With a traditional compiler, the programmer authors every expression in the source. With an LLM, the programmer authors the intent and the model makes the expressive decisions about structure, naming, pattern, and implementation. Whether that distinction matters legally is what Allen v. Perlmutter is working through right now. The summary judgment briefing completed in early 2026 and it may be the next landmark ruling on exactly this question.
Specifications are not necessarily creative input. Eg if I write a prompt that just says “write a rate limiter in Python”, there’s really no creative input. I didn’t decide on the API, or the algorithm to bucket requests, or where to store counters, or etc. I just gave it statements of fact, which are inherently not creative.
Compilers are different in that the resulting binaries are not separately copyrighted. They are the same object to the Copyright Office because one produces the other, in the same way that converting an image to a PDF is still the same copyright.
LLMs don’t do that. The stuff coming in may not be copyrighted, and may not be copyrightable. The stuff that comes out is not a rote series of transformations, there are decisions being made. In common use, running a prompt 10 times might yield 10 meaningfully different results.
I’m dubious the outcome will be “any level of prompting is enough creativity”.
Fine then that's not copyrightable at all. Just like hello world isn't copyrightable, whether in source form or compiled form.
The trick is to constrain the LLM to program in a very defined coding style
If I make the LLM generate code that follows my own code architecture and style, that should be enough creative input
Possibly; I'm not going to hazard a guess on what the Supreme Court will decide the exact bar is. I just don't think it will be either extreme. "Nothing is copyrighted" is too damaging to the economy, "everything is copyrighted" has weird impacts on non-LLM copyrights that conflict with precedent.
This is actually the opposite of what the copyright office has said. Directly addressing AI generated code/prompts, they compared it to someone who is commissioning art, describing to the artist what they want.
The copyright falls to the artist, not the person commissioning it.
Complicated in this case, because there is no artist.
it's well known that recipes cannot be copyrighted. But recipes still are protected intellectual property by trade secret law if they are treated as a secret by the holder of the recipe.
Claude code itself is a trade secret, and it is not open source, so its own copyrightability is moot till you get your hands on a copy of it with clean hands.
Recipes cannot be copyrighted because they are not expressions of human creativity. Software written by AIs are also not expressions of human creativity, so the balance is tilted in favor of AI generated copy not being copyrightable.
The Supreme Court or legislation could change this, and I'd guess there will be a movement to go in that direction, but till something like that succeeded it's not so.
> Software written by AIs are also not expressions of human creativity
I mean I'm not the biggest fan of AI on the planet by any means (which I think my post history would prove, lol), but isn't prompt design and steering the AI "human creativity"? In one of my AI-assisted projects I spent like a week in unending threads of posts trying to make the AI do stuff the way I wanted, testing the output, finding a bazillion of bugs and "basic bitch" solutions, asking for more robust this and edge case that. It felt like I wrote a novel. How is that not creativity (Crayon-eater or Picasso, creativity is creativity)?
I wonder when my manager "prompts" me "I want the feature X and I want it fast", is his prompt a human creativity?
To some extent yes. Your output at work is based on a combination of inputs from others in your organization, and is being paid for by your employer, so the organization owns the copyright on what you make for them.
I think from this view it makes sense that an LLM is a tool, and the operator of that tool (or their employer) can own the output.
The tricky part is when you squint and view an LLM with training input and prompted output as a machine that launders copyrighted input into customized output that is now copyrighted by a new owner.
A machine that vacuums up film reels and splices them according to a set of instructions by the user to create a compilation of recent animated Disney movies with the Shrek soundtrack superimposed would probably not pass legal challenges if the user of the tool attempted to claim full copyright on the output.
his prompt might be the result of human creativity but even in that case it's more than likely not to be a copyrightable expression of human creativity.
a copyrightable expression of human creativity in that case would need to be substantial enough in size to carry an imprint exlusive to your boss.
"why'd the chicken cross the road? to get to the other side" is not copyrightable. you can dress it up all you want, "why didst thy chickencock traverseth thee highway?..." etc would not qualify as something that would be exclusively yours/your bosses, because that trick is still rote.
BUT:
How do I love thee? Let me count the ways I like to see you work.
I love thee to the depth and breadth and height and number of your pull requests
My soul can reach, when feeling out of sight of your overnight toils
For the ends of being and ideal grace you provide me when you ship!
(i'm just extending each line of Elizabeth Barrett Browning https://poets.org/poem/how-do-i-love-thee-sonnet-43 )
that would be copyrightable if it was original to your boss.
The law and interpretation of the law does not have the tidy and necessary obsession with fencepost errors and corner cases. It deals with them by stepping back and saying "what would an ordinary person think should be copyrightable vs what would be more akin to the wordgames that clever nerds on the playground get beat up for?
>isn't prompt design and steering the AI "human creativity"?
yes it is, but that does not make the response by the AI an expression of human creativity and therefore not copyrightable.
If you wrote down your teachings about prompt design and published a book, your expression of your creativity would be copyrightable, but your ideas expressed in the book would not be.
if you-the-creative-human's prompt design was written by you as a computer program, that expression of your ideas would be copyrightable. but other people could just express themselves by typing in what your program does without using your particular expression and would not be stopped by copyright.
it's easy to get in the intellectual weeds questioning this, but just step back to, copyright was intended to give authors an income from their work, without stopping other authors from writing their own works. Everybody gets to write a King Lear play if they want, they just can't copy somebody else's expression of the ideas. What expression is trying to capture is "what makes you different from me, be we alike in most other ways"
as a funny sidelight, the titles of books and movies are not copyrightable nor considered part of the copyrighted work, because they are not considered to, in a sense, "leave enough room to contain expressions of human creativity" although when considered in the larger context might contain creative puns or double meanings that make illuminating sense.
However, the title of a movie may be a trademarked term (like Pokemon, Xformerz, or whatnot) but trademark has a "type" of good or service component (line of business) and the trademark would apply to action figures (dolls) and pajamas (clothing), but not to the film itself.
> But recipes still are protected intellectual property by trade secret law if they are treated as a secret by the holder of the recipe.
Trade secrets aren't very well protected, though.
You can sue the person who leaked/stole your secret, but if others keep sharing it once it is leaked you can do nothing to them.
i wasn't advocating for trade secrets as "equal" or "the way to go", i was trying to explain in simple terms how to think about copyright issues in concordance with the existing legal structures
people here who have not much experience were intellectually trying to reinvent wheels and I wanted to save them time in structuring their arguments. I have been exposed to various tips of the legal iceberg and was thrilled to learn what I learned and trying to pass it on.
> Claude code itself is a trade secret, and it is not open source, so its own copyrightability is moot till you get your hands on a copy of it with clean hands.
In this case Anthropic published the Claude Code source map file on npm themselves. https://venturebeat.com/technology/claude-codes-source-code-...
So by this logic my auto complete function before Ai also wrote 50% of my code and is not made by me, because I didn't type it.
What should matter is intent, the human that gives the orders.
> What should matter is intent, the human that gives the orders.
I'd like to hear more nuance with regards to this line of reasoning. Can you conceive of a model that contains highly non-trivial representations of IP owned by others than yourself? Can you conceive that you might "order" the model to "produce" that IP? What happens then?
Try this both for "open source code" as the IP, and "the novel I wrote", and "latest Hollywood movie". The model does not have to be a real model currently available. It's just a thought experiment.
Try also to elaborate on the sliding scale between "an AI model" and "a compression system".
Well, have you actually read the license for the auto complete function?
Example,
https://marketplace.visualstudio.com/items/VisualStudioExptT...
I just did. Nothing in here says who owns the resulting text. Did I miss something?
> If you hover over a line of code in your application, coding assistance services will display code strings of supported function calls available through the coding assistance service that are also present in your current code file. Coding assistive services will retrieve snippets from publicly available open source code showing how others are using those same functions. 3. THIRD PARTY COMPONENTS. The software may include third party components with separate legal notices or governed by other agreements, as may be described in the notices file(s) accompanying the software.
I've read that paragraph multiple times (both in the original and in your post) and I don't see anything that says who owns the resulting text. Just where it comes from. Am I missing something obvious?
>will retrieve snippets from publicly available open source code
Pretty sure it depends on the license the open source project uses. I dont think it's too troublesome if the autocomplete was truly only taken from open source projects, but it wouldn't surprise me if most closed source projects are also weighted into these models...
Ah, thank you. I read that but wasn't connecting the dots properly.
That is not auto-complete, that's API Usage Examples. See: https://marketplace.visualstudio.com/items?itemName=VisualSt...
The auto complete function doesn't really change what you write, so it doesn't remove the human creativity.
>What should matter is intent, the human that gives the orders.
If you are instructed by your professor to write an application, do you own the copyright or the professor?
Suddenly, you think you own the copyright again. In fact, in every case, you think you own the copyright. Because of your feelings. That's a common opinion here on HN too. You don't have this opinion by any logical stance. Nor by any legal doctrine.
The fact is: Copyright law applies to human authors. AI is not a human.
https://www.congress.gov/crs-product/LSB10922
This is of course assuming you take AI-generated code unchanged. But you don't, in my experience. And that generates a new work fully copyrightable even if the original wasn't. Just like how the fad a decade or so ago of taking Tolstoy and Jane Austen works and adding new elements -- "Android Karenina" and "Sense and Sensibility and Sea Monsters" are copyrighted works even if the majority of the text in them was from public domain sources.
I'm sure it's not quite that simple. Only parts the parts of those knock-off works that aren't public domain could be copyrightable. If you only own the copyright to ten lines in a 10k line codebase, then it's probably fair use for someone else to just to take the whole thing.
Plus what if Anna Karenina was GPL?
Anna Karenina is public domain, assuming you’re talking about the original? If you translate it then maybe you could release it under GPL, but bit odd?
I think you missed the "what if". It was just a point about how the constructed scenario might be different to the real scenario. Most AIs are not trained only on public-domain work.
I didn't, but not sure what the point is. Maybe I missed something else?
You use humans to edit AI code? When you level up you are just using AI to write, AI to review, AI to edit, AI to test. Not a lot of steps left for meat bags.
You're forgetting that you need coffee/tea/mate to fuel the button pushers. The Jetsons predicted this decades ago.
AI for review is terrible, and by no fault of their own. It's our job to specify and document intention, domain and the right problems to solve, and that is just hard to do. No getting around it. That's job security for us meat bags.
AI to write - code is buggy and not what I asked for
AI to review - shallow minutia and bikeshedding
AI to edit - wrote duplicated functions that already existed
AI to test - special casing and disabling code to pass the narrow tests it wrote
AI report - "Everything looks good, ship it!"
The article addresses this explicitly:
> Works predominantly generated by AI without meaningful human authorship are not eligible for copyright protection
Note the word "predominantly", and the discussion that follows in the article about what the courts and the copyright office said.
Skimming over the article, it's a lot about what the copyright office said and very little about what courts said. But the opinion of the copyright office doesn't have any legal force. Regulations passed by the copyright office would be binding, but their opinions are just opinions. We will have to wait until relevant court cases reach a conclusion. And so far running litigation isn't even about that question, it's about infringing the rights of works that are in the training data
Ok what about all the Anthropic’s engineers who say they don’t write code at all and it’s 100% AI-generated?
No such assumption is made in the article.
Nor does it give a single answer.
Mere prompting is still not enough for copyright, and the problem is unsolved on how much contribution a human needs to make to the generated code.
In the case for generated images copyright has been assigned only to the human-modified parts.
Even worse, it will be slightly different in other nations.
The only one that accepts copyright for the unchanged output of a prompt is China.
Here's a question I have: if the AI generated image is of a character of which you own the IP, don't you have protections based on the character regardless of who gets copyright protections from authorship of the image?
Yeah if you have a copyright on the character, the AI generated image doesn’t change that. It doesn’t give you more of less protection than you already had.
IANAL but this sounds more like trademark territory.
You can also trademark a character if it’s used as a brand identifier in commerce.
There are far more characters protected by copyright than trademark.
> This is of course assuming you take AI-generated code unchanged.
How much code do you need to change in order for it to be original? One line? 10%? More than 50%?
That's arbitrary and quite unproductive convo to be honest.
> That's arbitrary and quite unproductive convo to be honest.
Yeah but that’s what the legal system ostensibly does. Splitting fine hairs over whether a derived work is “transformative” is something lawyers and judges have been arguing and deciding for centuries. Just because it’s hard to define a bright red line, doesn’t mean the decision is arbitrary. Courts will mull over whether a dotted quarter note on the fourth bar of a melody constitutes an independent work all day long. It seems absurd, but deciding blurry lines are what courts are built to handle.
Because at the end of the day, someone has to own the code, so some lines have to be drawn no matter how arbitrary they seem.
EDIT: I changed my argument completely.
That makes no sense because what if you refactor your code ad infinitum using AI? You spin up a working implementation, then read through the code, catalog the changes like interface, docs, code quality and patterns and delegate to the AI to write what you would.
It's 100% AI code and it's 100% human code. That distinction is what's counterproductive.
Wrong. This territory was heavily covered in music before this code concept - it has to be “transformative” in the eyes of the law. Even going in and cleaning up code or adding 10-25% new code won’t pass this threshold. Don't bother arguing with me on this, just accept reality and deal with it.
My copy of "Sense and Sensibility and Sea Monsters" is explicitly listed as being copyrighted by Ben H. Winters in 2009 despite the majority of the words being Austen's, though. Perhaps music has different rules compared to text. I suspect Winters and his publisher have investigated the legality of this more than either of us have.
Jane Austen died long enough ago that her works are in the public domain, so Winters did not need a license to use it. That does not mean that he gained rights to her work: if he tried to sue someone for use of anything which appeared in the original, he would lose in court because it’s easy to show that copies made before he was born had the same text. This also how they prevent people trying to extend copyright by making minor changes to an existing work: the new copyright only covers the additions.
There’s a very accessible summary of the United States rules here:
https://www.copyright.gov/circs/circ14.pdf
If you modify the work, that creates a derived work from whatever copyright the original works has, not a new work that is fully copyrightable.
As the article says in the Tl;DR at the top the code may be contaminated by open source licenses
> Agentic coding tools like Claude Code, Cursor, and Codex generate code that may be uncopyrightable, owned by your employer, or contaminated by open source licenses you cannot see
> This is of course assuming you take AI-generated code unchanged. But you don't, in my experience. And that generates a new work fully copyrightable even if the original wasn't.
That's not how copyright works. The modified version is derivative. You can't just take the Linux kernel, make some changes, and slap a new license on it.
My opinion, copyright has mattered very little in the corporate world. Copyright is effectively meaningless with SaaS, and the compiled software ran on your machine is protected more by technical controls and EULAs. A world where copyright didn't exist for software would look nearly the same for the commercial world. Trade secrets, NDAs, and employment contracts bind workers more than copyright. The only thing that the question of copyright has real world impact is open source, but even then only for more restrictive licenses such as gpl.
Plus companies just violate GPL everywhere billions of times with impunity (see: every phone ever) and nothing happens to them.
What is being licensed by the End User License Agreement (EULA) is the copyright on the code and its artefacts (executable bytes, etc.) - you can't have an EULA without having the copyright to license.
You can have an eula on anything, it's a contract. You don't need copyright to enforce terms that two parties have agreed upon. The only thing copyright can do is force anyone in possession of copyrightable material to honor a eula. If you can only get software through approved channels, it's hard to avoid an eula. You would have to obtain it through the same pirated channels you have to now.
One question I have is this: if an employee produces code predominantly generated by AI, it means that it is not copyrightable. Does that mean that the employee can take that code and publish it on the Internet?
Or is it still IP even if it is not copyrightable? That would feel weird: if it's in the public domain, then it's not IP, is it?
A recipe isn't copyrightable but is still protected under trade secret law. I imagine that the same would apply. I think the major difference with software copyright is that I can just decompile your binary or copy a binary and give it to other people. For SAAS companies that don't distribute binaries, I imagine they basically have the same protections against rogue employees.
Presumably company policy would be implicated here, not copyright law. Whether or not it's copyrightable, what you create using AI is work product.
To look at it another way, just because some code I work on at my job is derived from open source MIT-licensed code doesn't mean I personally have the right to distribute it if my company doesn't want me to. I'd guess this comes under some generic "confidential information" clause in the employment contract.
Hmm your example is different: if you manually write code, there is a copyright for it whether it is derived from an MIT-licence or not. If you don't own that copyright (because your employer does), then you don't have the right to distribute it because it is not your code.
If you generate the same code with AI, now it does not have a copyright. If it depends on an MIT library, then the MIT library has a copyright and you have to honour the licence. But the code you produced does not have a copyright (because it was generated by an AI). And therefore nobody "owns" it. My question is: can your employer prevent you from distributing something they don't own?
This is a very long-standing and AFAIK never explicitly decided copyright and human rights question: If something is Public Domain, are contracts restricting distribution valid? Is our right to information or knowledge a fundamental human right that is not permissible to take from others, such that restrictions greater than those imposed directly by the State are invalid? In a healthy society, "I have created an extraction machine and your actions are hindering my extraction" is not a valid argument. So at the very least contracts restricting rights to public dmoain works should be allowed only with heavy restrictions as to when, how, and for how long they are binding - much like the legality of non-competes have has steadily reduced in many places in recent years.
CC0 came about in part because of this ambiguity. To deal with it, part of CC0 basically says - even if there would still be restrictions to this if it were only in the public domain, I renounce those theoretical rights.
Outside the underdeveloped legal framework, I believe knowledge and truth is like life, and human society has some continued philosophical growth required here.
That is exactly the right question and the answer is genuinely strange. Uncopyrightable work falls into the public domain, which means anyone can use it, copy it, or build on it freely. The employer can still call it a trade secret and protect it through confidentiality obligations in employment contracts, but that protection is contractual rather than property-based. A trade secret loses protection the moment it is disclosed. So the employer's claim over purely AI-generated code is essentially: "you cannot share this" rather than "we own this." Those are meaningfully different legal positions, and most companies have not thought through which one they actually have.
So employees are not allowed to distribute the code, but if it leaks, then it is public and the company cannot do anything about it. Correct? That's what happened to Anthropic I think?
Yes, and if the same come ends up in someone else's hands, they can state "we didn't steal it, a GenAI generated it for us, the same as it did for you". Given the non-deterministic operation of current GenAI systems (a major difference from compilers), it would probably be hard to prove either position.
More interesting question is "Who wants to own it"...
The answer is probably "Nobody"!
Presumably, every company that has non-LGPL CC code in production wants to own it...
"Own" as in "be responsible for". Nobody is too keen to own a pile of semi-working trash, and extensive vide-coding can produce such piles easily.
Not sure why this is being down voted. Outsourcing work doesn't also outsource accountability.
Yea, that is how I meant it.
Anyone can produce low-quality code, with or without AI. Agents have gotten exceptionally good however, and everyone should be including them in their workflow if they're able to.
Agents are more prolific. As with any power tool, they increase both your ability to build and to wreak havoc, depending on how you handle them.
> "Own" as in "be responsible for". Nobody is too keen to own a pile of semi-working trash
And yet that was the state of software at every company I worked at before FAANG, and even a good amount there...
Depending on the scale. If you ask Clause to one-shot an app from a nebulous description, you get a prototype which you would understandably loathe to own the code of. If you plan carefully and limit the scope, you get code that you understand, can approve of, and are okay owning further down the line.
I spent two and a half hours writing up a detailed outline for a small webapp. Claude popped it out in one shot 100% working., I added features after but the time you spend on a good outline saves hours later.
At what point is liability the only "job" left for humans?
I think it was tor.com that last year had a story where the newbie hired for the corporate HR dept ended up being the last human left after all others were replaced.
Ah, here we go, courtesy of google-ml: '"Human Resources" by Adrian Tchaikovsky, published on Reactor[...] https://reactormag.com/human-resources-adrian-tchaikovsky/ '
The whole thing with GPL code seems like a mess and surely couldn't be set as actual precedent, right? It is totally infeasible for me to check every single GPL project on every code hosting platform to see if the code Claude etc produced is too similar. If a set of training data used for the model was released to check against that would be one thing, but you can't honestly expect someone to check every repo available from all time to see if a model (that you are not informed of what it was trained on and therefore could reproduce) might've reproduced code from it.
That's not at all like checking the dependency chain of a dependency or anything as you can just read the licence of anything you're choosing to use. Surely the precedent would have to be that a model trained on GPL code has itself been infected by GPL, and therefore must have all source/weights released too if the assumption here is that it can have embedded the code well enough to be able to reproduce it?
> Surely the precedent would have to be that a model trained on GPL code has itself been infected by GPL, and therefore must have all source/weights released
I don't see how this follows, unless we also agree that humans who have ever read any GPL code are themselves permanently tainted and therefore cannot produce anything that isn't influenced even slightly by said code.
Is it just because we think the robot does a better job at learning than we do? It's an impossible line to draw, I agree, but I don't agree that the answer is "well then everything must be considered tainted," I say the answer is "ignore a vestigial concern of a bygone era."
The robot does a better job at reproduction. I don't think there exists a definition of "learning" unambiguous enough to make the claim that it learns better than humans. Specifically, published models don't learn at all -- after the training phase, the model weights are fully static.
There's an easy solution... release your code as GPL :)
(but that doesn't protect you against GPL-incompatible copyleft licenses, I guess)
Duplicating BSD-licensed code without copyright attribution and mention of the original license is just as much a violation of the original copyright -- that applies regardless of additional copyleft requirements imposed by the GPL. A different but no less serious restriction applies to all the code examples on MSDN: the license disallows using the samples in production code.
LLMs are effectively copyright laundering machines, and barring any indemnification clauses in the ToS (of course there are none), full liability lies with the user.
> but you can't honestly expect someone to check every repo available from all time to see if a model [...] might've reproduced code from it.
Well, if you care about not violating any licenses, you could buy services from an LLM provider that was only trained on code in the Public Domain (or code that the LLM provider licensed for that purpose), and/or buy some kind of legal guarantee from the LLM provider that the code produced is "clean".
Of course, that'd be much more expensive than current offerings, but it would reflect the real cost of software development, not just YOLOing it, from a legal perspective.
When I wrote a book, part of the contract with my publisher was that I had to attest that I actually wrote the book myself, that quotes were properly attributed etc. If you buy code-writing services, why shouldn't it contain similar clauses?
> It is totally infeasible for me to check every single GPL project on every code hosting platform to see if the code Claude etc produced is too similar.
I would say that choosing a tool that makes it infeasible doesn't actually excuse you from doing it.
Claude don't write code? The LLM writes code. Claude loops the LLM into writing consistent code. Humans loop Claude into consistently looping the LLM.
Who own's the code? Who owns a potato? If the code is the produce of the LLM and that costs tokens, the owner of the code is the one who paid for the tokens. Money, time or attention, someone pays for the tokens, owns the code.
>Who owns a potato?
I don't get what this analogy is trying to tell me but I know nothing about potato law. Is this about the Belgian potato surplus?
I pay to listen to songs. Do I own them?
Similar to most entertainment: you have the right to consume, but not the right to adapt into your works and distribute them.
Even consumption is usually limited to private usage: in my country, a consumer subscription is not enough to broadcast in a cafe or even a waiting room.
I’m no lawyer but I feel that meta, my employer, wouldn’t be letting us go hog-wild with Claude code if they weren’t completely confident that they fully owned the outputs, whether we change it or not.
Meta's confidence almost certainly rests on the employment contracts and IP assignment clauses, not on a legal theory that AI output is inherently copyrightable. The enterprise agreement with Anthropic assigns outputs to the licensee. The employment contract assigns work product to Meta. Those two documents together give Meta a defensible ownership position regardless of the authorship question. The interesting gap is for developers using personal accounts or consumer plans on side projects, where neither of those documents exists.
I don't understand how a company can have IP copyright rights on code that is inherently uncopyrightable (in the unlikely event scotus rules that way).
Worst case, meta will sue the programmer who produced infringing code.
I mean if the code is not copyrighteable that does not mean anything; it's just public domain code except that meta will just use good old security by obscurity to protect it. If somehow a meta programmer vibes code, say, VVVVVV, and Terry Cavanagh recognizes it on his facebook feed and sues meta, and wins, all that will happen is that meta will take down the copy of VVVVVV, will fire and sue the engineer that vibe coded it and call it a day.
There’s so much FOMO right now around AI that no one is thinking clearly. I wouldn’t be so confident in your company.
To evaluate the legal risks of using AI generated code, let’s consider how many lawsuits there have been over these concerns.
Inadvertent copyleft license violations: probably 0 lawsuits
Competitor copied your software, you could not defend your rights in court because it was made with AI: probably also 0
Users of agentic AI for software development: >10 million
The thinking here seems pretty clear to me.
This is a terrible take. Complex litigation takes longer to play out than the time span that agents have existed.
I would bet you any amount of money that metas ownership over CC-generated code is never challenged or threatened
Ok since you’re that confident—ten thousand to one odds. I’ll put up $100.
Seems to gloss over other kinds of contamination, beyond GPL code. Code from pirated text books, the problem with the entire language model being trained on copyright data, and on the possibility of the training data containing various copyrighted code.
> Code from pirated text books
Anthropic "solved" this by intermingling the texts extracted from pirated books (illegal) with texts extracted from the physical books they bought and destroyed (legal), so no one can clearly say if the copyrighted material it spits out came from a legal source or not. Everyone rejoiced.
The intermingling argument is actually central to the Bartz settlement structure. The settlement required destruction of the pirated dataset specifically because commingled training data creates an unresolvable provenance problem. For deployers building on Claude, EDPB Opinion 28/2024 requires a documented assessment of the foundation model's training data legal basis before deployment. "We cannot tell which outputs came from which source" is not a satisfactory answer to a regulator running that assessment. wrote about it before here: https://legallayer.substack.com/p/i-read-every-edpb-document...
> books they bought and destroyed (legal)
They're only legal if training is fair use - and even I don't think it's immediately clear what would be the legal status of verbatim regurgitation of code in copyright, or code protected by patents?
AFAIK I (as a human developer) can't assume that I can go and copy code out of a text book, and then assume copyright and charge for a license to it?
> They're only legal if training is fair use
The judge seems to have said it's because they "transformed" the books (destroying them after digitalizing) in the process, that made it legal.
> Ultimately, Judge William Alsup ruled that this destructive scanning operation qualified as fair use—but only because Anthropic had legally purchased the books first, destroyed each print copy after scanning, and kept the digital files internally rather than distributing them. The judge compared the process to “conserv[ing] space” through format conversion and found it transformative. - https://arstechnica.com/ai/2025/06/anthropic-destroyed-milli...
Interesting - so local models, like Google Gemini is then likely pirated by this interpretation - because the model is distributed? Ditto open weight models?
I've seen copyright notices that explicitly forbid use for AI training. Would this "transformation" argument still hold in such cases?
For example:
No Generative AI Training Use
For avoidance of doubt, Author reserves the rights, and grants no rights to, reproduce and/or otherwise use the Work in any manner for purposes of training artificial intelligence or machine learning technologies to generate text, text to speech, voice, or audio including without limitation, technologies that are capable of generating works in the same style or genre as the Work, unless individual or entity obtains Author’s specific and express permission to do so. Nor does any individual or entity have the right to sublicense others to reproduce and/or otherwise use the Work in any manner for the purposes of training artificial intelligence or machine learning technologies to generate text, text to speech, voice, or audio without Author’s specific and express permission.
Only if you also manage to purchase and destroy the source material, I suppose? In Anthropics case it wouldn't have worked if they've stolen/rented the books then destroyed them, but in the judge's eyes it was legal because legally purchased -> destroyed.
Nobody disputes that I own the copyright in a sound recording I made just by pushing the red button on my recorder. So it is a mystery to me that copyright to any sort of human conditioned machine generation is in dispute.
The sound recording analogy breaks down at the point where the recorder makes no creative decisions. Pressing record captures what is already there. Prompting Claude generates something that did not exist, through decisions the model makes about structure, naming, pattern, and implementation. The closer analogy is hiring a session musician and telling them the key and tempo. You own the recording under work-for-hire if they signed the right contract, but the creative expression in the performance is theirs unless explicitly assigned. The button you push to start the model is not the same button as the one on the recorder.
Fourier theory says that any sound, however complex, can be synthesized by summing sines and cosines. That's what an LLM does, if you twist the metaphor enough. It synthesizes complex outputs from simpler basis functions that are, or should be, uncopyrightable.
The fact that it inferred those basis functions from studying copyrighted works doesn't seem relevant. Nor does the fact that the "Fourier sums" sometimes coincide with larger fragments of works that are copyrighted. How weird would it be if that didn't happen?
Of course it's relevant. How copyright infringement happens doesn't actually matter, all that matters is that the infringement happened.
If I painstakingly recreate A New Hope frame by frame, pixel by pixel, that's infringement. Even if I technically used 0 content from the original.
Nobody is doing that, though. You might get a watermarked screenshot or stock photo now and then, or a couple of mostly-verbatim paragraphs from Harry Potter.
In any case, if the copyright mafia insists on butting heads with AI, they'll find that the fight doesn't quite play out the way it has in the past.
> Prompting Claude generates something that did not exist, through decisions the model makes about structure, naming, pattern, and implementation.
LLMs don't make decisions. Their output is completely determined by an algorithm using the human prompt, fixed weights, and a random seed. No different than the many effects humans use in image or audio editors. Nobody ever questioned whether art made using only those effects on a blank canvas was subject to copyright.
"if Claude was trained on the LGPL-licensed codebase and its output reflects patterns learned from that code, can the output be treated as license-free? The emerging legal consensus is probably not, and assuming it can creates significant liability for anyone shipping that code commercially."
Is there any citation for this "legal consensus"? I was not aware there was any evidence backed stances on this topic as of yet
This sounds like a problem that's pretty easy to get around.
CC does not need LGPL code. There's more than enough BSD and Apache code to go around.
And they can generate synthetic data that is better than LGPL for their training.
It's also a problem that does not seem feasible to meaningfully enforce.
It's easy to generate CC code and lie and say you didn't. It would be hard to prove that you did, especially if you took any precautions to make it even slightly difficult that you did.
Unlike GPL, BSD and Apache licenses do not claim to also cover your non-AI-generated code that only invokes the AI-generated code.
However, even if the BSD/Apache/MIT licensed code can be incorporated freely in your application, you still have no right to remove the copyright notices from it and/or to claim that you own the copyright for it.
Therefore, unless the AI model has been trained only on non-copyrighted public-domain code, incorporating the generated code in your application means that you have removed the copyright notices from it, which is not allowed by the original licenses.
There is absolutely no doubt that using an AI coding assistant works around the copyright laws, but it is still equivalent with doing copy and paste with fragments from copyrighted works into your source code.
I consider that copyright should not be applicable to program sources, at least not in its current form, so reusing parts from other programs should be fair use, but only if human programmers would be allowed to do the same.
> However, even if the BSD/Apache/MIT licensed code can be incorporated freely in your application, you still have no right to remove the copyright notices from it and/or to claim that you own the copyright for it.
I can't speak for all licenses, but I'm familiar with at least one BSD license. That's almost the entire point of it...
You cannot take their literal code and call it your own. You can derive code from it and call it your own. That's what LLMs primarily do.
The chardet dispute is the closest thing to an active test case on this specific question, and you are right that it has not resolved into settled law. "Emerging legal consensus" was imprecise. The more accurate framing is: the legal community's working assumption, based on how copyright doctrine treats derivative works, is that training-data provenance travels with the output. That assumption has not been tested definitively in court yet.
thanks for this; it's definitely a fair point. I updated the piece to reflect this
With sufficient obfuscation (which models seem to provide intrinsically), how would anyone know to sue? On top of that, only the most major sorts of litigation have the legal force to pierce even the flimsiest of obfuscation... this is likely all moot.
If some GPL-licensed group were to sue some commercial software project that they do not have the source code for, what would even give it away? But they throw $1 million at a lawyer who can at least get it to the discovery phase somehow, and the source code is provided. It looks to be shit, but maybe an expert witness would come along and say "that looks inspired by the open source project". Where does it go from there? The model is a black box, but maybe you've got a superhero lawyer who manages to rope in Anthropic or OpenAI, and you can see how it produced the code given those prompts. What now? Are there any expert witnesses who both could say and would say that it was "bulk copying-pasting code". And if it were, what jury is going to go for that theory of the crime? Copying-and-pasting, but the code doesn't match, except in short little strings that any code might match. This isn't a slamdunk, and it's not going to proceed very far unless it's another Google-vs-Oracle shitfest.
Ownership is one question. IMO, a more interesting question is who is responsible when the code does some real-life damage.
No one. The usual.
Why should it be any different than it ever was? If a release manager checked it but didn’t catch the vulnerability, they have some culpability. If the developer shipped the code without checking it, they have some culpability too. Ultimately, if they both work under an organization that they report to, they’re responsible to that organization, which is, in turn, accountable to its customers (and investors perhaps.)
LLMs really change nothing about this.
> What to preserve: Commit messages that describe what you changed and why, not just what the AI generated. “Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch” is evidence. “Add rate limiting module” is not.
> The second commit message versus the first is the difference between a defensible authorship claim and a clean “Claude wrote this” record.
That makes no sense to me, as the commit message is probably LLM generated as well. (and even easier to generate as it doesn't have to compile or pass automated tests).
Three things matter when it comes to eating my breakfast sandwich:
1/ Was the pork in my sausage reared on a farm that meets agricultural standards?
2/ Was the food handled safely by the kitchen that cooked my food?
3/ Does the owner of the diner pay kitchen wages in accordance with labor law?
By contrast, I have no idea what went into the models I use, what system prompts have prejudiced it, and whose IP has been exploited in pursuit of my answer.
That’s being charitable, really. In practice the open secret of the AI industry is that the vast majority of training data, for want of a better word even if it is likely to be the most precise description, is stolen data.
Probably, yes, but the burden of proof is with us not them.
I'm already glad some companies have the guts to open their models because proving it for open models is probably a lot easier than for a model behind a service.
That's a matter of changing a law, it's all up to the people and their representatives. We talk as if everything is set on stone but if there really is a will, there is a way.
The proof is the $stupid-billion infrastructure built and kept up to host mousetraps armed with free cheese made of virtue signalling about doing the right thing and sharing the code with the world for free.
The media industry loves to quote ridiculous numbers on lost revenue due to piracy etc. May be a rough ballpark numbers will get them to do something about this theft.
Can someone put a rough estimate on potential revenue loss (direct and incidental) from training AI with industry wise breakup.
It’s wrong to stop progress. I just want to know what data went into my model and have access to the same data. The same way we have national libraries of books but with the caveat that I don’t really know how one is supposed to browse petabytes of OpenAI .zips like I browse old books.
If the data is proprietary (eg Meta’s stash of FB comments) then I am satisfied to be told it’s private and I can’t see it. If, however, the works were public then give me a URL if it’s live or a cached copy if it isn’t.
What's an example of data that might have been stolen?
This is a big question that makes my employer nervous about using LLM-generated code, along with the even-more-unresolved question "what happens if the LLM outputs an algorithm that is protected by patent?" (particularly worrying because we know the base training included patent descriptions.) Questionable copyright can often be worked around (particularly since we don't distribute source) but infringing on a patent can destroy a company.
The elephant in the room, of course, is what constitutes “meaningful human authorship.” However, I cannot shake off the feeling that all user interactions with these AI models are being logged. Perhaps this may turn out to be the bigger concern in a potential legal battle than code authorship.
The meaningful human authorship question is the elephant, agreed, and the regulators have deliberately refused to quantify it for exactly the reason you describe any bright line number becomes a target to game rather than a standard to meet.
The logging point is sharper than it might appear. In a copyright dispute over AI-assisted code, interaction logs could cut both ways. A plaintiff trying to establish human authorship would want the logs to show substantial architectural redirection, multiple rejections of Claude output, and documented reasoning for structural decisions. A defendant challenging that authorship claim would subpoena the same logs to show verbatim acceptance of output without modification.
The practical implication i guess here,that the developers who want to preserve a copyright claim over AI-assisted code should treat their prompt history as a legal document from the start. It seems all over the world the logs are the evidence. Whether they help or hurt depends entirely on what they show.
The bit about treating one’s prompt history as a legal document has really struck a nerve with me. I’ve been keeping a separate git history solely for my prompts. Initially, the goals were simple: reuse prompts, turn some into skills, etc. But in light of the insights from the article and the discussions here, I need to treat this practice as serious business.
I wrote an R library doing some simple regressions using the GPU, with Claude. I asked it to provide the same API as lm, glm and some other base R functions. It copied their code wholesale without mentioning it to me. So, now my library is GPL… which is not a big deal in this context, but it was quite a shock.
Note to anyone reading this: the author is actively reading the comments and updating the piece based off reported issues. As a result, no meaningful discussion will take place here.
I think it's pretty clear cut, whoever is paying for your agentic coding tool subscription is part of the litmus test.
I use my own computer, I pay for my own subscription and I build my open source projects then the code belongs to me.
If I use my company's computer, they pay for my subscription and we work on the company's projects then the code belongs to the company.
In any step of the way if some copy-left or any other form of exotic open source license is violated, who pays for discovery? Is it someone in Russia who created a popular OSS library that is now owed? How will it be enforced?
If you want to go much deeper, https://www.copyright.gov/ai/ is particularly good at least on the side of comprehensiveness.
Article is incredibly fear mongering.
Twice in my career the owners of a company have wanted to sue competitors for stealing their "product" after poaching our staff.
Each time, the lawyers came in and basically told us that suing them for copyright is suicide, will inevitably be nearly impossible to prove, and money would be better spent in many other areas.
In fact, we ended up suing them (and they settled) for stealing our copyrighted clinical content, which they copied so blatantly they left our own typos and customer support phone number in it.
Go ahead, try to sue over your copyrighted code, 10 years and 100M later you will end up like Google v Oracle. What if the code is even 5% different? What about elements dictated by external constraints; hardware, industry standards, common programming practices, these aren't copyrightable.
Then you have merger doctrine, how many ways can we really represent the same basic functions?
Same goes with the copyleft argument, "code resembling copyleft" is incredibly vague, it would need to be verbatim the code, not resembling. Then you have the history of copyleft, there have been many abuses of copyleft and only ~10 notable lawsuits. Now because AI wrote it (which makes it _even harder_ to enforce), we will see a sudden outburst of copyleft cases? I doubt it.
Ultimately anyone can sue you for any reason, nothing is stopping anyone right now from suing you claiming AI stole their copyleft code.
The documentation advice is practical, but commit messages and prompt logs are self-reported. "Meaningful human authorship" needs a verifiable evidentiary chain, not attestations.
Normally this solved with an employment contract: "Anything you write, the copyright is transferred to your employer"
> Code that Claude Code or Cursor generated and you accepted without meaningful modification may not be copyrightable by anyone.
Except if it happens to regurgitate a significant excerpt of some existing work, then the authors of that can assert their copyright; i.e. claim that it infringes.
Lawyers I have spoken to have stated strongly that they believe collective works doctrine will provide strong protections for most mature and sizable software. I see no mention of these considerations here.
Tangential but I find this an interesting parallel from a few years ago:
https://www.vice.com/en/article/musicians-algorithmically-ge...
Did Claude Code not start out as human input? Would it not be safe to say that a reasonable amount of it is still human input? But also, just because its mysteriously "not theirs" doesn't mean they magically have to give you the code.
> Here is the legal baseline, in plain terms:
This particular AI-ism really encapsulates what annoys me about some AI-isms. I don't mind the delves and the em-dashes that just give away the AI source of what otherwise might be good text. But these structural pieces just feel fundamentally not for the reader. Part of it is blatant pick-me language for the human feedback ("hey look you wanted plain language I did that") and part of it feels like it's just helping the future token stream (thinking-like tokens polluting the actual text).
The not-this-but-that, the sycophancy, the symbolizing-vague-significance, they all have this flavor of serving a process that's no longer there as I now need to read it. It gives a similar sickening feeling to the one I get seeing something designed by committee.
This is like asking:
"Who owns the text microsoft word helped you write?"
Claude code is a software tool not a legal entity.
Not if claude does the writing. MS doesn't write things for you, and if it did, you would not be entitled to a copyright in whatever it wrote for you.
Claude is not a legal entity, it is a software tool that outputs text based on statistics. There is a user that used a tool to create text and that user is the legal entity responsible for the text in any legal way that matters.
Anything else would be completely ridiculous given current laws in most countries.
It would be as ridiculous as blaming the car in a car accident where you drove over someone.
> Claude is not a legal entity
And?
>It would be as ridiculous as blaming the car in a car accident where you drove over someone.
No more ridiculous than you posting something you know nothing about.
Just because you don't get the copyright doesn't mean claude does. The fact that claude is not a legal entity has no bearing on whether or not you are entitled to a copyright for a work you did not create.
If neither the user or the tool created and is responsible for the text, who is in your mind?
If claude made it, it is not a copyrightable work. There is no copyright for anyone to own.
Okay. If it made it, it made it. That is true in a deductible way. If p, then p.
And?
Those "statistics" that the output is based on are often under licenses that forbid making proprietary software with them for example. It is not the same as using Word.
The statistics is generally not. But the data used to learn the statistics may have been under license.
Learning from licensed material is generally accepted in humans, you may learn from something and then create something else and the new thing is not considered legally problematic with the exception of patents i guess.
Whether the same thing holds true for electronic systems is where people disagree if you look at the problem space in its essence. I land on the side that it is the same thing(humans and electronic systems learning), some seam to think it is a different thing.
Maybe the useful test is not “who wrote this line?” but “can you show how it went from requirement/prompt/context to diff to human review/tests?” If you can’t, ownership is only one issue. You also can’t tell what was accepted as engineering work versus just copied output.
This is actually closer to how the Copyright Office thinks about it than the article makes clear. The registration guidance that emerged from the Thaler proceedings specifically asks applicants to describe the human creative contributions and how the AI was used. A documented workflow showing requirement, architectural decision, rejection of AI output, human restructuring, and review creates a paper trail that maps directly onto what the Office looks for. The can you show how it got here test you are describing is the practical version of the legal standard.
Good overview of the issues. I'm sure there are a few nits to pick with that.
But something that is overlooked is that the world is bigger than the US and it's an absolute zoo out there in terms of copyright laws in different countries. Anything you think you might understand about this topic goes out of the window if you have international customers or provide software services outside the US. Or are not actually based there to begin with. And there are treaties between countries to consider as well.
Courts tend to try to be consistent with previous rulings, interpretations, etc. When it comes to copyright, there are a few centuries of such rulings. The commonly held opinions among developers that aren't lawyers are that AI is somehow different. And of course since the law hasn't actually changed, the simple legal question then becomes "How?". And the answer to that seems to involve a lot of different notions.
For example, "AIs are not people, and therefore any content produced by them isn't covered by copyright to begin with" is one of the notions brought up in the article. A lawyer might have some legal nits to pick with that one but it seems to broadly be the common interpretation. So AI's don't violate copyright by doing what they do. In the same way you can't charge a Xerox machine with copyright infringements. Or Xerox. But you could go after a person using one.
And another notion is that any content distributed by a human can be infringing on somebody else's copyright and that party can try to argue their case in a court and ask for compensation. Note that that sentence doesn't involve the word AI in any way. How the infringing party creates/copies the content is actually irrelevant. Either it infringes or it doesn't. You could be using AI, a Tibetan Monk copying things by hand, trained monkeys hitting the keyboard randomly, a photo copier, or whatever. It does not really matter from a legal point of view. All that matters is that you somehow obtained a copy of an apparently copyrighted work. AI is just yet another way to create copies and not in any way special here.
There are of course lots of legal fine points to make to how models are trained, how training data is handled, etc. But if you break each of those down it boils down to "this large blob of random numbers doesn't really resemble the shape or form of some copyrighted thing" and "Anthropic used dodgy means to get their hands on copies of copyrighted work". I actually received a letter inviting me to claim some money back from them recently, like many other copyright holders.
This seems to be grounded in US law. Does anyone know if the same rules would apply in eg EU law?
Most of this is based on Copyright legal framework, which is surprisingly homogeneous around the world. The discussions about ownership of AI-generated material are exactly the same in EU.
Copyright law kind of transcends national borders by certain international treaties like the Berne Convention. Which is why the US copyright holders could enforce their "woulnd't steal a car" threats in Europe.
It’s the same as photography. No photographer built the multibillion dollar supply chain for the optics train in a camera, nor did they build the city scape they are enjoying as a background, they simply set the stage and push a button.
On a related note, another question: who owns the paper that Claude (or OpenAI) wrote? Should such paper submissions in conferences call out the model(s) used to write the paper itself?
The "if you generated the code at work using company tools, it's owned by your employer" affirmation in the article makes no sense to me?
If computer generated code is not copyrightable, ownership cannot be reassigned either.
It is copyrightable. A *human* can copyright code they wrote.
I meant in the sense that the "tool" is an LLM and the "work" was vibe coded.
If vibe coded work is not copyrightable, it cannot be reassigned to the employer and become copyright protected.
correct
This is the sharpest point in the thread. You are right if the output has no copyright to begin with, there is nothing to assign. The employer's contractual claim over purely AI-generated code is not a copyright claim, it is a trade secret and confidentiality claim. Those are weaker protections: they require the information to remain secret, they do not survive disclosure, and they cannot be enforced against independent creation of the same code. Most IP assignment clauses in employment contracts were not drafted with this scenario in mind and may be claiming rights that do not legally exist.
How is it for human developers now if the company tool is a cloud tool and not running on company servers?
You don't but nevertheless you bear the responsibility of making it public (whether in soyrce or binary form). That is what Anthropic would like.
Your employer can claim your code if you use their tools to produce it. Nothing new here. This has nothing to do with AI tooling.
Well I don't own anything I write while working on my company. Maybe my company and Claude can fight over who owns it.
First answer who owns the model built with public data
The model ownership question and the output ownership question run on separate legal tracks and the piece focuses on the second deliberately. On the first: the model weights are owned by Anthropic under work-for-hire from their engineers regardless of what the training data contained. Training data copyright infringement is a separate tort claim against Anthropic, not a basis for anyone else to claim ownership of the model. The Bartz settlement resolved the pirated books claim without disturbing Anthropic's ownership of the weights. Owning the training data does not give you ownership of the model trained on it, any more than owning the paint gives you ownership of the painting.
Missed opportunity for a tongue twister:
Who coded the code Claude Code code?
Who owns the code my keyboard wrote?
I'm still flabbergasted that people – and big, visible companies with big targets on their backs – choose to keep on using the output of LLMs without having an answer to these questions.
And I'm worried that once that has been sufficiently normalized, laws and interpretations of them will adapt to whatever best suits those users. Which will mean copyrightwashing of FOSS. My only hope then is that surely if free software can be copyright-washed by the big guys, then so can the little guy copyright-wash the big guys' blockbuster movies or whatever, which might lead to some sort of reckoning.
The idea that the provenance of a given tool's code inherently pollutes the material it's used with seems kind of illogical. Wouldn't it follow from this premise that any code written using open source IDEs and debugged with open source debuggers and other tooling would itself then be considered copyleft? Are works written with LibreOffice not copyrightable?
There's obviously a huge issue with the legitimacy and ownership of training data being fed to LLMs. That seems like an issue between the owners of that IP and the people training the models and selling them as services more than the people using the tool. Isn't this just another flavor of SCO trying to extort money out of companies using Linux?
I have a wood cutting machine and some wood. Who owns the timber?
Sadly, IP "ownership" and copyright law are vastly more complex than ownership of physical stuff.
Or were you planning to reproduce the (say) Ford Motor Company's trademarked symbol in wood? If so, you're right back in the stinkin' swamp.
What is the wood in your example?
This is like a machine you ask for timber and you get timber but you didn’t need to provide any wood
the entire US economy rides on AI. no ruling throwing a wrench into the multi trillion engine is ever going to be permitted to happen
IMO this is the greatest argument against AI as technofascism. The general public seems to believe that AI will usher in technofascism by claiming corporate ownership of AI output: the independent entrepreneur will be unable to compete against the corporations compute, every piece of data about you will be stolen and monetized by AI, and you will own nothing.
But AI might in fact do the exact opposite and reverse the privatization trend that the West has been going through for the last 400 years. All of our copyright laws rely on the idea that there is a human consciousness behind the copyright. The more AI has input, the less we can claim ownership. If AI returns everything to the commons, then it results in a much more egalitarian world.
Hilariously, many people, especially artists, see the return of the commons as an assault against them. They’re so captured by copyright that they assume any infringement on their copyright is inherently fascist. It’s ridiculous. Copyright is a corporations number 1 weapon when it comes to creating a moat and keeping the masses out.
The original intent of copyright, in fact, was an incentive to return an idea to the commons. Experts used to hide their discoveries in order to keep them for themselves. Copyright provided an opportunity to release this knowledge and still profit. There were even several cases where it was established that those who claimed copyright could retain copyright even if the idea had been previously discovered. This created a huge incentive: release the knowledge or risk having your process copyrighted by the opposition. But that system worked because copyright could only exist for so long (14 years, doubled if they filed again.)
Now copyright is a lifelong sentence at almost 100 years. The entire purpose of it has been undermined. Corporations own all your childhood and by the time you can profit off of it, it’s outdated.
A world where the mainstream is primarily a commons seems to me like an egalitarian world. I’d like to live in that world.
The original bargain you describe, limited term in exchange for public disclosure, is exactly what makes the current situation strange. If AI-generated output falls into the public domain immediately, that is actually closer to the original intent of copyright than 95-year terms. The legal question is whether that outcome happens by design or by accident, and what it means for the people building products on top of AI-generated codebases right now.
By design or by accident
It’ll happen by evolution. Just complex systems trending the way they trend.
It seems that author unironically advises to write your commit messages like this: "Restructured Claude’s module architecture, rejected initial state management approach, rewrote error handling from scratch", to have a chance at defense in potential court hearing. I find it funny, if vindicating for my personal approach. If the expectation is to "restructure, reject, rewrite" what "AI" spits out, why use "AI" at all at this point???
Copyright has a lot to do with what we as a society want to protect and encourage. We want to protect an author that put the hours into creating a book, as opposed to the person creating a copy of that work. The person copying can claim they put in work too but the claim is not strong enough to override our preference to protect original authors.
Part of the problem with generated works is that it is lower effort like the person copying something. It’s not an activity that demands special protection like original authorship. I believe this is a large part of the reasoning.
AI is a monster to our current copyright system - monster in the philosophical sense, that is, an example that destroys the concept.
First, its creation is (claimed to be) extremely useful for society, but in order to be created it requires ignoring copyright for pretty much everything ever written. Something we kinda shrugged under the table.
Then, it introduces an extreme jump down in creation effort - so if the focus is protection of effortful creation, nothing with AI use qualifies. But of course, you'd want society to benefit from effortlessness in general, spending more effort than needed in a task is the opposite of efficiency.
Whoever pays for the tokens.
This is a non issue, since any complex thing needs a lot of human oversight, otherwise it's nothing more than a multitentacled monstrosity.
i do, all of it. sorry
What if no meaningful thought was put into the code (entirely vibe-coded slop), but it’s made for your employer? Shouldn’t the work be uncopyrightable?
I do. I used a tool to create it. I own the things I create.
Anything else is just bullshit equivocation.
LLMs are just tools we use. If I program an app in C++, do I not own the rights to the executable because my compiler wrote machine code for me?
There is no such thing as ownership of a pattern of information. It has been an illusion, and that illusion is now fading.
so as i understood GPL dont cover code written by agents?
That was a rather unhelpful TL;DR.
yo Mama
-Claude
Could you please stop posting generated comments to HN? It's not allowed here, and it looks like you've done it over 30 times already.
(Of course, there's no way to be certain of this, but it's what our software thinks, and the overall pattern is pretty convincing.)
See https://news.ycombinator.com/newsguidelines.html#generated and https://news.ycombinator.com/item?id=47340079
You are definitely right to flag it, apologize for that. I used an AI assistant for the replies, and I will make sure not to use one going forward.
Appreciated!
@dang, just wanted to say that it seems that the response to your statement does also seem to be AI generated. Dead-internet theory is turning real day by the day, oof.
Why do you use an AI assistant for the replies?
My guess is she wants to respond to all feedback and questions but doesn't have time to do it all by hand.
This too is a generated post
On that matter, wouldn't an AI flag for submissions help hn? I wouldn't flag a submission for LLM style as it is too harsh, but I don't want to read them -- if only because I don't like LLM prose.
There are so many submissions where most of the discussion is about whether the content has any human effort behind, or the LLM was just a purely assistive role like translating. It's really devaluing hn, IMO. Not sure how much an AI flag would help, or introduce new issues, given how difficult the problem is, though.
Curious: how do you exactly detect an AI-generated comment?
A case when security through obscurity is perfectly justified.
Ask chatgpt deep research citing court cases and it shows dark factory swe code are not copyrightable under current precedents.
Even steering it with prompts isn't enough. The guy couldn't copyright the image he made with ai, code is no different.
Maybe prompts written by humans are copyrightable.
Can't wait for the Billionaires to entrench in court they can steal everything for these machines and claim it as their own and maybe even reach for anything that it helps produce. Fuck that