Skip to content

doc: add policy on LLM-generated contributions#62447

Open
bengl wants to merge 2 commits intonodejs:mainfrom
bengl:bengl/ai-policy
Open

doc: add policy on LLM-generated contributions#62447
bengl wants to merge 2 commits intonodejs:mainfrom
bengl:bengl/ai-policy

Conversation

@bengl
Copy link
Member

@bengl bengl commented Mar 26, 2026

Add doc/contributing/ai-contributions.md banning LLM-generated content from commits. Scoped to committed content only, excluding discussion, vendored deps, and accessibility tools. Enforcement uses consensus seeking. Rationale covers reviewer burden and DCO compatibility.

References added in CONTRIBUTING.md and the collaborator guide.


A PR was opened with a more relaxed AI policy (#62105), so this one here is offered as an alternative. Note, however, that some parts of that PR seem compatible with this one (e.g. the behavioural guidelines).

Some key points:

  • I tried to draw a lot from existing AI bans in other projects. There's no need to tread entirely new ground here.
    • This list is slightly out of date, so in addition to these, I also had a look at Wikipedia's ban.
  • In the recent discussions around AI usage in Node.js core, I found that some conversations (backchannel and otherwise) centered around whether an AI ban would be enforceable. I think consequences of policy violation can definitely be enforceable, but detection of policy violations is a lot more fuzzy, and that's okay. I tried to capture that here. A useful comparison is to local laws and policing. Not all law violations are universally detectable, but the ability to enforce them still ensures they're a deterrent.
  • I've made an attempt at making reasonable exceptions. Other reasonable exceptions could be added.

This goes without saying, but I'm going to say it anyway, since this subject is pretty heated: This is not an invitation for personal attacks on myself, the author of the other PR, anyone who wants to ban AI in Node.js core, or anyone who doesn't want to ban AI in Node.js core. Or anyone else for that matter. Please observe Node.js' Code of Conduct.

EDIT: Just to be absolutely clear, this PR was made on my own, and has nothing to do with any of my employers, past or present.

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/tsc

@nodejs-github-bot nodejs-github-bot added the doc Issues and PRs related to the documentations. label Mar 26, 2026
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language I used here is admittedly strong. Perhaps "have a higher probability of not meeting that bar" is more accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong and demonstratably false. I've used AI-agents to assist in authoring PRs and I can always fully explain the changes, respond to review, and take full responsibility for the PRs I open.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can weaken the language indeed, that being said as a north star this statement makes a lot of sense. Regardless of how we write it it is going to be a subject to interpretation. Choosing concise and strong form over length and vague statements is clearer.

@jsumners
Copy link
Contributor

I approve (even though GitHub won't let me).

Add doc/contributing/ai-contributions.md banning LLM-generated content
from commits. Scoped to committed content only, excluding discussion,
vendored deps, and accessibility tools. Enforcement uses consensus
seeking. Rationale covers reviewer burden and DCO compatibility.

References added in CONTRIBUTING.md and the collaborator guide.
@bengl bengl force-pushed the bengl/ai-policy branch from f9c42a3 to 39b580a Compare March 26, 2026 18:41
Copy link
Member

@indutny indutny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +13 to +14
LLM produced a draft that the contributor then edited. Contributors must be
the sole human authors of the changes they submit.
Copy link
Contributor

@aduh95 aduh95 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to imply having co-authors in the PR would be forbidden, which is not the case – or at least, is not related to AI-contributions

Suggested change
LLM produced a draft that the contributor then edited. Contributors must be
the sole human authors of the changes they submit.
LLM produced a draft that the contributor then edited. Contributors must be
human after all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll reword this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest push.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would make our automated dependency updates via the github bot invalid and would require that all dependency updates be performed manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that language might be too exclusive. That being said, saying that dependabot PRs are subject to AI contribution policy is a bit of a stretch. These are basically configuration file updates and arguably not code changes.

Comment on lines +74 to +75
This policy can be revisited if the legal situation or the tools change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove mention of legal here, because it's IMO obvious and does not bring much

Suggested change
This policy can be revisited if the legal situation or the tools change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the callout that revisiting it in the future as situations change is still worth it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that it goes without saying, docs are always a "snapshot of the current state of things", I don't think this document is any different on that aspect. That being said, if you feel strongly it's worth explicitly say it, feel free to treat this as a nit and resolve the thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your opinion here makes sense to me so I'll take the suggestion. πŸ‘

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest push.

Comment on lines +23 to +26
* Pull request descriptions, review comments, issue discussion, or other
communication that is not part of the committed tree. Those are covered by
general expectations around good-faith participation and the
[Code of Conduct][].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it somewhat conflicts with nodejs/admin@8b746bc – which would be worth revisiting if this PR lands as is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's also somewhat conflicting with #62105 (and indeed, anything that seeks to solidify allowing LLM-generated PRs), so that moderation policy would have to change in either case.

That said, removing this exception would put it in line with that moderation policy, I think. I'm not sure on the best approach here yet.

Copy link
Member

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry such a policy leaves a lot of room for abuse if not defined very clearly and with clear and measurable criteria for the identification of such content. While detection tools may be unreliable, I think we do need some clear definition or test to make this not just a vague judgement call which itself can not be explained clearly.

It seems to me that the real issue, which the content already hints at, is a contributor lacking proper understanding of the code being submitted to be able to explain or properly maintain it. It to me therefore follows that denying a change should require equal rigor to be able to clearly describe criteria not met for the contribution to be considered acceptable.


Personally, I find LLMs can be helpful, and the scenarios in which they are most helpful in are in writing code or prose. I've used LLMs a bunch lately, and I'm confident in my ability not just to steer them, but to also have them build from my ideas, not to go off and build their own rough semblance of what I have described. As such, I feel I could generally quite effectively explain what I've had an LLM generate. I suspect many other contributors are in a similar position.

My worry is that such a policy would simply encourage dishonesty among those of us that know the things we build. I'm confident I could submit LLM-written code without raising suspicion. What if I'm a bit busy today and this bug needs fixing? Would it harm anyone if I just submit this one little PR and not tell anyone an LLM did it?

I don't want to be creating an environment which encourages people to tell little white lies about the origins of their code, fostering a community of subtle dishonesty. I would much prefer we feel welcome to be open and honest about the manner in which we've created our contributions and instead focus our attention on what actually are the problems emerging from the rise in LLM contributions and how can we address those things more directly.

Certainly some may have ethical qualms about LLM use, there are no shortage of reasons to be unhappy about it. But it exists whether we like it or not. It's like being unhappy that cars killed walking or biking places. Yeah, they pollute, they're expensive, and they reshape the world in ways we might not like, but they are part of the modern reality just as LLMs are. We could decide to be one of those niche communities where cars are banned, but that will mean isolating us from most of the rest of the ecosystem and would thus likely significantly reduce contributions. Maybe that's fine, maybe it's not. It's a choice we would have to make, and I think it's quite a significant choice to make.

Personally, I see a bunch of problematic behaviours rising due to the emergence of LLMs and banning LLMs does not actually address any of those problems, it's just lumping them in with one source from which those problems happen to be represented disproportionately.

In my opinion, I think we should consider policies which more directly target these problems. For example, limiting how many subsystems are "acceptable" for a feature PR to touch, how many lines of code are allowed before it warrants an RFC doc or issue to discuss/explain the concept first before submitting a huge and challenging to review PR. Any change which needs to exceed these limits would then have to ask for a TSC exception. Though my example there lacks definition, that direction seems to me more like one which would adequately address the actual problems emerging.


That was a bit of a wall of text, but all that having been said, I am not against having a desire to reduce LLM contributions, but I'm not convinced banning them outright is reasonable or would even work. I think this is less a policy concern and more a quality standards concern, which we would generally address with linting rules, test coverage, and other mechanisms to validate contribution quality. What makes a fancy code generator so different that the same sorts of quality validation systems can not apply to it?


Do not submit commits containing code, documentation, or other content
written in whole or in part by a large language model (LLM), AI
code-generation tool, or similar technology. This includes cases where an
Copy link
Member

@Qard Qard Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or similar technology" is fairly vague. For example, would this apply to our automation to rewrite REPLACEME markers? A human didn't write that replacement. To be clear, I know that's obviously not included, but only on my own intuition, we should make that clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

James also pointed this out. I'll reword this to that non-AI automations are explicitly excluded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the distinction in reproducibility of changes. If the change was partly or fully automated, the source code for automation should be mentioned in the PR.

Comment on lines +40 to +45
If a collaborator suspects a pull request contains LLM-generated content,
they should raise the concern in the pull request. The normal
[consensus seeking][] process applies: collaborators discuss, the author can
respond, and a good-faith determination is made. If consensus is that the
contribution violates this policy, it must not land. If consensus cannot be
reached, the matter can be escalated to the TSC.
Copy link
Member

@Qard Qard Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bit witch-hunt-y and also exploitable to block certain contributions someone might not like, or from a person another contributor might not like. Proving a negative is not really feasible so this could be used to get changes stuck in an uncertain limbo forever.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These last two sentences are not particularly different from any existing policy involving consensus around a PR and escalating to TSC. I'll adjust this to refer to the existing procedure for dealing with such disputes, rather than re-hash it here.

submissions shift the verification burden onto those reviewers, with no
assurance the contributor has done that work themselves.

The copyright status of LLM output is unresolved. Training data for popular
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite accurate. Because the training of the LLM on that content is done by a different person than the one using the LLM to produce the output it's legally clean-room design, which there is already legal precedence for when Phoenix Technologies and American Megatrends copied the IBM bios.

It's possible the law could be revised to address LLMs specifically in a different way, but the current state is not an absence of definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ€”

I do not see the parallel you are suggesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The details of how a thing works get captured as some description by one person, that description then gets relayed to another. They can now build the thing described and it is legally not considered derivative. A bit of a wacky edge case of the legal system, but that's how it was ruled. 🀷🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of us in this thread are lawyers (AFAIK), and this is the first time I'm hearing of this. If the copyright law is truly resolved at this point, then this can be dropped. Can you point to references indicating that this applies here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Clean room" aspect of LLM generated code is at the very least debatable. Since LLM can reproduce word-for-word the content it was trained on we could argue very different things about it.


## [Policy on LLM-generated contributions](./doc/contributing/ai-contributions.md)

Do not submit commits containing content written in whole or in part by a
Copy link
Member

@joyeecheung joyeecheung Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this effectively means we cannot land more V8 updates, for example, #61898 contains https://chromium-review.googlesource.com/c/v8/v8/+/7269587 and a bunch of other code written by Gemini (you can search the V8 commit log to find a lot more).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that deps are an explicit exception:

* Vendored dependencies and other vendored content (e.g. code under `deps/`
and vendored code under `tools/`). That content is maintained upstream
under its own governance.

Copy link
Member

@joyeecheung joyeecheung Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, perhaps that needs to be clearer and mention that there is a scope in the TL;DR, considering probably ~90% of the code in the code base is vendored in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see what I can do to clarify this.

@jsumners
Copy link
Contributor

jsumners commented Mar 26, 2026

What makes a fancy code generator so different that the same sorts of quality validation systems can not apply to it?

The fancy code generator has a verbatim "memory" of the code used to "train" it. It uses that bank of code to create the code you ask of it. It isn't generating code from a "oh, I did something similar once. I can do this again" position. It's generating code from everything that was stolen and shoved inside of it.

Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally and firmly -1 on any form of ban on AI assisted contributions


Do not submit commits containing content written in whole or in part by a
large language model (LLM) or AI code-generation tool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire clause here is unenforceable. As models improve, how are you going to be able to tell the difference? All this does is provide incentive for contributors to lie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have to be enforceable, and this policy explicitly says so.

@Qard
Copy link
Member

Qard commented Mar 26, 2026

The fancy code generator has a verbatim "memory" of the code used to "train" it. It uses that bank of code to create the code you ask of it. It isn't generating code from a "oh, I did something similar once. I can do this again" position.

That's wildly inaccurate. There is no such "verbatim" copy. It all becomes math mixed into neurons. Sometimes the signal of a particular piece of training is strong enough that the overall piece can be extracted verbatim, but very rarely and mostly only the very first bits of knowledge ingested as they serve as the first and strongest backpropagation applications.

A very, very small quantity of the training data of an LLM can be extracted verbatim, but only with substantial effort to specifically target that exact piece of data, and only the vanishingly few pieces that actually remain that intact.

It's generating code from everything that was stolen and shoved inside of it.

Yes, I would personally be in agreement that there's definitely some stolen data in there. Though at this point most of the data LLMs are trained on is actually created from scratch specifically for training material. There are a bunch of LLM training data orgs that hire devs to just build tons of apps to create training data they can sell. Certainly the stolen data was the bootstrap for a lot of these LLMs though. I personally disapprove of the approach that was taken to bootstrap these systems, but my dissatisfaction is not going to stop them from existing and being widely adopted. 🀷


* Pull request descriptions, review comments, issue discussion, or other
communication that is not part of the committed tree. Those are covered by
general expectations around good-faith participation and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contradictory. If an AI agent reviews a PR and makes an alternative code suggestion, and the PR author accepts that suggestion, then this policy is violated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that makes it contradictory? If an agent makes the suggestion, and the suggestion is accepted, then that becomes part of the committed tree, doesn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the policy would be: you can use an agent to review code but you can't accept any of it's suggestions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are strictly "please include this code" suggestions, via GitHub's mechanism for this or otherwise, then yeah, that would be the case.

I admit this is quite weird :/

The intent here was to provide some amount of compromise. It backfired into this corner here, but I think it would be consistent with the notion that committed code in the actual source tree is what's covered by the policy, so weird as it is, I'd keep it this way (unless folks don't need the compromise!).

and vendored code under `tools/`). That content is maintained upstream
under its own governance.
* AI-powered accessibility tools like screen readers or text-to-speech
software, provided they do not influence the content of the contribution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no sense to add this here as screen readers and text-to-speech software do not write code contributions.

Copy link
Member

@Qard Qard Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, speech-to-text can. I know someone that does that, actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which raises another interesting case here: use of AI as disability-assistive device. This would make this policy completely unenforceable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this would impact the enforceability of the policy. If a contributor uses AI tools for accessibility reasons, and discloses this in their PR (or in some blanket statement clearly visible elsewhere), then they wouldn't be in violation. If they fail to disclose ahead of time, and the PR is contested under this policy, then they disclose, then there's still no problem.


I just realized that you may mean that providing for this exception provides an additional avenue for abuse of the policy, since folks can simply state that they're using AI tools for accessibility reasons. I think this is the same category of (un-)enforceability as "they can just lie", so, still, I don't think this adds or subtracts any amount of enforceability.

Copy link
Member

@jasnell jasnell Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not in the same category. Your assumption there is that someone saying they are using it for accessibility is equivalent to someone lying about using it, and that's not a valid equivalency.

to draw that out a bit more..

The "they can just lie" category is ... they'll use it and won't tell you.

The "accessibility" category is ... they'll use it and there's nothing you can do about it.

But this is besides the point. The larger issue is that the policy in unenforceable because it's simply not possible to reasonably and unabiguously tell the difference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your assumption there is that someone saying they are using it for accessibility is equivalent to someone lying about using it, and that's not a valid equivalency.

Not quite. My assumption is that someone lying that they are using it for accessibility is equivalent to someone lying about not using it. You did clarify the categories and their differences though. Thanks for that πŸ‘.

But this is besides the point.

Agreed.

[consensus seeking][] process applies: collaborators discuss, the author can
respond, and a good-faith determination is made. If consensus is that the
contribution violates this policy, it must not land. If consensus cannot be
reached, the matter can be escalated to the TSC.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on what basis does one determine this decision? Here, let me give you two examples... one of these snippets was written by me, the other by Claude: which one is acceptable and which one is not?

Option A:

for (let n = 0; n < 1000; n++) {
  console.log('Am I AI generated?');
}

Option B:

let n = 0;
while (n++ < 1000) {
  console.log('Or am I AI generated?');
}

What this part of the policy basically boils down to is: any contribution can be contested and our normal consensus seeking rules are to be followed to resolve the contention. That's no different than what we have now except for the added incentive for a contributor to just lie about whether they used an agent to help write the code. If the policy is unenforceable it's not useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this specific point is my main concern. If the policy is not defined in a way that is clearly measurable/identifiable then it's likely to just encourage lying about it, and the ability to contest things arbitrarily could easily be abused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna guess A because from what I've seen, most of these tools will prefer a for-loop when it comes to using a loop for simple repetition a static number of times.

But, yes, to your point, that's clearly not enough to go on here, and there's a reasonably good change my guess was wrong, or even that their both AI or neither are AI.

Obvious cases can, will, and do exist. The rest relies on good faith participation. I think it's clear that your point here is that that's insufficient, but there are other policies within Node.js core that require good faith participation, like the DCO itself.

Copy link
Member

@jasnell jasnell Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and those existing other policies, including the DCO, are sufficient on their own. We do not need to layer on an unenforceable policy just to put on a show that "AI bad, Human good". Nothing in this document strengthens or improves our existing policies. Nothing in the debate has demonstrated that our existing policies are inadequate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to layer on an unenforceable policy just to put on a show that "AI bad, Human good". Nothing in this document strengthens or improves our existing policies.

This document guides contributors towards acceptable AI use. I don't believe we have an analogous policy with similar level of comprehensiveness in this repository.

Nothing in the debate has demonstrated that our existing policies are inadequate.

This debate started with 19k LoC LLM generated PR that was opened and positively reviewed. If we believe this PR can be merged - our existing polices are adequate, but if on other hand we think that merging such change harms Node.js in long term and introduces risks - the policy change is warranted.

Clearly the debate isn't over, but this PR is a big milestone in closing it.

models includes material under licensing terms that may not be compatible
with the Node.js license. Contributors certify their submissions under the
[Developer's Certificate of Origin 1.1][DCO], which requires that the
contribution was created by them and that they have the right to submit it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mistating what the DCO says. There are three clauses separated by an OR statement.

You're focusing solely on the first clause and ignoring the "or"

 (a) The contribution was created in whole or in part by me and I
     have the right to submit it under the open source license
     indicated in the file; or

 (b) The contribution is based upon previous work that, to the best
     of my knowledge, is covered under an appropriate open source
     license and I have the right under that license to submit that
     work with modifications, whether created in whole or in part
     by me, under the same open source license (unless I am
     permitted to submit under a different license), as indicated
     in the file; or

 (c) The contribution was provided directly to me by some other
     person who certified (a), (b) or (c) and I have not modified
     it.

You're also making the assumption that the person using the tool has not performed their own thorough review of the output generated. For instance, I may be in the minority but I read every line generated by the AI agents I use.

This also ignores the fact that the Linux Foundation legal team has established that AI contributions do not violate the DCO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the Node.js project not free to have its own interpretation here? If it's not free to do so, I can remove all the language about the DCO. There are other rationales.

If it is free to do so, then it makes sense to respond to the rest of what you've said:

The language I put here does focus solely on clause (a), but I'm confused as to how (b) applies if the LLM did not provide the code to the contributor under an open source license. The LLM is not even an entity that can enter into any contract or hold copyright at all, so I'm not sure how either would apply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only people can attest to the DCO. Nothing in any of this absolves the person opening the PR from responsibility to ensure everything in the PR is appropriate to contribute. Your exact concern would apply to me hiring a contractor to write the code on my behalf. If I'm opening the PR, it becomes my responsibility, regardless of how the contribution was written. That's what the foundation legal team established.

Is the project free to have its own interpretation? Sure. But I'm not sure why we would.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your exact concern would apply to me hiring a contractor to write the code on my behalf.

I'm having trouble understanding this analogy. A contractor is (by definition) entering into a contract with you (which would presumably fall under (b)). An LLM can't do that (as you said).

Or, is it more because of (a), since you can say you authored it "in part" and you're asserting the right to submit it under the license regardless of its origin (i.e. the "it's a tool, no different from vim or a template-generator-thing" kind of argument)? That seems more plausible to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I re-read your blog post on this.

It seems like the contractor analogy hinges on the notion that you (not the contractor in this analogy) are the copyright holder. While that does cover the "right to submit" portion of (a), it doesn't cover the "created in whole or in part by me". But! That's okay, since "tool use" basically covers it for LLMs/AI. It just leaves me with more questions about contractors, but that's irrelevant to this conversation.

Now, whether you are the copyright holder in the LLM scenario is of course potentially suspect, but as you and @Qard have pointed out in this PR, LLMs rarely produce truly copyright-infringing material (despite how they themselves were created), and when they do, the usual copyright rules apply and no additional rules or policy is needed to cover that.


In short, it took me a while to get there, and I'm sorry about that, but I agree that the DCO is not inherently violated by LLM/AI-created contributions (though as with any other PR, it could be violated through other means). It shall be stricken from this PR when I get the chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fair to note that while it becomes your responsibility, the assumption is that you indeed have a legal arrangement with said contractor. It's impossible to have such an arrangement with an LLM, so I don't think your analogy holds.

LF legal's position here, while untested in court (and thus, neither inherently correct or incorrect), is that this is fine for the DCO - but that doesn't mean it's the same as with a contractor, which is something that has very much been tested in court.


## Enforcement

This policy cannot be fully enforced. There is no reliable way to determine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This policy cannot be fully enforced. There is no reliable way to determine
This policy cannot be enforced. There is no reliable way to determine

You can strike the word "fully" here. It can't be enforced, period.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's accurate. "Enforcement" in this context simply means a policy violation can be detected and consequences can be rendered. Sure, not all cases are detectable, but certainly self-identifying cases (among others) are. "fully" still applies because there are cases where enforcement is possible.

## Scope

This applies to content that lands in the repository: source code,
documentation, tests, tooling, and other files submitted via pull requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently non-human bots make automated updates to files in test (e.g. WPT fixture updates, data syncs), .github (GH action workflow updates), src (version headers for undici, amaro, zlib, etc), lib, and doc. The wording of this policy indicating "All authorship of submitted changes must be human" would make these bot updates invalid.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Non-LLM automation would need an exclusion. πŸ‘ (Saw your other comment about this as well.)

@indutny
Copy link
Member

indutny commented Mar 27, 2026

@Qard

But it exists whether we like it or not.

You've probably heard me saying this, but we are the one normalizing use of it. The fact that assault riffles exist doesn't mean that we should all be carrying them around. Some people would, many won't and would consider that unethical.

It's like being unhappy that cars killed walking or biking places. Yeah, they pollute, they're expensive, and they reshape the world in ways we might not like, but they are part of the modern reality just as LLMs are. We could decide to be one of those niche communities where cars are banned, but that will mean isolating us from most of the rest of the ecosystem and would thus likely significantly reduce contributions.

Yet, there are cities (Paris, Barcelona) that ban or limit cars in certain areas and these areas seem to thrive. There is a sense of FOMO distilled into us by LLM companies and their ardent followers. Partly this is infused with the empowerment that LLM makes the user feel. It is not hard to understand why feeling this way one would think others are missing out. While LLM supporters are very vocal what I discovered since last week is that there are a lot of people who feel differently and either publicly can say this, or are afraid of retribution from their employer in case they object to LLM hype.

That's wildly inaccurate. There is no such "verbatim" copy. It all becomes math mixed into neurons. Sometimes the signal of a particular piece of training is strong enough that the overall piece can be extracted verbatim, but very rarely and mostly only the very first bits of knowledge ingested as they serve as the first and strongest backpropagation applications.
A very, very small quantity of the training data of an LLM can be extracted verbatim, but only with substantial effort to specifically target that exact piece of data, and only the vanishingly few pieces that actually remain that intact.

I don't think this is the case. There are studies that show otherwise. (...and just in common sense the training of LLM literally includes predicting next token in the training data so they have to be capable of reproducing the training material)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc Issues and PRs related to the documentations.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants