doc: add policy on LLM-generated contributions#62447
doc: add policy on LLM-generated contributions#62447bengl wants to merge 2 commits intonodejs:mainfrom
Conversation
|
Review requested:
|
There was a problem hiding this comment.
The language I used here is admittedly strong. Perhaps "have a higher probability of not meeting that bar" is more accurate.
There was a problem hiding this comment.
Strong and demonstratably false. I've used AI-agents to assist in authoring PRs and I can always fully explain the changes, respond to review, and take full responsibility for the PRs I open.
There was a problem hiding this comment.
I think we can weaken the language indeed, that being said as a north star this statement makes a lot of sense. Regardless of how we write it it is going to be a subject to interpretation. Choosing concise and strong form over length and vague statements is clearer.
|
I approve (even though GitHub won't let me). |
Add doc/contributing/ai-contributions.md banning LLM-generated content from commits. Scoped to committed content only, excluding discussion, vendored deps, and accessibility tools. Enforcement uses consensus seeking. Rationale covers reviewer burden and DCO compatibility. References added in CONTRIBUTING.md and the collaborator guide.
doc/contributing/ai-contributions.md
Outdated
| LLM produced a draft that the contributor then edited. Contributors must be | ||
| the sole human authors of the changes they submit. |
There was a problem hiding this comment.
This seems to imply having co-authors in the PR would be forbidden, which is not the case βΒ or at least, is not related to AI-contributions
| LLM produced a draft that the contributor then edited. Contributors must be | |
| the sole human authors of the changes they submit. | |
| LLM produced a draft that the contributor then edited. Contributors must be | |
| human after all. |
There was a problem hiding this comment.
Good point! I'll reword this.
There was a problem hiding this comment.
This would make our automated dependency updates via the github bot invalid and would require that all dependency updates be performed manually.
There was a problem hiding this comment.
I agree that language might be too exclusive. That being said, saying that dependabot PRs are subject to AI contribution policy is a bit of a stretch. These are basically configuration file updates and arguably not code changes.
doc/contributing/ai-contributions.md
Outdated
| This policy can be revisited if the legal situation or the tools change. | ||
|
|
There was a problem hiding this comment.
I would remove mention of legal here, because it's IMO obvious and does not bring much
| This policy can be revisited if the legal situation or the tools change. |
There was a problem hiding this comment.
Do you think the callout that revisiting it in the future as situations change is still worth it?
There was a problem hiding this comment.
My opinion is that it goes without saying, docs are always a "snapshot of the current state of things", I don't think this document is any different on that aspect. That being said, if you feel strongly it's worth explicitly say it, feel free to treat this as a nit and resolve the thread.
There was a problem hiding this comment.
Your opinion here makes sense to me so I'll take the suggestion. π
| * Pull request descriptions, review comments, issue discussion, or other | ||
| communication that is not part of the committed tree. Those are covered by | ||
| general expectations around good-faith participation and the | ||
| [Code of Conduct][]. |
There was a problem hiding this comment.
Note that it somewhat conflicts with nodejs/admin@8b746bc β which would be worth revisiting if this PR lands as is
There was a problem hiding this comment.
Hmm, that's also somewhat conflicting with #62105 (and indeed, anything that seeks to solidify allowing LLM-generated PRs), so that moderation policy would have to change in either case.
That said, removing this exception would put it in line with that moderation policy, I think. I'm not sure on the best approach here yet.
Qard
left a comment
There was a problem hiding this comment.
I worry such a policy leaves a lot of room for abuse if not defined very clearly and with clear and measurable criteria for the identification of such content. While detection tools may be unreliable, I think we do need some clear definition or test to make this not just a vague judgement call which itself can not be explained clearly.
It seems to me that the real issue, which the content already hints at, is a contributor lacking proper understanding of the code being submitted to be able to explain or properly maintain it. It to me therefore follows that denying a change should require equal rigor to be able to clearly describe criteria not met for the contribution to be considered acceptable.
Personally, I find LLMs can be helpful, and the scenarios in which they are most helpful in are in writing code or prose. I've used LLMs a bunch lately, and I'm confident in my ability not just to steer them, but to also have them build from my ideas, not to go off and build their own rough semblance of what I have described. As such, I feel I could generally quite effectively explain what I've had an LLM generate. I suspect many other contributors are in a similar position.
My worry is that such a policy would simply encourage dishonesty among those of us that know the things we build. I'm confident I could submit LLM-written code without raising suspicion. What if I'm a bit busy today and this bug needs fixing? Would it harm anyone if I just submit this one little PR and not tell anyone an LLM did it?
I don't want to be creating an environment which encourages people to tell little white lies about the origins of their code, fostering a community of subtle dishonesty. I would much prefer we feel welcome to be open and honest about the manner in which we've created our contributions and instead focus our attention on what actually are the problems emerging from the rise in LLM contributions and how can we address those things more directly.
Certainly some may have ethical qualms about LLM use, there are no shortage of reasons to be unhappy about it. But it exists whether we like it or not. It's like being unhappy that cars killed walking or biking places. Yeah, they pollute, they're expensive, and they reshape the world in ways we might not like, but they are part of the modern reality just as LLMs are. We could decide to be one of those niche communities where cars are banned, but that will mean isolating us from most of the rest of the ecosystem and would thus likely significantly reduce contributions. Maybe that's fine, maybe it's not. It's a choice we would have to make, and I think it's quite a significant choice to make.
Personally, I see a bunch of problematic behaviours rising due to the emergence of LLMs and banning LLMs does not actually address any of those problems, it's just lumping them in with one source from which those problems happen to be represented disproportionately.
In my opinion, I think we should consider policies which more directly target these problems. For example, limiting how many subsystems are "acceptable" for a feature PR to touch, how many lines of code are allowed before it warrants an RFC doc or issue to discuss/explain the concept first before submitting a huge and challenging to review PR. Any change which needs to exceed these limits would then have to ask for a TSC exception. Though my example there lacks definition, that direction seems to me more like one which would adequately address the actual problems emerging.
That was a bit of a wall of text, but all that having been said, I am not against having a desire to reduce LLM contributions, but I'm not convinced banning them outright is reasonable or would even work. I think this is less a policy concern and more a quality standards concern, which we would generally address with linting rules, test coverage, and other mechanisms to validate contribution quality. What makes a fancy code generator so different that the same sorts of quality validation systems can not apply to it?
|
|
||
| Do not submit commits containing code, documentation, or other content | ||
| written in whole or in part by a large language model (LLM), AI | ||
| code-generation tool, or similar technology. This includes cases where an |
There was a problem hiding this comment.
"or similar technology" is fairly vague. For example, would this apply to our automation to rewrite REPLACEME markers? A human didn't write that replacement. To be clear, I know that's obviously not included, but only on my own intuition, we should make that clearer.
There was a problem hiding this comment.
James also pointed this out. I'll reword this to that non-AI automations are explicitly excluded.
There was a problem hiding this comment.
I see the distinction in reproducibility of changes. If the change was partly or fully automated, the source code for automation should be mentioned in the PR.
| If a collaborator suspects a pull request contains LLM-generated content, | ||
| they should raise the concern in the pull request. The normal | ||
| [consensus seeking][] process applies: collaborators discuss, the author can | ||
| respond, and a good-faith determination is made. If consensus is that the | ||
| contribution violates this policy, it must not land. If consensus cannot be | ||
| reached, the matter can be escalated to the TSC. |
There was a problem hiding this comment.
This sounds a bit witch-hunt-y and also exploitable to block certain contributions someone might not like, or from a person another contributor might not like. Proving a negative is not really feasible so this could be used to get changes stuck in an uncertain limbo forever.
There was a problem hiding this comment.
These last two sentences are not particularly different from any existing policy involving consensus around a PR and escalating to TSC. I'll adjust this to refer to the existing procedure for dealing with such disputes, rather than re-hash it here.
| submissions shift the verification burden onto those reviewers, with no | ||
| assurance the contributor has done that work themselves. | ||
|
|
||
| The copyright status of LLM output is unresolved. Training data for popular |
There was a problem hiding this comment.
That's not quite accurate. Because the training of the LLM on that content is done by a different person than the one using the LLM to produce the output it's legally clean-room design, which there is already legal precedence for when Phoenix Technologies and American Megatrends copied the IBM bios.
It's possible the law could be revised to address LLMs specifically in a different way, but the current state is not an absence of definition.
There was a problem hiding this comment.
π€
I do not see the parallel you are suggesting.
There was a problem hiding this comment.
The details of how a thing works get captured as some description by one person, that description then gets relayed to another. They can now build the thing described and it is legally not considered derivative. A bit of a wacky edge case of the legal system, but that's how it was ruled. π€·π»
There was a problem hiding this comment.
None of us in this thread are lawyers (AFAIK), and this is the first time I'm hearing of this. If the copyright law is truly resolved at this point, then this can be dropped. Can you point to references indicating that this applies here?
There was a problem hiding this comment.
"Clean room" aspect of LLM generated code is at the very least debatable. Since LLM can reproduce word-for-word the content it was trained on we could argue very different things about it.
|
|
||
| ## [Policy on LLM-generated contributions](./doc/contributing/ai-contributions.md) | ||
|
|
||
| Do not submit commits containing content written in whole or in part by a |
There was a problem hiding this comment.
I think this effectively means we cannot land more V8 updates, for example, #61898 contains https://chromium-review.googlesource.com/c/v8/v8/+/7269587 and a bunch of other code written by Gemini (you can search the V8 commit log to find a lot more).
There was a problem hiding this comment.
Note that deps are an explicit exception:
node/doc/contributing/ai-contributions.md
Lines 27 to 29 in 7178b64
There was a problem hiding this comment.
I see, perhaps that needs to be clearer and mention that there is a scope in the TL;DR, considering probably ~90% of the code in the code base is vendored in.
There was a problem hiding this comment.
I'll see what I can do to clarify this.
The fancy code generator has a verbatim "memory" of the code used to "train" it. It uses that bank of code to create the code you ask of it. It isn't generating code from a "oh, I did something similar once. I can do this again" position. It's generating code from everything that was stolen and shoved inside of it. |
jasnell
left a comment
There was a problem hiding this comment.
I'm generally and firmly -1 on any form of ban on AI assisted contributions
|
|
||
| Do not submit commits containing content written in whole or in part by a | ||
| large language model (LLM) or AI code-generation tool. | ||
|
|
There was a problem hiding this comment.
The entire clause here is unenforceable. As models improve, how are you going to be able to tell the difference? All this does is provide incentive for contributors to lie.
There was a problem hiding this comment.
It doesn't have to be enforceable, and this policy explicitly says so.
That's wildly inaccurate. There is no such "verbatim" copy. It all becomes math mixed into neurons. Sometimes the signal of a particular piece of training is strong enough that the overall piece can be extracted verbatim, but very rarely and mostly only the very first bits of knowledge ingested as they serve as the first and strongest backpropagation applications. A very, very small quantity of the training data of an LLM can be extracted verbatim, but only with substantial effort to specifically target that exact piece of data, and only the vanishingly few pieces that actually remain that intact.
Yes, I would personally be in agreement that there's definitely some stolen data in there. Though at this point most of the data LLMs are trained on is actually created from scratch specifically for training material. There are a bunch of LLM training data orgs that hire devs to just build tons of apps to create training data they can sell. Certainly the stolen data was the bootstrap for a lot of these LLMs though. I personally disapprove of the approach that was taken to bootstrap these systems, but my dissatisfaction is not going to stop them from existing and being widely adopted. π€· |
|
|
||
| * Pull request descriptions, review comments, issue discussion, or other | ||
| communication that is not part of the committed tree. Those are covered by | ||
| general expectations around good-faith participation and the |
There was a problem hiding this comment.
This is contradictory. If an AI agent reviews a PR and makes an alternative code suggestion, and the PR author accepts that suggestion, then this policy is violated.
There was a problem hiding this comment.
I don't think that makes it contradictory? If an agent makes the suggestion, and the suggestion is accepted, then that becomes part of the committed tree, doesn't it?
There was a problem hiding this comment.
so the policy would be: you can use an agent to review code but you can't accept any of it's suggestions?
There was a problem hiding this comment.
If they are strictly "please include this code" suggestions, via GitHub's mechanism for this or otherwise, then yeah, that would be the case.
I admit this is quite weird :/
The intent here was to provide some amount of compromise. It backfired into this corner here, but I think it would be consistent with the notion that committed code in the actual source tree is what's covered by the policy, so weird as it is, I'd keep it this way (unless folks don't need the compromise!).
| and vendored code under `tools/`). That content is maintained upstream | ||
| under its own governance. | ||
| * AI-powered accessibility tools like screen readers or text-to-speech | ||
| software, provided they do not influence the content of the contribution. |
There was a problem hiding this comment.
It makes no sense to add this here as screen readers and text-to-speech software do not write code contributions.
There was a problem hiding this comment.
Well, speech-to-text can. I know someone that does that, actually.
There was a problem hiding this comment.
Which raises another interesting case here: use of AI as disability-assistive device. This would make this policy completely unenforceable.
There was a problem hiding this comment.
I'm not sure I understand how this would impact the enforceability of the policy. If a contributor uses AI tools for accessibility reasons, and discloses this in their PR (or in some blanket statement clearly visible elsewhere), then they wouldn't be in violation. If they fail to disclose ahead of time, and the PR is contested under this policy, then they disclose, then there's still no problem.
I just realized that you may mean that providing for this exception provides an additional avenue for abuse of the policy, since folks can simply state that they're using AI tools for accessibility reasons. I think this is the same category of (un-)enforceability as "they can just lie", so, still, I don't think this adds or subtracts any amount of enforceability.
There was a problem hiding this comment.
No, it's not in the same category. Your assumption there is that someone saying they are using it for accessibility is equivalent to someone lying about using it, and that's not a valid equivalency.
to draw that out a bit more..
The "they can just lie" category is ... they'll use it and won't tell you.
The "accessibility" category is ... they'll use it and there's nothing you can do about it.
But this is besides the point. The larger issue is that the policy in unenforceable because it's simply not possible to reasonably and unabiguously tell the difference.
There was a problem hiding this comment.
Your assumption there is that someone saying they are using it for accessibility is equivalent to someone lying about using it, and that's not a valid equivalency.
Not quite. My assumption is that someone lying that they are using it for accessibility is equivalent to someone lying about not using it. You did clarify the categories and their differences though. Thanks for that π.
But this is besides the point.
Agreed.
| [consensus seeking][] process applies: collaborators discuss, the author can | ||
| respond, and a good-faith determination is made. If consensus is that the | ||
| contribution violates this policy, it must not land. If consensus cannot be | ||
| reached, the matter can be escalated to the TSC. |
There was a problem hiding this comment.
And on what basis does one determine this decision? Here, let me give you two examples... one of these snippets was written by me, the other by Claude: which one is acceptable and which one is not?
Option A:
for (let n = 0; n < 1000; n++) {
console.log('Am I AI generated?');
}Option B:
let n = 0;
while (n++ < 1000) {
console.log('Or am I AI generated?');
}What this part of the policy basically boils down to is: any contribution can be contested and our normal consensus seeking rules are to be followed to resolve the contention. That's no different than what we have now except for the added incentive for a contributor to just lie about whether they used an agent to help write the code. If the policy is unenforceable it's not useful.
There was a problem hiding this comment.
Yeah, this specific point is my main concern. If the policy is not defined in a way that is clearly measurable/identifiable then it's likely to just encourage lying about it, and the ability to contest things arbitrarily could easily be abused.
There was a problem hiding this comment.
I'm gonna guess A because from what I've seen, most of these tools will prefer a for-loop when it comes to using a loop for simple repetition a static number of times.
But, yes, to your point, that's clearly not enough to go on here, and there's a reasonably good change my guess was wrong, or even that their both AI or neither are AI.
Obvious cases can, will, and do exist. The rest relies on good faith participation. I think it's clear that your point here is that that's insufficient, but there are other policies within Node.js core that require good faith participation, like the DCO itself.
There was a problem hiding this comment.
Yes, and those existing other policies, including the DCO, are sufficient on their own. We do not need to layer on an unenforceable policy just to put on a show that "AI bad, Human good". Nothing in this document strengthens or improves our existing policies. Nothing in the debate has demonstrated that our existing policies are inadequate.
There was a problem hiding this comment.
We do not need to layer on an unenforceable policy just to put on a show that "AI bad, Human good". Nothing in this document strengthens or improves our existing policies.
This document guides contributors towards acceptable AI use. I don't believe we have an analogous policy with similar level of comprehensiveness in this repository.
Nothing in the debate has demonstrated that our existing policies are inadequate.
This debate started with 19k LoC LLM generated PR that was opened and positively reviewed. If we believe this PR can be merged - our existing polices are adequate, but if on other hand we think that merging such change harms Node.js in long term and introduces risks - the policy change is warranted.
Clearly the debate isn't over, but this PR is a big milestone in closing it.
| models includes material under licensing terms that may not be compatible | ||
| with the Node.js license. Contributors certify their submissions under the | ||
| [Developer's Certificate of Origin 1.1][DCO], which requires that the | ||
| contribution was created by them and that they have the right to submit it. |
There was a problem hiding this comment.
This is mistating what the DCO says. There are three clauses separated by an OR statement.
You're focusing solely on the first clause and ignoring the "or"
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
You're also making the assumption that the person using the tool has not performed their own thorough review of the output generated. For instance, I may be in the minority but I read every line generated by the AI agents I use.
This also ignores the fact that the Linux Foundation legal team has established that AI contributions do not violate the DCO.
There was a problem hiding this comment.
Is the Node.js project not free to have its own interpretation here? If it's not free to do so, I can remove all the language about the DCO. There are other rationales.
If it is free to do so, then it makes sense to respond to the rest of what you've said:
The language I put here does focus solely on clause (a), but I'm confused as to how (b) applies if the LLM did not provide the code to the contributor under an open source license. The LLM is not even an entity that can enter into any contract or hold copyright at all, so I'm not sure how either would apply.
There was a problem hiding this comment.
Only people can attest to the DCO. Nothing in any of this absolves the person opening the PR from responsibility to ensure everything in the PR is appropriate to contribute. Your exact concern would apply to me hiring a contractor to write the code on my behalf. If I'm opening the PR, it becomes my responsibility, regardless of how the contribution was written. That's what the foundation legal team established.
Is the project free to have its own interpretation? Sure. But I'm not sure why we would.
There was a problem hiding this comment.
Your exact concern would apply to me hiring a contractor to write the code on my behalf.
I'm having trouble understanding this analogy. A contractor is (by definition) entering into a contract with you (which would presumably fall under (b)). An LLM can't do that (as you said).
Or, is it more because of (a), since you can say you authored it "in part" and you're asserting the right to submit it under the license regardless of its origin (i.e. the "it's a tool, no different from vim or a template-generator-thing" kind of argument)? That seems more plausible to me.
There was a problem hiding this comment.
Alright, I re-read your blog post on this.
It seems like the contractor analogy hinges on the notion that you (not the contractor in this analogy) are the copyright holder. While that does cover the "right to submit" portion of (a), it doesn't cover the "created in whole or in part by me". But! That's okay, since "tool use" basically covers it for LLMs/AI. It just leaves me with more questions about contractors, but that's irrelevant to this conversation.
Now, whether you are the copyright holder in the LLM scenario is of course potentially suspect, but as you and @Qard have pointed out in this PR, LLMs rarely produce truly copyright-infringing material (despite how they themselves were created), and when they do, the usual copyright rules apply and no additional rules or policy is needed to cover that.
In short, it took me a while to get there, and I'm sorry about that, but I agree that the DCO is not inherently violated by LLM/AI-created contributions (though as with any other PR, it could be violated through other means). It shall be stricken from this PR when I get the chance.
There was a problem hiding this comment.
It's fair to note that while it becomes your responsibility, the assumption is that you indeed have a legal arrangement with said contractor. It's impossible to have such an arrangement with an LLM, so I don't think your analogy holds.
LF legal's position here, while untested in court (and thus, neither inherently correct or incorrect), is that this is fine for the DCO - but that doesn't mean it's the same as with a contractor, which is something that has very much been tested in court.
|
|
||
| ## Enforcement | ||
|
|
||
| This policy cannot be fully enforced. There is no reliable way to determine |
There was a problem hiding this comment.
| This policy cannot be fully enforced. There is no reliable way to determine | |
| This policy cannot be enforced. There is no reliable way to determine |
You can strike the word "fully" here. It can't be enforced, period.
There was a problem hiding this comment.
I don't think that's accurate. "Enforcement" in this context simply means a policy violation can be detected and consequences can be rendered. Sure, not all cases are detectable, but certainly self-identifying cases (among others) are. "fully" still applies because there are cases where enforcement is possible.
| ## Scope | ||
|
|
||
| This applies to content that lands in the repository: source code, | ||
| documentation, tests, tooling, and other files submitted via pull requests. |
There was a problem hiding this comment.
Currently non-human bots make automated updates to files in test (e.g. WPT fixture updates, data syncs), .github (GH action workflow updates), src (version headers for undici, amaro, zlib, etc), lib, and doc. The wording of this policy indicating "All authorship of submitted changes must be human" would make these bot updates invalid.
There was a problem hiding this comment.
Right. Non-LLM automation would need an exclusion. π (Saw your other comment about this as well.)
You've probably heard me saying this, but we are the one normalizing use of it. The fact that assault riffles exist doesn't mean that we should all be carrying them around. Some people would, many won't and would consider that unethical.
Yet, there are cities (Paris, Barcelona) that ban or limit cars in certain areas and these areas seem to thrive. There is a sense of FOMO distilled into us by LLM companies and their ardent followers. Partly this is infused with the empowerment that LLM makes the user feel. It is not hard to understand why feeling this way one would think others are missing out. While LLM supporters are very vocal what I discovered since last week is that there are a lot of people who feel differently and either publicly can say this, or are afraid of retribution from their employer in case they object to LLM hype.
I don't think this is the case. There are studies that show otherwise. (...and just in common sense the training of LLM literally includes predicting next token in the training data so they have to be capable of reproducing the training material) |
Add doc/contributing/ai-contributions.md banning LLM-generated content from commits. Scoped to committed content only, excluding discussion, vendored deps, and accessibility tools. Enforcement uses consensus seeking. Rationale covers reviewer burden and DCO compatibility.
References added in CONTRIBUTING.md and the collaborator guide.
A PR was opened with a more relaxed AI policy (#62105), so this one here is offered as an alternative. Note, however, that some parts of that PR seem compatible with this one (e.g. the behavioural guidelines).
Some key points:
This goes without saying, but I'm going to say it anyway, since this subject is pretty heated: This is not an invitation for personal attacks on myself, the author of the other PR, anyone who wants to ban AI in Node.js core, or anyone who doesn't want to ban AI in Node.js core. Or anyone else for that matter. Please observe Node.js' Code of Conduct.