How and why I attribute LLM-derived code
This post's featured URL for sharing metadata is https://www.jvt.me/img/profile.jpg.
As I've written about before, I'm a self-described "cautious skeptic" of AI and Large Language Models, and I'm trying to do more with AI where makes sense.
One thing I've noticed I do differently to other users of AI-generated code is that I work hard to document usage of AI tools much more visibly than other folks, going as far as documenting this at the commit level.
This may be because I have thoughts about Git commits, but also comes from the still unknown legal risks of using AI-generated code.
While at Elastic, I was part of the Open Source Working Group, the cross Engineering + Legal team who worked towards better contributing back to Open Source, ensuring we were well-managed with our obligations (of license compliance) and supporting Engineering leadership in the roll out of AI usage across the organisation in a way that didn't lead to any legal concerns.
(Aside: I Am Not A Lawyer, and this is not legal advice)
As we were working on the rollout of AI tooling, one of the many interesting tidbits I learned about was Microsoft's AI indemnity protection. If you're unaware, by signing on to GitHub Copilot Enterprise (not any other plans!), and someone sues you for copyright infringement due to your usage of LLMs, Microsoft will foot the bill for legal fees!
That's pretty great, right? You get to perform copyright laundering, but if you're caught, you don't even have to pay to fight it in court!
Well, this only applies if you've followed the terms you agreed to when signing the Enterprise deal by keeping records of usage of GitHub Copilot - for instance, which line(s) in your codebase are LLM-derived? I seem to remember there may also have been requirements to note the model(s) used, but I can't quite remember nor confirm that.
(Microsoft have other stipulations to follow on top of this before the indemnity kicks in - again, this is not legal advice nor exhaustive protection)
If we take a step back and think about the wider software engineering ecosystem - what percentage of the code written do you think will have this level of detail?
This is one of the key reasons I consider documenting my LLM-derived code very important.
Note that this is my personal workflow, and both works for me, and in my opinion, will future-proof my contributions.
How do I use LLMs?
Before I start discussing in a bit more detail about why I enforce this workflow on myself, it's worth mentioning how I'm currently using LLMs.
I'm a fair bit behind the curve, but I'm slowly pushing my LLM usage, giving me the opportunity to delegate small tasks ("here's a failing test, go and implement it" or "this code should work, go and work out why it's not") as well as starting to trial it on implementing slightly larger pieces of work.
I'll have conversations with the LLM in the classic chat interface, using CodeCompanion.nvim, and periodically use Ollama (through OpenWebUI) running locally on my Linux desktop or my work MacBook Pro.
As a user of GitHub Copliot Pro (as a prolific Open Source maintainer), I was excited to see that Charm's agentic tool crush landed support for using GitHub Copilot as a backend, so I've used Charm as my "daily driver" for agent-driven tasks for a few months now.
I've also been doing some more with Claude Code, through access at work, so I can use Claude Code for internal codebases, and crush for my public-facing work.
Through my work, I'll take snippets of code from chat discussions (or ask the LLM to write it directly to the files), and have the agent write changes directly to the codebase. In these cases, I'm still very much human-in-the-loop, and I'll commit the code when I'm happy with it.
I may allow agents to commit for me in the future, but I'll still rewrite commit messages until I'm happy with them.
How do I attribute code?
That's a lot of preamble for what you're interested in, right? How exactly do I do this attribution?
Inline code comments
As mentioned, because I'm personally involved in committing the changes, and sometimes copy-pasting or implementing suggestions from an LLM, I know exactly which lines are LLM-derived.
If a whole function, or complex one-liner, comes from an LLM, I'll document it with an inline comment i.e.:
# Code snippet licensed under the Apache-2.0, and co-authored-by: gpt-oss:20b
jq '
{
# -------- missing currentVersionTimestamp ----------
missingCurrentVersionTimestamps: [
.config
# ...
'
This especially helps future readers be a bit more aware of the potential risk - but hopefully during code review with another human, we'd have teased out some of the risks and concerns with that code.
Although these comments would help with the "where are you using LLM-derived code", I've found that trying to keep this correct over time can be cumbersome, especially taking into account the Ship of Theseus with how changes to the code over time may lead to the original code no longer existing.
Co-authored-by
Instead of this, I've now settled on per-commit attribution to LLM model(s) that have introduced the code.
For instance on Renovate we use squash-merge for PRs, which means that everything in the PR will get associated with the use of LLMs. As well as helping generally to break up large pieces of work, this also has the benefit of making sure that the "tainting" of LLM code only exists in the smallest unit of work.
The way I manage this commit-level attribution for LLMs is using the Co-authored-by Git trailer.
This trailer has existed for many years and Git forges show the multiple authors when browsing commits that have Co-authored-by in them, which makes it slightly more visible than using custom metadata.
Additionally, because the trailer can add many co-authors, I can disclose usage of potentially many models used to generate a single commit of work.
The Co-authored-by trailer is of the format:
Co-authored-by: $authorName $authorEmail
Co-authored-by: $author2Name $author2Email
I take advantage of this format to use the model name as the author name, and associate the email with the LLM provider. This gives me full coverage of the key metadata we may need in the future to understand what was used to create code in the commit.
For example, one of my recent posts included code from the local model GPT-OSS, at 20 billion parameters, and so the commit message for that blog post was:
Blogument
jqforreleaseTimestamps
Co-authored-by: gpt-oss:20b <ollama@...>
This tells me:
- I used
gpt-oss - It was the
20bparameter variant - I used it through Ollama
For a more complex example, we can see:
refactor(manager/npm): extract function for resolving
.npmrc
As part of future changes in #41215 and/or #41216, we will resolve the.npmrc, if it can be found, in multiple places. Before we do this, we can prefactor this logic to extract this out into its own function that can be used independently.
Co-authored-by: Claude Sonnet 4.5 <jamie.tanna+github-copilot@...>
As you can see, this gives a similar level of information. As I'm using my work email, I use a + alias to tag it with GitHub Copilot.
In cases where I need to describe in more detail where the LLM derived data has been used, I can also call that out in the commit message, such as:
feat(util): log warning if file(s) contain hidden Unicode characters (#41353)
With the rise of prompt injection, there is an increased number of attacks occurring using hidden Unicode characters.
[...]
The list of suspicious characters has been largely compiled by Claude Sonnet 4.5.
Co-authored-by: Claude Sonnet 4.5 <jamie.tanna+github-copilot@...>
Not a generic co-authored-by
Notice also that this isn't Claude Code or Crush adding the generic co-authored-by which the Go team are discussing as to whether they serve any value. In this case, the co-authored-by is more specific, and because it is associated with my email, it's something that I personally can sign the CLA for, as I'm taking on the burden of authorship.
Why?
I'm sure many of you are jumping to this section with "yeah but why bother"?
For myself
I want to do right by myself.
As someone who prides themselves on attention to detail, I want to make sure that I've captured important metadata that may be important in the future.
Also, as someone who's still unsure about the legal risks, and has ethical concerns, about AI usage, it makes me feel like I have some level of control.
Because it's a small barrier to usage
I've found that this is also a small psychological nudge to myself to say "do I need to be using an LLM to do that?" as well as considering whether the code that's been provided could do with some more consideration when I'm writing the commit message.
For legal-minded folks
As noted, there may be legal risks of using LLM-derived code, and I'd like to make sure that I'm doing what I can to cover myself (on behalf of the company).
If this ever is a need to go back and find where LLM-derived code has been used, I'd love to be able to say that I've made it easier for my colleagues in legal, at least for my own work.
For reviewers
There's a lot of discourse about whether disclosing AI usage can lead to a higher/lower bar on a code review by another human.
Regardless of what others think, I think it's fair to be up front with the humans who are spending time reading your code to work out whether I've suddenly forgotten how to do a for loop in the way our project expects it, or if it's because AGENTS.md doesn't mention that.
(And either way, it's up to me to review my code first)
Commit-level has better longevity than PRs
If the choices you have are "document AI usage at the PR level" and "don't document AI usage", I'd prefer you did it at the PR level than not at all.
My preference for my own contributions will always be at the commit level, as that gives me the most control around disclosure and tracking. But I'm not forcing that on anyone else, as I know that enforcing Git commit standards is a battle .
Attributing the AI usage at the PR level is a great starting point, as projects like Ghostty have recently required, but with large projects like Zig moving from GitHub, any PR-based metadata will be lost as part of the move.
Keeping this metadata in the commit level is the best for longevity of the project and for ability for others to review the metadata themselves.
Here's to the future
This workflow works for me, and I'm glad I've been doing it. It gives me the ability to go back and see where I've been using AI, and I'm sure I'll be able to do some interesting stats about number of commits I'm authoring with AI over time.
I'm sure this workflow may change over time, especially as I lean on AI tools more in the future, and I'm sure this post will be later revisited as a (2027 edition) or (winter 2026 edition).
Does this make you stop and think about how you're using AI? About some of the potential legal issues you weren't aware of, and how you may want to consider future-proofing your contributions going forwards?