Worries about Open Source in the age of LLMs
This post's featured URL for sharing metadata is https://www.jvt.me/img/profile.jpg.
On Saturday, I was listening to an episode of Changelog and Friends and Jerod wondered whether Open Source would be needed/used in the world of Large Language Models (LLMs). He's mentioned this before - not necessarily "ooh this is great!" but considering whether that's the trend - so I've been noodling on it in the background for a bit.
Then, last night I was reading The fate of "small" open source by Nolan Lawson and realised I needed to get my thoughts down.
I know I'm not at all the first person to be thinking about this, but I wanted to get my thoughts out there - I'd be interested to hear your thoughts.
I'm very pro Open Source. My ideal choice of licensing is for maximal user freedom with the AGPL-3.0, with no copyright assignment Contributor License Agreement, but that isn't always possible (due to enterprises' fears of the license) so I'll otherwise use the Apache-2.0.
(Aside for those unaware: "Open Source" refers to projects that use a license that follows the Open Source Definiton and is "blessed" by the Open Source Initiative, whereas "open source" is everything else, which could include "source available", licenses like The Elastic License v2, All-Rights-Reserved, the Blue Oak Model License, and "Fair Source", to name a few. For simplicity, I'll use "open source" below to refer to all classes of code that is shared so people can at least read it, but maybe not modify it)
Like so many people in tech, I owe a lot to Open Source. But if you consider my career trajectory and where I am right now, it's irrevocably changed where I've gone with my career, and I'm so happy I've been able to be part of it, learn from others and shape some small part of the ecosystem, and hopefully make space for others to get some of the benefits I have.
This is exactly why I'm worried about where the ecosystem is going with the rise of LLMs and AI Agents - as Nolan mentions, it's looking like a lot of code that would previously be a composable library is now going to be re-written in many projects.
(Aside: do we think it's worth having 1000 developers asking an LLM to generate the same 100 line piece of code, when we could instead use a dependency for it π Considering the energy costs alone, that seems like a significant benefit for code reuse)
I agree with a general viewpoint that "small" projects, potentially with as many as 100 lines of code, are arguably not worth having as a separate dependency, and that replacing them with inlined code could be fine. These maybe don't change that often, or have security patches to keep on top of.
Similarly, with the "Go Proverb": "A little copying is better than a little dependency." (which I dislike) copying code from other projects leads to secret dependencies - you do depend on some external code written with a given license (has the copier checked it's an approved license? That you're following the license compliance accordingly?) but now lose i.e. package manager metadata to flag this to tools who do surface-level investigation.
(I've worked on the team looking at open source license compliance, and been interested in licenses for the best part of a decade - trust me, it's way harder to get engineers - even highly engaged engineers with licenses - to learn to care about doing the right thing!)
I'll note that naturally I'm a little biased towards the continuation of using "real" dependencies, given my work on Renovate, but it's still fair for me to ask "how are you planning on keeping an eye on updates upstream to that file/function?"
When we inline these dependencies, you also rid the opportunity to share and grow. With open source, if the upstream project doesn't want your feature you can keep your fork and use that. What will we do when we want to share our forks of LLM-rewritten open source code? Will anyone even care, as they've got their own custom function for the same logic, so they could never pull your changes in?
Losing out on this opportunity for sharing and growth is a huge loss, especially as it will reduce the ability for folks to learn great interpersonal skills of working across timezone boundaries, companies, and how to deal with the "rough edges" that us squishies have, building something that leads towards a little bit more social good in the world.
I do want to note that I'm probably best described as "cautiously skeptic" about AI, but am trying to "kick the tyres" and understand more about it, so I can have a more informed opinion. If you feel a similar level of skepticism, I'd recommend this episode of Fallthrough with Steve Klabnik.
Personally, my usage is still at odds with the legal ramifications of using LLM-generated code (I refer to this best as "copyright laundering"), and so I'm making very clear what parts of code are LLM-generated, to make it easier to rip out, as well as making it clear to reviewers and myself where there may be code that needs increased scrutiny.
As an industry, we're hoping very hard that we can "own" output from the "plagiarism machine", and I'm sure companies like OpenAI, Microsoft and Anthropic are going to work as hard as possible to win the lawsuits that come out of the first copyright case about LLM-derived content. It also, worryingly, feels like it may be "too big to fail", and the risk of not being able to own the copyright is considerable.
(Related: Practical AI had a podcast episode about using AI to generate every possible combination of musical melody)
I know many maintainers who are taking code off GitHub.com, and moving to a smaller platform (which is not training on their data) or making their code private (but still Open Source) altogether. I've personally had to be a bit more restrictive of traffic to this website, given a significantly increased level of scraping I'm seeing.
Let's also consider the endgame of this shift - if we end up "killing" open source as a community by generating everything from scratch each time, what will LLMs train on? Will they start to get worse and worse, given they're only trained on their own content and code?
Or is this then where AI companies will need to start training on your company's "proprietary" codebase? Will you then empathise with the maintainers who've been crying out about this for years?
The for-profit drive is ruining a lot of the great things about open source, and I really hope we the people prevail.
Open source is rad - do more of it.
I know I'm not at all the first person to be thinking about this, but I wanted to get my thoughts out there - what do you think?