How I'm using Local Large Language Models

Prompted by Joe and Marcus after my last post about AI, I thought I'd share how I'm currently using locally-hosted Large Language Models (LLMs), as I'd teased it a little bit in the post.

The TL;DR is "not that much".

Why did I start?

Before I get into how I'm using them now, it's worth explaining how I got started using them.

In late 2024, some things were happening at work that made me feel like:

Ralph Wiggum from The Simpsons - in a crossover episode from Family Guy - where Ralph is at the back of the school bus while Peter Griffin and Homer Simpson are fighting (offscreen from this screenshot), and chuckles and says "I'm in danger"

Because of this, I started to look around at what the job market was like (it wasn't particularly great) and consider what I wanted to do next.

One thing that worried me - on top of the regular interviewing nerves - was that there was a lot of focus in the industry about AI usage. At the time, the word "agent" wasn't something anyone would correlate with AI for another month or so, but there were still many areas I was behind.

With this in mind, I decided that a key area of interview prep would be to learn a bit more about AI and LLMs, and at least have some level of familiarity, even if it wasn't enough to be a practitioner.

I'd not really been using it much because I did (and still do) have environmental, legal and ethical concerns about AI usage, and at the time, chat-based interactions or in-editor code completion wasn't really interesting for me.

(In particular, I find that in-editor code completion can be particularly distracting for my ADHD)

Instead of risking spending a load of money on a thing I didn't particularly get the value from, and spend my money trying to work out exactly how the things worked, I had a better idea.

My (Arch BTW) Linux desktop was running an NVIDIA Titan X that I bought with my starting bonus at Capital One in 2016, and I'd been pondering an upgrade for the last year or so. As I game in 4K, it'd been struggling to hit the top settings in recent games.

I took the chance to upgrade to the current top-of-the-line AMD GPU, which would benefit both my hobby and my work-adjacent hobby.

(I decided not to go for another NVIDIA - because they were hella expensive, and as a way to continue to support AMD's great fight against their competitors)

Local vs hosted LLMs?

As noted above, especially when I started out, I wasn't sure how best to make my prompts as effective as possible, and wanted to trial things without leading to accidentally spending a chunk of my own personal money trying to "git good" at the usage of AI.

Although I could have used GitHub Copilot (as I had Copilot Pro as a (prolific?) Open Source maintainer) I decided that going for a local-only approach would also give me a bit more understanding of what's going on, and how running on

(Although I don't have a reference to this, I remember Simon Willison mentioning a couple of times that testing LLMs out with local models are a great way to see things like hallucinations and get a feel for what the tech actually is)

I've found that if it's something that I don't need to be the most accurate, or the most perfect code, or it's something that I can wait for ~60 seconds before I get a response, and may need a few back-and-forths, I'll use a local model.

I'm also using it in cases where the questions I'm asking or the data I'm providing absolutely shouldn't be sent to a hosted model provider (even if we've signed the enterprise agreement), for instance while filling in my tax return.

What's my software stack?

What's my hardware stack?

  • Personal Linux desktop: AMD Radeon RX 7900 XTX (Saphire NITRO+, 24GB VRAM)
  • Work laptop: MacBook Pro M4 Pro (48GB RAM)

(I opted for a Mac at work, even though I dislike them and would've loved to continue running Linux, I wanted to have the option of doing more local LLM things while at work which is easier on Apple's more tuned hardware)

I even set up Tailscale so I could use OpenWebUI on my desktop while at Batch Bunch.

What works best for my hardware?

I've not really done anything around actually understanding how to work out what models work best for my hardware, and instead opted for a few models I could fit in the VRAM. I had started looking into things like quantization, but put that on hold.

I've recently seen that llm-checker and llmfit exist, which would be a nice way to automagically find out what would work well for me, as I've kinda tested a few different options, but don't have any good heuristics or reasons for the model choices I've picked.

What models am I using?

For the most part, gpt-oss:20b is the model I use for most of my local usage, interspersed with qwen3:30b to mix things up.

As noted, it's also very little I use local models for, which is why I don't really test out many others.

I'll take advantage of OpenWebUI's "ask multiple models" as a way to see what's different between them when I'm happy waiting a little longer for an answer.

In the past I've used qwen2.5-coder and qwen3-coder.

What am I using it for?

(This was very interesting looking back through my chats in OpenWebUI to see what sort of things I've been asking)

Personally

  • "Can you review this job contract for me?"
    • Surprisingly to a lot of people, I actually read legal contracts I need to sign
    • But I also used this when starting at Mend as a way to see if there's anything else I may have missed considering a few areas I wanted it to focus on ("risks around IP ownership", etc)
    • The PDF I uploaded had a few key bits of PII in it, so I was very glad that I didn't upload it anywhere else, like I know some folks do
  • "What murder mystery TV shows are like The Residence?"
    • I think a bad example, given the knowledge cutoff for the model
  • "Given these outputs from about:config and about:support, why is Firefox not using Hardware Acceleration?"

At work

  • "given JSON of this format, how would I write a jq to get these pieces of data out?"
  • as I was thinking about writing an agent for Renovate config changes, I'd been fleshing out some thoughts on Slack, and so took a couple of my messages, and asked a couple of models (in a couple of different ways) what I should consider when doing this, or what a good starting point for the tech stack would be
  • "what options do I have for stacked PRs in Git?"
  • "can you convert these Markdown API docs into an OpenAPI spec for me"?
  • "can you review this PR for me", given a diff
  • "how would you go about solving {a given bug report}"

Looking forward

I'm not sure if my local model usage will increase over time.

I'd like to say it will, but considering I'm generally using LLMs for "I would like an answer sooner than later" or "can you please go and implement this thing for me" in parallel with another task, it's unlikely to do that on my local hardware.

Given there are a few local agent tools I can use with Ollama, maybe I'll try and give them a bit more of a go to compare to other tools.

I do think there are significant risks of tying your productivity to an API that you don't control, at a price point that is heavily subsidised, which is why I'm still working hard to keep myself in-the-loop and learning and improving.

Written by Jamie Tanna's profile image Jamie Tanna on , and last updated on .

Content for this article is shared under the terms of the Creative Commons Attribution Non Commercial Share Alike 4.0 International, and code is shared under the Apache License 2.0.

#llm #ai #ollama.

This post was filed under articles.

Interactions with this post

Interactions with this post

Below you can find the interactions that this page has had using WebMention.

Have you written a response to this post? Let me know the URL:

Do you not have a website set up with WebMention capabilities? You can use Comment Parade.