You can still use the IGP, which might be faster in some cases.
- 3 Posts
- 70 Comments
Oh actually that’s a great card for LLM serving!
Use the llama.cpp server from source, it has better support for Pascal cards than anything else:
https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md
Gemma 3 is a hair too big (like 17-18GB), so I’d start with InternVL 14B Q5K XL: https://huggingface.co/unsloth/InternVL3-14B-Instruct-GGUF
Or Mixtral 24B IQ4_XS for more ‘text’ intelligence than vision: https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF
I’m a bit ‘behind’ on the vision model scene, so I can look around more if they don’t feel sufficient, or walk you through setting up the llama.cpp server. Basically it provides an endpoint which you can hit with the same API as ChatGPT.
1650
You mean GPU? Yeah, it’s good, I was strictly talking about purchasing a laptop for LLM usage, as most are less than ideal for the money. Laptop vram pools are relatively small and SO-DIMMS are usually very slow.
Things will get much better once the “Max” AMD SKUs proliferate.
Yeah, just paying for LLM APIs is dirt cheap, and they (supposedly) don’t scrape data. Again I’d recommend Openrouter and Cerebras! And you get your pick of models to try from them.
Even a framework 16 is not good for LLMs TBH. The Framework desktop is (as it uses a special AMD chip), but it’s very expensive. Honestly the whole hardware market is so screwed up, hence most ‘local LLM enthusiasts’ buy a used RTX 3090 and stick them in desktops or servers, as no one wants to produce something affordable apparently :/
I was a bit mistaken, these are the models you should consider:
https://huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ
https://huggingface.co/AnteriorAI/gemma-3-4b-it-qat-q4_0-gguf
https://huggingface.co/unsloth/Jan-nano-GGUF (specifically the UD-Q4 or UD-Q5 file)
they are state-of-the-art at this size, as far as I know.
8GB?
You might be able to run Qwen3 4B: https://huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ/tree/main
But honestly you don’t have enough RAM to spare, and even a small model might bog things down. I’d run Open Web UI or LM Studio with a free LLM API, like Gemini Flash, or pay a few bucks for something off openrouter. Or maybe Cerebras API.
…Unfortunely, LLMs are very RAM intensive, and >4GB (more realistically like 2GB) is not going to be a good experience :(
Actually, to go ahead and answer, the “fastest” path would be LM Studio (which supports MLX quants natively and is not time intensive to install), and a DWQ quantization (which is a newer, higher quality variant of MLX models).
Hopefully one of these models, depending on how much RAM you have:
https://huggingface.co/mlx-community/Qwen3-14B-4bit-DWQ-053125
https://huggingface.co/mlx-community/Magistral-Small-2506-4bit-DWQ
https://huggingface.co/mlx-community/Qwen3-30B-A3B-4bit-DWQ-0508
https://huggingface.co/mlx-community/GLM-4-32B-0414-4bit-DWQ
With a bit more time invested, you could try to set up Open Web UI as an alterantive interface (which has its own built in web search like Gemini): https://openwebui.com/
And then use LM Studio (or some other MLX backend, or even free online API models) as the ‘engine’
Alternatively, especially if you have a small RAM pool, Gemma 12B QAT Q4_0 is quite good, and you can run it with LM Studio or anything else that supports a GGUF. Not sure about 12B-ish thinking models off the top of my head, I’d have to look around.
Honestly perplexity, the online service, is pretty good.
As for local running, one question first: how much RAM does your Mac have? This is basically the factor for what model you can and should run.
I don’t understand.
Ollama is not actually docker, right? It’s running the same llama.cpp engine, it’s just embedded inside the wrapper app, not containerized. It has a docker preset you can use, yeah.
And basically every LLM project ships a docker container. I know for a fact llama.cpp, TabbyAPI, Aphrodite, Lemonade, vllm and sglang do. It’s basically standard. There’s all sorts of wrappers around them too.
You are 100% right about security though, in fact there’s a huge concern with compromised Python packages. This one almost got me: https://pytorch.org/blog/compromised-nightly-dependency/
This is actually a huge advantage for llama.cpp, as it’s free of python and external dependencies by design. This is very unlike ComfyUI which pulls in a gazillian external repos. Theoretically the main llama.cpp git could be compromised, but it’s a single, very well monitored point of failure there, and literally every “outside” architecture and feature is implemented from scratch, making it harder to sneak stuff in.
OK.
Then LM Studio. With Qwen3 30B IQ4_XS, low temperature MinP sampling.
That’s what I’m trying to say though, there is no one click solution, that’s kind of a lie. LLMs work a bajillion times better with just a little personal configuration. They are not magic boxes, they are specialized tools.
Random example: on a Mac? Grab an MLX distillation, it’ll be way faster and better.
Nvidia gaming PC? TabbyAPI with an exl3. Small GPU laptop? ik_llama.cpp APU? Lemonade. Raspberry Pi? That’s important to know!
What do you ask it to do? Set timers? Look at pictures? Cooking recipes? Search the web? Look at documents? Do you need stuff faster or accurate?
This is one reason why ollama is so suboptimal, with the other being just bad defaults (Q4_0 quants, 2048 context, no imatrix or anything outside GGUF, bad sampling last I checked, chat template errors, bugs with certain models, I can go on). A lot of people just try “ollama run” I guess, then assume local LLMs are bad when it doesn’t work right.
Totally depends on your hardware, and what you tend to ask it. What are you running? What do you use it for? Do you prefer speed over accuracy?
TBH you should fold this into localllama? Or open source AI?
I have very mixed (mostly bad) feelings on ollama. In a nutshell, they’re kinda Twitter attention grabbers that give zero credit/contribution to the underlying framework (llama.cpp). And that’s just the tip of the iceberg, they’ve made lots of controversial moves, and it seems like they’re headed for commercial enshittification.
They’re… slimy.
They like to pretend they’re the only way to run local LLMs and blot out any other discussion, which is why I feel kinda bad about a dedicated ollama community.
It’s also a highly suboptimal way for most people to run LLMs, especially if you’re willing to tweak.
I would always recommend Kobold.cpp, tabbyAPI, ik_llama.cpp, Aphrodite, LM Studio, the llama.cpp server, sglang, the AMD lemonade server, any number of backends over them. Literally anything but ollama.
…TL;DR I don’t the the idea of focusing on ollama at the expense of other backends. Running LLMs locally should be the community, not ollama specifically.
brucethemoose@lemmy.worldto Ask Lemmy@lemmy.world•Whats a better name for 'graphics cards' that describes the kind of computational work it does2·9 days agoWell not everyone in the machine learning space is an AI Bro, either. Many (most?) researchers see Altman et al. as snake-oil grifters.
Same with the P2P/networking junkies. They didn’t ask for a mountain of pyramid schemes.
brucethemoose@lemmy.worldto Ask Lemmy@lemmy.world•Whats a better name for 'graphics cards' that describes the kind of computational work it does66·9 days agoThey are GPUs.
All of them, even the H100, B100, and MI300X all have texture units, pixel shaders, everything. They are graphics cards at a low level. Only the MI300X is missing ROPs, but the Nvidia cards have them (and can run realtime games on Linux), and they all can be used in Blender and such.
The compute programming languages they use are, fundamentally, hacked up abstractions to map to the same GPU hardware in consumer stuff.
That’s the whole point, they’re architected as GPUs so that they’re backwards compatible, as everything’s built on the days when consumer gaming GPUs were hacked to be used for compute.
Are there more dedicated accelerators? Yes. They’re called ASICs, or application specific integrated circuits. This is technically a broad term, but mostly its connotation is very purpose made compute.
brucethemoose@lemmy.worldto Ask Lemmy@lemmy.world•How many of you use Lemmy and ONLY use Lemmy vs Reddit?3·10 days agoOn the two subs I frequented:
-
/r/thelastairbender is just cultish and shallow now. I abandoned it. But it’s painful for me, as this is like the only sane place left the fandom has any critical mass. /c/thelastairbender is nice, but very quiet.
-
/r/localllama Has… lost its intelligence? Like no one seems to experiment or talk technically anymore, good talk seems to be on github, or shattered across Discords, while the ‘critical mass’ is in the AI Bro black hole of Twitter and Linkedin. I read it, but never post anymore. localllama here is better, but smaller and downvoted to hell.
Also, I’ve been shadowbanned on like 4 accounts in 3 different IPs/machines, no explanation, no recourse. I never post anything political or even remotely provocative (unless links to Lemmy count) and only visit those two subs, so… Yeah, kinda sick of that.
-
Funny thing is correct json is easy to “force” with grammar-based sampling (aka it literally can’t output invalid json) + completion prompting (aka start with the correct answer and let it fill in whats left, a feature now depreciated by OpenAI), but LLM UIs/corporate APIs are kinda shit, so no one does that…
A conspiratorial part of me thinks that’s on purpose. It encourages burning (read: buying) more tokens to get the right answer, encourages using big models (where smaller, dumber, (gasp) prompt-cached open weights ones could get the job done), and keeps the users dumb. And it fits the Altman narrative of “we’re almost at AGI, I just need another trillion to scale up with no other improvements!”
Some mod packs are just unstable.
Could be a specific area/item crashing it, and TBH all modded servers need regular maintenance like mob culling and regular server restarts. But it could also be a problem with your host, yeah, and the M3 Pro is going to be way faster than any CPU your host has. Plenty of RAM too.
I’d recommend running it with GraalVM EE as your JVM.
I tend to gravitate towards Enigmatica and ATM myself (as their devs/dev process is pretty good), but not sure about Ozone or skyblock mod packs.
It would be a fast host (and client) for more heavily modded Minecraft.
You could self-host an LLM, but unfortunately 18GB total system RAM (so less than 14GB usable by the GPU?) is pretty skinny.
You can do some stuff to an iPhone with one. Like sideloading, I think?
What @mierdabird@lemmy.dbzer0.com said, but the adapters arent cheap. You’re going to end up spending more than the 1060 is worth.
A used desktop to slap it in, that you turn on as needed, might make sense? Doubly so if you can find one with an RTX 3060, which would open up 32B models with TabbyAPI instead of ollama. Some configure them to wake on LAN and boot an LLM server.