

TBH I don’t know a single person that’s ever used Truth Social, or many that even know what it is. I think it’s more of a niche, not something affecting the masses.
TBH I don’t know a single person that’s ever used Truth Social, or many that even know what it is. I think it’s more of a niche, not something affecting the masses.
I have very religious family that repeatedly told my 90 year old grandma not to get vaccinated in the depths of COVID-19. I have other, not-at-all religious family that works as a nurse… And is anti vaccine.
It’s like a parody.
…But it is no joke. I can answer questions about them if you want.
If you’re wondering why, it’s because many Americans are inundated in really scary social media and TV. That part of my family is constantly on Facebook, watching Fox, doomscrolling whatever. Even their church preaches some really, uh, interesting things now.
It’s this way because there’s a lot of profiteering. For example, the current head of the FBI is apparently selling and promoting some kind of “brave anti vaccine” health merchandise. The current head of the US health department made a lot of money and fame off vaccine skepticism. And their church clergy is crooked in ways I can’t even publicly discuss.
And Facebook still funds lots of its development afaik
React and PyTorch are pretty neat. Llama is too. So are some smaller software efforts.
Not that I disagree with the meme at all, but there is a tiny branch at Facebook that is symbiotic.
Honestly this is an age old tradition. Elite, from the founder’s associates to the Clintons, Bushes, Obamas and such probably profited off some insider information, to some extent. But I dunno. SEC regulation for political billionaires is not my area of expertise, though it seems like his family is carrying on a lot of business.
Trump’s memecoin is already a ridiculously flagrant manipulation though: https://www.reuters.com/markets/currencies/trumps-meme-coin-made-nearly-100-million-trading-fees-small-traders-lost-money-2025-02-03/
Small insider trades or charity shenanigans pale in comparison.
If I were him… Well, why mess with insider trading when he can just blatantly profit in the open?
Yes! Try this model: https://huggingface.co/arcee-ai/Virtuoso-Small-v2
Or the 14B thinking model: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
But for speed and coherence, instead of ollama, I’d recommend running it through Aphrodite or TabbyAPI as a backend, depending if you prioritize speed or long inputs. They both act as generic OpenAI endpoints.
I’ll even step you through it and upload a quantization for your card, if you want, as it looks like there’s not a good-sized exl2 on huggingface.
I mean, if you have huge GPU, sure. Or at least 12GB free vram or a big Mac.
Local LLMs for coding is kinda a niche because most people don’t have a 3090 or 7900 lying around, and you really need 12GB+ free VRAM for the models to start being “smart” and even worth using over free LLM APIs, much less cheap paid ones.
But if you do have the hardware and the time to set a server up, the Deepseek R1 models or the FuseAI merges are great for “slow” answers where the model thinks things out for replying. Qwen 2.5 32B coder is great for quick answers on 24GB VRAM. Arcee 14B is great for 12GB VRAM.
Sometimes running a small model on a “fast” less vram efficient backend is better for stuff like cursor code completion.
My friend, the Chinese have been releasing amazing models all last year, it just didn’t make headlines.
Tencent’s Hunyuan Video is incredible. Alibabas Qwen is still a go to local model. I’ve used InternLM pretty regularly… Heck, Yi 32B was awesome in 2023, as the first decent long context local model.
…The Janus models are actually kind of meh, unless you’re captioning images, and FLUX/Hunyuan Video is still king in diffusion world.
As implied above, the raw format fed to/outputed from Deepseek R1 is:
<|begin▁of▁sentence|>{system_prompt}<|User|>{prompt}<|Assistant|><think>The model rambles on to itself here, “thinking” before answering</think>The actual answer goes here.
It’s not a secret architecture, theres no window into its internal state. This is just a regular model trained to give internal monologues before the “real” answer.
The point I’m making is that the monologue is totally dependent on the system prompt, the user prompt, and honestly, a “randomness” factor. Its not actually a good window into the LLM’s internal “thinking,” you’d want to look at specific tests and logit spreads for that.
Zero context to this…
My experience with Deepseek R1 is that it’s quite “unbound” by itself, but the chat UI (and maybe the API? Not 100% sure about that) does seem to be more aligned.
All the open Chinese LLMs (Alibiaba’s Qwen, Tencent, InternLM, Yi, GLM) have been like this, rambling on about Tiananmen Square as much as they can, especially if you ask in English. The Chinese tech devs seem to like “having their cake and eating it,” complying with the govt through the most publicly visible portals while letting the model rip underneath.
Contrast this with OpenAI’s approach of opaquly censoring the model, probably at the weights level, which neuters its intelligence and prose even in other tasks. Oh, and keeping every single detail closed and proprietary.
The real excuse from Elons fans is “It’s just a joke.” Or “He’s trolling libs.”
Which is almost as bad, even if true.
It reminds me of school bullies that would make cutting, abusive “jokes” and follow it up with “Just kidding, bruh” if it doesn’t land right. And these are some of the worst human beings I have ever personally encountered.
How can people worship something like that at such scale? Like, I wouldn’t even wish that on the most raging Nazi, it’s worse than breaking their jaw.
Honestly, most LLMs suck at the full 128K. Look up benchmarks like RULER.
In my personal tests over API, LLama 70B is bad out there. Qwen (and any fine tune based on Qwen Instruct, with maybe an exception or two) not only sucks, but is impractical past 32K once its internal rope scaling kicks in. Even GPT-4 is bad out there, with Gemini and some other very large models being the only usable ones I found.
So, ask yourself… Do you really need 128K? Because 32K-64K is a boatload of code with modern tokenizers, and that is perfectly doable on a single 24G GPU like a 3090 or 7900 XTX, and that’s where models actually perform well.
Late to this post, but shoot for and AMD Strix Halo or Nvidia Digits mini PC.
Prompt processing is just too slow on Apple, and the Nvidia/AMD backends are so much faster with long context.
Otherwise, your only sane option for 128K context in a server with a bunch of big GPUs.
Also… what model are you trying to use? You can fit Qwen coder 32B with like 70K context on a single 3090, but honestly its not good above 32K tokens anyway.
I interpreted this as “depression sure is great” at first.
…Still valid.
Hard disagree, lol. That’s a classic.
To go into more detail:
Exllama is faster than llama.cpp with all other things being equal.
exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)
With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.
It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.
Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.
I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.
It’s less optimal.
On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.
Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.
Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.
And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.
Your post is suggesting that the same models with the same parameters generate different result when run on different backends
Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.
There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.
Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.
So there are multiple ways to split models across GPUs, (layer splitting, which uses one GPU then another, expert parallelism, which puts different experts on different GPUs), but the way you’re interested in is “tensor parallelism”
This requires a lot of communication between the GPUs, and NVLink speeds that up dramatically.
It comes down to this: If you’re more interested in raw generation speed, especially with parallel calls of smaller models, and/or you don’t care about long context (with 4K being plenty), use Aphrodite. It will ultimately be faster.
But if you simply want to stuff the best/highest quality model you can at VRAM, especially at longer context (>4K), use TabbyAPI. Its tensor parallelism only works over PCIe, so it will be a bit slower, but it will still stream text much faster than you can read. It can simply hold bigger, better models at higher quality in the same 48GB VRAM pool.
Yeah, but Facebook and Twitter still have the critical mass.
Most of my family’s Trumpism is grassroots, either from their church or other real-world connections.