• 3 Posts
  • 319 Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle


  • I view it differently.

    In the US, there are either megacorps, or “people in garages” which honestly don’t have resources and stuff like legal support to do huge innovations. They publish cool papers, which never get implemented because they don’t have $200k+ for a bigger test, and can’t work on it themselves for a living. Any “garage devs” who get too big, get smited or amalgamated into Big Tech gray goo, and whatever was interesting gets lost in oblivion.

    There’s no cooperation, no sharing, either.

    And OpenAI/Anthropic are way more conservative than you’d think. Same with Meta; they want results next quarter. Zuckerburg literally fired the whole Llama team, which put meta on the AI map and basically founded the open weights space, when they had one failed experiment. In other words, I’d argue clueless billionaires and the Tech Bro acolytes surrounding them are poisoning LLM development, and it’s starting to catch up.


    In China, things are different. The GPU sanctions forced these gigantic companies like Alibaba or Tencent to be compute-thrifty, but they all seem to have access to suspiciously good training data… I would be the Chinese govt is helping them under the table. Chinese devs also have an interesting attitude; I would characterize them as “cooperative,” with lots of private forum sharing going on, most models being open-weights, and clearly not a lot of desire to censor their models for the government. But they have their own forms of dysfunction too, sometimes by copying other firms a little to closely, or corporate/personal drama like anywhere.


  • Okay, I fudged the part about “for free.” The problem is DeepSeekv4 is literally in preview, and its architecture is so new that engine support for its weights is poor.

    Right this second, you can either pay a few cents to try it from some API (there are many providers since its open weights), or rent a GPU (or maybe a CPU) instance if you don’t trust the public tests, and actually want to test resource usage yourself.

    Or you can quantize it and self host it. I plan to do so on my 128GB RAM/RTX 3090 desktop, which is a affordable config to rent if you don’t have a desktop like that.

    But llama.cpp support is a work-in-progress. Same with other backends like Ktransformers. Realistically your options are:

    • Wait a week, maybe a few weeks, for the llama.cpp/ik_llama.cpp developers to implement to DSV4 architecture.

    • Try one of the janky GPU/Apple forks availible right now.

    • Try one of the slightly-less-janky, but slow CPU-only chinese forks.

    But once its implemented, I’m going to make my own personal IQ3_KS mixed quantization for 128G desktops, and see how it compares to older architectures myself.


    Another confounding factor is, if you’re researching “AI farm inference costs,” thats very different.

    Frugal providers like Deepseek use complicated schemes to batch requests over many GPUs, with each taking requests in parallel. In other words, the more GPUs they have, the more speed per GPU they can squeeze out. For DeepseekV3, last I heard, Around 300 GPUs or so was an ideal deployment number…

    And they aren’t even going to be using Nvidia GPUs anyway. I believe Deepseek is switching to Huawei for inference.

    But however you slice it, they’re using order of magnitudes fewer resources than Tech Bro providers like OpenAI or Grok. They have been, for over a year.


  • Yes.

    It’s dropping, dramatically.

    Look at the history of open and closed releases, on benchmarks that aren’t totally gamed, and it’s easy to see. LLM capabilities are plateauing, and bigger models are getting more and more niche.

    But inference efficiency is increasing exponentially. Tiny models are getting closer and closer to frontier ones. See: Qwen 27B, and how it can do most of what mega models did just months ago.

    And there’s tons of unpicked efficiency fruit in papers. Bitnet is the big one, but I’ve seen dozens of proof of concepts, just yet to be tried in a production model, that are dramatic efficiency boosts.



  • TurboQuant is total baloney.

    It’s just KV cache quantization, and we’ve had all sorts of that for ages. Backends, not just papers, have had 4-bit cache with hadamard rotation (a major component of TurboQuant), and very low loss, since like 2023.

    We’ve had proof that Bitnet works for over a year.

    And no one cares. No one uses that kind of quantization because it reduces batched throughput, just like TurboQuant.

    Besides, new architectures (like DeepSeek V4) render it obsolete, as they don’t use traditional KV cache anymore. I honestly have no idea how TurboQuant became such a meme, other than major astroturfing.


    TL;DR All AI news is total bull. It’s chum for investors.

    You need to look at what the engines, papers and actual LLM weight architectures are doing.





  • It’s not as detrimental as you think.

    If you take a dedicated camera, put it on a tripod, and shoot it at like f22, yeah, you’ll clearly see a spec of dust or a smudge in a shot. But shoot with a wide open aperture (like your phone usually does), and its essentially invisible.

    And then that little imperfection gets AI’d away by all the computational frame-stacking your phone does for every shot.


    In other words, your phone’s images so processed that one spec of dust doesn’t really matter.

    And even on RAWs from a mirrorless camera, it’s the least of your problems, with things like shot noise, motion blur, lens distortion, botched settings and other imperfections all having a much bigger impact on the final image.



  • To add to what others said:

    LPDDRX is used in some inference hardware. The same stuff you find in laptops and smartphones.

    Also, the servers need a whole lot of regular CPU DIMMs since they’re still mostly EPYC/Xeon severs with 8 GPUs in each. And why are they “wasting” so much RAM on CPU RAM that isn’t really needed, you ask? Same reason as a lot of AI: it’s immediately accessible, already targeted by devs, and AI dev is way more conservative and wasteful than you’d think.

    Same for SSDs. Regular old servers (including AI servers) need it too. In a perfect world they’d use centralized storage for images/weights with near-“diskless” inference/training servers. Some AI servers do this, but most don’t.


    Basically, the waste is tremendous, for the same reason they use cheap gas generators on-site: it’s faster-to-market.







  • From my perspective in the local LLM scene:

    They’re getting better at being dumb tools doing mundane things. Formatting for MCP, tool use and stuff is all getting trained in now.

    …They aren’t getting much smarter or more reliable, though.

    This is especially true with the big US AI houses. There are all sorts of incredible papers that come out weekly to address things like errors from sampling, power use, the one-way serial autoregressive architecture, all these fundamental caps of capabilities, and… they aren’t even testing any of it? Contrary to what you hear, LLM development has been very, very, very conservative, and even a single failed experiment can destroy a whole division.


    The Chinese devs are the forefront IMO, and are really pushing efficiency, but are falling into a similar trap elsewhere, unfortunately.


    So I don’t know how to answer your question.

    There is TONS of low hanging fruit to be picked, both on the model and implementation side, but it doesn’t seem like anyone is picking it efficiently? It’s largely following the corporate enshittification model of “don’t improve, scale.”

    Labs doing anything interesting/not scammy are either bought out and smothered (mostly in the US), tend to fall into a “herd mentality” (my observation looking into China), or are crushed by corpo nonsense (Korea/Middle East/US) or messed up laws (Europe).


  • If you’re on a browser, I’d recommend:

    https://github.com/amitbl/blocktube

    And perhaps other YouTube client apps have a similar feature.

    I find, for a given topic, that there are a few common channels spamming hundreds and hundreds of junk videos. Block them as you find them, and it cleans up the feed immensely.

    It’s absolutely mind boggling that YT doesn’t include this as a default feature.


    Also, respectfully I would not get too invested in YT.

    The other day, I found my TV (with the stock app) auto skipping sponsers. That’s just one of a bazillion ways Google is crushing creators that make anything but attention slop, intentionally, so that kind of long-form content you like may not last.