Using Mac M2 Ultra 192GB to Self-Host LLMs?

shaserlark@sh.itjust.works · edit-2 9 months ago

Using Mac M2 Ultra 192GB to Self-Host LLMs?

shaserlark@sh.itjust.works · 9 months ago

Yeah I found some stats now and indeed you’re gonna wait like an hour to process if you throw like 80-100k token into a powerful model. With APIs that kinda works instantly, not surprising but just to give a comparison. Bummer.

Boomkop3@reddthat.com · 9 months ago

Application Programming Interface, are you talking about something on the internet? On a gpu driver? On your phone?

Then also, what’s the size model you’re using? Define with int32? fp4? Somewhere in between? That’s where ram requirements come in

I get that you’re trying to do a mic drop or something, but you’re not being very clear

shaserlark@sh.itjust.works · 9 months ago

Are you drunk?

Boomkop3@reddthat.com · 9 months ago

No, just calling your bluff. git gud m8

shaserlark@sh.itjust.works · 8 months ago

You’re aware that there’s the OpenAI API library right? https://github.com/openai/openai-python

It’s really nothing fancy especially on Lemmy where like 99% of people are software engineers…

Boomkop3@reddthat.com · edit-2 8 months ago

Eyy, a web api! You could’ve just said that right away. There’s more than just web api’s.

How is this web api relevant in your choice of hardware to locally run these models?

shaserlark@sh.itjust.works · 8 months ago

Congrats on being that guy

Boomkop3@reddthat.com · 8 months ago

Throwing money at a problem works, next time try to know what you’re doing

Boomkop3@reddthat.com · 9 months ago

Anyways, the important thing is the “TOPS” aka trillions of operations per second. Having enough ram in important, but if you don’t have a fast processor than you’re wasting ram while you can just stream it from a fast ssd.

One such cases is when your system can’t handle more than 50 tops, like the apple m systems. Try an old gpu, and enjoy 1000’s of tops