I know you said consumer GPU, but I run a used Tesla P40. It has 24 GB of vram. The price has gone up since I got it a couple years ago, there might be better options in the same price category. Still, it’s going to be cheaper than a modern full fat consumer gpu, with a reasonable performance hit.
My use case is text generation, chat kind of things. In most cases, the inference is more than fast enough, but it can get slow when swapping out large context lengths.
Mostly I run quantized 8-20B models with the sweet spot being around 12. For specialized use cases outside of general language, you can run more compact models. The general output is quite good, and I would have never had thought it was possible 10 years ago.
ETA: I paid about $200 USD for the P40 a couple years ago, plus the price for a fan and 3d printed shroud.
Have you been wronged by njalla?
I think having an external owner is preferable.