Navigating the bustling market of large language models (LLMs) today can feel like being a kid in a candy store… with a catch. The catch? There’s just too much choice, and unlike picking between gummy bears and chocolate bars, the stakes are higher and the options far more complex.
🤯 LLM Overload!
As of May 2024, Hugging Face boasts over 600,000 models. Yes, you read that correctly! With such a staggering array of models—from various developers, fine-tuned variants, model sizes, quantizations, to deployment backends—picking the right one can be downright daunting.
Our research peered into the depths of Hugging Face’s extensive model repository, analyzing the most popular models based on downloads, likes, and trends. We discovered that NLP models are the reigning champions, accounting for 52% of all downloads. Audio and computer vision models trail behind, while multimodal models are just starting to make their mark. Additional analysis showed that, as expected, most downloaded open-source models were authored by Universities and Research institutions (39% of all downloads). The remainder is roughly evenly distributed between Open-source communities, Emerging AI Organizations, and Big Tech.
So, how does one sift through this mountain of models to find the right one? Well, fear not! We’ve devised a no-nonsense framework to help you select the perfect LLM for your needs.
🗺️ A Taxonomy to navigate LLM selection
To streamline this complex decision, we suggest categorizing the selection process into four key elements:
- 📖 Model Openness: How accessible are the model’s code, weights, and training datasets?
- Open-source: Everything is public.
- Open-weight: Only the model weights are available.
- Proprietary: A black box. Everything is under lock and key.
- ⚒️ Model Task Use Case: What job do you need the model to perform?
- NLP: For tasks like generating text, translating, and determining sentence similarity.
- Audio: For tasks like speech recognition or generating speech from text.
- Computer Vision: For generating images from text, recognizing images, etc.
- Multimodal: Combining different types of data, like visual question answering.
- 🎯 Model Precision: What level of performance do you need?
- # of parameters variants: From models with a few billion parameters to those reaching the trillions.
- Quantization variants: several quantization mechanisms have emerged that allow model compression and effective deployment in consumer hardware
- 🏃🏻♂️ Model Deployment: How will the model run?
❓Guiding Questions to Find Your LLM Match❓
Having stated the taxonomy above, now try answering these questions to refine your choices:
- What task are you trying to accomplish? Depending on your needs, you might want a specific type of model. Here’s some examples:
- 🔡 NLP: consider GPT-4, GPT-3.5, Copilot, Claude 3, Llama3, Gemini, Mixtral
- 🔉 Audio: For Text to speech consider MeloTTS, Deepgram. For Speech to text consider OpenAI Whisper (or Whisper JAX), Deepgram, Google Cloud Speech-to-Text,
- 👁️ Computer Vision: OpenAI DALL-E, Midjourney, Stable Diffusion
- 🔀 Multimodal: Gemini Advanced, OpenAI GPT-4 with Vision, LLaVA-NeXT
- Do you need high precision in a specialized domain? Some use cases might require fine-tuning with specific datasets, others may work with RAGs, and for some others general purpose models will suffice:
- Your use case requires fine-tuning to match the precision and domain knowledge required: we recommend you leverage open-source models
- If high performance is required on general purpose knowledge across diverse fields: most likely big proprietary LLMs will be your best bet.
- Otherwise, use either proprietary or open-source. This includes cases when extreme specialization of the model isn’t required, since grounding knowledge may be achieved via RAGs.
- What are your inference speed requirements? The necessary inference speed can dramatically influence the choice of model (read our Nvidia blog for our POV on AI hardware evolution)
- ⚡️ Ultra fast response? Use Groq, the fastest LLM inference platform at the moment, which allows you to run LLaMa and Mixtral models at lightning fast speeds (might be particularly useful for Agentic workflows)
- ⏩ Fast response? Try smaller models (7b, or quantized) that fit into consumer hardware. Examples are Microsoft Phi3 and LLaMa3 8b. ChatGPT3.5 Turbo is quite fast as well.
- 🏃🏻♂️ Around human reading speeds? Try GPT4 Turbo, LLaMa3 70b
- 🐢 Time response is irrelevant: Run the biggest best model you can on cloud deployments
- What’s your budget? Always a critical question, as costs can vary widely based on the model and deployment method.
- Free please: then go open-source
- Unknown, I’m testing an MVP: start with open-source, move to more performant models as needed
- Good price/performance ratio: OpenAI chatGPT 3.5. Its fast, cheap, and performs generally well.
- Money isn’t a concern: Run the biggest best model you can on hosted hardware or pay for best in class API alternatives
✅ Applying the Framework: Real-Life Scenarios
Scenario 1: Building an Internal Support Ticket Chatbot
- Task Use Case: NLP text generation. Therefore general purpose models like chatGPT, Gemini, Claude, LLaMa3 or Mixtral would work well.
- Precision requirements: Conversational abilities grounded on internal company documentation (support docs, previous tickets). Since specialized knowledge is required, but you want to keep general conversational abilities, a general purpose model with RAGs will be a good start.
- Inference speed requirements: Ideally human-reading speeds, but ok if slower. Therefore, GPT-4-grade models would be a good target.
- Budget: less than 33% of the annual salary for the average IT support professional ($70,000 / year / 3 = $23,000).
- Assumptions:
- 1 out of 3 tickets can be solved automatically by the chatbot
- 10 IT support tickets per day for your organization, you would have 3650 tickets per year.
- On average 1000 words input, 1000 words output, per ticket (around 3000 tokens per ticket)
- chatGPT 4 Turbo cost = $200 USD yearly
- chatGPT 3.5 cost = $15 USD yearly
- Gemini Pro 1.5 = $15 USD yearly
- In short, even the biggest best models would be very much under the cost of an IT support professional and fit within the budget
- Assumptions:
- Conclusion: Your best option is most likely a chatGPT4 Turbo with RAG implementation via OpenAI’s API.
Scenario 2: Creating a Custom Text to Voice Generator
- Task Use Case: Audio, text to voice generation. Most TTS models from Hugging Face would work (ex: MeloTTS, Parler TTS)
- Precision requirements: Needs to convincingly mimic your boss’s voice. Therefore open-source models are going to be a better option since you will have to fine tune to replicate your boss’s voice
- Inference speed requirements: An hour’s delay is acceptable. You will probably need cloud based deployments for fine tuning phase
- Budget: Less than what your boss would cost making the announcements over a year.
- 1 hour of work of my boss costs $60 USD
- 1 boss announcement by week, so 4 by month, 52 per year
- Therefore budget needs to be less than $3,152 per year ($60 x 52)
- Conclusion: Fine tune Parler TTS hosted in the cloud
🔮 LLM Commoditization and What Lies Ahead
LLMs are improving rapidly and will likely consolidate into a few big foundational models that excel at many tasks. We believe that then differentiation will mainly come from how well LLMs can integrate and leverage contextual data; by knowing you and your problem space incredibly well it will be able to better serve your needs. Choosing the right LLM can be a game-changer for your projects. Whether it’s powering a chatbot or generating realistic voice messages, we hope to have given you the some tools to navigate this decision process!
So, which LLM will you choose? How do you plan to use it? Dive into the comments and let us know your thoughts and experiences!