Quantity, context size and the number of parameters of the model influence material needs directly. Bonus: an interactive tool to save time.
Between the choice of model, architecture, quantity, define the right configuration for a model of generative artificial intelligence can quickly turn to the headache. However, some simple rules make it possible to avoid over-dimensioning your infrastructure or to end up with underdimensive equipment unable to meet business needs. Explanations.
The true, nerve of war
The VRAM (for Video Random Access Memory) is the main parameter to monitor when you want to deploy generative AI locally, whether it is an LLM or diffusion models, for example. Integrated with GPU graphics cards, the VRAM temporarily stores the model and data during processing during inference. Compared to the classic RAM, it offers a much superior bandwidth, essential for massively parallel calculations of neural networks. “The VRAM directly determines the maximum size of the model that can be loaded and the length of the exploitable context window”, sums up Marie-Michel Maudet, managing director of Linagora. Three parameters influence the VRAM directly: the number of models of the model, the qantization and the size of the context window.
The size of the model, the main criterion
This is the main parameter to be analyzed to anticipate the needs in TRAM. The more a parameter count model, the more it will be efficient on sophisticated tasks, but the more resources it will consume. “For a model, it is necessary to plan about twice its size in parameters in true for comfortable use”, specifies Marie-Michel Maudet. Thus, a model of 24 billion parameters will require approximately 48 GB of VRAM in real operating conditions.
Be careful however, the architecture of the model can modify this rule. The MOE (Mixture of Experts) models, more and more present, theoretically display 56 billion parameters but only act only during inference. “We apply the same calculation rule, but only on the active partition, which significantly saves the VRAM,” explains the director general of Linagora. These architectures are nevertheless complex to deploy and require several aggregated GPU cards for optimal performance.
The choice of size must above all meet the case of business use. For a simple conversational chatbot or text classification tasks, a model of 3 to 7 billion parameters is more than enough. RAG applications benefit from using models from 13 to 30 billion parameters to guarantee a better understanding of the context. Beyond 30 billion, use cases are more advanced: multimodal analysis, OCR, orchestrator agent …
Quantity, to modulate in precision and cost
After the number of parameters, quantity is the second lever to optimize the memory footprint of a model. The principle is to “compress” the accuracy of the numbers that make up the model. Go from a very precise format (16 bits) to a reduced format (4 bits) divides by four the necessary memory space, with a loss of moderate quality according to uses. A model of 24 billion parameters thus goes from 48 GB to 15 GB between its full version and its 4 -bit quantized version.
Marie-Michel Maudet recommends adapting the quantization level according to the use case. “For pure text, the de facto standard is 4-bit quantity that offers the best performance-consumption compromise,” he explains. For more demanding tasks such as the OCR or the image analysis, he recommends rise to 5 or 8 bits depending on the available material capabilities. In production, the ideal remains 8 -bit quantity which divides the memory imprint by four while preserving optimal quality. Finally, non -quantized models remain reserved for critical cases where maximum precision is required. We then have full capacity of the model, corresponding to its level in the benchmarks.
The importance of the context window
The context window defines the maximum amount of text that a model can deal simultaneously (in tokens). It directly influences the consumption of VRAM: the larger the window, the more memory consumption increases. By approximating, we can consider that each widening of the window results in a linear increase in the necessary memory. For professional uses, it is recommended to use a minimum window of 16,000 to 32,000 tokens. “Below, you quickly find yourself limited as soon as you want to analyze long documents or maintain a conversation with history,” recalls Marie-Michel Maudet.
From RTX4060 to H100, some recommendations
For models up to 7 billion quantity settings, Marie-Michel Maudet recommends entry-level cards with less than 8 GB of VRAM like NVIDIA RTX 4060 and 4070, sufficient for simple chatbots or experimentation. For models up to 13 billion parameters (moderately quantized) with context windows of 16,000 tokens, 12 to 16 GB of VRAM will be necessary. We can then target NVIDIA RTX 4070 TI, RTX 3090 or 4090. For models of 30 billion quantized parameters, the specialist recommends cards with 24 GB of VRAM like the RTX 4090 or the A6000.
Finally for models up to 70 billion parameters, more than 48 GB of VRAM will be necessary. We will therefore opt for an NVIDIA L40 or the classic H100. For models exceeding 100 billion parameters, some open source models even reach 600 billion, deployment must require GPU clusters. Several H100 or A100 cards are thus aggregated via very high speed interconnection technologies, making it possible to distribute the model on several dozen GPUs. Complex and expensive deployments that currently only concern very small number of companies.
To allow you to appreciate the material issues more concretely, we have designed a small interactive widget. This illustrates, from a few simple parameters (use cases, expected precision, context size), the theoretical needs in VRAM as well as the graphics cards likely to agree. However, this is only an indicative and imprecise estimate.
In practice, the best approach consists from an approximate estimate based on VRAM, taking into account the number of parameters, the context window and the quantization, then to iterate thereafter according to the results observed. Finally, the inference server used (Ollama, VLLM, etc.) also plays a key role in optimizing resources. We will come back to this.




