Go Back
TheMind's experiment explored running large language models (LLMs) on CPUs as a cost-effective and scalable alternative to GPUs. Results showed comparable accuracy with longer training times but manageable inference speeds, making CPUs a viable option for AI deployment. Future optimizations could further enhance performance, efficiency, and hybrid CPU-GPU solutions for AI applications.
Published
May 9, 2023
Introduction
Recent advancements in artificial intelligence have been fueled by the incredible growth of large-scale language models like OpenAI’s GPT-4. While GPUs have traditionally been the go-to for training and deploying these cutting-edge models, a growing body of research is demonstrating the viability of CPUs in this domain. In this article, we explore the benefits of running large language models (LLMs) on CPUs, discuss the current results of our experiment in this area, and outline possible next steps to further optimize performance.
Advantages of Running LLMs on CPUs
Current Results of the Experiment
The experiment aimed to assess the performance and efficiency of running LLaMa model on CPUs compared to GPUs. In this experiment we utilized the computing resources of our partner Cato Digital that provides the following machines configuration:
Type:Application ServerProcessor:2 x Intel Xeon E5-2680v4Processor Speed:2.4-3.3GhzvCores:56Memory:256GBLocal Storage:1 x 512GB SSDNetwork:10Gbps
Preliminary results are promising:
Large models performance:
ModelTime per token, msTime per run, msMemory required65B/ggml-model-f16.bin1278.023726.15128109.20 MB (+ 5120.00 MB per state)65B/ggml-model-q8_0.bin904.732226.1973631.70 MB (+ 5120.00 MB per state)65B/ggml-model-q4_0.bin621.351310.8842501.70 MB (+ 5120.00 MB per state)
Smaller models performance:
ModelTime per token, msTime per run, msMemory required30B/ggml-model-f16.bin721.931909.4864349.70 MB (+ 3124.00 MB per state)30B/ggml-model-q8_0.bin392.791069.4837206.10 MB (+ 3124.00 MB per state)30B/ggml-model-q4_0.bin204.90613.7421695.48 MB (+ 3124.00 MB per state)13B/ggml-model-f16.bin290.64821.6026874.67 MB (+ 1608.00 MB per state)13B/ggml-model-q8_0.bin204.94558.9316013.73 MB (+ 1608.00 MB per state)13B/ggml-model-q4_0.bin145.74368.899807.48 MB (+ 1608.00 MB per state)7B/ggml-model-f16.bin167.59492.20128109.20 MB (+ 5120.00 MB per state)7B/ggml-model-q8_0.bin158.99339.449022.33 MB (+ 1026.00 MB per state)7B/ggml-model-q4_0.bin94.42241.415809.33 MB (+ 1026.00 MB per state)
You can read the full experiment performance report at this GitHub Gist page.
Possible Next Steps
Conclusion
Running large language models on CPUs is a promising alternative to GPU-based deployments, offering numerous advantages in cost-effectiveness, scalability, and energy efficiency. Current experimental results show that the performance of LLMs on CPUs is comparable to that of GPUs, with further optimization offering the potential for even greater gains. By embracing CPU computing and exploring new ways to harness its power, the AI community can continue to push the boundaries of what is possible with large language models.
The evolution of data centers towards power efficiency and sustainability is not just a trend but a necessity. By adopting green energy, energy-efficient hardware, and AI technologies, data centers can drastically reduce their energy consumption and environmental impact. As leaders in this field, we are committed to helping our clients achieve these goals, ensuring a sustainable future for the industry.
For more information on how we can help your data center become more energy-efficient and sustainable, contact us today. Our experts are ready to assist you in making the transition towards a greener future.
June 9, 2023
Andrej Karpathy’s talk on GPT-4 covered prompt engineering, model augmentation, and finetuning while stressing bias risks, human oversight, and its versatility.
Read post
June 15, 2023
Meta’s I-JEPA, introduced by Yann LeCun, learns internal world representations instead of pixels. It excels in low-shot classification, self-supervised learning, and image representation, surpassing generative models in accuracy and efficiency.
Read post