Go Back

theMind Experiment: running LLM on CPUs | AI Consulting

TheMind's experiment explored running large language models (LLMs) on CPUs as a cost-effective and scalable alternative to GPUs. Results showed comparable accuracy with longer training times but manageable inference speeds, making CPUs a viable option for AI deployment. Future optimizations could further enhance performance, efficiency, and hybrid CPU-GPU solutions for AI applications.

Published

May 9, 2023

Introduction

Recent advancements in artificial intelligence have been fueled by the incredible growth of large-scale language models like OpenAI’s GPT-4. While GPUs have traditionally been the go-to for training and deploying these cutting-edge models, a growing body of research is demonstrating the viability of CPUs in this domain. In this article, we explore the benefits of running large language models (LLMs) on CPUs, discuss the current results of our experiment in this area, and outline possible next steps to further optimize performance.

Advantages of Running LLMs on CPUs

  1. Cost-Effectiveness: GPUs have become increasingly expensive due to high demand and limited supply. Using CPUs to train and run LLMs offers a more cost-effective solution, especially for small and medium-sized businesses.
  2. Scalability: CPU infrastructure is more abundant and easily accessible, enabling researchers and organizations to scale their projects across more cores without the constraints of limited GPU availability.
  3. Energy Efficiency: CPUs generally consume less power than GPUs, leading to reduced energy costs and a smaller carbon footprint for AI projects.
  4. Wider Applicability: Many institutions and data centers have more CPUs than GPUs, making it easier to deploy LLMs in a broader range of environments.

Current Results of the Experiment

The experiment aimed to assess the performance and efficiency of running LLaMa model on CPUs compared to GPUs. In this experiment we utilized the computing resources of our partner Cato Digital that provides the following machines configuration:

Type:Application ServerProcessor:2 x Intel Xeon E5-2680v4Processor Speed:2.4-3.3GhzvCores:56Memory:256GBLocal Storage:1 x 512GB SSDNetwork:10Gbps

Preliminary results are promising:

  1. Model Accuracy: The LLM running on CPUs achieved comparable accuracy to the GPU version in several benchmark tasks, demonstrating that there is minimal loss in performance.
  2. Training Time: Although training on CPUs took longer than on GPUs, the difference was manageable, making it a viable alternative for organizations with time flexibility.
  3. Inference Speed: Inference latency on CPUs was found to be within an acceptable range, making it suitable for real-time applications.

Large models performance:

ModelTime per token, msTime per run, msMemory required65B/ggml-model-f16.bin1278.023726.15128109.20 MB (+ 5120.00 MB per state)65B/ggml-model-q8_0.bin904.732226.1973631.70 MB (+ 5120.00 MB per state)65B/ggml-model-q4_0.bin621.351310.8842501.70 MB (+ 5120.00 MB per state)

Smaller models performance:

ModelTime per token, msTime per run, msMemory required30B/ggml-model-f16.bin721.931909.4864349.70 MB (+ 3124.00 MB per state)30B/ggml-model-q8_0.bin392.791069.4837206.10 MB (+ 3124.00 MB per state)30B/ggml-model-q4_0.bin204.90613.7421695.48 MB (+ 3124.00 MB per state)13B/ggml-model-f16.bin290.64821.6026874.67 MB (+ 1608.00 MB per state)13B/ggml-model-q8_0.bin204.94558.9316013.73 MB (+ 1608.00 MB per state)13B/ggml-model-q4_0.bin145.74368.899807.48 MB (+ 1608.00 MB per state)7B/ggml-model-f16.bin167.59492.20128109.20 MB (+ 5120.00 MB per state)7B/ggml-model-q8_0.bin158.99339.449022.33 MB (+ 1026.00 MB per state)7B/ggml-model-q4_0.bin94.42241.415809.33 MB (+ 1026.00 MB per state)

You can read the full experiment performance report at this GitHub Gist page.

Possible Next Steps

  1. Optimizing CPU Implementation: Further research should focus on optimizing the implementation of LLMs on CPUs to minimize training time and improve inference speed.
  2. Tailoring Models for CPU Performance: Developing LLMs specifically designed to run efficiently on CPUs could lead to significant performance gains without compromising on model accuracy.
  3. Exploring Hybrid Solutions: Combining CPU and GPU resources could allow for more efficient and cost-effective AI deployments, leveraging the strengths of both platforms.

Conclusion

Running large language models on CPUs is a promising alternative to GPU-based deployments, offering numerous advantages in cost-effectiveness, scalability, and energy efficiency. Current experimental results show that the performance of LLMs on CPUs is comparable to that of GPUs, with further optimization offering the potential for even greater gains. By embracing CPU computing and exploring new ways to harness its power, the AI community can continue to push the boundaries of what is possible with large language models.

Become An Energy-Efficient Data Center With theMind

The evolution of data centers towards power efficiency and sustainability is not just a trend but a necessity. By adopting green energy, energy-efficient hardware, and AI technologies, data centers can drastically reduce their energy consumption and environmental impact. As leaders in this field, we are committed to helping our clients achieve these goals, ensuring a sustainable future for the industry.



For more information on how we can help your data center become more energy-efficient and sustainable, contact us today. Our experts are ready to assist you in making the transition towards a greener future.

Related Blog Posts

June 9, 2023

State of GPT: Andrej Karpathy’s Talk | AI Consulting

Andrej Karpathy’s talk on GPT-4 covered prompt engineering, model augmentation, and finetuning while stressing bias risks, human oversight, and its versatility.

Read post

June 15, 2023

I-JEPA: human-like AI model | AI Consulting

Meta’s I-JEPA, introduced by Yann LeCun, learns internal world representations instead of pixels. It excels in low-shot classification, self-supervised learning, and image representation, surpassing generative models in accuracy and efficiency.

Read post