theMind Experiment: running LLM on CPUs – theMind

Introduction

Recent advancements in artificial intelligence have been fueled by the incredible growth of large-scale language models like OpenAI’s GPT-4. While GPUs have traditionally been the go-to for training and deploying these cutting-edge models, a growing body of research is demonstrating the viability of CPUs in this domain. In this article, we explore the benefits of running large language models (LLMs) on CPUs, discuss the current results of our experiment in this area, and outline possible next steps to further optimize performance.

Advantages of Running LLMs on CPUs

Cost-Effectiveness: GPUs have become increasingly expensive due to high demand and limited supply. Using CPUs to train and run LLMs offers a more cost-effective solution, especially for small and medium-sized businesses.
Scalability: CPU infrastructure is more abundant and easily accessible, enabling researchers and organizations to scale their projects across more cores without the constraints of limited GPU availability.
Energy Efficiency: CPUs generally consume less power than GPUs, leading to reduced energy costs and a smaller carbon footprint for AI projects.
Wider Applicability: Many institutions and data centers have more CPUs than GPUs, making it easier to deploy LLMs in a broader range of environments.

Current Results of the Experiment

The experiment aimed to assess the performance and efficiency of running LLaMa model on CPUs compared to GPUs. In this experiment we utilized the computing resources of our partner Cato Digital that provides the following machines configuration:

Type:	Application Server
Processor:	2 x Intel Xeon E5-2680v4
Processor Speed:	2.4-3.3Ghz
vCores:	56
Memory:	256GB
Local Storage:	1 x 512GB SSD
Network:	10Gbps

Preliminary results are promising:

Model Accuracy: The LLM running on CPUs achieved comparable accuracy to the GPU version in several benchmark tasks, demonstrating that there is minimal loss in performance.
Training Time: Although training on CPUs took longer than on GPUs, the difference was manageable, making it a viable alternative for organizations with time flexibility.
Inference Speed: Inference latency on CPUs was found to be within an acceptable range, making it suitable for real-time applications.

Large models performance:

Model	Time per token, ms	Time per run, ms	Memory required
65B/ggml-model-f16.bin	1278.02	3726.15	128109.20 MB (+ 5120.00 MB per state)
65B/ggml-model-q8_0.bin	904.73	2226.19	73631.70 MB (+ 5120.00 MB per state)
65B/ggml-model-q4_0.bin	621.35	1310.88	42501.70 MB (+ 5120.00 MB per state)

Smaller models performance:

Model	Time per token, ms	Time per run, ms	Memory required
30B/ggml-model-f16.bin	721.93	1909.48	64349.70 MB (+ 3124.00 MB per state)
30B/ggml-model-q8_0.bin	392.79	1069.48	37206.10 MB (+ 3124.00 MB per state)
30B/ggml-model-q4_0.bin	204.90	613.74	21695.48 MB (+ 3124.00 MB per state)
13B/ggml-model-f16.bin	290.64	821.60	26874.67 MB (+ 1608.00 MB per state)
13B/ggml-model-q8_0.bin	204.94	558.93	16013.73 MB (+ 1608.00 MB per state)
13B/ggml-model-q4_0.bin	145.74	368.89	9807.48 MB (+ 1608.00 MB per state)
7B/ggml-model-f16.bin	167.59	492.20	128109.20 MB (+ 5120.00 MB per state)
7B/ggml-model-q8_0.bin	158.99	339.44	9022.33 MB (+ 1026.00 MB per state)
7B/ggml-model-q4_0.bin	94.42	241.41	5809.33 MB (+ 1026.00 MB per state)

You can read the full experiment performance report at this GitHub Gist page.

Possible Next Steps

Optimizing CPU Implementation: Further research should focus on optimizing the implementation of LLMs on CPUs to minimize training time and improve inference speed.
Tailoring Models for CPU Performance: Developing LLMs specifically designed to run efficiently on CPUs could lead to significant performance gains without compromising on model accuracy.
Exploring Hybrid Solutions: Combining CPU and GPU resources could allow for more efficient and cost-effective AI deployments, leveraging the strengths of both platforms.

Conclusion

Running large language models on CPUs is a promising alternative to GPU-based deployments, offering numerous advantages in cost-effectiveness, scalability, and energy efficiency. Current experimental results show that the performance of LLMs on CPUs is comparable to that of GPUs, with further optimization offering the potential for even greater gains. By embracing CPU computing and exploring new ways to harness its power, the AI community can continue to push the boundaries of what is possible with large language models.

Leave a Comment Cancel Reply