A Brighter AI Future: Exploring 1.58 Scaling Laws

Emerging Model Architectures and the Possibility of Cheaper, More Energy-Efficient AI Training

Introduction

As AI continues to advance, several pressing challenges are becoming increasingly evident. The high cost, huge energy demands, and limited availability of computational resources pose significant barriers to widespread adoption and innovation. These issues are further compounded by reliance on third-party LLM providers, raising serious concerns about security, privacy, and control. Current models can also lack diversity and specialization, which limits their effectiveness in downstream tasks. While workarounds such as Retrieval-Augmented Generation can help customize these models, training them on custom data can be a more powerful and elegant way to fully leverage proprietary information and develop competitive, tailored AI products and services.

In this study, we explore the use of 1.58 Bit LLMs as a potential solution. Our approach leverages a practical evaluation of scaling laws to demonstrate that 1.58 Bit LLMs can offer significant improvements over full-precision FP16 models in terms of latency, memory usage, throughput, and energy efficiency.

This project is a continuation of the research presented in the following published AI paper ARXIV.

1.58 Overview

The 1.58-bit LLM represents a groundbreaking approach to optimizing Large Language Models by encoding parameters using a ternary set of values (-1, 0, 1), effectively reducing weight precision to 1.58 bits. This innovative method not only matches the performance of full-precision FP16 models in terms of perplexity and task effectiveness but also demonstrates significant improvements in latency, memory usage, throughput, and energy consumption. These advancements are particularly crucial given the growing concerns about the deployment challenges and environmental impacts associated with increasingly large LLMs.

Built on the Llama architecture, the 1.58-bit LLM offers a more efficient alternative to traditional post-training quantization methods. By focusing on optimizing matrix multiplication, the most computationally expensive task in LLMs, this approach replaces floating-point operations with integer addition, resulting in substantial energy savings. Furthermore, the use of 8-bit activations enables more efficient processing of longer sequences. As model sizes continue to grow, the efficiency gains of the 1.58-bit LLM become even more pronounced, offering significant reductions in memory usage and energy consumption compared to FP16 models.

Pre-Training And Model Setup

Install PyTorch, Hugging Face Transformers, and Weights and Biases in python environment. We will use Weights and Biases to evaluate our training runs.
Download the cosmopedia-100k-pretrain dataset
Download the base LLaMA 2 model from Hugging Face and then strip away the existing weights and reconfigure the model with our custom architecture. This involves setting up custom heads, layers, and dimensions.
Develop a custom tokenizer
Set up the training environment by defining the data collator, training arguments, and hyper-parameters. Note that both variants use identical hyper-parameters
Use our current configurations to instantiate FP16 Variant.
We instantiate our 1.58 Variant using our current configurations and then transform the linear layers and quantize the model weights and activation functions.

What Are Scaling Laws ?

Scaling laws are essential tools for predicting the performance of LLMs as we increase their size, training data, and computational resources. They provide empirical relationships that help estimate the capabilities of models that are too large to train in practice, offering insights into how performance metrics improve with scaling. Given the significant computational demands of training very large LLMs, scaling laws guide researchers in making informed decisions about resource allocation and future model design, effectively bridging the gap between current capabilities and the potential of even larger, more advanced models.

Building Our Own Scaling Laws

Perform 3 separate pre-training runs on the 1.58 model variant. The model will be configured with 1 million parameters on the first run, 25 million on the second, and finally 50 million parameters on the third run.
Create a model size array consisting of the numerical values 1 million, 25 million, and 50 million in ascending order.
Choose a desired performance metric you want to predict based on model size. In Weights and Biases, copy all 3 run values from your desired training metric and add them to a new array. This new array should be ordered relative to the values in the model size array.
We then curve fit a power law using these 2 arrays
We can now predict the desired performance metric of a much larger model by inputting the curve-fitting results and a desired model parameter value in to our power law.
Duplicate this exact process for our FP16 model variant.
We can now choose a larger model parameter value and use both of our model variant power laws to predict and compare performance.

Our Results

We created customized 1.58 and FP16 variants of the Llama 2 model and conducted multiple training sessions at different parameter sizes to derive scaling laws for each variant. These laws allowed us to forecast the performance of our models at larger sizes, which we then compared to evaluate how scaling affects their performance.

Our research has revealed that the 1.58-bit model exhibits remarkable performance, defying expectations given its extremely low precision. The loss rate of the 1.58-bit model remains on par with that of the FP 16-bit model as the parameter size increases. This striking result demonstrates that the 1.58-bit model retains competitive performance on downstream tasks, suggesting that it effectively manages to balance precision and efficiency.

We see impressive results with the 1.58-bit model, which demonstrates a significant leap in memory and energy efficiency compared to its FP16 counterpart. When we applied our GPU metrics to the scaling laws, it became evident that as the model size increases, the 1.58-bit variant delivers an impressive reduction in both memory footprint and energy consumption. This efficiency gain is not just incremental but transformative, enabling the deployment of larger models with lower resource overhead, thereby paving the way for scalable AI solutions that are both cost-effective and environmentally friendly.