When designing and training an LLM for the task of text generation, there are two general paradigms to focus on. Pre-training, which is mostly focused on structure, language, and knowledge. While fine-tuning helps to guide style and presentation of the outputs generated.
Domain specific tasks
Tuning allows businesses to tailor the model to their specific needs and industry-specific terminology, leading to improved accuracy and relevance. For example, a customer service chatbot would want to be kind, helpful, and understanding, in which case you could help fine-tune it on a representative dataset in that style.
Also helping use LLMs for APIs or code outputs. You can use a code-based dataset to help the model learn to output correct language syntax rather than natural language responses, helping it to better interact with other computer systems.
Security and Privacy
Businesses can ensure their proprietary or sensitive data never leaves their controlled environment. This is particularly crucial for industries where data privacy regulations are stringent. Moreover, a custom tuned LLM can be trained to align more closely with a company’s brand voice and communication style, ensuring consistency across digital interactions. In contrast, a generic model might not always reflect the nuances and specifics of individual business needs or industry standards. This could be important if you are working in a regulated domain such as health or legal fields.
Many companies have been popping up this past year that merely act as “GPT wrappers” and the question is if there is truly a moat you can build with this approach. If you are using an openly available LLM API, expect your competitors to do the same. So, if you have your own unique data to fine-tune a model on, that can help give you an edge.
The main blocker you may see with your hardware is the memory requirements (GPU or CPU), after that it is mostly just down to how much time you want to spend training. For example, if you are using a GPU with only 8GB of VRAM, you may need to use some of the latest tricks for memory optimizations or still to the smallest models. Typically, you see people using the larger consumer GPUs (3090 or 4090 with 24GB) or enterprise versions that may have even 40–80GB each.
There is little reason to purchase an enterprise GPU though. The cost to performance is not there, and the main reason you see them in cloud services is due to licensing allowing them to be split amongst different VM environments in large server racks. For the best value typically start with a single 3090 then scale up more from there. You can typically fit up to 4 GPUs in most larger cases until you start needing specialized hardware. Fortunately, the crypto mining industry has faltered over the past year, and there is a lot of surplus hardware being moved.
Do I need a GPU?
The short answer is yes. While there have been considerable advancements recently (past ~6 months) with inferencing on simpler hardware, training itself is much more difficult. There are many inference tricks like offloading and quantization that make inference feasible on consumer CPUs recently, but training still greatly benefits from a GPU. With Apple’s recent M-series hardware, training even on a laptop may become something reasonable once the software catches up.
Cloud vs. Local
For occasional training, cloud-based providers are going to be the most cost effective for the occasional user. You can rent anything from a single GPU to groups of 8 or 16, for just an hour or multiple days. While training locally requires large upfront costs which you will take a long time to recoup. Though recently, the increasing popularity of LLMs and generative AI in general is creating another GPU shortage, especially for the higher-end chips. It is common to check the availability of cloud providers recently and have not a single GPU server available at the time.
I think we will see an important distinction in some companies soon, where those that are well positioned right now are the ones that have locked-in reservation contracts already or have even purchased the hardware themselves, while the ‘GPU poor’ will begin to struggle.
Keep in mind if you are using local hardware, having a fast internet connection is quite important if you are going to be working with the larger models. Even with the smallest llama 7B parameter version, that is going to be 10GB+ you will need to download base weights for, and if you are saving to something like HuggingFace Hub you will want the fast upload as well.
Sometimes I upload multiple gigabytes a day, so fiber is preferred for good symmetrical speeds. I have my training machine hardwired to a 2.5Gb/s fiber connection. You may be able to download models fast with cable/DOCSIS but the upload speeds are usually quite poor (20–50Mb/s range).
Where you keep your data and models is something that you can completely ignore at first but will come back to bite you later, as the storage will continue to grow and expand over time. I keep a lot of models downloaded locally, with almost 2TB of space used across different base models, sizes, fine-tunes, etc.
For example, llama 7B (a small LLM) is 13GB, and once you fine-tune, you must begin saving the extra weights somewhere too.
Tip: By default, HuggingFace will save to some local ~/.cache directory. If you are doing a lot of testing with the Transformers library, you can easily fill your disk without realizing it. And if you are checkpointing while training, all those weights will also need to be saved for each checkpoint. So be aware of how much space you begin using.
df -h is my go-to command to quickly check overall disk space availability on any Linux/mac-based systems.
You can also use
du -h to check storage sizes within a specific directory. Such as this example below that shows the top largest directories in your current location. This may help you catch directories blowing up in size due to some mismanaged training configs that save too often.
du -h — max-depth=1 | sort -hr
54G ./models - tiiuae - falcon-7b-instruct
49G ./models - NousResearch - Nous-Hermes-Llama2–13b
41G ./models - tiiuae - falcon-7b
25G ./models - meta-llama - Llama-2–13b-hf
I like to have a separate storage drive mounted to
/datawhere I save all my datasets and models. Then depending on the library (transformers for example), I will place
os.environ[“TRANSFORMERS_CACHE”] = “/data/hf” at the top of any training jobs or notebooks, to make sure everything is going where I intend it to.
NVMe will be the best and fastest for quickly loading weights, and the prices are approaching parity with the older SSD style drives. Avoid spinning disks at all costs, as it will become a large bottleneck even on inference. In an ideal situation you can load a llama-7b model in just a few seconds on the fastest NVMe drives, while disk based could take minutes just to load the weights.
Some general ideas of the optimal reading speeds.
- HDD: 100 MB/s
- SSD (Sata): 500 MB/s
- NVMe (pci-e): 3,000–8,000 MB/s
Hardware scalability and parallel processing
It is best recommended to use at least a couple GPUs unless you are working with the smaller ~7B model sizes, but it is no longer a requirement with LoRA, PEFT, DeepSpeed, etc.
A single 24GB consumer GPU will work for only the smallest models, while an enterprise style 40–80GB GPU such as A6000 or H100 can work with the medium sized models.
Once you go for 30B+ models you will want to rent a server with 4–8 GPUs for reasonable training times. It is still reasonable in cost to use Lambda Labs for this, where you can rent 8xA100 GPUs for about $9/hour for a total of 320GB of VRAM. This would make it possible to run training for any open-source model size that I am aware of. If you are on AWS, expect to pay 3–4 times this cost, and the UX of finding availability and spinning something up quickly can be difficult.
Popular DL frameworks for LLMs
By far the most common software stack is going to be PyTorch with CUDA (for NVIDIA GPUs). Historically AMD did not pursue ML uses so they are still catching up with their ROCm software, but there have been some advances recently. If you want something to make your life easy, just stick with NVIDIA for now!
Other frameworks you still see around include TensorFlow and JAX. TensorFlow has a greater focus on production deep learning and has not been keeping up with the latest LLM research implementations, so for our purposes here you will not need to worry about it. Google has also been promoting another recent framework JAX which is more of a lower-level implementation for many machine learning operations, but you typically will not see many state-of-the-art models using this yet. Mostly seen more in research and learning style projects.
Transformers makes everything easy! This combination has an incredibly low bar to jump in and test out models, including training, deployment, and evaluations. The modular design of this library means you do not need to even be aware of the differences in various LLM implementations, merely pick a type of task (text-generation, classification, etc.) and then the specific model-id and Transformers will handle everything for you.
This combination Python library and website is where all the latest models, datasets, and more can be found. You can upload, download, and even test them out in your web browser. It’s basically Github for machine learning! It even enables the use of revisions/commits for different versions of your models, which makes it quite simple to create multiple different model training runs and then evaluate them together afterwards.
At the time of writing this is still in beta, but I wanted to include it as it could help consolidate some lower-level libraries. This upcoming library is a wrapper for TensorFlow/Torch/JAX and allows you to use the same code and building blocks regardless of the underlying deep learning framework. I am optimistic about where this is heading and think it could be worth reading up on now.
This can be the simplest method as you do not need to worry about software/hardware/driver compatibility as everything is handled within the container. But this method can get a bit unwieldy and frustrating especially if you are training models and constantly worrying about losing code changes or outputs due to the ephemeral nature of Docker.
This is my preferred method. The first thing I do upon setting up any computer or logging into any server is setting up MiniConda and using that to manage different Python environments. Since this field is moving quite fast a lot of times you will see varying and specific requirements for versions of frameworks and libraries, even down to the CUDA versions that only work with each one. This means keeping separate and distinct environments for each of your projects.
I will typically have some general environments for toying around doing quick research style work in say Jupyter Notebooks. Then when working on production focused projects you can more exactly contain all your requirements and versioning within your directory or repo.
This software from NVIDIA allows for communications of the low-level matrix operations between deep learning frameworks such as Torch and your GPU drivers. Back in the ‘old’ days of say 2017 it was common for me to mess up a driver update on my 1080ti and spend the next day or two reinstalling CUDA or even the entire OS. Fortunately, CUDA management has come far since then. These days I do recommend handling CUDA versions from within Conda, that way each python version or project can handle its own implementation without worrying about system-wide pollution of trying to upgrade or downgrade CUDA all the time.
There is plenty of generic and high-quality text compiled from books, articles, code, etc. available online. This may be used for pre-training, or you can use something a bit more domain specific for fine-tuning. Some of it enters a weird legal grey area. It may be considered fine to have your model read over a collection of books to learn a specific style, but what if you originally obtained those books via piracy?
Copyright activists are currently crusading against some of these methods, here is a recent example going over some of the headwinds facing the people making these datasets.
Examples of datasets:
Make your own data
If you want to compile or use some data you already have, you may find you need less data than you imagined. We are not talking about gigabytes of data here, even a few hundred samples can be fine, and you can just augment with GPT-3 or GPT-4 to expand your collection.
For example, let us say you want an LLM to interact with an API. If this is something you have been running in production, you can extract all the samples from your logs and use that for training, or even use some sort of rules-based system to automatically create some API inputs and outputs to be used in the training.
Typically, I will take the highest quality LLM available (usually GPT-4) and create a clear prompt with my goals for augmentation. Then using a small script, I will feed in that same prompt multiple times, and provide a random sample of 2–3 examples from my original dataset. That way you can get a diverse set of augmentations that follows a similar distribution from your original data. There are a lot of little tricks and techniques to use for LLM data augmentation, and results can vary a lot depending on the prompt and methods you use. I have not seen a lot of documentation about these methods so it may require a bit of experimentation on your own data to find what works best for you.
Preprocessing and formatting data
A lot of practitioners treat datasets as a black box they dump onto their GPUs. The work isn’t as exciting, but there can be a lot of value in manually reviewing a lot of your data. Even popular open-source datasets will contain errors and wrong labels. As more datasets are being built with community sourced labels and created in bulk with Mechanical Turk, this is something to be aware of going forward. I have had ChatGPT create some data augmentation for me before, and then thought to go back and validate some of the responses. There were multiple examples of outputs that were not valid for the specific API I was training on!
Sometimes you will want to add additional special tokens to help with formatting your code. This can be used to replace substantial amounts of boilerplate that may be represented a lot in your data, such as some API formatting. I have used this trick to help speed up inference as an output that previously may have taken 10 tokens can now be replaced with a single token.
Be aware it will take time for the model to learn these new tokens, so try and use more samples and longer training time if it is a completely never seen token. The model must learn from scratch what this means and could be an issue if you are doing a quick LoRA training run.
The first option is to update all the model parameters, which is just like a continuation of the original pre-training. This will typically require much more GPU memory and computer power than the other techniques. For the latest billion+ parameter models you may even need multiple top-end GPUs just to fit everything in memory to begin the fine-tuning process.
Over the past couple of years some new techniques have started popping up that allow for specific layers to be updated only. This means you do not have to optimize and update over a substantial portion of the parameters. All the recent implementations you will see online and in the open-source community involve some sort of parameter efficient fine-tuning, and this is what we will focus on in this guide. The HuggingFace PEFT library is a good place to start.
Low rank adaptation (LoRA)
As the parameters of an LLM are simple layers and blocks of large matrices, we can use certain mathematical techniques to reduce the overall parameter space while retaining as much information as we can. By decomposing the matrices into these smaller layers, we then train on those rather than the original versions. When you are ready to deploy these layers are then bolted on to the original LLM and inference can then be run over your specific version.
This is the same method as above, but now instead of just making the matrices smaller, we also reduce the size of the data values themselves. It is common to go as low as 4-bits of information per parameter rather than the original FP16 or FP32 data types. This means the memory requirements are much lower, so one can begin to load and train models on simpler consumer hardware or even their laptops.
Be aware this area is advancing rapidly. A lot of hardware optimizations are being made to ensure quick performance, as many GPUs are not really designed to work well with these data types. While memory requirements are reduced, naive implementations may be slower than running a full FP16 inference or training update, and there is work being done to optimize these methods on current hardware.
BitsAndBytes — One of the top libraries right now for doing the work on making quantization quick and easy to implement. It has recently been incorporated into the HuggingFace library and can be enabled with a simple flag whenever you want to load a model with quantization.
Other neat techniques
The past year has brought a lot of open-source help for those with limited resources trying to work with LLMs. One of the most popular is DeepSpeed/ZeRO, which brings along a few techniques to help manage the memory requirements and compute of all the different layers within larger models. Some benefits include:
- Split layers across multiple GPUs, or even offload to CPU and system RAM for some tasks.
- Partition up the optimizer states to reduce redundancy, further reducing memory requirements.
- PEFT (Parameter Efficient Fine Tuning) converts the model parameters to smaller datatypes to help fit more on a single GPU at the cost of precision. (more on this later)
- GGML (for inference) allows for splitting between CPU and GPU, especially useful on unified mac m-series computers. Designed from the ground up in C to be efficient as possible on consumer CPU hardware.
Train the model
First you will need the model and tokenizer loaded into memory. You can grab something automatically using the hugging-face hub (easiest method) by just providing a specific author and model id combination. Below shows how to load a model at the most basic method. This will run on the original weight datatypes and may fill up all but the largest 24GB consumer GPUs.
model_id = "meta-llama/Llama-2–7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
Here is a scaled down recommended method that utilizes LoRA to scale down memory requirements:
from peft import LoraConfig
from trl import AutoModelForCausalLMWithValueHead
model_id = "meta-llama/Llama-2–7b-hf"
lora_config = LoraConfig(
model = AutoModelForCausalLMWithValueHead.from_pretrained(
TIP: Set a permanent cache directory on a drive that is both fast and large. By default HF will use the default
~/.cache which can be an issue if your home directory is not on your preferred disk. I purchased a secondary 2TB NVMe drive and mounted it to /data on which I store all my datasets and models for machine learning. Then I just set the environment variable for
‘TRANSFORMERS_CACHE’ to match that directory.
It is important to first begin with simple and short runs to make sure all the pieces work together! I have had issues before where I spent hours training a model only to have it try and save to a non-existent directory at the end, crash, and lose all my progress.
Some ideas and tips:
- Start with just 10 steps, see if it works, any OOM issues?
- Train, save, upload, inference. Try to upload the model somewhere, then load it in a notebook and test that it works correctly.
- You do not want to spend a day training to find out the end of the script has a bug, or you run out of storage space!
- Depending on how your code is written you may be creating multiple checkpoints every single time you re-train. It is easy to overlook this until you start getting critical warnings from your operating system about your partitions being completely full!
Example: Here is purely an example of a minimum setup that will train for 10 steps. Try something like this first to make sure there are no issues with memory, storage, or your code.
trainer = Trainer(
During the training process, you will typically see some loss function outputs being printed. This can represent the model’s learning ability and confidence in its results but is not guaranteed to line up with real-world performance. Sometimes you may want to add task specific metrics, such as validating API inputs if you are building an LLM agent to interact with another system. Or checking that it is producing valid syntax if you are having it write code, such as correct JSON formatting.
Try and build a collection of test cases. This is common in many LLM research papers you will see where they have collections of tasks such as logic puzzles, coding challenges, or multiple-choice tests that can be automatically run against the LLM. This can be an extra step outside of just monitoring the loss of your model, to ensure it is heading in the direction you really want.
I have seen issues of data contamination recently, where some of these common evaluation task answers have found their way into the training dataset. So, treat this like any other machine learning problem and look out for the typical data leakage traps. If you constantly tune and modify your model to perform well on a specific task you will begin leaking that information and over-fitting to said task.
Nothing beats human evaluation
Just go through and talk to the LLM yourself, try some inputs you know well and see how it responds. Also, slight tweaks in input formatting can be important. What if you reword your question, do you get completely different responses from the model? It may have just learned to answer a specific style of question well.
Tip: I like to use non-augmented data for holdout. This may vary depending on your specific use-case, but I like to hold as true to real-world production uses as I can. Sometimes the data augmentation is not perfect and just creates easily predictable outputs (remember it was all created by an LLM already).
Inference and serving of LLMs is an entirely different area with as much or more development and discussion time being spent right now. As important as training can be, getting the models to produce new outputs is the final goal, and most people just want something that can run, and run fast. We will go over some basics of deploying the model here.
If you did a parameter efficient tune with LoRA as discussed above and then tried to save your model, you may notice it is only a few or hundreds of megabytes in size. That is because HF will not save the full original weights, just the small section you trained. Fortunately, there is a single line of code that will put everything back together.
Where do I put my models?
I typically recommend using the HuggingFace Hub as they provide free repositories to store all the massive models you would like, along with version control like any other git system. This is simpler and more reliable than just saving everything to disk and trying to keep track of which files belong to which runs.
Here is a quick example of saving your model to the cloud. I like saving to the Hub because it provides a seamless interface when loading the models later, on any system you want. Be aware this can take a LONG time if you have a slow internet connection. While a symmetrical gigabit fiber connection may only take 2–3 minutes for the smaller 7B models, if you have a cable connection with only 20Mb/s up, expect it to take an hour or two.
If you have saved it via the method above, you will have access to my favorite tool for inference, the text-generation-inference library. It comes in a standalone docker container that contains all the code and setup needed to run out of the box optimized inference on any Hub-hosted model.
With this method below you can either use your custom model, or a pre-trained model such as falcon, llama, etc. You can also choose to shard across more than 1 GPU if available, to fit the larger models, or if you would like to not use quantization. I also make sure to specify the cache directory for the container, so I do not duplicate models across various locations on my computer.
sudo docker run
- gpus all - shm-size 1g -p 8080:80
-v $volume:/data ghcr.io/huggingface/text-generation-inference:latest
- model-id $model - num-shard $num_shard - trust-remote-code
- quantize bitsandbytes
Once the model is loaded and deployed from the code above, you can call it across the network from a simple curl command, or directly from within Python as well.
-H 'Content-Type: application/json'
from text_generation import Client
client = Client("http://localhost:8080")
response = client.generate("Hello", max_new_tokens=128)
Better prompt instructions
As models get faster with greater context, it is possible to just include detailed prompts for exactly the style and type of outputs you want. You can add paragraphs or even pages into the input, and in many cases, this may be plenty for your task.
Along with more general prompting instructions, you can provide explicit inputs and outputs from the task you have. Say you want the LLM to respond in the style of Shakespeare. You can simply ask for it in the prompt or include some specific examples.
- Translate the following sentences into Shakespearean English:
- 1) “Hello, how are you?” — “Hail, how art thou?”
- 2) “I don’t understand.” — “I fathom not.”
- 3) “That’s a good idea.” — “That’s a noble thought.”
While this may be a simple example, it should help to get you to think of the ways this method could be useful for your purposes as well.
Retrieval-augmented generation (RAG)
While we mostly rely on internal model knowledge, many times it makes more sense to just add some external source of data to input to help it out. This is common for retrieving facts where we can grab some high-level results from an internet search or Wikipedia to ensure the facts stay correct.
Especially useful when answering questions over some specific corpus of documents or a knowledge base. We can dump in a few paragraphs or pages of information and let the LLM sift through that and summarize it before returning a final response.