Deep Dive: Demystifying LLM Technology Volume 2

Behind the Scenes of Training LLMs

Mornin’ miners⛏️,

Happy Tuesday!

Digger Insights is your easy-to-read daily digest about tech. We gather tech insights that help you gain a competitive advantage!

Let’s get to it!

Today’s Deep Dive: 🤖Demystifying LLM Technology Volume 2 - Behind the Scenes of Training LLMs💻

Demystifying LLM Technology Volume 2 - Behind the Scenes of Training LLMs

We talked about the fundamentals of LLMs last week in volume 1 of this Deep Dive series. We’ve learned that LLMs have integrated themselves into the digital landscape and have transformed industries with their ability to analyze human language.

By knowing more about what they are, how they work, how they have developed over the years, as well as their purpose, we’ve familiarized ourselves with LLMs a bit more. This time, we’ll be learning about how they came to be the revolutionary form of technology they are now, along with the risk of creating and utilizing LLMs without care.

Preprocessing and Training

LLMs are trained on a massive amount of data. Being a deep learning algorithm*, LLMs have the ability to study vast datasets, and this allows them to execute all the language processing tasks that used to be unique to humans, to generate, summarize, predict, and translate text. Before training begins, however, the process of data collection, also known as preprocessing, has to be done. This entails searching, sorting, and sharing subsets of data.

The data used to train LLMs mostly encompasses content in books, articles, web pages, and open datasets, which can be found basically anywhere these days and so very easily. Preprocessing also entails the cleansing of data, which can be done by removing junk and noisy data, fixing inconsistent data by changing letters to lowercase or correcting spellings and typos, eliminating stop words, and tokenizing, the process of turning text into sequences of tokens.

We briefly discussed the concept of tokens last week, but to understand it more easily, you can use GPT’s token encoder and decoder to either enter text to tokenize it or convert tokens to text. The site also provides a list of tokens and their corresponding integers.

*Deep Learning Algorithm: A type of machine learning that works based on the structure and function of the human brain.

Photo Courtesy of Simon Willison

After collecting and cleansing, the LLM can then start training. Training usually begins by offering the trained model with a sequence of words, followed by the model predicting the following word in the sequence. This process is repeated millions and millions of times until the model reaches its highest, most optimal level of performance.

So, essentially, you take unstructured data and transform it into a format that the model can understand and learn from. Once the model learns from the preprocessed data, the foundation for the LLM is formed. The process of building a high-performing LLM doesn’t stop there, though.

Fine-Tuning and Transfer Learning

Once the model goes through basic training, an annotated test dataset is used to assess the model and measure its performance. If the test results show a sub-optimal performance from the model, it needs to undergo another training process referred to as fine-tuning. Fine-tuning can be done by adjusting a model’s parameters, changing a model’s architecture, or further training the model with more data to boost its performance quality.

If the default dataset and the boosted dataset still prove to not be enough to train the model properly, a different kind of fine-tuning method known as transfer learning can be used.

The transfer learning process is done by obtaining a pre-trained model that uses similar data to the model still in the process of training. Pre-trained models are not too difficult to find, but most are accessed through an image database called ImageNet, which is free for researchers to use. This database has datasets that relate to real-life objects and things, and so its pre-trained models already have some sense of the knowledge of the world.

Photo Courtesy of ImageNet

The pre-trained model is then loaded for researchers to understand its structure and learn its ways. To study these layers, the model has to be frozen to prevent the existing knowledge in it from being destroyed. Once frozen, fine-tuning is performed on the pre-trained model by adding new layers to it to be trained whilst still keeping the original layers intact. After this is done, a new and improved dataset is formed that can be used to train the original model.

Data Sources

Most of the big tech companies we know that have developed chatbots and LLMs, like OpenAI with ChatGPT, Anthropic with Claude, and Google with Bard and PaLM, are quite resistant when it comes to revealing their data sources or have at least been vague about them. At the moment, the companies open about their LLMs’ training process are Meta and OpenAI.

OpenAI released a research paper when it released GPT-2, which says that GPT-2 was trained on WebText but only limited to data up to December 2017. WebText is a dataset created by OpenAI which contains about 45 million outbound Reddit links. GPT-3 uses several different datasets, including Common Crawl, WebText2, and Wikipedia.

OpenAI has been slightly vague about GPT-4’s data sources and has only revealed that it was pre-trained with publicly available data from the web and from third-party providers until September 2021.

When Meta released LLaMA last February, they revealed in a research paper that they used CommonCrawl, Github, Wikipedia, ArXiv, StackExchange, C4, and “Books.” What the company means by “Books” here is two book corpora, the Gutenberg Project consisting of books in the public domain, and the Books3 section of ThePile, which consists of 200,000 pirated eBooks, and Meta got into a little trouble for it.

Photo Courtesy of Simon Willison

Author Sarah Silverman sued Meta for copyright infringement, with claims that Meta’s LLaMA was trained on illegally-acquired datasets containing her work. Silverman, along with authors Christopher Golden and Richard Kadrey, also sued OpenAI with the same claims.

Challenges

Legal issues aren’t the only challenge that LLM developers have come to face. The performance of LLMs depends greatly on the amount of data and training they receive. While being trained on larger datasets would mean that the model would have larger parameters, giving it more capabilities, the quality of training matters too.

With such diverse web and book corpora, it is unavoidable that models would meet toxic and biased content, and this is why filtering data is exceedingly essential. It is important to pay attention to these contents to ensure that models do not exhibit bias or produce harmful content, avoiding any ethical issues. One of the ways the filtering process can be done is by using tools like PerspectiveAPI.

It is as significant to be attentive to personally identifiable information (PII). If models are pre-trained without masking or removing PII, including proper names, organization names, medical records, social identification numbers, etc., copyright infringement would not be the only legal issue developers face. Tools like Presidio and pii-codex can be utilized to detect, analyze, and handle PII from datasets.

Photo Courtesy of Pypi

A process that follows the basic training of LLMs, known as Reinforcement Learning from Human Feedback (RLHF), has also become one of the ways to make language models behave in a certain way. RLHF entails a “reward model” with human testers directly guiding models, where models are rewarded when they do good. This method prevents LLMs from giving unethical or dangerous responses.

Environmental Impact

The processes of LLM training, from ensuring that you have cleaned your data properly to customizing LLMs so they can produce reliable outputs, would require substantial computational power. Not only is this costly, but it could affect the environment significantly.

Training for GPT-3 alone required massive amounts of water for data center cooling, consuming 185,000 gallons of water, equivalent to the amount needed to fill a nuclear reactor’s cooling tower.

Not an actual picture of training GPT-3

Aside from water, GPT-3’s training required tons of electricity, setting off 502 metric tons of carbon. That amount of electricity is equivalent to powering the average American home for hundreds of years, and the amount of carbon emissions produced was 500 times the emission of a New York to San Francisco round-trip flight. This goes to show how important it is for LLM developers to start finding ways to continue their work without harming the environment.

LLMs, like any other form of technology in our modern society, come with beauty as well as repercussions, and it is the duty of creators and users to stay mindful, aware, and conscientious.

Meme & AI-Generated Picture

Not an actual picture of training GPT-3

Job Posting

  • BigCommerce - Tier 2 Technical Support Representative - Austin, TX (Remote/Hybrid)

  • GoPuff - Graphic Designer, Creative Series - Philadelphia, PA (Remote/Hybrid)

  • MemFault - Technical Documentation Lead - New York City, NY (Remote/Hybrid)

  • ServiceNow - Sr. Technical Consultant, Finance and Supply Chain Solutions - Portland, OR (Remote)

Promote your product/service to Digger Insights’ Community

Advertise with Digger Insights. Digger Insights’ Miners are professionals and business owners with diverse Industry backgrounds who are looking for interesting and helpful tools, products, services, jobs, events, apps, and books. Email us [email protected]

Gives us feedback at [email protected]

Reply

or to participate.