Unmasking LLMs: The AI Behind ChatGPT and Gemini

Discover how AI systems which understand and generate human-like text, are taking your job.

Alia Systems

May 16, 2024 — 13 min read

So, you think these Large Language Models, these LLMs, ChatGPT, Gemini, Claude and so forth are the bee's knees? All the rage, transforming the way we interact with technology and what not? Well, let's not get ahead of ourselves, shall we? Before we start waxing lyrical about their supposed brilliance, let's have a look under the bonnet.

We build AI systems all the time here in Alia Systems, and one of the things we pride ourselves upon is that our articles aren't written by LLMs. Hence the way you read this, if you've met me, will be the way I actually talk. We do use AI to generate some images and we recently created our first AI Video, but you can't do here, this is all, for better or worse, from my own 20+ years experience wrestling with electrons in cyberspace, and now with LLMs.

The Sexy Mysterious Beasts of AI

What exactly are these LLMs? How do they work? And where did they come from, anyway? Are they just another flash in the pan, a technological fad doomed to fade into obscurity? Or are they truly the harbingers of a new era in communication and education?

Well, I'm here to tell you that these LLMs are not to be underestimated. They're a force to be reckoned with, a formidable presence in the ever-evolving landscape of artificial intelligence. But before we get carried away with their potential, supplicating in front of our nascent Robot Overlords, let's delve into their origins and understand how they came to be.

So, put on your thinking caps, and let's get down to brass tacks.

Building Blocks: Neural Networks and Attention

At the heart of LLMs lies a technology called neural networks.

Simplified Neural Networks are recreated in software systems

Inspired by the human brain, these networks consist of interconnected nodes that process information and learn from data.

The aim of this article is to help you, or your company in understanding the state of play of AI today, and what led us to a moment in time where most, if not all, jobs are going to be eliminated or transformed by Artificial Intelligence. Those people and companies who adapt to the new industrial revolution will succeed, and those who don't will miss out.

Curious About AI?

In recent years, a particular type of neural network architecture called the Transformer has revolutionised the field.

The Transformer on the left, with Attention, and detailing. Image Credit: Google, and Lilian Weng.

This new neural network architecture was unveiled in a very famous research paper by Google which you can handily download here:

Attention Is All You Need

Google's "Attention Is All You Need" 2017 paper introduced the Transformer.

attention-1706.03762v7.pdf

2 MB

The Transformer's key innovation is the "attention mechanism," a process that allows the model to focus on the most relevant parts of text when generating responses. Highlighting the most important words in a sentence, enabling the model to understand complex relationships and nuances in language. Here's an overview of the entire Transformer Architecture from that famous paper.

This breakthrough was a turning point in AI research, enabling the training of larger and more powerful language models. The development of back propagation, an algorithm for efficiently training neural networks, further accelerated progress. It allowed a systematic correction and retraining of neural networks to get closer and closer to the desired outcomes. You can see a lovely animation of it below.

Image Credit: machinelearningknowledge.ai

The Rise of Large Language Models

The development of LLMs like GPT (Generative Pre-trained Transformer) by Google, and later OpenAI marked a significant leap forward. These models are trained on massive datasets of text and code, enabling them to generate coherent and contextually relevant responses to a wide range of prompts.

The LAION (Large-scale Artificial Intelligence Open Network) project further fueled this progress by creating an open-source dataset of billions of images, providing valuable training data for LLMs to understand and generate images in addition to text.

The Grand Architecture of Transformers

Think of the Transformer as an enigmatic black box. You feed it a sentence in one end such as perhaps a bit of Molière and out pops some English, perhaps a bit of Shakespeare.

Now, let's pry open this box, like Pandora, but hopefully with fewer unintended consequences.

Take a look at this animation, and then go get some coffee, and come back and look again. Especially focus on how every node talks to it's neighbours and information flows through the system. If nothing else it's very relaxing to view.

0:00

/0:42

Inside the LLM, we find a structured duality: the encoder and decoder.

Imagine the encoder as a meticulous linguist, dissecting the input sentence, identifying its grammatical structures, its nuances, and the relationships between its words. Meanwhile, the decoder is the eloquent translator, weaving a tapestry of meaning in the target language, guided by the encoder's profound insights.

Both encoder and decoder are composed of stacks of identical layers, akin to a well-drilled regiment, each soldier performing their specialised task in perfect unison.

Each encoder layer is further divided into two specialized units: a self-attention mechanism and a feed-forward neural network. The self-attention mechanism acts as a magnifying glass, allowing the encoder to focus on the interplay between different words in the input sentence. It's as if each word is casting a vote on the importance of its companions, revealing the intricate web of meaning within the sentence.

The decoder layers mirror this structure, but with an additional attention layer nestled between the self-attention and the feed-forward network. This layer, like a seasoned diplomat, facilitates communication between the encoder and decoder, ensuring that the translation remains faithful to the original text while adhering to the grammatical rules and idiomatic expressions of the target language.

The Dance of Vectors

The vector tells us how much the word is related, and how closely to the other words in the sentence.

In the realm of machine learning, words are not mere symbols; they are numerical entities, vectors dancing in a multi-dimensional space. Each word is meticulously encoded into a vector, a numerical fingerprint that captures its essence and its relationships with other words. The vector tells us how much the word is related, and how closely to the other words in the sentence, as there are many words, there are many numbers in each vector, so they are typically represented as a list.

If this seems a little fuzzy, think about all the places you could go out to eat today, you'll consider how hungry you are, and what kinds of food you're inclined to enjoy, you'll also consider cost, and how busy each place is, as well as how close. These are impossible to represent as just one number, so we put them in a list (the vector), just as we do for the words.

Each word (or token) is embedded into a list of numbers, a vector.

The bottom layer of the encoder is where this transformation takes place. It's the birthplace of word embeddings, where words are imbued with numerical meaning. Subsequent layers receive these embeddings, or the output of the layer below, as their input, propagating this meaning up the hierarchy.

Each layer transforms these 'fingerprint' vectors, to establish their meaning and importance.

One of the Transformer's key innovations lies in its parallelisation. Unlike previous models, where words and their weightings were dealt with sequentially, like a queue at a post office, the Transformer allows each word to traverse its own independent path through the encoder. It's a multi-lane superhighway of information, where words zip along simultaneously, dramatically accelerating the translation process.

Each word can travel through any path between Encoder and Decoder

The Art of Self-Attention

Now, let's delve into the enigma of self-attention. It's a sophisticated mechanism that empowers the model to discern the most salient relationships between words in a sentence. Consider the phrase, "A quick brown fox jumps over the lazy dog."

This example is for information purposes only. No mammals were harmed in the production of this article.

When processing the word "jumps," self-attention enables the model to forge connections with "fox," "over," and "dog." Like a detective piecing together clues, identifying the key players and their actions in a complex narrative. It does this by it's prior training on those exceedingly large datasets (such as LAION et al) we mentioned earlier. A detective will have a hard time recognising patterns or associations without reference to her experience, or as we call it here, training.

Key, Query, and Value provide insight. Image Credit: The Three Graces, Indiana Museum

The Three Sisters: Query, Key, and Value,

At the heart of self-attention are three vectors associated with each word: the query vector, the key vector, and the value vector. Query is the eldest sister, and the pushiest. She asks her siblings to assist her in understanding each word from a given aspect, like discussing a DM she's just received. These vectors, derived from the word embedding through a series of matrix multiplications, provide a multi-faceted perspective on the word, like viewing a sculpture from different angles to appreciate its full form.

The query vector of a word is compared against the key vectors of all other words, resulting in a series of scores. Image Credit: Google

The query vector of a word is compared against the key vectors of all other words, resulting in a series of scores. These scores act as a measure of affinity, revealing the strength of association between words. It's as if each word is casting a vote on the relevance of its companions, with the highest scores signifying the most meaningful relationships.

Normalisation highlights the essential relationships, allowing us to jump to the prediction phase.

These scores are then normalized and used to weight the value vectors, producing a refined representation of the word that incorporates the contextual information gleaned from its neighbors. It's akin to a painter blending colors on a palette, creating a new hue that reflects the interplay of light and shadow. Each pixel, or point on the canvas has a distinct colour, but only in relation to it's surroundings does it begin to make sense, and as we pull back more and more becomes clear.

A Multitude of Perspectives

The Transformer doesn't stop at a single set of query, key, and value matrices. It employs multiple sets, known as multi-head attention. This allows the model to attend to different aspects of the sentence simultaneously, like a team of experts each scrutinizing a different facet of a problem. There are different sisters in different households across the world possibly all reading DMs by a famous person, pondering what is meant by their words. One head might focus on grammatical relationships, another on semantic associations, and yet another on idiomatic expressions.

Fortunately, unlike in social media, the outputs of these multiple attention heads are then combined, like weaving together the threads of a complex tapestry to reveal a unified picture.

The Significance of Order

In the world of language, order is paramount. The same words, arranged differently, can convey important different meanings. Let's imagine I say the following two sentences:

Could you please let me know where I can find the door?
Please let me know where I can find the door.

Try explaining to someone who is not a native speaker the difference between the two. Yes, yes, one is a question and the other is a statement, because they are both asking for the same thing. But that's not the material point. The difference is the tone. And that key difference in urgency, impatience, stress, and even fear is often embedded in word order, i.e. the way the words are arranged in the sentence.

To capture this crucial aspect, the Transformer employs positional encoding, assigning each word a unique identifier, like a prisoner's number, that reveals its position in the sentence.

This is interesting because it opens a door into LLMs understanding, or at least recognising some of the signs of emotion, or at the very least sentiment. Which can be used to adjust the vocal tone of synthesised speech.

Be prepared for a flood of flirty, threatening, plaintive, and passive aggressive AI voices in your near future.

So now we've encoded our original words, handed them off to our supreme council of dancing vector sisters, and even accounted for their order and tone, we can finally safely pass our work over to the decoder to make something useful of it all.

The Decoder's Craft

The decoder, like a skilled artisan, then crafts the translated sentence one word at a time.

The Decoding Process. Image Credit: Google

It operates in a manner similar to the encoder, but with an additional attention layer that enables it to focus on the most relevant parts of the input sentence. It's like a sculptor meticulously chipping away at a block of marble, revealing the hidden figure within.

The output of the decoder is then channeled through a linear layer and a softmax layer. These layers transform the raw output into a probability distribution over all possible words in the target language, where each word has a certain likelihood of being selected, with the highest probability emerging as the winner.

These layers simply take the weightings and convert them into probabilities, and pick the biggest one.

And finally, it makes a choice, and selects the most likely word. All that effort just to squeeze out one option, and even then it doesn't always get it right. LLMs, make no mistake are auto-complete on steroids.

The entire transformer, combined, the beating heart of every Large Language Model in Artificial Intelligence.

However, and this is important, the more words you have, the better the calculation, and the better the outcome. And yet, even this is not enough, without pre-training the system on a sufficiently large and representative data set you're going to have garbage results.

This is why there's a land rush for data sets happening across all industries across the entire world. Your data is valuable, but once it's used, it's worthless, so if you don't know what your data situation or strategy is, definitely get in touch before it is too late.

Anti-Climactic, isn't it?

This is why Transformers alone are not sufficient, they must be pre-trained, which is where the P, in GPT comes from. The G, incidentally just stands for Generative (because it generates the next symbol/word/pixel etc), so pulling it all together you get Generative Pre-Trained Transformer, or GPT. As in ChatGPT.

The Crucible of Training

Training the Transformer is an iterative process of refinement. Much like a blacksmith forging a sword, repeatedly heating and hammering the metal to achieve the desired shape and strength.

It goes through the following stages:

Data Collection

First, the neural network model, is fed a vast corpus of text, consisting of sentence pairs in both the source and target languages. It hoovers up vast swathes of text from every nook and cranny of the internet. Books, websites, social media drivel – the lot. A veritable feast for the digital mind, you might say.

This is its training ground, where it learns the intricate patterns and nuances of language. And to it's credit it does this while being entirely blind to the meaning of any of the symbols.

Tokenisation

Next, it's time for tokenization. Now, don't let this fancy term bamboozle you. It's simply the earlier process we mentioned of chopping up the text into bite-sized chunks, like a butcher preparing a Sunday roast. These chunks are then converted into numbers (remember the vector sisters?).

Model Architecture

Now we come to the model architecture, the blueprint for this digital brain. Large Language Model systems rely on the Transformer, which we saw was the neural network designed specifically for processing sequential data, like the words in a sentence.

Unsupervised Pre-Training

But, of course, building a brain is one thing, teaching it to think is quite another. So, the model undergoes a rigorous training regime. First, it's let loose on a vast corpus of text, learning to predict the next word in a sequence like a hyperactive autocomplete.

Supervised Training

But that's just the warm-up act. The real training begins with supervised learning, where the model is fed carefully curated data and given feedback on its performance. It's like a stern headmaster drilling facts into a student's head, ensuring they learn the correct answers.

Reward Function

And then, there's the reward function, a clever trick used to fine-tune the model's responses. Humans review and rank the outputs so the system has some idea of what it's like to get the existential dread of a performance review. Much like a teacher giving gold stars for good answers and detentions for bad ones, encouraging the model to produce the most desirable output.

The LLM's output is usually compared to the reference material, and the discrepancy between the two, known as the loss, is used to adjust the model's internal parameters.

This process, repeated countless times, gradually hones the model's ability to produce accurate and fluent translations. It's not so much that the model is ever 'right', as it's less wrong. Most people in contemporary dating scenarios can possibly relate.

Deployment into Production

Finally, after all this intensive training, the model is ready for deployment. It's unleashed upon the world, answering questions, generating text, and generally trying to convince us it's sentient.

And there you have it, the Large Language Model in all its technical glory. It's a complex and sophisticated model, but hopefully this explanation has demystified some of it's inner workings, revealing the ingenuity of its design and the power of its capabilities.

Now, if you'll excuse me, I have a rather pressing engagement with a queue of people who are suddenly curious about AI.