Transformers

pipelines

This is one of the most basic elements of the transformers library. This function connects a model with the necessary steps for its pre and postprocessing, receiving any text and obtaining a legible answer.

from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life!")

We can put various sentences in the classifier method using a list, for example, classifier( ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"] ) By default, this pipeline selects a particular, pretrained model, adjusted for English sentiment analysis. This model is downloaded and saved in chache when classifier object is created. Subsequent executions of the command do not require to download the model again.

Preprocessing and postprocessing

There are three steps occurring when passing a text to a pipeline, namely,

The text is preprocessed in a format that the model can understand.
The preprocessed entry is passed to the model.
The model predictions are posprocessed, so a human can understand it
Pipeline Examples

There are a lot of available pipelines, like
feature-extraction (obtains the vector representation of a text).

fill-mask. This is used to fill blank spaces.

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

top_k controls the number of possibilities that will be shown.

ner (named entity recognizion). Used to identify which parts of the text

correspond to people, locations or organizations. An example of this is

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

question-answering. Used to answer questions given a context. An example is

question_answerer = pipeline("question-answering")
question_answerer(
  question="Where do I work?",
  context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

sentiment-analysis.

summarization. Used to summarize text. Its usage is straightforward.

summarizer = pipeline("summarization")
summarizer("Text to be summarized.")

text-generation. Used to generate text. You provide an indication (prompt), and the model completes it automatically, generating the remaining text. An example is
```
generator = pipeline("text-generation")
generator("In this course, we will teach you how to")
```
The argument num_return_sequences sets how many different sequences are generated, while max_new_tokens establishes how many tokens are going to be generated at max.
translation. For translation, you can use the default model if you indicate a pair of languages in the name of the task, like “translaton_en_to_fr”, but the easiest way is to choose the desired model in the Hub. For example, translating from French to English is done below.
```
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")
```

zero-shot-classification. Used to classify unlabeled texts. You can specify which labels to use for the classificaton (for example, in sentiment-analysis, either positive or negative). An example is

classifier = pipeline("zero-shot-classification")
classifier(
  "This is a course about the Transformers library",
  candidate_labels=["education", "politics", "business"],
)

I get as result

{'sequence': 'This is a course about the Transformers library', 'labels':
['education', 'business', 'politics'], 'scores': [0.8445994257926941,
0.11197380721569061, 0.04342673346400261]}

Using any Hub model in a pipeline

Previous examples used the default model for each task, but you can also choose a particular model from the Hub. For example, we can use distilgpt2 for text generation.

from transformers import pipeline

generator = pipeline("text-generation", model = "distilgpt2")
generator("In this course, we will teach you how to",
            max_new_tokens=5,
          num_return_sequences=2,
            truncation=True)

There is an “Inference” API in here where you can test online the models.

How do transformers work

Transformers are generally grouped in three categories:

The ones like GPT (known as auto-regressive models).
The ones like BERT (known as auto-encoding models).
The ones like BART/T5 (known as sequence-to-sequence models).

All these transformers are language models. This means that they are trained with a lot of crude text in an auto-supervised way. Auto-supervised learning is a type of training in which the objective is computed automatically from the model input. That is, it does not require human labelling data.

This leads to a statistical understanding of the language it was trained on, not very useful for practical specific tasks. Then, the general pretrained model goes into a process called transfer learning, in which the model is adjusted in a supervised way, for a given task.

An example of a task is to predict the next word in a sentence based on the $n$ previous words. This is called casual model of language, because the output depends on the previous and current inputs, but not from the future entries.

my -> name
my name -> is
my name is -> Sylvain
my name is Sylvain -> .

Another example is the hidden language modeling, in which the model predicts a word hidden in a sentence.

My {MASK} is Sylvain.
MASK -> name

Phases of training. pretraining is usually done with big data. Is expensive and tends to take weeks. Fine-tuning is the training done after the model has been pretrained. Here you start with a pretrained language model and then an additional training is done with a specific dataset for your task.

We do no train directly the model for the final task because

The pretained model is already trained with a similar dataset to the fine-tuning dataset.
As the pretrained model was trained with a lot of data, the adjustment will require less data to have decent results.
Then, the required time and resources to have good results is much smaller.

Example: A pretrained model can be used with an arXiv corpus, having as a result a model based in scientific research. The adjustment will require only a limited amount of data: the knowledge the pretrained model has acquired is transfered, which we call knowledge transfering.

General Architecture

The model is composed by two blocks:

Codifier: it receives an input and constructs a representation of it(s characteristics). This means that the model is optimized to obtain an understanding from the input.
Decodifier: it uses the representation of the codifier (characteristics) as well as other entries to generate an objective sequence. This means that the model is optimized to generate outputs.

Each one of these parts can be used independently, depending on the task:

Models with codifiers only

Good for tasks that require the understanding of the input, as the classification of sentences and recognizion of named entitites.

In each step, the attention layers can access all words of the initial sentence. These models are characterized to generally have bidirectional attention and they are usually called auto-encoding models.

Pretraining of these models generally consists of changing a given sentence (i.e., hiding words in it), and asking the model to find or reconstruct the original sentence. This is called Masked language modeling (MLM).

These models are adequate for tasks that require a full sentence understanding, as sentence classification, ren, and extractive answering of questions.

Examples: ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa.

Models with decodifiers only

Good for generative tasks as text generation. In each stage, for a given word the attention layers can retrieve to the words located prior in the sentence. These models are usually called auto-regressive.

Pretraining consists of predicting the next word of a sentence. These models are therefore the most adequate to generate text. This is called Casual language modeling (CLM).

Examples: CTRL, GPT, GPT-2, Transformer XL. GPT-2 architecture

Models with codifiers and decodifiers or sequence to sequence models

Good for generative tasks that require an input, as translation or summarizing.

Transformers are built with special layers called attention layers. This layer indicates the model that has to pay special attention to some parts of the sentence given (and to ignore the other parts, more or less), when it works with the representation of each word. This is because, when doing NLP, a word may have a meaning by itself, but that meaning is deeply affected by the context, that can be any word (or words) before of after the word being studied.

The original transformer architecture was originally designed for translation. Original architecture

Architectures vs. checkpoints

Architectures: This is the barebones of the model - the definition of each layer and operation occuring inside the model.
Checkpoint: these are the weights that will be loaded in a given architecture.
Model: This is a term not as precise as architecture or checkpoint and can mean both things.

Example: BERT is an arquitecture, while bert-base-cased is a set of trained weights by the Google’s team for the first version of BERT - which is a checkpoint. Nevertheless, we call them BERT model and bert-base-cased model.

References

llm-HuggingFace-course

Transformers

Transformers

pipelines

Preprocessing and postprocessing

Pipeline Examples

Using any Hub model in a pipeline

How do transformers work

General Architecture

Models with codifiers only

Models with decodifiers only

Models with codifiers and decodifiers or sequence to sequence models

Architectures vs. checkpoints

References