Install transformers¶

In [ ]:
!python -m pip install transformers --quiet
     |████████████████████████████████| 4.9 MB 36.1 MB/s 
     |████████████████████████████████| 163 kB 67.9 MB/s 
     |████████████████████████████████| 6.6 MB 49.5 MB/s 

Transformers Documentations: https://huggingface.co/docs/transformers/index

Sequence Classification¶

In [ ]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", 
                      model="distilbert-base-uncased-finetuned-sst-2-english")
Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]
In [ ]:
classifier("I love to hate you")
Out[ ]:
[{'label': 'NEGATIVE', 'score': 0.9974361062049866}]
In [ ]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

tokenizer = AutoTokenizer.from_pretrained(model_name)
In [ ]:
text = "I love you"

tokens = tokenizer.tokenize(text)

tokens
Out[ ]:
['i', 'love', 'you']
In [ ]:
sentence = tokenizer.convert_tokens_to_ids(tokens)

sentence
Out[ ]:
[1045, 2293, 2017]
In [ ]:
sentence = tokenizer(text,  return_tensors="pt")

sentence
Out[ ]:
{'input_ids': tensor([[ 101, 1045, 2293, 2017,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
In [ ]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
In [ ]:
import torch

torch.softmax(model(**sentence).logits, axis=1)
Out[ ]:
tensor([[1.3436e-04, 9.9987e-01]], grad_fn=<SoftmaxBackward0>)

See an example of the classification model deployed on HuggingFace space at: https://huggingface.co/spaces/Donlapark/sample-text-classification

Exercise¶

  1. Choose your own task (can be image or audio related) that can be performed using one of the HuggingFace models.
  2. Use the HugginFace model to create a Streamlit app in a HuggingFace space that asks for the user's input and then perform the said task.
  3. Deploy the model on HuggingFace space.

To see what Transformers can do, you might want to check out the links below:

https://huggingface.co/docs/transformers/v4.22.2/en/task_summary

https://huggingface.co/docs/transformers/index

These are the list of all possible tasks:

['audio-classification', 'automatic-speech-recognition', 'conversational', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'visual-question-answering', 'vqa', 'zero-shot-classification', 'zero-shot-image-classification', 'translation_XX_to_YY']

Insert your HuggingFace Space link here:¶

Fine-tuning¶

In [ ]:
!python -m pip install datasets evaluate --quiet
     |████████████████████████████████| 431 kB 20.1 MB/s 
     |████████████████████████████████| 69 kB 7.9 MB/s 
     |████████████████████████████████| 212 kB 61.7 MB/s 
     |████████████████████████████████| 115 kB 62.9 MB/s 
     |████████████████████████████████| 127 kB 72.6 MB/s 
In [ ]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_deprecation.py:97: FutureWarning: Deprecated argument(s) used in 'dataset_info': token. Will not be supported from version '0.12'.
  warnings.warn(message, FutureWarning)
Downloading builder script:   0%|          | 0.00/4.41k [00:00<?, ?B/s]
Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]
Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...
Downloading data:   0%|          | 0.00/196M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]
Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]
Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.
  0%|          | 0/2 [00:00<?, ?it/s]
In [ ]:
dataset["train"][100]
Out[ ]:
{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}
In [ ]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased",
                                          use_fast=True)


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, 
                                 batched=True,
                                 remove_columns=["text"])
Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]
  0%|          | 0/650 [00:00<?, ?ba/s]
  0%|          | 0/50 [00:00<?, ?ba/s]
In [ ]:
tokenized_datasets["train"][100]['input_ids'][10]
In [ ]:
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))

small_eval_dataset = tokenized_datasets["test"].shuffle().select(range(1000))
In [ ]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [ ]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer",
                                  evaluation_strategy="epoch",
                                  learning_rate=2e-5,
                                  optim="adamw_torch") ##to use Pytorch's AdamW optimizer
In [ ]:
from transformers import Trainer

trainer = Trainer(

    model=model,

    args=training_args,

    train_dataset=small_train_dataset,

    eval_dataset=small_eval_dataset,

)
In [ ]:
trainer.train()
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
[375/375 06:42, Epoch 3/3]
Epoch Training Loss Validation Loss
1 No log 1.169685
2 No log 1.004425
3 No log 0.966987

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


Out[ ]:
TrainOutput(global_step=375, training_loss=1.0997899576822916, metrics={'train_runtime': 406.139, 'train_samples_per_second': 7.387, 'train_steps_per_second': 0.923, 'total_flos': 789354427392000.0, 'train_loss': 1.0997899576822916, 'epoch': 3.0})
In [ ]:
import torch

sentence = tokenizer("I hate you", return_tensors="pt").to("cuda")

torch.softmax(model(**sentence).logits, axis=1)
Out[ ]:
tensor([[0.4654, 0.2268, 0.1085, 0.1172, 0.0822]], device='cuda:0',
       grad_fn=<SoftmaxBackward0>)

Upload model to HuggingFace Hub¶

We will upload the tokenizer and the model on HuggingFace hub. First we need to install a library that allows us to log-in our HuggingFace account from colab.

In [ ]:
!python -m pip install huggingface_hub --quiet

Enter a credential to login, then create a new model hub, which will be used to store your model.

In [ ]:
!huggingface-cli login
!huggingface-cli repo create finetuned_yelp --type model
        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .
        
Token: 
Login successful
Your token has been saved to /root/.huggingface/token
Authenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store
git version 2.17.1
Error: unknown flag: --version

Sorry, no usage text found for "git-lfs"

You are about to create Donlapark/finetuned_yelp
Proceed? [Y/n] Y

Your repo now lives at:
  https://huggingface.co/Donlapark/finetuned_yelp

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/Donlapark/finetuned_yelp

Finally, you can now save your tokenizer and model.

To load the mode and tokenizer from the HuggingFace space, use (change username to your HuggingFace username):

Now you can load the model within HuggingFace Space using pipeline("sentiment-analysis", model="your_username/finetuned_yelp"). Here's an example.

In [ ]:
tokenizer.push_to_hub("finetuned_yelp")
model.push_to_hub("finetuned_yelp")
tokenizer config file saved in finetuned_yelp/tokenizer_config.json
Special tokens file saved in finetuned_yelp/special_tokens_map.json
Uploading the following files to Donlapark/finetuned_yelp: tokenizer.json,vocab.txt,tokenizer_config.json,special_tokens_map.json
Configuration saved in finetuned_yelp/config.json
Model weights saved in finetuned_yelp/pytorch_model.bin
Uploading the following files to Donlapark/finetuned_yelp: config.json,pytorch_model.bin
Out[ ]:
CommitInfo(commit_url='https://huggingface.co/Donlapark/finetuned_yelp/commit/2115ed92cdc2be4342dce3d46efac36dbd5eeea3', commit_message='Upload BertForSequenceClassification', commit_description='', oid='2115ed92cdc2be4342dce3d46efac36dbd5eeea3', pr_url=None, pr_revision=None, pr_num=None)

Exercise¶

  1. Fine-tune the uncased DistilBERT model (introduced in this paper) on the imdb dataset (https://huggingface.co/datasets/imdb). Note that:
    • There are only two classes in imdb dataset: positive and negative.
    • The name of the model in transformers library is "distilbert-base-uncased".
  2. After finish training the model, make your own sentence and use the model to classify your sentence.
In [ ]: