!python -m pip install transformers --quiet
|████████████████████████████████| 4.9 MB 36.1 MB/s |████████████████████████████████| 163 kB 67.9 MB/s |████████████████████████████████| 6.6 MB 49.5 MB/s
Transformers Documentations: https://huggingface.co/docs/transformers/index
from transformers import pipeline
classifier = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english")
Downloading: 0%| | 0.00/629 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/268M [00:00<?, ?B/s]
Downloading: 0%| | 0.00/48.0 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]
classifier("I love to hate you")
[{'label': 'NEGATIVE', 'score': 0.9974361062049866}]
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "I love you"
tokens = tokenizer.tokenize(text)
tokens
['i', 'love', 'you']
sentence = tokenizer.convert_tokens_to_ids(tokens)
sentence
[1045, 2293, 2017]
sentence = tokenizer(text, return_tensors="pt")
sentence
{'input_ids': tensor([[ 101, 1045, 2293, 2017, 102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
model = AutoModelForSequenceClassification.from_pretrained(model_name)
import torch
torch.softmax(model(**sentence).logits, axis=1)
tensor([[1.3436e-04, 9.9987e-01]], grad_fn=<SoftmaxBackward0>)
See an example of the classification model deployed on HuggingFace space at: https://huggingface.co/spaces/Donlapark/sample-text-classification
To see what Transformers can do, you might want to check out the links below:
https://huggingface.co/docs/transformers/v4.22.2/en/task_summary
https://huggingface.co/docs/transformers/index
These are the list of all possible tasks:
['audio-classification', 'automatic-speech-recognition', 'conversational', 'document-question-answering', 'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-text', 'ner', 'object-detection', 'question-answering', 'sentiment-analysis', 'summarization', 'table-question-answering', 'text-classification', 'text-generation', 'text2text-generation', 'token-classification', 'translation', 'visual-question-answering', 'vqa', 'zero-shot-classification', 'zero-shot-image-classification', 'translation_XX_to_YY']
!python -m pip install datasets evaluate --quiet
|████████████████████████████████| 431 kB 20.1 MB/s |████████████████████████████████| 69 kB 7.9 MB/s |████████████████████████████████| 212 kB 61.7 MB/s |████████████████████████████████| 115 kB 62.9 MB/s |████████████████████████████████| 127 kB 72.6 MB/s
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_deprecation.py:97: FutureWarning: Deprecated argument(s) used in 'dataset_info': token. Will not be supported from version '0.12'. warnings.warn(message, FutureWarning)
Downloading builder script: 0%| | 0.00/4.41k [00:00<?, ?B/s]
Downloading metadata: 0%| | 0.00/2.04k [00:00<?, ?B/s]
Downloading and preparing dataset yelp_review_full/yelp_review_full (download: 187.06 MiB, generated: 496.94 MiB, post-processed: Unknown size, total: 684.00 MiB) to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf...
Downloading data: 0%| | 0.00/196M [00:00<?, ?B/s]
Generating train split: 0%| | 0/650000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/50000 [00:00<?, ? examples/s]
Dataset yelp_review_full downloaded and prepared to /root/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf. Subsequent calls will reuse this data.
0%| | 0/2 [00:00<?, ?it/s]
dataset["train"][100]
{'label': 0, 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased",
use_fast=True)
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function,
batched=True,
remove_columns=["text"])
Downloading: 0%| | 0.00/29.0 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/570 [00:00<?, ?B/s]
Downloading: 0%| | 0.00/213k [00:00<?, ?B/s]
Downloading: 0%| | 0.00/436k [00:00<?, ?B/s]
0%| | 0/650 [00:00<?, ?ba/s]
0%| | 0/50 [00:00<?, ?ba/s]
tokenized_datasets["train"][100]['input_ids'][10]
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle().select(range(1000))
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
Downloading: 0%| | 0.00/436M [00:00<?, ?B/s]
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight'] - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer",
evaluation_strategy="epoch",
learning_rate=2e-5,
optim="adamw_torch") ##to use Pytorch's AdamW optimizer
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
)
trainer.train()
***** Running training ***** Num examples = 1000 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 1 Total optimization steps = 375
Epoch | Training Loss | Validation Loss |
---|---|---|
1 | No log | 1.169685 |
2 | No log | 1.004425 |
3 | No log | 0.966987 |
***** Running Evaluation ***** Num examples = 1000 Batch size = 8 ***** Running Evaluation ***** Num examples = 1000 Batch size = 8 ***** Running Evaluation ***** Num examples = 1000 Batch size = 8 Training completed. Do not forget to share your model on huggingface.co/models =)
TrainOutput(global_step=375, training_loss=1.0997899576822916, metrics={'train_runtime': 406.139, 'train_samples_per_second': 7.387, 'train_steps_per_second': 0.923, 'total_flos': 789354427392000.0, 'train_loss': 1.0997899576822916, 'epoch': 3.0})
import torch
sentence = tokenizer("I hate you", return_tensors="pt").to("cuda")
torch.softmax(model(**sentence).logits, axis=1)
tensor([[0.4654, 0.2268, 0.1085, 0.1172, 0.0822]], device='cuda:0', grad_fn=<SoftmaxBackward0>)
We will upload the tokenizer and the model on HuggingFace hub. First we need to install a library that allows us to log-in our HuggingFace account from colab.
!python -m pip install huggingface_hub --quiet
Enter a credential to login, then create a new model hub, which will be used to store your model.
!huggingface-cli login
!huggingface-cli repo create finetuned_yelp --type model
_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| _|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_| _| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _| _| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_| To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens . Token: Login successful Your token has been saved to /root/.huggingface/token Authenticated through git-credential store but this isn't the helper defined on your machine. You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default git config --global credential.helper store git version 2.17.1 Error: unknown flag: --version Sorry, no usage text found for "git-lfs" You are about to create Donlapark/finetuned_yelp Proceed? [Y/n] Y Your repo now lives at: https://huggingface.co/Donlapark/finetuned_yelp You can clone it locally with the command below, and commit/push as usual. git clone https://huggingface.co/Donlapark/finetuned_yelp
Finally, you can now save your tokenizer and model.
To load the mode and tokenizer from the HuggingFace space, use (change username
to your HuggingFace username):
Now you can load the model within HuggingFace Space using pipeline("sentiment-analysis", model="your_username/finetuned_yelp")
. Here's an example.
tokenizer.push_to_hub("finetuned_yelp")
model.push_to_hub("finetuned_yelp")
tokenizer config file saved in finetuned_yelp/tokenizer_config.json Special tokens file saved in finetuned_yelp/special_tokens_map.json Uploading the following files to Donlapark/finetuned_yelp: tokenizer.json,vocab.txt,tokenizer_config.json,special_tokens_map.json Configuration saved in finetuned_yelp/config.json Model weights saved in finetuned_yelp/pytorch_model.bin Uploading the following files to Donlapark/finetuned_yelp: config.json,pytorch_model.bin
CommitInfo(commit_url='https://huggingface.co/Donlapark/finetuned_yelp/commit/2115ed92cdc2be4342dce3d46efac36dbd5eeea3', commit_message='Upload BertForSequenceClassification', commit_description='', oid='2115ed92cdc2be4342dce3d46efac36dbd5eeea3', pr_url=None, pr_revision=None, pr_num=None)
transformers
library is "distilbert-base-uncased"
.