banner
NEWS LETTER

Hugging Face使用(1)

Scroll down

1. hugging face网站简介

抱脸相当于Ai界的GitHub,里面的模型都是开源免费的,非常适合AI开发者使用,使用前应先安装transformer库。
网站:https://huggingface.co/

1
pip install transformers

2. fine-tunning简易教程(以新闻分类为例)

需要在huggingface网站中寻找合适于自己任务的模型,下载下来,默认的下载路径为:

  • 1)使用 Windows 模型保存的路径在 C:\Users[用户名].cache\torch\transformers\ 目录下,根据模型的不同下载的东西也不相同
  • 2)使用 Linux 模型保存的路径在 ~/.cache/torch/transformers/ 目录下

通过tokenizer可以将文本转换为模型能理解的数字,其中checkpoint是我们找好的模型的名字:

1
2
3
4
5
6
7
8
9
10
11
12
13
from transformers import AutoTokenizer

model_checkpoint = "uer/roberta-base-finetuned-chinanews-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

sentence1_key, sentence2_key = task_to_keys[task]#任务锁定为非结构化新闻,提取拼接好的数据


def preprocess_function(examples):
return tokenizer(examples[sentence1_key], truncation=True, padding="max_length", max_length=max_length)

#进行tokenizer
encoded_dataset = dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

模型的训练也是从选好的模型的checkpoint开始训练的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=label_num, ignore_mismatched_sizes=True)#因为是分类的模型

args = TrainingArguments(
"test-glue",
evaluation_strategy="no",
logging_strategy="steps",
logging_steps=500,
save_strategy='no',
# save_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=5,
weight_decay=0.01,
# load_best_model_at_end=True,
metric_for_best_model=metric_name,
)#模型训练的参数


def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels, average='micro')

trainer = Trainer(
model,
args,
train_dataset=encoded_dataset['train'],
# eval_dataset=encoded_dataset['test'],
compute_metrics=compute_metrics
)

trainer.train()#模型训练

# print(trainer.evaluate())

pred = trainer.predict(encoded_dataset['test'])
pred = np.argmax(pred.predictions, axis=1)#得到最终的结果
其他文章