2025年12月20日玄貓（BlackCat）

大語言模型量化與LoRA微調技術

本文探討大語言模型（LLM）的量化和 LoRA（Low-Rank Adaptation）微調技術，闡述如何降低模型佈署資源需求並提升效率。文章涵蓋量化技術的優勢與挑戰、LoRA 技術的核心概念、int8 訓練模型準備、LoRA 層的設定與應用、以及模型訓練、評估和實際應用案例。同時，以 PubMed

機器學習自然語言處理

大語言模型 LLM LoRA 量化微調深度學習

大語言模型（LLM）在自然語言處理領域中扮演著重要的角色，但其龐大的規模也帶來了佈署上的挑戰。為瞭解決這個問題，量化技術和 LoRA 微調技術應運而生。量化技術能有效降低模型大小和推理速度，但同時需要注意潛在的精確度損失。LoRA 技術則透過新增少量可訓練引數，在兼顧效能的同時，大幅降低訓練所需的資源。本文將詳細介紹如何結合這兩種技術，以提升 LLM 的效率和效能。首先，我們會說明如何將預訓練模型（例如 flan-t5-xxl）量化到 int8 格式，並準備進行 LoRA 微調。接著，會逐步說明 LoRA 組態的設定、模型訓練、評估以及如何將其應用於實際任務。

大語言模型的量化與LoRA微調技術解析

在深度學習領域，特別是在自然語言處理（NLP）的應用中，大語言模型（LLM）已經成為了一項重要的技術。然而，這些模型通常需要大量的計算資源和記憶體，從而限制了它們在資源有限的裝置上的佈署。為瞭解決這個問題，量化技術和LoRA（Low-Rank Adaptation）微調技術被提出來最佳化模型的效能和效率。

量化技術的優勢與挑戰

將大語言模型量化到int8格式可以帶來多方面的好處。首先，量化後的模型佔用的記憶體大幅減少，這使得模型可以在記憶體有限的裝置上執行。其次，量化後的模型通常具有更快的推理速度，因為它們需要處理的資料量更小。然而，量化也可能引入一定的精確度損失，特別是在處理大型資料集時。此外，量化模型的計算複雜度可能會增加，因為需要額外的操作來轉換權重到int8格式。最後，量化模型的移植性可能會降低，因為它們通常是針對特定的硬體平台進行最佳化的。

載入與準備模型

為了開始訓練過程，首先需要載入模型。在這個例子中，使用了philschmid/flan-t5-xxl-sharded-fp16模型，這是一個分片版本的google/flan-t5-xxl模型。分片技術使得模型可以在不超過記憶體限制的情況下被載入。

from transformers import AutoModelForSeq2SeqLM

model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True, device_map="auto")

內容解密：

AutoModelForSeq2SeqLM類別來自Transformers函式庫，用於載入序列到序列的語言模型。
model_id變數指定了要載入的模型，這裡使用的是philschmid/flan-t5-xxl-sharded-fp16。
load_in_8bit=True引數表示將模型載入為8位元整數格式，以減少記憶體使用並提高計算效率。
device_map="auto"引數允許自動將模型對映到可用的裝置上，以最佳化資源利用。

LoRA組態與微調

在準備好模型之後，需要定義一個LoRA組態，以指定低秩矩陣的引數，如r值、lora_alpha值、目標模組、lora_dropout值和偏差。

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM
)

內容解密：

r=16：內部維度，用於控制低秩矩陣的訓練引數數量。較高的r值意味著更多的可訓練引數，但也需要更多的記憶體。
lora_alpha=32：縮放因子，用於控制LoRA層的強度。較高的值意味著LoRA層對基礎模型的影響更大。
target_modules=["q", "v"]：指定要應用LoRA更新矩陣的模組。這裡選擇了查詢和值頭作為目標模組。
lora_dropout=0.05：用於控制LoRA層的丟棄率，以防止過擬合。
bias="none"：表示不使用偏差。
task_type=TaskType.SEQ_2_SEQ_LM：指定任務型別為序列到序列的語言模型。

低秩適應（LoRA）技術在大語言模型中的應用與實作

LoRA 技術的核心概念

LoRA（Low-Rank Adaptation）是一種針對大型預訓練語言模型進行高效微調的技術。該技術透過在模型中新增少量可訓練引數，實作對模型的精確控制，同時大幅降低了訓練所需的記憶體和計算資源。

LoRA 層的關鍵組成部分

LoRA 層主要包含以下幾個關鍵組成部分：

r (Rank)：表示 LoRA 層的秩大小，用於控制新增引數的數量。較小的秩可以減少訓練引數，但可能影響模型的表達能力。
target_modules：指定 LoRA 層將被應用到的目標模組，例如模型的查詢（query）、鍵值（key 和 value）或輸出層。
lora_alpha：用於縮放 LoRA 層輸出的縮放因子，有助於控制 LoRA 層對模型原始輸出的影響程度。
lora_dropout：LoRA 層的 dropout 率，用於防止過擬合，通常設定為至少 0.05。
bias：指定是否訓練偏差引數，可設為 none、all 或 lora_only。
task_type：指定模型的任務型別，預設為序列到序列語言建模（TaskType.SEQ_2_SEQ_LM）。

設定 LoRA 組態

透過設定 LoRA 組態，可以控制 LoRA 層的行為。以下是一個範例組態：

from peft import LoraConfig, TaskType

lora_config = LoraConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    r=16,
    target_modules=["q", "v"],
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none"
)

為 int8 訓練準備模型

為了進行 int8 訓練，需要將模型的權重轉換為 int8 格式。這可以透過 prepare_model_for_int8_training 函式實作：

model = prepare_model_for_int8_training(model)

該函式將模型的權重轉換為 int8 格式，使模型能夠進行高效的 int8 訓練。

新增 LoRA 層到模型

透過 get_peft_model 函式，可以將 LoRA 層新增到模型中：

model = get_peft_model(model, lora_config)

新增 LoRA 層後，可以列印出模型的可訓練引數量：

model.print_trainable_parameters()

輸出結果顯示，模型現在有約 18,874,368 個可訓練引數，佔總引數量的 0.17%。

程式碼解析：

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

內容解密：

get_peft_model 函式用於將 LoRA 組態應用到模型中，使其準備好進行微調。
print_trainable_parameters 方法用於輸出模型中可訓練引數的數量和比例。

建立 Data Collator

Data Collator 用於處理輸入和標籤的填充。以下範例使用 DataCollatorForSeq2Seq 類別建立 Data Collator：

from transformers import DataCollatorForSeq2Seq

label_pad_token_id = -100
data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

程式碼解析：

data_collator = DataCollatorForSeq2Seq(
    tokenizer,
    model=model,
    label_pad_token_id=label_pad_token_id,
    pad_to_multiple_of=8
)

內容解密：

DataCollatorForSeq2Seq 用於處理序列到序列任務的資料填充。
label_pad_token_id=-100 表示在損失計算中忽略填充標籤。
pad_to_multiple_of=8 確保輸入和標籤的長度是 8 的倍數，以最佳化計算效率。

定義訓練超引數

透過 Seq2SeqTrainingArguments 可以定義訓練超引數，例如學習率、訓練輪數和批次大小等：

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="lora-flan-t5-xxl",
    auto_find_batch_size=True,
    learning_rate=1e-3,
    max_steps=1,
    logging_dir="lora-flan-t5-xxl/logs",
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="no",
    report_to="tensorboard"
)

程式碼解析：

training_args = Seq2SeqTrainingArguments(
    output_dir="lora-flan-t5-xxl",
    auto_find_batch_size=True,
    learning_rate=1e-3,
    max_steps=1,
    logging_dir="lora-flan-t5-xxl/logs",
    logging_strategy="steps",
    logging_steps=1,
    save_strategy="no",
    report_to="tensorboard"
)

內容解密：

output_dir 指定了模型和日誌的輸出目錄。
auto_find_batch_size=True 表示自動尋找最佳批次大小。
learning_rate=1e-3 設定了學習率。
max_steps=1 設定了最大訓練步數。
logging_dir 和 logging_strategy 控制日誌的記錄方式。

微調大語言模型的訓練與評估流程

在進行大語言模型的微調時，需要經過多個步驟來確保模型能夠有效地學習並適應特定的任務。本文將詳細介紹如何使用LoRA（Low-Rank Adaptation）技術對FLAN-T5-XXL模型進行微調，以適應PubmedQA任務。

步驟3.5：訓練模型

在定義好超引數後，我們可以建立一個Seq2SeqTrainer例項來訓練模型。訓練過程中，模型的表現將被記錄到TensorBoard中，並且模型將根據設定的logging_steps進行儲存。

# 建立Trainer例項
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_train_dataset,
)
model.config.use_cache = False  # 停用快取以避免警告

內容解密：

Seq2SeqTrainer是Hugging Face Transformers函式庫中的一個類別，用於序列到序列模型的訓練。
model引數指定了要訓練的模型。
args引數包含了訓練的超引數，如學習率、批次大小等。
data_collator函式負責將訓練資料整理成模型所需的格式。
train_dataset是已經標記好的訓練資料集。
將model.config.use_cache設為False可以避免在訓練過程中出現警告。

訓練過程

使用trainer.train()啟動訓練過程。T5模型為了穩定性，部分層會保持在float32精確度。訓練時間取決於模型的規模和計算資源。

trainer.train()

內容解密：

trainer.train()啟動模型的訓練過程。
T5模型保持部分層在float32精確度以提高訓練穩定性。

步驟3.6：儲存模型

訓練完成後，我們需要儲存微調後的LoRA模型和標記器（tokenizer）。

# 儲存微調後的LoRA模型和標記器
peft_model_id = "flan-t5-pubmed"
trainer.model.save_pretrained(peft_model_id)
tokenizer.save_pretrained(peft_model_id)

內容解密：

save_pretrained方法用於儲存模型和標記器的組態和權重。
peft_model_id指定了儲存模型的ID或名稱。

步驟4：模型評估

模型訓練完成後，需要進行評估。常見的評估指標包括F1-score、BLEU score、ROUGE score和句子相似度分數。

常見評估指標

指標名稱	描述
F1-score	衡量模型的答案準確性和流暢度
BLEU score	衡量模型的答案流暢度，透過與參考答案比較計算
ROUGE score	衡量模型的答案與參考答案的重疊度
句子相似度分數	使用相似度度量（如餘弦相似度）衡量模型的答案與參考答案的相似度

載入微調後的模型

評估模型之前，需要載入微調後的LoRA模型和其組態。

# 載入LoRA組態
# ...

內容解密：

載入LoRA組態是為了取得微調後的模型的引數，如r值、lora_alpha值等。

透過上述步驟，我們可以對大語言模型進行有效的微調和評估，從而使其適應特定的自然語言處理任務。

微調大語言模型的實務應用：以 PubMed 資料集為例

在人工智慧領域中，微調大語言模型（LLM）是提升模型效能的重要技術。本章節將探討如何利用 LoRA（Low-Rank Adaptation）技術對預訓練模型進行微調，並以 PubMed 資料集為例，展示其在醫學研究問答任務中的實際應用。

步驟 4.1：載入微調後的模型

首先，我們需要載入微調後的 LoRA 模型及其設定檔。以下程式碼展示瞭如何實作這一步驟：

peft_model_id = "flan-t5-pubmed"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, load_in_8bit=True, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
model.eval()

內容解密：

peft_model_id 指定了微調後的 LoRA 模型的儲存位置。
PeftConfig.from_pretrained 載入 LoRA 模型的設定檔。
AutoModelForSeq2SeqLM.from_pretrained 載入基礎的 LLM 模型，並設定 load_in_8bit=True 以 int8 格式載入模型，減少記憶體使用。
AutoTokenizer.from_pretrained 載入與基礎模型配套的 tokenizer。
PeftModel.from_pretrained 將基礎模型與 LoRA 模型結合，載入微調後的模型權重。
model.eval() 將模型設為評估模式，關閉 dropout 等訓練相關的層。

步驟 4.2：測試微調後的模型

在將模型應用於整個測試資料集之前，先對單一範例進行測試，以手動評估結果的正確性。

prompt = f"""
Answer the question based on the context below.
Context: To study whether nontriploid partial hydatidiform moles truly exist. We conducted a reevaluation of pathology and ploidy in 19 putative nontriploid partial hydatidiform moles using standardized histologic diagnostic criteria and repeat flow cytometric testing by the Hedley technique. On review of the 19 moles, 53% (10/19) were diploid nonpartial moles (initially pathologically misclassified), and 37% (7/19) were triploid partial moles (initial ploidy misclassifications). One additional case (5%) was a diploid early complete mole (initially pathologically misclassified).
Question: Do nontriploid partial hydatidiform moles exist?
""".strip()

encoding = tokenizer(prompt, return_tensors="pt")

with torch.inference_mode():
    outputs = model.generate(
        input_ids=encoding.input_ids,
        attention_mask=encoding.attention_mask,
        max_new_tokens=maxResponseLength,
    )
    generated_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Generated Output: ", generated_output)

內容解密：

定義了一個包含上下文和問題的提示字串 prompt。
使用 tokenizer 將 prompt 編碼為模型可接受的輸入格式。
在 torch.inference_mode() 環境下，使用模型生成答案。
將生成的 token 解碼為文字輸出，並印出結果。

輸出範例：

Generated Output: Nontriploid partial hydatidiform moles do not exist.
The initial pathologic diagnosis of these moles is often incorrect. Flow cytometric testing by the Hedley technique is the most reliable method for determining ploidy.

步驟 4.3：在測試資料集上評估微調後的模型

接下來，我們將在整個測試資料集上評估微調後的模型效能。

test_dataset = load_from_disk('data/eval/').with_format('torch')

rouge_metric = evaluate.load('rouge')
bleu_metric = corpus_bleu

def compute_f1(a_gold, a_pred):
    # F1 分數計算實作
    gold_toks = get_tokens(a_gold)
    pred_toks = get_tokens(a_pred)
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        return int(gold_toks == pred_toks)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def calculate_sentencesim_score(generated_answer, actual_answer):
    # 使用 SentenceTransformer 計算句子相似度
    model = SentenceTransformer('average_word_embeddings_glove.6B.300d')
    embeddings1 = model.encode([generated_answer])[0]
    embeddings2 = model.encode([actual_answer])[0]
    similarity = np.dot(embeddings1, embeddings2) / (np.linalg.norm(embeddings1) * np.linalg.norm(embeddings2))
    return similarity

predictions, references = [], []
for sample in tqdm(test_dataset):
    p, l = evaluate_peft_model(sample)
    predictions.append(p)
    references.append(l)

# 計算各種評估指標
bleu = bleu_metric([[ref] for ref in references], predictions, auto_reweigh=True)
f1_scores = [compute_f1(ref, pred) for ref, pred in zip(references, predictions)]
f1_avg = sum(f1_scores) / len(f1_scores)
sentencesim_scores = [calculate_sentencesim_score(pred, ref) for pred, ref in zip(predictions, references)]
sentencesim_avg = sum(sentencesim_scores) / len(sentencesim_scores)

print(f"Rouge1: {rouge['rouge1'].mid.fmeasure* 100:.2f}%")
print(f"Rouge2: {rouge['rouge2'].mid.fmeasure* 100:.2f}%")
print(f"RougeL: {rouge['rougeL'].mid.fmeasure* 100:.2f}%")
print(f"RougeLsum: {rouge['rougeL'].mid.fmeasure* 100:.2f}%")

內容解密：

載入測試資料集並轉換為 PyTorch 格式。
定義了多種評估指標的計算函式，包括 ROUGE、BLEU、F1 分數和句子相似度。
對測試資料集中的每個樣本進行預測，並收集預測結果和參考答案。
計算並輸出各種評估指標的分數，以量化微調後模型的效能。