2024年07月19日玄貓（BlackCat）

大語言模型訓練迴圈增強技術

本文探討如何增強大語言模型的訓練迴圈，涵蓋學習率預熱策略、LoRA 微調技巧以及程式碼實作細節。藉由調整學習率曲線與運用 LoRA 技術，有效提升模型訓練穩定性及效率，同時提供 Stanford Alpaca 資料集微調與程式碼解析，讓讀者能快速掌握 LLM 訓練技巧。

機器學習深度學習

大語言模型學習率預熱訓練迴圈 PyTorch LoRA 指令微調

大語言模型的訓練過程複雜且耗時，需要仔細調整引數以確保穩定性和效率。本文介紹了學習率預熱策略，透過逐步提升學習率，避免初始階段的劇烈震盪，並搭配餘弦衰減策略調整學習率曲線，有效控制模型訓練步調。此外，LoRA 技術的應用能大幅減少可訓練引數，降低運算成本並提升訓練效率。文章同時提供了 Stanford Alpaca 資料集的微調方法和詳細的程式碼解析，包含 InstructionDataset 類別、客製化 collate 函式以及 LoRA 的實作，讓讀者能更深入地理解程式碼運作邏輯，並將其應用於實際的 LLM 訓練任務中。

練習題解答

第7章練習題解答

練習7.3

要對原始的Stanford Alpaca資料集進行微調，我們只需要更改檔案URL即可。原始的URL是：

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch07/01_main-chapter-code/instruction-data.json"

將其更改為：

url = "https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json"

Stanford Alpaca資料集包含52,000個條目（比第7章中的資料多50倍），並且條目比第7章中的更長。因此，強烈建議在GPU上執行訓練。如果遇到記憶體不足的錯誤，請考慮將批次大小從8降低到4、2或1。除了降低批次大小外，還可以考慮將allowed_max_length從1024降低到512或256。

練習7.4

要使用LoRA對模型進行指令微調，請使用附錄E中的相關類別和函式：

from appendix_E import LoRALayer, LinearWithLoRA, replace_linear_with_lora

在第7.5節的模型載入程式碼下方新增以下程式碼行：

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters before: {total_params:,}")
for param in model.parameters():
    param.requires_grad = False
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters after: {total_params:,}")
replace_linear_with_lora(model, rank=16, alpha=16)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable LoRA parameters: {total_params:,}")
model.to(device)

在Nvidia L4 GPU上，使用LoRA進行微調大約需要1.30分鐘，而原始程式碼需要1.80分鐘。因此，在這種情況下，LoRA大約快了28%。使用第7章中的Ollama Llama 3方法評估的分數約為50，與原始模型的分數差不多。

附錄A練習題解答

練習A.1

該網路有兩個輸入和兩個輸出。此外，還有兩個隱藏層，分別具有30和20個節點。我們可以透過以下方式計算引數數量：

model = NeuralNetwork(2, 2)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)

這將傳回752。我們也可以手動計算：

第一隱藏層：2個輸入×30個隱藏單元 + 30個偏差單元
第二隱藏層：30個輸入單元×20個節點 + 20個偏差單元
輸出層：20個輸入節點×2個輸出節點 + 2個偏差單元

將所有層的引數相加，結果為2 × 30 + 30 + 30 × 20 + 20 + 20 × 2 + 2 = 752。

練習A.2

執行時間結果將取決於用於此實驗的硬體。在我的實驗中，即使對於像下面這樣的小矩陣乘法，我也觀察到了顯著的加速：

a = torch.rand(100, 200)
b = torch.rand(200, 300)
%timeit a@b

在CPU上，這導致：

63.8 μs ± 8.7 μs per loop

當在GPU上執行時：

a, b = a.to("cuda"), b.to("cuda")
%timeit a @ b

結果是：

13.8 μs ± 425 ns per loop

在這種情況下，在V100上，計算速度大約快了四倍。

練習A.3

該網路有兩個輸入和兩個輸出。此外，還有兩個隱藏層，分別具有30和20個節點。我們可以透過以下方式計算引數數量：

model = NeuralNetwork(2, 2)
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print("Total number of trainable model parameters:", num_params)

這將傳回752。我們也可以手動計算：

第一隱藏層：2個輸入×30個隱藏單元 + 30個偏差單元
第二隱藏層：30個輸入單元×20個節點 + 20個偏差單元
輸出層：20個輸入節點×2個輸出節點 + 2個偏差單元

將所有層的引數相加，結果為2 × 30 + 30 + 30 × 20 + 20 + 20 × 2 + 2 = 752。

程式碼解析

InstructionDataset類別實作

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.instruction_lengths = []
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = format_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))
            instruction_length = len(tokenizer.encode(instruction_plus_input))
            self.instruction_lengths.append(instruction_length)

    def __getitem__(self, index):
        return self.instruction_lengths[index], self.encoded_texts[index]

    def __len__(self):
        return len(self.data)

custom_collate_fn函式實作

def custom_collate_fn(
    batch,
    pad_token_id=50256,
    ignore_index=-100,
    allowed_max_length=None,
    device="cpu"
):
    batch_max_length = max(len(item)+1 for instruction_length, item in batch)
    inputs_lst, targets_lst = [], []
    for instruction_length, item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        inputs = torch.tensor(padded[:-1])
        targets = torch.tensor(padded[1:])
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index
        targets[:instruction_length-1] = -100
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]
        inputs_lst.append(inputs)
        targets_lst.append(targets)
    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)
    return inputs_tensor, targets_tensor

LoRA實作

from appendix_E import LoRALayer, LinearWithLoRA, replace_linear_with_lora

# ...

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters before: {total_params:,}")
for param in model.parameters():
    param.requires_grad = False
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable parameters after: {total_params:,}")
replace_linear_with_lora(model, rank=16, alpha=16)
total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total trainable LoRA parameters: {total_params:,}")
model.to(device)

#### 內容解密：

InstructionDataset類別：用於處理指令資料集，將資料轉換為模型可接受的格式。
custom_collate_fn函式：用於批次處理資料，對輸入和目標進行填充和遮罩處理。
LoRA實作：使用LoRA技術對模型進行微調，減少可訓練引數數量，提高訓練效率。

增強訓練迴圈的功能

在附錄 D 中，我們將增強第 5 章到第 7 章中涵蓋的預訓練和微調過程的訓練函式。特別是，它涵蓋了學習率預熱、餘弦衰減和梯度裁剪。然後，我們將這些技術納入訓練函式並預訓練一個大語言模型（LLM）。

初始化模型和資料載入器

首先，我們重新初始化在第 5 章中訓練的模型：

import torch
from chapter04 import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 256,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
model.eval()

內容解密：

這段程式碼初始化了一個 GPT 模型，組態引數來自 GPT_CONFIG_124M。首先，匯入必要的函式庫，包括 PyTorch 和自定義的 GPTModel。然後，定義模型的組態引數，如詞彙大小、上下文長度、嵌入維度、注意力頭數、層數、丟棄率和查詢鍵值偏差。接著，檢查是否有可用的 CUDA 裝置，若有則使用 CUDA，否則使用 CPU。設定隨機種子以確保結果的可重現性，並將模型移動到指定的裝置上，最後將模型設定為評估模式。

載入資料

接下來，我們載入“The Verdict”短篇故事：

import os
import urllib.request

file_path = "the-verdict.txt"
url = (
    "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/"
    "main/ch02/01_main-chapter-code/the-verdict.txt"
)

if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()

內容解密：

這段程式碼負責下載“The Verdict”文字資料。首先，檢查本地是否已經存在該檔案，若不存在，則從指定的 URL 下載文字資料並儲存到本地檔案中。若檔案已存在，則直接讀取本地檔案內容。這樣可以避免重複下載相同的資料。

建立資料載入器

然後，我們將 text_data 載入資料載入器：

from previous_chapters import create_dataloader_v1

train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))

torch.manual_seed(123)
train_loader = create_dataloader_v1(
    text_data[:split_idx],
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    text_data[split_idx:],
    batch_size=2,
    max_length=GPT_CONFIG_124M["context_length"],
    stride=GPT_CONFIG_124M["context_length"],
    drop_last=False,
    shuffle=False,
    num_workers=0
)

內容解密：

這段程式碼將文字資料分成訓練集和驗證集。首先，設定訓練集的比例，並根據此比例劃分資料索引。然後，使用自定義的 create_dataloader_v1 函式建立訓練和驗證資料載入器。設定批次大小、最大長度、步幅、是否丟棄最後一個不完整的批次、是否打亂資料以及工作執行緒數等引數。

學習率預熱

實作學習率預熱可以穩定大語言模型等複雜模型的訓練過程。這個過程涉及從一個非常低的初始值（initial_lr）逐漸增加學習率到使用者指定的最大值（peak_lr）。

n_epochs = 15
initial_lr = 0.0001
peak_lr = 0.01
warmup_steps = 20

total_steps = len(train_loader) * n_epochs
warmup_steps = int(0.2 * total_steps)
print(warmup_steps)

內容解密：

這段程式碼計算了總的訓練步數和預熱步數。首先，定義了訓練的輪數、初始學習率和峰值學習率。然後，計算總的訓練步數，即訓練資料載入器的長度乘以訓練輪數。接著，計算預熱步數，通常設定為總步數的 0.1% 到 20%。這裡將預熱步數設為總步數的 20%。

簡單的訓練迴圈範本

接下來，我們實作一個簡單的訓練迴圈範本來說明這個預熱過程：

optimizer = torch.optim.AdamW(model.parameters(), weight_decay=0.1)
lr_increment = (peak_lr - initial_lr) / warmup_steps
global_step = -1
track_lrs = []

for epoch in range(n_epochs):
    for input_batch, target_batch in train_loader:
        optimizer.zero_grad()
        global_step += 1
        
        if global_step < warmup_steps:
            lr = initial_lr + global_step * lr_increment
        else:
            lr = peak_lr
        
        for param_group in optimizer.param_groups:
            param_group["lr"] = lr
        
        track_lrs.append(optimizer.param_groups[0]["lr"])

內容解密：

這段程式碼實作了一個簡單的訓練迴圈，並在其中進行了學習率預熱。首先，初始化最佳化器和學習率增量。然後，在每個訓練步驟中，檢查是否仍在預熱階段。如果是，則根據當前步驟計算學習率；否則，將學習率設為峰值學習率。最後，將當前的學習率應用於最佳化器，並記錄學習率的變化。

視覺化學習率變化

執行上述程式碼後，我們可以透過視覺化來驗證學習率預熱是否按預期工作：

import matplotlib.pyplot as plt

plt.ylabel("Learning rate")
plt.xlabel("Step")
total_training_steps = len(train_loader) * n_epochs
plt.plot(range(total_training_steps), track_lrs)
plt.show()

內容解密：

這段程式碼繪製了學習率隨訓練步驟變化的曲線。首先，匯入 matplotlib.pyplot 用於繪圖。然後，設定 y 軸和 x 軸的標籤，分別表示學習率和訓練步驟。接著，繪製學習率隨訓練步驟變化的曲線，並顯示圖表。這樣可以直觀地觀察到學習率在訓練過程中的變化情況。

此圖示展示了學習率在初始階段逐漸增加到峰值的過程。此圖示為：

此圖示

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title 大語言模型訓練迴圈增強技術

package "LLM 訓練迴圈增強" {
    package "學習率策略" {
        component [學習率預熱] as warmup
        component [餘弦衰減] as cosine
        component [批次大小調整] as batch
    }

    package "模型訓練" {
        component [模型選擇] as select
        component [超參數調優] as tune
        component [交叉驗證] as cv
    }

    package "評估部署" {
        component [模型評估] as eval
        component [模型部署] as deploy
        component [監控維護] as monitor
    }
}

collect --> clean : 原始資料
clean --> feature : 乾淨資料
feature --> select : 特徵向量
select --> tune : 基礎模型
tune --> cv : 最佳參數
cv --> eval : 訓練模型
eval --> deploy : 驗證模型
deploy --> monitor : 生產模型

note right of feature
  特徵工程包含：
  - 特徵選擇
  - 特徵轉換
  - 降維處理
end note

note right of eval
  評估指標：
  - 準確率/召回率
  - F1 Score
  - AUC-ROC
end note

@enduml

此圖示說明瞭學習率預熱的過程，從初始的低學習率逐漸增加到峰值學習率，並在之後保持峰值學習率。

綜上所述，本附錄介紹瞭如何增強訓練迴圈的功能，包括初始化模型和資料載入器、載入資料、建立資料載入器、實作學習率預熱以及視覺化學習率變化等。透過這些技術，可以提高大語言模型的訓練穩定性和效果。