2024年07月05日玄貓（BlackCat）

大語言模型訓練微調與相關技術資源彙整

本文整理了大語言模型（LLM）訓練、微調及相關技術的論文、程式碼範例與資源，涵蓋PyTorch交叉熵函式、LLM預訓練資料集、超引數、架構細節、古騰堡計劃書籍準備、持續預訓練、領域特定LLM（BloombergGPT）、高效訓練方法（GaLore）、大規模預訓練資料集（Dolma、The

機器學習深度學習

大語言模型 LLM 微調 PyTorch 預訓練資料集

大語言模型的訓練和微調已成為自然語言處理領域的重要課題。本篇文章彙整了相關的技術資源，包含學術論文、程式碼範例和公開資料集，涵蓋了從模型內部運作機制到實際應用技巧的廣泛主題。藉由PyTorch交叉熵函式的解析，深入理解模型訓練的核心概念，並透過整理多個LLM預訓練資料集、超引數及架構細節，提供讀者建構與最佳化模型的參考。文章也探討了持續預訓練、領域特定LLM的應用，以及提升訓練效率的方法，例如GaLore。此外，也介紹了不同取樣策略、解碼演算法和微調技術，並列舉多個指令微調資料集，提供讀者更全面的理解和實作參考。最後，文章也觸及了模型評估、知識取得和偏好微調等進階議題，幫助讀者深入探索LLM的發展趨勢。

GaLore：一種高效的LLM訓練方法

GaLore是一個旨在使LLM預訓練更高效的研究專案。所需的程式碼更改僅需將PyTorch的AdamW最佳化器替換為galore-torch Python套件提供的GaLoreAdamW最佳化器：

# 使用GaLoreAdamW最佳化器的範例程式碼
from galore_torch import GaLoreAdamW

# 定義模型和最佳化器
model = YourModel()
optimizer = GaLoreAdamW(model.parameters(), lr=1e-4)

內容解密：

from galore_torch import GaLoreAdamW：匯入GaLoreAdamW最佳化器。
model = YourModel()：初始化你的模型。
optimizer = GaLoreAdamW(model.parameters(), lr=1e-4)：使用GaLoreAdamW最佳化器，並設定學習率。

top-p取樣是top-k取樣的替代方案，它根據累積機率超過閾值p來選擇最小的頂部token集合，而top-k取樣則根據機率選擇前k個token：

# top-k取樣的範例程式碼
import torch

def top_k_sampling(logits, k):
    values, indices = torch.topk(logits, k)
    return indices

# top-p取樣的範例程式碼
def top_p_sampling(logits, p):
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)
    nucleus = sorted_indices[cumulative_probs < p]
    return nucleus

內容解密：

torch.topk(logits, k)：取得logits中前k個最大的元素。
torch.cumsum和torch.nn.functional.softmax：計算累積機率，用於top-p取樣。

Beam Search

Beam Search是一種替代的解碼演算法，透過在每個步驟中保留得分最高的區域性序列來平衡效率和品質：此圖示展示了Beam Search的基本流程。

微調技術與資源

第6章討論了不同型別的微調技術，以下資源提供了更多相關資訊：

# 微調範例程式碼
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# 載入預訓練模型和tokenizer
model_name = "your_model_name"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 定義自定義資料集和訓練迴圈
class YourDataset(torch.utils.data.Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length')
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

    def __len__(self):
        return len(self.texts)

# 建立資料載入器和最佳化器
dataset = YourDataset(your_texts, your_labels)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

# 訓練迴圈
for epoch in range(5):
    model.train()
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

內容解密：

AutoModelForSequenceClassification.from_pretrained(model_name)：載入預訓練的序列分類別模型。
tokenizer(text, return_tensors='pt', truncation=True, padding='max_length')：對輸入文字進行tokenize。
自定義YourDataset類別以載入和處理資料。
建立資料載入器並定義訓練迴圈，使用Adam最佳化器進行微調。

指令微調資料集

Alpaca資料集包含52,000個指令-回應對，是最早且最受歡迎的公開指令微調資料集之一：

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title 大語言模型訓練微調與相關技術資源彙整

package "LLM 訓練微調資源" {
    package "訓練技術" {
        component [GaLore 高效訓練] as galore
        component [Top-k/Top-p 取樣] as sampling
        component [交叉熵函式] as entropy
    }

    package "模型訓練" {
        component [模型選擇] as select
        component [超參數調優] as tune
        component [交叉驗證] as cv
    }

    package "評估部署" {
        component [模型評估] as eval
        component [模型部署] as deploy
        component [監控維護] as monitor
    }
}

collect --> clean : 原始資料
clean --> feature : 乾淨資料
feature --> select : 特徵向量
select --> tune : 基礎模型
tune --> cv : 最佳參數
cv --> eval : 訓練模型
eval --> deploy : 驗證模型
deploy --> monitor : 生產模型

note right of feature
  特徵工程包含：
  - 特徵選擇
  - 特徵轉換
  - 降維處理
end note

note right of eval
  評估指標：
  - 準確率/召回率
  - F1 Score
  - AUC-ROC
end note

@enduml

此圖示展示了Alpaca資料集在指令微調中的應用。

其他公開可用的指令微調資料集包括：

資料集1：LIMA

# 載入LIMA資料集範例程式碼
from datasets import load_dataset

dataset = load_dataset("GAIR/lima")

內容解密：

load_dataset("GAIR/lima")：使用Hugging Face的datasets函式庫載入LIMA資料集。

大語言模型的訓練資料集

研究人員提出了多種大型資料集，用於訓練和評估大語言模型。以下是一些重要的資料集：

UltraChat（https://huggingface.co/datasets/openchat/ultrachat-sharegpt）：包含80.5萬個指令-回應對的大規模資料集。有關更多資訊，請參閱Ding等人的論文《Enhancing Chat Language Models by Scaling High-quality Instructional Conversations》（https://arxiv.org/abs/2305.14233）。
Alpaca GPT4（https://mng.bz/Aa0p）：類別似於Alpaca的資料集，包含5.2萬個使用GPT-4生成的指令-回應對，而不是GPT-3.5。

Phi-3模型與其技術報告

Phi-3是一個具有38億引數的模型，其指令微調變體據報導可與更大型的專有模型（如GPT-3.5）相媲美。有關更多資訊，請參閱Abdin等人的技術報告《Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone》（2024）（https://arxiv.org/abs/2404.14219）。

合成指令資料生成方法

研究人員提出了一種合成指令資料生成方法，該方法從指令微調的Llama-3模型生成30萬個高品質的指令-回應對。使用這些指令範例對預訓練的Llama 3基礎模型進行微調，其效能可與原始指令微調的Llama-3模型相媲美。有關更多資訊，請參閱Xu等人的論文《Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing》（2024）（https://arxiv.org/abs/2406.08464）。

指令微調的最佳實踐

研究表明，在指令微調過程中不遮蔽指令和輸入，可以有效提高各種NLP任務和開放式生成基準的效能，特別是在使用冗長指令和簡短輸出的資料集或使用少量訓練範例時。有關更多資訊，請參閱Shi的論文《Instruction Tuning with Loss Over Instructions》（2024）（https://arxiv.org/abs/2405.14394）。

評估大語言模型的工具

Prometheus和PHUDGE是公開可用的大語言模型，可以根據自定義標準評估長篇回應，其效能可與GPT-4媲美。雖然本文未使用這些工具，但感興趣的讀者可以進一步瞭解。有關更多資訊，請參閱Kim等人的論文《Prometheus: Inducing Finegrained Evaluation Capability in Language Models》（2023）（https://arxiv.org/abs/2310.08491）和Deshwal與Chawla的論文《PHUDGE: Phi-3 as Scalable Judge》（2024）（https://arxiv.org/abs/2405.08029）。

大語言模型的微調與知識取得

研究結果支援大語言模型主要在預訓練期間取得事實知識，而微調主要增強其使用此知識的效率的觀點。有關更多資訊，請參閱Gekhman的論文《Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?》（2024）（https://arxiv.org/abs/2405.05904）。

偏好微調

偏好微調是指令微調後的可選步驟，用於使大語言模型更符合人類偏好。有關更多資訊，請參閱作者的其他文章《LLM Training: RLHF and Its Alternatives》（https://mng.bz/ZVPm）和《Tips for LLM Pretraining and Evaluating Reward Models》（https://mng.bz/RNXj）。