2024年03月12日玄貓（BlackCat）

YouTube 字幕摘要問答系統實作

本文解析一個根據 Streamlit、HuggingFace 和 Langchain 的 YouTube 字幕摘要與問答系統。系統能自動擷取 YouTube 影片字幕，利用 HuggingFace 的 Falcon 和 GPT-2 模型進行摘要生成和問答處理，並透過 Streamlit

Web 開發自然語言處理

YouTube API HuggingFace Streamlit Langchain 摘要生成問答系統

本系統採用模組化設計，將影片 ID 擷取、字幕下載、文字清理、摘要生成與問答處理等功能封裝成獨立函式，提升程式碼可讀性和維護性。系統核心是整合 HuggingFace 推理 API，利用預訓練的 Falcon 和 GPT-2 大語言模型，分別處理摘要生成和問答任務。透過 Streamlit 框架建構使用者介面，讓使用者能輕鬆輸入 YouTube 影片網址和問題，系統即時傳回摘要和答案。為確保系統穩定性，程式碼中包含 URL 驗證和錯誤處理機制，並針對字幕內容進行多步驟清理，移除雜訊和無關資訊，以提高模型處理效率和準確度。

YouTube 字幕摘要與問答系統實作解析

系統架構概述

本系統根據 Streamlit 框架開發，結合 HuggingFace 推理 API 實作 YouTube 影片字幕摘要與問答功能。系統主要包含四個核心功能模組：影片 ID 擷取、字幕擷取、摘要生成與問答處理。

程式碼解析

1. 匯入必要函式庫

import streamlit as st
import requests
import urllib.parse
from langchain.document_loaders import YoutubeLoader
import json
import re

系統匯入了必要的函式庫，包括 Streamlit 用於建構使用者介面、requests 用於 API 請求、urllib 用於 URL 解析等。

2. 設定 API 端點

SUMMARIZATION_ENDPOINT = "https://api-inference.huggingface.co/models/tiiuae/falcon-7b-instruct"
QA_ENDPOINT = "https://api-inference.huggingface.co/models/gpt2-large"

定義了用於摘要生成和問答任務的 HuggingFace 推理 API 端點。

3. 定義 Streamlit 應用主函式

def main():
    st.title("YouTube 字幕摘要與問答系統")
    url = st.text_input("請輸入 YouTube 影片網址：")
    question = st.text_input("請輸入與字幕相關的問題：")

主函式設定了應用程式的標題，並提示使用者輸入 YouTube 影片網址和相關問題。

4. 按鈕點選事件處理

if st.button("摘要與回答"):
    if url:
        video_id = extract_video_id(url)
        transcript = get_youtube_captions(video_id)
        answer = answer_question(transcript, question)
        summary = generate_summary(transcript)
        st.subheader("摘要：")
        st.write(summary)
        st.subheader("回答：")
        st.write(answer)
    else:
        st.warning("請輸入有效的 YouTube 網址。")

當使用者點選「摘要與回答」按鈕時，系統會執行以下操作：

擷取影片 ID
取得字幕內容
生成問題答案
生成摘要
顯示結果

5. 輔助函式實作

擷取影片 ID

def extract_video_id(url):
    parsed_url = urllib.parse.urlparse(url)
    query_string = urllib.parse.parse_qs(parsed_url.query)
    video_id = query_string["v"][0]
    return video_id

此函式解析 YouTube 網址以擷取影片 ID。

取得字幕內容

def get_youtube_captions(video_id):
    loader = YoutubeLoader(video_id, language="en")
    summarization_docs = loader.load_and_split()
    summarization_text = summarization_docs[0].page_content[:2000]
    cleaned_text = re.sub(r"\[.*?\]", "", summarization_text)
    cleaned_text = re.sub(r"\(.*?\)", "", cleaned_text)
    # ... 其他清理操作
    return cleaned_text.strip()

內容解密：

此函式使用 YoutubeLoader 載入指定影片的字幕內容，並進行多步驟的文字清理，包括移除括號內容、特殊字元等，最後傳回清理後的文字。

生成摘要

def generate_summary(context):
    headers = {"Authorization": "Bearer XXXXXXX"}
    model_input = f"Summarize the following transcript. Transcript: {context}"
    json_data = {
        "inputs": model_input,
        "parameters": {'temperature': 0.5, 'max_new_tokens': 100, 'return_full_text': False}
    }
    response = requests.post(SUMMARIZATION_ENDPOINT, headers=headers, json=json_data)
    summary = response.json()[0]['generated_text']
    return summary

內容解密：

此函式呼叫 HuggingFace 的摘要模型 API，將清理後的字幕內容作為輸入，生成影片內容的摘要。設定了特定的請求引數，如溫度（temperature）和最大生成 token 數。

回答問題

def answer_question(context, question):
    headers = {"Authorization": "Bearer XXXXXXX"}
    model_input = f"Answer the question based on the context below. Context: {context} Question: {question}"
    json_data = {
        "inputs": model_input,
        "parameters": {'temperature': 0.5, 'max_new_tokens': 100, 'return_full_text': False}
    }
    response = requests.post(QA_ENDPOINT, headers=headers, json=json_data)
    answer = response.json()[0]['generated_text']
    return answer

內容解密：

此函式同樣呼叫 HuggingFace 的問答模型 API，將字幕內容和使用者問題作為輸入，生成對應的答案。與摘要生成類別似，設定了特定的請求引數。

系統特點與技術考量

模組化設計：系統功能被劃分為多個獨立的函式，每個函式負責特定的任務，如影片 ID 擷取、字幕處理等。
API 整合：系統整合了 HuggingFace 的推理 API，用於實作摘要生成和問答功能。
錯誤處理：在關鍵步驟（如 URL 驗證）中加入了錯誤處理機制，提升了系統的健壯性。
文字清理：在取得字幕後進行了多步驟的文字清理，以提高後續處理的準確性。

圖表說明

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title YouTube 字幕摘要問答系統實作

package "機器學習流程" {
    package "資料處理" {
        component [資料收集] as collect
        component [資料清洗] as clean
        component [特徵工程] as feature
    }

    package "模型訓練" {
        component [模型選擇] as select
        component [超參數調優] as tune
        component [交叉驗證] as cv
    }

    package "評估部署" {
        component [模型評估] as eval
        component [模型部署] as deploy
        component [監控維護] as monitor
    }
}

collect --> clean : 原始資料
clean --> feature : 乾淨資料
feature --> select : 特徵向量
select --> tune : 基礎模型
tune --> cv : 最佳參數
cv --> eval : 訓練模型
eval --> deploy : 驗證模型
deploy --> monitor : 生產模型

note right of feature
  特徵工程包含：
  - 特徵選擇
  - 特徵轉換
  - 降維處理
end note

note right of eval
  評估指標：
  - 準確率/召回率
  - F1 Score
  - AUC-ROC
end note

@enduml

圖表翻譯： 此流程圖展示了系統的主要處理流程：

使用者輸入 YouTube 網址
系統擷取影片 ID
取得並清理字幕內容
同時進行摘要生成和問答處理
最後顯示處理結果

利用Langchain和Hugging Face模型建立影片摘要與問答系統

本章節將介紹如何使用Langchain和Hugging Face模型建立一個影片摘要與問答系統。該系統能夠擷取YouTube影片的字幕，進行清理和摘要，並根據使用者輸入的問題提供答案。

系統架構

該系統主要包含以下幾個功能模組：

YouTube字幕擷取：使用YoutubeLoader從YouTube影片中擷取字幕。
字幕清理：對擷取的字幕進行清理，移除不必要的字元、URL和額外空白。
摘要生成：使用Hugging Face的摘要模型生成影片摘要。
問答功能：根據清理後的字幕和使用者輸入的問題，使用Hugging Face的問答模型提供答案。

程式碼實作

1. YouTube字幕擷取與清理

def get_youtube_captions(video_id):
    # 使用YoutubeLoader擷取字幕
    loader = YoutubeLoader(video_id, language="en")
    summarization_docs = loader.load_and_split()
    summarization_text = summarization_docs[0].page_content
    summarization_text = summarization_text[0:2000]
    
    # 清理字幕
    cleaned_text = re.sub(r"\[.*?\]", "", summarization_text)
    cleaned_text = re.sub(r"\(.*?\)", "", cleaned_text)
    cleaned_text = re.sub(r"\'s", "", cleaned_text)
    cleaned_text = re.sub(r"\w+://\S+", "", cleaned_text)
    unwanted_chars = ["'", '"', ",", ".", "!", "?"]
    cleaned_text = ''.join(c for c in cleaned_text if c not in unwanted_chars)
    cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()
    return cleaned_text

內容解密：

此函式首先使用YoutubeLoader根據提供的video_id擷取YouTube影片的字幕。接著，對擷取的字幕進行清理，移除括號內的內容、URL、撇號後的’s’以及不必要的字元。最後，將多餘的空白字元替換為單一空格並去除首尾空白。

2. 摘要生成

def generate_summary(context):
    headers = {"Authorization": "Bearer XXXXXXX"}
    model_input = f"Summarize the following transcript. Transcript: " + context
    json_data = {
        "inputs": model_input,
        "parameters": {'temperature': 0.5, 'max_new_tokens': 100, 'return_full_text': False},
    }
    response = requests.post(SUMMARIZATION_ENDPOINT, headers=headers, json=json_data)
    json_response = json.loads(response.content.decode("utf-8"))
    summary = json_response[0]['generated_text']
    return summary

內容解密：

此函式將清理後的字幕作為輸入，使用Hugging Face的摘要模型生成摘要。首先，設定API請求的headers和引數，然後傳送POST請求到指定的摘要端點。最後，從API的回應中提取生成的摘要。

3. 問答功能

def answer_question(context, question):
    headers = {"Authorization": "Bearer XXXXX"}
    model_input = f"Answer the question based on the context below. Context: " + context + " Question: " + question
    json_data = {
        "inputs": model_input,
        "parameters": {'temperature': 0.5, 'max_new_tokens': 100, 'return_full_text': False},
    }
    response = requests.post(QA_ENDPOINT, headers=headers, json=json_data)
    answer = response.json()[0]["generated_text"]
    return answer

內容解密：

此函式根據清理後的字幕和使用者輸入的問題，使用Hugging Face的問答模型提供答案。與摘要生成類別似，首先設定API請求的headers和引數，然後傳送POST請求到指定的問答端點。最後，從API的回應中提取生成的答案。

系統執行

將上述功能模組整合到主程式中，並使用Streamlit建立使用者介面，使用者可以輸入YouTube影片的URL和問題，系統將生成影片摘要並回答問題。

使用Langchain與OpenAI進行檔案分析與問答系統開發

在現代的自然語言處理（NLP）任務中，大語言模型（LLMs）如GPT系列已經展現出卓越的效能。透過結合Langchain與OpenAI的API，開發者能夠輕鬆建立檔案分析與問答系統。本文將詳細介紹如何使用這些工具來處理文字資料、進行問答以及生成摘要。

環境準備與資料處理

首先，我們需要準備環境並匯入必要的函式庫。Langchain提供了一種簡便的方法來與多種LLMs互動，而OpenAI的API則提供了強大的語言模型。

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# 初始化嵌入模型
embedding_model = OpenAIEmbeddings()

# 假設 transcripts 是我們的檔案資料
document_store = Chroma.from_documents(transcripts, embedding_model)

內容解密：

這段程式碼初始化了一個根據OpenAI嵌入模型的Chroma向量儲存，用於儲存和檢索檔案資料。
transcripts代表輸入的檔案或文字資料，將被轉換為向量並儲存在document_store中。

設定語言模型與問答系統

接下來，我們需要指定用於資料分析的語言模型，並設定問答系統。

# 指定語言模型
language_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.9)

# 初始化問答系統
question_answer = RetrievalQA.from_chain_type(llm=language_model, chain_type="stuff", retriever=document_store.as_retriever())

內容解密：

這裡選擇了gpt-3.5-turbo作為語言模型，並設定了temperature引數為0.9，以控制輸出的隨機性。
RetrievalQA是一種問答鏈，它結合了檢索器和語言模型，用於根據檔案內容回答問題。

進行問答

現在，我們可以向模型提問。

query = "Who scored more goals in one Champions League season?"
print(question_answer.run(query))

query = "Who scored more goals in one season between Ronaldo and Haaland?"
print(question_answer.run(query))

內容解密：

透過question_answer.run(query)，我們可以將問題傳遞給問答系統，並獲得根據檔案內容的答案。

處理不完整的答案

在某些情況下，模型的輸出可能會因為Token限制而不完整。為瞭解決這個問題，可以嘗試不同的鏈型別（如refine）或重新表述問題。

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)
question_answer = RetrievalQA.from_chain_type(llm=llm, chain_type="refine", retriever=document_store.as_retriever())

內容解密：

這裡使用了refine鏈型別來改進問答結果。
temperature被設為0.2，以獲得更為確定性的答案。

文字摘要生成

除了問答外，我們還可以使用Hugging Face的API來生成文字摘要。

import requests
import json

# 設定API URL和授權頭
API_URL = "https://api-inference.huggingface.co/models/google/flan-t5-xxl"
headers = {"Authorization": "Bearer XXXXX"}

# 定義查詢函式
def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

# 生成摘要
truncated_doc = str(summarization_docs[0])[0:1000]
model_input = f"Provide a summary for the following document:" + truncated_doc
json_data = {"inputs": model_input, "parameters": {'temperature': 0.5, 'max_new_tokens': 300}}
response = query(json_data)
model_output = response[0]['generated_text']
print(model_output)

內容解密：

這段程式碼使用Hugging Face的flan-t5-xxl模型來生成檔案摘要。
首先對輸入檔案進行截斷，然後呼叫API生成摘要。