2025年05月22日玄貓（BlackCat）

Pandas高效資料讀寫格式支援

本文探討 pandas 函式庫的 I/O 系統，涵蓋多種資料格式的讀寫操作，包含 CSV、Excel、Apache Parquet 和 JSON。著重說明 Parquet 格式的優勢，例如分割槽功能和高效篩選讀取，以及 JSON 格式處理的彈性，例如使用不同 orient 引數控制 JSON 佈局，並解析

資料科學 Python

Pandas 資料讀寫 Parquet JSON HTML 資料分析

pandas 的 I/O 系統提供豐富的資料讀寫功能，支援多種檔案格式，有效提升資料處理效率。Apache Parquet 格式以其列式儲存特性，在讀取速度和儲存空間上展現優勢，尤其在處理大規模資料集時，分割槽功能和篩選讀取能有效降低記憶體負擔。JSON 格式則以其輕量和通用性，方便資料交換，pandas 提供 json_normalize 處理巢狀結構，並透過 orient 引數控制 JSON 佈局。此外，pandas 也支援 HTML 表格讀取，方便從網頁擷取資料。對於需要高效處理不同格式資料的開發者，pandas 提供了完整的解決方案。

pandas I/O 系統詳解：高效資料讀寫與格式支援

在資料分析領域中，資料的讀取與寫入是基本且重要的環節。pandas 函式庫提供了多樣化的 I/O 功能，能夠支援多種檔案格式，包括 CSV、Excel、Apache Parquet 和 JSON 等。本文將探討 pandas 在不同資料格式下的讀寫操作，重點介紹 Apache Parquet 和 JSON 格式的應用。

Apache Parquet 格式的優勢與應用

Apache Parquet 是一種高效的列式儲存格式，廣泛應用於大資料處理和分析。與傳統的行式儲存相比，Parquet 格式在查詢效能和儲存效率上具有明顯優勢。

分割槽（Partitioning）功能

Parquet 格式支援分割槽功能，允許將資料分散儲存於不同的目錄和檔案中。這種組織方式不僅便於資料管理，還能顯著提升查詢效能。例如，假設我們按照時間進行分割槽，將不同季度或年份的資料儲存在獨立的檔案中：

data/partitions/
├── 2022/
│   ├── q1_sales.parquet
│   └── q2_sales.parquet
└── 2023/
    ├── q1_sales.parquet
    └── q2_sales.parquet

每個 Parquet 檔案中儲存了特定時間範圍內的銷售資料，包括年份、季度、地區和銷售額等資訊。

import pandas as pd

# 讀取特設定檔案
df = pd.read_parquet("data/partitions/2022/q1_sales.parquet")
print(df)

輸出結果：

   year quarter   region  sales
0  2022      Q1   America      1
1  2022      Q1    Europe      2

高效讀取分割槽資料

pandas 提供了一種便捷的方式來讀取分割槽資料，只需指定目錄路徑即可自動載入所有相關檔案：

df = pd.read_parquet("data/partitions/")
print(df)

輸出結果：

   year quarter   region  sales
0  2022      Q1   America      1
1  2022      Q1    Europe      2
2  2022      Q2   America      4
3  2022      Q2    Europe      8
4  2023      Q1   America     16
5  2023      Q1    Europe     32
6  2023      Q2   America     64
7  2023      Q2    Europe    128

篩選資料讀取

對於大規模資料集，直接讀取所有資料可能會導致記憶體不足。Parquet 格式允許在讀取過程中進行篩選，從而減少記憶體佔用：

df_filtered = pd.read_parquet(
    "data/partitions/",
    filters=[("region", "==", "Europe")]
)
print(df_filtered)

輸出結果：

   year quarter   region  sales
0  2022      Q1    Europe      2
1  2022      Q2    Europe      8
2  2023      Q1    Europe     32
3  2023      Q2    Europe    128

JSON 資料格式處理

JSON（JavaScript Object Notation）是一種輕量級的資料交換格式，廣泛用於網路資料傳輸。pandas 提供了一系列函式來處理 JSON 資料。

使用 `json` 函式庫處理 JSON 資料

Python 的標準函式庫 json 可用於將 Python 物件序列化為 JSON 字串，或將 JSON 字串反序列化為 Python 物件：

import json

beatles = {
    "first": ["Paul", "John", "Richard", "George"],
    "last": ["McCartney", "Lennon", "Starkey", "Harrison"],
    "birth": [1942, 1940, 1940, 1943]
}

serialized = json.dumps(beatles)
print(f"序列化後的 JSON：{serialized}")

deserialized = json.loads(serialized)
print(f"反序列化後的 Python 物件：{deserialized}")

輸出結果：

序列化後的 JSON：{"first": ["Paul", "John", "Richard", "George"], "last": ["McCartney", "Lennon", "Starkey", "Harrison"], "birth": [1942, 1940, 1940, 1943]}
反序列化後的 Python 物件：{'first': ['Paul', 'John', 'Richard', 'George'], 'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'], 'birth': [1942, 1940, 1940, 1943]}

使用 pandas 處理 JSON 資料

pandas 提供 pd.read_json 和 DataFrame.to_json 方法來讀取和寫入 JSON 資料：

import pandas as pd
import io

# 從 JSON 字串讀取資料
data = io.StringIO(serialized)
df = pd.read_json(data, dtype_backend="numpy_nullable")
print(df)

輸出結果：

     first       last  birth
0     Paul  McCartney   1942
1     John     Lennon   1940
2  Richard    Starkey   1940
3    George   Harrison   1943

# 將 DataFrame 輸出為 JSON 字串
df = pd.DataFrame(beatles)
print(df.to_json())

輸出結果：

{"first":{"0":"Paul","1":"John","2":"Richard","3":"George"},"last":{"0":"McCartney","1":"Lennon","2":"Starkey","3":"Harrison"},"birth":{"0":1942,"1":1940,"2":1940,"3":1943}}

pandas 資料儲存與讀取的 JSON 格式探討

在資料交換與儲存的過程中，JSON 格式因其彈性與通用性而被廣泛使用。然而，在實際應用中，如何有效地將表格資料（tabular data）表示為 JSON 格式卻存在多種不同的做法。pandas 提供了一種彈性的解決方案，允許使用者透過 orient 引數來控制 DataFrame 在 JSON 格式中的佈局。

不同的 `orient` 引數及其應用

pandas 的 to_json 方法支援多種 orient 引數，用於控制輸出的 JSON 格式。以下將詳細介紹這些引數及其對應的輸出格式：

columns（預設）

此格式將資料儲存為一個 JSON 物件，其中鍵為欄位名稱，值為另一個物件，該物件將列索引對映到對應的值。

{
    "first": {"row 0": "Paul", "row 1": "John", "row 2": "Richard", "row 3": "George"},
    "last": {"row 0": "McCartney", "row 1": "Lennon", "row 2": "Starkey", "row 3": "Harrison"},
    "birth": {"row 0": 1942, "row 1": 1940, "row 2": 1940, "row 3": 1943}
}

這種格式較為冗長，但能夠完整保留列索引與欄位名稱的資訊。

records

此格式將每一列資料表示為一個 JSON 物件，並將所有列組成一個陣列。

[
    {"first": "Paul", "last": "McCartney", "birth": 1942},
    {"first": "John", "last": "Lennon", "birth": 1940},
    {"first": "Richard", "last": "Starkey", "birth": 1940},
    {"first": "George", "last": "Harrison", "birth": 1943}
]

此格式較為簡潔，但無法保留列索引的資訊。

split

此格式將欄位名稱、列索引和資料分別儲存為三個獨立的陣列。

{
    "columns": ["first", "last", "birth"],
    "index": ["row 0", "row 1", "row 2", "row 3"],
    "data": [
        ["Paul", "McCartney", 1942],
        ["John", "Lennon", 1940],
        ["Richard", "Starkey", 1940],
        ["George", "Harrison", 1943]
    ]
}

這種格式在重建 DataFrame 時能夠保留完整的資訊，且相對簡潔。

index

此格式與 columns 相似，但將列索引與欄位名稱的角色互換。

{
    "row 0": {"first": "Paul", "last": "McCartney", "birth": 1942},
    "row 1": {"first": "John", "last": "Lennon", "birth": 1940},
    "row 2": {"first": "Richard", "last": "Starkey", "birth": 1940},
    "row 3": {"first": "George", "last": "Harrison", "birth": 1943}
}

values

此格式僅儲存資料的數值，不包含欄位名稱或列索引。

[
    ["Paul", "McCartney", 1942],
    ["John", "Lennon", 1940],
    ["Richard", "Starkey", 1940],
    ["George", "Harrison", 1943]
]

這種格式最為簡潔，但無法保留任何索引資訊。

table

此格式遵循 JSON Table Schema，是一種較為冗長但標準化的表示方式。

{
    "schema": {
        "fields": [
            {"name": "index", "type": "string"},
            {"name": "first", "type": "any", "extDtype": "string"},
            {"name": "last", "type": "any", "extDtype": "string"},
            {"name": "birth", "type": "integer"}
        ]
    },
    "data": [
        {"index": "row 0", "first": "Paul", "last": "McCartney", "birth": 1942},
        {"index": "row 1", "first": "John", "last": "Lennon", "birth": 1940},
        {"index": "row 2", "first": "Richard", "last": "Starkey", "birth": 1940},
        {"index":

不同 `orient` 的比較與選擇

資料完整性：table 和 split 能夠保留完整的 DataFrame 資訊，包括欄位名稱、列索引和資料。
簡潔性：values 和 records 相對簡潔，但犧牲了部分資訊（如索引）。
標準化：table 符合 JSON Table Schema，具有較高的標準化程度。

程式碼範例與解析

以下是一個簡單的範例，展示如何使用不同的 orient 將 DataFrame 輸出為 JSON：

import pandas as pd

# 建立一個簡單的 DataFrame
data = {
    'first': ['Paul', 'John', 'Richard', 'George'],
    'last': ['McCartney', 'Lennon', 'Starkey', 'Harrison'],
    'birth': [1942, 1940, 1940, 1943]
}
df = pd.DataFrame(data, index=['row 0', 'row 1', 'row 2', 'row 3'])

# 使用不同的 orient 輸出 JSON
orients = ['columns', 'records', 'split', 'index', 'values', 'table']
for orient in orients:
    json_output = df.to_json(orient=orient)
    print(f'Orient: {orient}\n{json_output}\n')

程式碼解析：

匯入 pandas：首先，我們需要匯入 pandas 程式函式庫。
建立 DataFrame：我們建立了一個包含 Beatles 成員資訊的 DataFrame。
使用不同的 orient：遍歷不同的 orient 值，並使用 to_json 方法輸出對應的 JSON 字串。
輸出結果：列印出不同 orient 下的 JSON 輸出結果，以便比較和觀察其差異。

輸出結果分析：

columns：輸出結果包含欄位名稱作為鍵，內部物件則對映列索引到對應的值。
records：每個列被表示為一個物件，所有列組成一個陣列。
split：欄位名稱、列索引和資料分別儲存於不同的陣列中。
index：類別似於 columns，但列索引和欄位名稱角色互換。
values：僅包含資料的數值，沒有任何索引資訊。
table：遵循 JSON Table Schema，提供完整的 schema 和資料資訊。

使用 pandas 處理 JSON 及 HTML 資料

JSON 資料處理

JSON（JavaScript Object Notation）是一種輕量級的資料交換格式，廣泛應用於 Web API 及資料儲存。pandas 提供了多種方法來處理 JSON 資料。

JSON Table Schema

JSON Table Schema 是 pandas 用於儲存 DataFrame 的一種 JSON 格式，能夠保留資料的元資訊（metadata），例如欄位資料型別。這使得讀取資料時無需額外指定資料型別。

import pandas as pd
import io

# 建立 DataFrame 並轉換為 JSON Table Schema 格式
df = pd.DataFrame({
    "first": ["John", "Jane"],
    "last": ["Doe", "Smith"],
    "birth": [1990, 1995]
})
df["birth"] = df["birth"].astype(pd.UInt16Dtype())
serialized = df.to_json(orient="table")

# 從 JSON Table Schema 讀取 DataFrame
df_read = pd.read_json(
    io.StringIO(serialized),
    orient="table"
)
print(df_read.dtypes)

內容解密：

to_json(orient="table")：將 DataFrame 轉換為 JSON Table Schema 格式，能夠保留資料型別資訊。
pd.read_json(orient="table")：從 JSON Table Schema 讀取資料，無需額外指定資料型別。
dtype_backend="numpy_nullable"：確保讀取時使用 pandas 擴充套件型別。

處理複雜 JSON 結構

對於包含巢狀結構的 JSON 資料，可以使用 pd.json_normalize 進行扁平化處理。

data = {
    "records": [
        {"name": "human", "characteristics": {"num_leg": 2, "num_eyes": 2}},
        {"name": "dog", "characteristics": {"num_leg": 4, "num_eyes": 2}},
        {"name": "horseshoe crab", "characteristics": {"num_leg": 10, "num_eyes": 10}}
    ],
    "type": "animal",
    "pagination": {"next": "23978sdlkusdf97234u2io", "has_more": 1}
}

# 使用 record_path 指定要處理的 JSON 路徑
df_normalized = pd.json_normalize(
    data,
    record_path="records",
    meta="type"
).convert_dtypes(dtype_backend="numpy_nullable")

print(df_normalized)

內容解密：

record_path="records"：指定要處理的 JSON 路徑，提取 records 下的資料。
meta="type"：保留 type 欄位的資訊，將其新增到結果 DataFrame 中。
.convert_dtypes(dtype_backend="numpy_nullable")：轉換為 pandas 擴充套件型別。

HTML 表格處理

pandas 的 read_html 方法能夠從網頁中提取 HTML 表格。

安裝必要函式庫

首先，需要安裝 lxml 解析器：

python -m pip install lxml

從 Wikipedia 提取表格

url = "https://en.wikipedia.org/wiki/The_Beatles_discography"
dfs = pd.read_html(url, dtype_backend="numpy_nullable")
print(len(dfs))  # 輸出找到的表格數量

內容解密：

pd.read_html(url)：從指定 URL 提取所有 HTML 表格，傳回一個 DataFrame 清單。
dtype_backend="numpy_nullable"：確保使用 pandas 擴充套件型別讀取資料。
dfs[0]：檢視第一個提取的 DataFrame。

精確定位特定表格

由於 read_html 傳回所有表格，可以使用 attrs 引數來指定特定表格的屬性。

# 使用 attrs 指定表格屬性
dfs = pd.read_html(url, attrs={"class": "wikitable plainrowheaders"})

圖表翻譯：

此圖示呈現了使用 read_html 方法提取 Wikipedia 網頁中的表格流程：

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title Pandas高效資料讀寫格式支援

package "Pandas 資料處理" {
    package "資料結構" {
        component [Series
一維陣列] as series
        component [DataFrame
二維表格] as df
        component [Index
索引] as index
    }

    package "資料操作" {
        component [選取 Selection] as select
        component [篩選 Filtering] as filter
        component [分組 GroupBy] as group
        component [合併 Merge/Join] as merge
    }

    package "資料轉換" {
        component [重塑 Reshape] as reshape
        component [透視表 Pivot] as pivot
        component [聚合 Aggregation] as agg
    }
}

series --> df : 組成
index --> df : 索引
df --> select : loc/iloc
df --> filter : 布林索引
df --> group : 分組運算
group --> agg : 聚合函數
df --> merge : 合併資料
df --> reshape : melt/stack
reshape --> pivot : 重新組織

note right of df
  核心資料結構
  類似 Excel 表格
end note

@enduml

pandas I/O 系統詳解：高效資料讀寫與格式支援

Apache Parquet 格式的優勢與應用

分割槽（Partitioning）功能

高效讀取分割槽資料

篩選資料讀取

JSON 資料格式處理

使用 json 函式庫處理 JSON 資料

使用 pandas 處理 JSON 資料

pandas 資料儲存與讀取的 JSON 格式探討

不同的 orient 引數及其應用

columns（預設）

records

split

index

values

table

不同 orient 的比較與選擇

程式碼範例與解析

程式碼解析：

輸出結果分析：

使用 pandas 處理 JSON 及 HTML 資料

JSON 資料處理

JSON Table Schema

內容解密：

處理複雜 JSON 結構

內容解密：

HTML 表格處理

安裝必要函式庫

從 Wikipedia 提取表格

內容解密：

精確定位特定表格

圖表翻譯：

使用 `json` 函式庫處理 JSON 資料

不同的 `orient` 引數及其應用

不同 `orient` 的比較與選擇