2025年04月06日玄貓（BlackCat）

Pandas資料篩選與指定技巧

本文探討Pandas資料篩選與指定的技巧，涵蓋標籤、資料型別、布林陣列等篩選方法，並詳述如何使用`loc`、`iloc`與`pd.IndexSlice`操作單層及多層索引(MultiIndex)的Series和DataFrame，包含指定新列、處理不同長度序列、以及使用assign方法進行鏈式指定等實務案例，有效提升資

資料科學 Python

Pandas 資料篩選資料指定 MultiIndex DataFrame Series

Pandas 提供了多元的資料篩選和指定方法，活用這些技巧能大幅提升資料處理效率。除了常見的標籤、資料型別和布林陣列篩選，更可以結合邏輯運算子號處理複雜條件。針對 MultiIndex 的資料結構，.loc 和 .iloc 方法提供了更精確的資料存取和修改方式，避免索引混淆。瞭解 pd.IndexSlice 的使用，能簡化多層索引的切片操作，提升程式碼可讀性。在 DataFrame 的列指定方面，除了直接使用 [] 運算元，assign 方法更允許鏈式操作，使程式碼更簡潔流暢。對於不同長度的序列指定，Pandas 也提供了相應的處理機制，確保資料一致性。熟悉這些技巧，能更好地駕馭 Pandas，提升資料處理的效率和靈活性。

資料篩選與指定

使用標籤進行資料篩選

在 pandas 中，可以使用 like= 或 regex= 引數來根據標籤進行資料篩選。例如，若要選擇包含特定字串的欄位，可以使用 like= 引數：

df.filter(like="_")

內容解密：

df.filter(like="_") 會選擇任何欄位名稱中包含底線（_）的欄位。
這種方法對於根據欄位名稱的部分內容進行篩選非常有用。

若需要更複雜的篩選，可以使用正規表示式（regular expressions）搭配 regex= 引數：

df.filter(regex=r"^Ja.*(?<!e)$", axis=0)

內容解密：

df.filter(regex=r"^Ja.*(?<!e)$", axis=0) 會選擇任何以 “Ja” 開頭但不以 “e” 結尾的列標籤。
axis=0 表示篩選的是列標籤，若要篩選欄位標籤，應使用 axis=1。

根據資料型別進行篩選

pandas 提供了 select_dtypes 方法，可以根據資料型別進行篩選。例如，若要選擇整數型別的欄位：

df.select_dtypes("int")

內容解密：

df.select_dtypes("int") 會選擇所有整數型別的欄位。
可以傳入多個型別來選擇多種型別的欄位，例如 df.select_dtypes(include=["int", "float"])。

若要排除某些型別，可以使用 exclude= 引數：

df.select_dtypes(exclude=["int", "float"])

內容解密：

df.select_dtypes(exclude=["int", "float"]) 會選擇所有不是整數或浮點數型別的欄位。

使用布林陣列進行資料篩選

布林陣列（或稱遮罩）是篩選資料的另一種常見方法。例如：

mask = [True, False, True]
ser = pd.Series(range(3))
ser[mask]

內容解密：

ser[mask] 會傳回遮罩中為 True 的對應列。
這種方法同樣適用於 DataFrame，但需注意 DataFrame 的遮罩是對列進行操作的。

對於 DataFrame，可以使用 .loc 方法同時對列和欄位進行遮罩：

df.loc[mask, col_mask]

內容解密：

df.loc[mask, col_mask] 會根據列和欄位的遮罩進行篩選。
.loc 方法提供了更靈活的索引方式。

組合多個條件進行資料篩選

在實際應用中，常常需要組合多個條件進行資料篩選。可以使用邏輯運算子號（如 |、 & 、 ~）來組合多個條件：

blue_eyes = df["eye_color"] == "blue"
green_eyes = df["eye_color"] == "green"
mask = blue_eyes | green_eyes
df[mask]

內容解密：

blue_eyes | green_eyes 會傳回眼睛顏色為藍色或綠色的列。
使用 & 可以實作多個條件的同時滿足，使用 ~ 可以實作條件的反向選擇。

使用 MultiIndex 進行資料篩選

當 DataFrame 使用 MultiIndex 時，可以使用 .loc 方法進行更精確的資料篩選。例如：

index = pd.MultiIndex.from_tuples([("John", "Smith"), ("Jane", "Doe")])
ser = pd.Series(range(2), index=index)

內容解密：

使用 pd.MultiIndex.from_tuples 可以建立 MultiIndex。
.loc 方法可以根據 MultiIndex 的層級進行篩選，避免了使用 [] 可能帶來的歧義。

使用 `pd.MultiIndex` 進行資料選擇與指定

在處理具有多層索引（MultiIndex）的 pandas 資料結構時，資料的選擇和指定操作變得更加複雜和靈活。本章節將介紹如何使用 .loc 和 .iloc 來進行這些操作。

使用 `pd.Series.loc` 與 `pd.MultiIndex`

當使用 pd.Series.loc 與一個 pd.MultiIndex 時，預設情況下會對第一層索引進行匹配。例如：

index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    ("Jane", "Doe"),
    ("Stephen", "Smith"),
], names=["first_name", "last_name"])

ser = pd.Series(range(4), index=index)
print(ser)

輸出結果：

first_name  last_name
John        Smith      0
            Doe        1
Jane        Doe        2
Stephen     Smith      3
dtype: int64

內容解密：

這段程式碼建立了一個具有多層索引的 pd.Series，其中 first_name 和 last_name 分別是第一層和第二層索引。
ser 的值為 [0, 1, 2, 3]，與索引一一對應。

使用 `.loc` 進行單層索引選擇

要選擇第一層索引為 “John” 的所有資料，可以使用 .loc：

print(ser.loc["John"])

輸出結果：

last_name
Smith    0
Doe      1
dtype: int64

內容解密：

這段程式碼選擇了第一層索引為 “John” 的所有資料，並傳回了一個新的 pd.Series。
注意輸出的結果中，第一層索引 “John” 被隱含了，因為 .loc 預設會減少第一層索引。

使用 `.loc` 進行多層索引選擇

要選擇多層索引，可以使用 tuple 引數：

print(ser.loc[("Jane", "Doe")])

輸出結果：

內容解密：

這段程式碼選擇了第一層索引為 “Jane” 且第二層索引為 “Doe” 的資料。
傳回的結果是對應的數值 2。

使用 `slice(None)` 進行多層索引切片

要選擇第二層索引為 “Doe” 的所有資料，可以使用 slice(None)：

print(ser.loc[(slice(None), "Doe")])

輸出結果：

first_name
John    1
Jane    2
dtype: int64

內容解密：

這段程式碼使用了 slice(None) 來表示第一層索引的所有可能值，從而選擇了第二層索引為 “Doe” 的所有資料。
輸出的結果中，第二層索引被隱含了。

使用 `pd.IndexSlice` 簡化多層索引切片

pd.IndexSlice 提供了一個更自然的方式來進行多層索引切片：

ixsl = pd.IndexSlice
print(ser.loc[ixsl[:, ["Doe"]]])

輸出結果：

first_name  last_name
John        Doe        1
Jane        Doe        2
dtype: int64

內容解密：

這段程式碼使用了 pd.IndexSlice 來簡化多層索引切片的語法。
ixsl[:, ["Doe"]] 表示選擇所有第一層索引和第二層索引為 “Doe” 的資料。

在 `pd.DataFrame` 中使用 `pd.MultiIndex`

在 pd.DataFrame 中，可以同時在行和列上使用 pd.MultiIndex：

row_index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    ("Jane", "Doe"),
    ("Stephen", "Smith"),
], names=["first_name", "last_name"])

col_index = pd.MultiIndex.from_tuples([
    ("music", "favorite"),
    ("music", "last_seen_live"),
    ("art", "favorite"),
], names=["art_type", "category"])

df = pd.DataFrame([
    ["Swift", "Swift", "Matisse"],
    ["Mozart", "T. Swift", "Van Gogh"],
    ["Beatles", "Wonder", "Warhol"],
    ["Jackson", "Dylan", "Picasso"],
], index=row_index, columns=col_index)

print(df)

輸出結果：

art_type           music                   art
category      favorite last_seen_live   favorite
first_name last_name                                     
John       Smith      Swift           Swift     Matisse
           Doe       Mozart         T. Swift     Van Gogh
Jane       Doe       Beatles          Wonder       Warhol
Stephen    Smith     Jackson           Dylan     Picasso

內容解密：

這段程式碼建立了一個具有多層索引的 pd.DataFrame，其中行索引包含 first_name 和 last_name，列索引包含 art_type 和 category。
資料框中填充了不同的字串值，分別對應不同的藝術型別和類別。

使用 `.loc` 在 `pd.DataFrame` 中進行多層索引選擇

要在 pd.DataFrame 中進行多層索引選擇，可以使用兩個 tuple 引數：

row_idxer = (slice(None), "Smith")
col_idxer = (slice(None), "favorite")
print(df.loc[row_idxer, col_idxer])

輸出結果：

art_type           music     art
category      favorite  favorite
first_name last_name             
John       Smith      Swift   Matisse
Stephen    Smith     Jackson   Picasso

內容解密：

這段程式碼選擇了第二層行索引為 “Smith” 和第二層列索引為 “favorite” 的所有資料。
輸出的結果是一個新的 pd.DataFrame，其中包含了符合條件的資料。

使用 `.loc` 和 `.iloc` 進行資料指定

.loc 和 .iloc 不僅可以用於資料選擇，還可以用於資料指定：

ser = pd.Series(range(3), index=list("abc"))
ser.loc["b"] = 42
print(ser)

輸出結果：

a      0
b     42
c      2
dtype: int64

內容解密：

這段程式碼建立了一個簡單的 pd.Series，然後使用 .loc 將索引為 “b” 的值指定為 42。
輸出的結果顯示了指定後的 pd.Series。

本章節介紹瞭如何在 pandas 中使用 pd.MultiIndex 進行資料選擇和指定操作，包括使用 .loc 和 .iloc 方法。這些操作對於處理具有複雜索引結構的資料非常有用。透過這些技巧，可以更靈活地管理和分析資料。

資料選擇與指定

使用 `pd.Series.loc` 和 `pd.Series.iloc` 進行指定

在 pandas 中，pd.Series.loc 和 pd.Series.iloc 不僅可以用於選擇資料，也可以進行指定操作。當你想根據標籤或位置指定時，這兩個屬性就非常有用。

如何實作

首先，建立一個簡單的 pd.Series：

ser = pd.Series([0, 1, 2], index=list("abc"))
ser

輸出：

a    0
b    1
c    2
dtype: int64

使用 pd.Series.loc 根據標籤指定：

ser.loc["b"] = 42
ser

輸出：

a    0
b   42
c    2
dtype: int64

使用 pd.Series.iloc 根據位置指定：

ser.iloc[2] = -42
ser

輸出：

a     0
b    42
c   -42
dtype: int64

DataFrame 列指定

將列指定給 pd.DataFrame 是一個常見的操作。

如何實作

建立一個簡單的 pd.DataFrame：

df = pd.DataFrame({"col1": [1, 2, 3]})
df

輸出：

使用 pd.DataFrame[] 運算元指定新的列：

df["new_column1"] = 42
df

輸出：

   col1  new_column1
0     1           42
1     2           42
2     3           42

也可以指定 pd.Series 或序列，只要元素的數量與 pd.DataFrame 的列數相符：

df["new_column2"] = list("abc")
df["new_column3"] = pd.Series(["dog", "cat", "human"])
df

輸出：

   col1  new_column1 new_column2 new_column3
0     1           42           a         dog
1     2           42           b         cat
2     3           42           c       human

如果新的序列與現有的 pd.DataFrame 的列數不符，則指定會失敗：

df["should_fail"] = ["too few", "rows"]

錯誤訊息：

ValueError: Length of values (2) does not match length of index (3)

對於具有 pd.MultiIndex 的 pd.DataFrame，也可以進行列指定。

圖表翻譯：DataFrame 指定過程圖示

@startuml
skinparam backgroundColor #FEFEFE
skinparam componentStyle rectangle

title Pandas資料篩選與指定技巧

package "Pandas 資料處理" {
    package "資料結構" {
        component [Series
一維陣列] as series
        component [DataFrame
二維表格] as df
        component [Index
索引] as index
    }

    package "資料操作" {
        component [選取 Selection] as select
        component [篩選 Filtering] as filter
        component [分組 GroupBy] as group
        component [合併 Merge/Join] as merge
    }

    package "資料轉換" {
        component [重塑 Reshape] as reshape
        component [透視表 Pivot] as pivot
        component [聚合 Aggregation] as agg
    }
}

series --> df : 組成
index --> df : 索引
df --> select : loc/iloc
df --> filter : 布林索引
df --> group : 分組運算
group --> agg : 聚合函數
df --> merge : 合併資料
df --> reshape : melt/stack
reshape --> pivot : 重新組織

note right of df
  核心資料結構
  類似 Excel 表格
end note

@enduml

圖表翻譯： 此圖示展示了 DataFrame 指定的基本流程。首先建立 DataFrame，接著指定新列，然後檢查資料的一致性，最後完成指定。

程式碼範例：使用 assign 方法進行鏈式操作

df = pd.DataFrame([[0, 1], [2, 4]], columns=list("ab"))
(
    df
    .mul(2)
    .add(42)
    .assign(chained_c=lambda df: df["b"] - 3)
)

輸出：

    a   b  chained_c
0  42  44         41
1  46  50         47

#### 內容解密：

此範例展示瞭如何使用 assign 方法在鏈式操作中新增新列。首先建立一個 DataFrame，然後對其進行乘法和加法運算，最後使用 assign 新增一個新列 chained_c。該新列的值是根據現有的 b 列計算而來。整個過程保持了程式碼的流暢性和可讀性。

程式碼範例：建立具有 MultiIndex 的 DataFrame

row_index = pd.MultiIndex.from_tuples([
    ("John", "Smith"),
    ("John", "Doe"),
    ("Jane", "Doe"),
    ("Stephen", "Smith"),
], names=["first_name", "last_name"])

col_index = pd.MultiIndex.from_tuples([
    ("music", "favorite"),
    ("music", "last_seen_live"),
    ("art", "favorite"),
], names=["art_type", "category"])

df = pd.DataFrame([
    ["Swift", "Swift", "Matisse"],
    ["Mozart", "T. Swift", "Van Gogh"],
    ["Beatles", "Wonder", "Warhol"],
    ["Jackson", "Dylan", "Picasso"],
], index=row_index, columns=col_index)

df.loc[:, ("art", "museuems_seen")] = [1, 2, 4, 8]
df

輸出：

art_type           music                   art           
category      favorite last_seen_live   favorite museuems_seen
first_name last_name                                          
John       Smith      Swift           Swift     Matisse             1
           Doe       Mozart        T. Swift    Van Gogh             2
Jane       Doe       Beatles          Wonder       Warhol             4
Stephen    Smith     Jackson           Dylan      Picasso             8

#### 內容解密：

此範例展示瞭如何建立一個具有 MultiIndex 的 DataFrame，並對其進行列指定。首先建立 row_index 和 col_index，然後使用這些索引建立 DataFrame。接著，使用 loc 方法對 DataFrame 的特定部分進行指定，新增了一個名為 museuems_seen 的新列。整個過程體現了 pandas 在處理複雜資料結構方面的靈活性。

資料篩選與指定

使用標籤進行資料篩選

內容解密：

內容解密：

根據資料型別進行篩選

內容解密：

內容解密：

使用布林陣列進行資料篩選

內容解密：

內容解密：

組合多個條件進行資料篩選

內容解密：

使用 MultiIndex 進行資料篩選

內容解密：

使用 pd.MultiIndex 進行資料選擇與指定

使用 pd.Series.loc 與 pd.MultiIndex

內容解密：

使用 .loc 進行單層索引選擇

內容解密：

使用 .loc 進行多層索引選擇

內容解密：

使用 slice(None) 進行多層索引切片

內容解密：

使用 pd.IndexSlice 簡化多層索引切片

內容解密：

在 pd.DataFrame 中使用 pd.MultiIndex

內容解密：

使用 .loc 在 pd.DataFrame 中進行多層索引選擇

內容解密：

使用 .loc 和 .iloc 進行資料指定

內容解密：

資料選擇與指定

使用 pd.Series.loc 和 pd.Series.iloc 進行指定

如何實作

更多資訊

DataFrame 列指定

如何實作

更多資訊

圖表翻譯：DataFrame 指定過程圖示

程式碼範例：使用 assign 方法進行鏈式操作

#### 內容解密：

程式碼範例：建立具有 MultiIndex 的 DataFrame

#### 內容解密：

使用 `pd.MultiIndex` 進行資料選擇與指定

使用 `pd.Series.loc` 與 `pd.MultiIndex`

使用 `.loc` 進行單層索引選擇

使用 `.loc` 進行多層索引選擇

使用 `slice(None)` 進行多層索引切片

使用 `pd.IndexSlice` 簡化多層索引切片

在 `pd.DataFrame` 中使用 `pd.MultiIndex`

使用 `.loc` 在 `pd.DataFrame` 中進行多層索引選擇

使用 `.loc` 和 `.iloc` 進行資料指定

使用 `pd.Series.loc` 和 `pd.Series.iloc` 進行指定