2024年05月07日玄貓（BlackCat）

多維度資料視覺化與處理技術

本文探討多維度資料視覺化與處理技術，涵蓋平行坐標圖、RadViz 等圖表方法，以及使用 Pandas 和 JSON 進行資料清理、切片、切塊等操作。文章提供 Python 程式碼範例，演示如何運用這些技術處理和分析多維度資料，例如 Iris 和股票資料，並輔以圖表說明資料縮放方法如平均居中和正規化的實際效果。

資料分析資料視覺化

平行坐標圖 RadViz Pandas JSON 資料清理資料切片

多維度資料在資料分析和科學運算中很常見，需要有效的視覺化和處理方法。本文介紹了平行坐標圖和 RadViz 圖表，它們能有效呈現多變數資料的叢集和模式。同時，文章也示範瞭如何使用 Pandas 和 JSON 進行資料清理，例如修正 CSV 檔案中的錯誤資料，並將其轉換為 JSON 格式，方便後續處理。此外，資料切片和切塊技術也被用於提取資料子集，以便更精確地分析和呈現資料。文章最後介紹了資料立方體的概念，並使用 Python 和 JSON 展示瞭如何建立和儲存一個簡單的股票資料立方體。

多維度資料視覺化與處理技術

在資料分析與科學運算領域，多維度資料的視覺化與處理是一項重要的技術挑戰。本文將介紹多種用於呈現和分析多維度資料的方法，包括平行坐標圖、RadViz以及資料清理和切片等技術。

平行坐標圖（Parallel Coordinates）

平行坐標圖是一種用於視覺化多變數資料的有效方法，能夠清晰地展示資料中的叢集和統計特徵。在平行坐標圖中，每個資料點被表示為一組連線的線段，每條垂直線代表一個屬性。資料點之間的叢集關係可以透過觀察線段的密集程度來識別。

程式碼範例：

import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import parallel_coordinates

if __name__ == "__main__":
    data = pd.read_csv('data/iris.csv')
    plt.figure()
    parallel_coordinates(data, 'Name', color=['b', 'mediumspringgreen', 'r'])
    plt.show()

內容解密：

匯入必要的函式庫：matplotlib.pyplot 用於繪圖，pandas 用於資料處理。
使用 pd.read_csv 函式讀取 iris 資料集到 pandas DataFrame 中。
呼叫 parallel_coordinates 函式繪製平行坐標圖，其中 'Name' 列用於區分不同的類別，color 引數指定了不同類別的顏色。
最後呼叫 plt.show() 顯示圖形。

RadViz 圖表

RadViz 是另一種用於視覺化多維度資料的方法。它透過將每個資料點對映到二維空間中的一個點來呈現資料結構，從而幫助識別資料中的模式和關係。

程式碼範例：

import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import radviz

if __name__ == "__main__":
    data = pd.read_csv('data/iris.csv')
    plt.figure()
    radviz(data, 'Name', color=['b', 'mediumspringgreen', 'r'])
    plt.show()

內容解密：

匯入必要的函式庫，與平行坐標圖範例相同。
讀取 iris 資料集到 pandas DataFrame 中。
使用 radviz 函式繪製 RadViz 圖表，同樣使用 'Name' 列進行類別區分，並指定顏色。
呼叫 plt.show() 顯示圖形。

使用 Pandas 和 JSON 清理 CSV 檔案

在實際應用中，資料往往存在錯誤或不一致的情況。本文介紹如何使用 Pandas 和 JSON 技術清理 CSV 檔案中的髒資料。

程式碼範例：

import csv
import pandas as pd
import json

def to_dict(d):
    return [dict(row) for row in d]

def dump_json(f, d):
    with open(f, 'w') as f:
        json.dump(d, f)

def read_json(f):
    with open(f) as f:
        return json.load(f)

if __name__ == "__main__":
    df = pd.read_csv("data/audio.csv")
    print(df, '\n')
    data = csv.DictReader(open('data/audio.csv'))
    d = to_dict(data)
    # 進行資料清理操作，例如修正特定的欄位值
    for row in d:
        if row['pno'][0] not in ['a', 'c', 'p', 's']:
            # 根據不同的條件進行修正
            if row['pno'][0] == '8':
                row['pno'] = 'a' + row['pno']
            # 其他條件...
    json_file = 'data/audio.json'
    dump_json(json_file, d)
    data = read_json(json_file)
    for i, row in enumerate(data):
        if i < 5:
            print(row)

內容解密：

匯入必要的函式庫，包括 csv、pandas 和 json。
定義輔助函式 to_dict、dump_json 和 read_json 分別用於轉換資料格式、寫入 JSON 檔案和讀取 JSON 檔案。
在主程式中，首先使用 pd.read_csv 讀取 CSV 檔案並列印原始資料。
使用 csv.DictReader 讀取 CSV 檔案並轉換為字典列表，以便進行資料清理。
對資料進行清理，例如根據特定條件修正某些欄位的值。
將清理後的資料寫入 JSON 檔案，並讀取 JSON 檔案驗證結果。

資料切片與切塊（Slicing and Dicing）

資料切片與切塊是將資料分解為更小的部分或檢視，以便更好地理解和呈現資料。切片是指根據特定的屬性值過濾資料，而切塊則是選擇多個維度中的特定值子集。

程式碼範例：

import pandas as pd

if __name__ == "__main__":
    df = pd.read_json("data/audio.json")
    amps = df[df.desc == 'amplifier']
    print(amps, '\n')
    price = df.query('price >= 40000')
    print(price, '\n')
    between = df.query('4999 < price < 6000')
    print(between, '\n')
    row = df.loc[[0, 10, 19]]
    print(row)

內容解密：

匯入 pandas 函式庫。
使用 pd.read_json 讀取 JSON 檔案到 DataFrame 中。
進行資料切片操作，例如選取 desc 為 'amplifier' 的行，或根據價格範圍進行篩選。
使用 df.loc 根據特定的行索引選取資料。

資料立方體（Data Cubes）

資料立方體是一種 n 維陣列，用於儲存和分析多維度資料。實際應用中，大多數資料立方體是三維的。

程式碼範例：

import json

def dump_json(f, d):
    with open(f, 'w') as f:
        json.dump(d, f)

def read_json(f):
    with open(f) as f:
        return json.load(f)

if __name__ == "__main__":
    d = dict()
    googl = dict()
    # 為 GOOGL 股票新增五天的交易資料
    googl['2017-09-25'] = {'Open':939.450012, 'High':939.750000, 'Low':924.510010, 'Close':934.280029, 'Adj Close':934.280029, 'Volume':1873400}
    # 其他日期的資料...
    d['GOOGL'] = googl
    # 新增其他股票的資料...
    json_file = 'data/stock.json'
    dump_json(json_file, d)
    data = read_json(json_file)
    # 列印或分析讀取的資料

內容解密：

定義函式 dump_json 和 read_json 用於處理 JSON 檔案。
在主程式中，建立一個字典 d 用於儲存股票資料，並為每隻股票新增多天的交易資訊。
將構建好的資料字典寫入 JSON 檔案。
從 JSON 檔案中讀取資料，並可進行進一步的分析或列印。

資料立方體的建立與 JSON 儲存

在資料科學領域，資料的儲存與讀取是基本且重要的步驟。本文將介紹如何使用 Python 建立一個簡單的資料立方體，並將其儲存為 JSON 格式。

建立資料立方體

首先，我們需要建立一個資料立方體。在這個範例中，我們將建立一個包含三個股票（GOOGL、AMZN、MKL）在五個交易日的股價資料的資料立方體。

import json

# 定義股票資料
googl = dict()
googl['2017-09-25'] = {'Open': 974.900024, 'High': 978.790039, 'Low': 968.599976, 'Close': 974.290039, 'Adj Close': 974.290039, 'Volume': 1597600}
googl['2017-09-26'] = {'Open': 968.000000, 'High': 974.750000, 'Low': 962.500000, 'Close': 970.040039, 'Adj Close': 970.040039, 'Volume': 1236300}
googl['2017-09-27'] = {'Open': 969.000000, 'High': 974.500000, 'Low': 964.099976, 'Close': 969.230005, 'Adj Close': 969.230005, 'Volume': 1175700}
googl['2017-09-28'] = {'Open': 967.099976, 'High': 975.530029, 'Low': 966.000000, 'Close': 973.719971, 'Adj Close': 973.719971, 'Volume': 1266300}
googl['2017-09-29'] = {'Open': 966.000000, 'High': 975.809998, 'Low': 966.000000, 'Close': 973.719971, 'Adj Close': 973.719971, 'Volume': 2031100}

amzn = dict()
amzn['2017-09-25'] = {'Open': 949.309998, 'High': 949.419983, 'Low': 932.890015, 'Close': 939.789978, 'Adj Close': 939.789978, 'Volume': 5124000}
amzn['2017-09-26'] = {'Open': 945.489990, 'High': 948.630005, 'Low': 931.750000, 'Close': 937.429993, 'Adj Close': 938.599976, 'Volume': 3564800}
amzn['2017-09-27'] = {'Open': 948.000000, 'High': 955.299988, 'Low': 943.299988, 'Close': 950.869995, 'Adj Close': 950.869995, 'Volume': 3148900}
amzn['2017-09-28'] = {'Open': 951.859985, 'High': 959.700012, 'Low': 950.099976, 'Close': 956.400024, 'Adj Close': 956.400024, 'Volume': 2522600}
amzn['2017-09-29'] = {'Open': 960.109985, 'High': 964.830017, 'Low': 958.380005, 'Close': 961.349976, 'Adj Close': 961.349976, 'Volume': 2543800}

mkl = dict()
mkl['2017-09-25'] = {'Open': 1056.199951, 'High': 1060.089966, 'Low': 1047.930054, 'Close': 1050.250000, 'Adj Close': 1050.250000, 'Volume': 23300}
mkl['2017-09-26'] = {'Open': 1052.729980, 'High': 1058.520020, 'Low': 1045.000000, 'Close': 1045.130005, 'Adj Close': 1045.130005, 'Volume': 25800}
mkl['2017-09-27'] = {'Open': 1047.560059, 'High': 1069.099976, 'Low': 1047.010010, 'Close': 1064.040039, 'Adj Close': 1064.040039, 'Volume': 21100}
mkl['2017-09-28'] = {'Open': 1064.130005, 'High': 1073.000000, 'Low': 1058.079956, 'Close':1070.550049 , 'Adj Close' :1070 .550049 ,   Volume :23500 }
mkl [date(2017 ,9 ,29 )]   Open :1068 .439941 , High :1073 .000000 , Low :1060 .069946 , Close :1067 .979980 , Adj   Close' :1067 .979980 ,   Volume :20700 }

# 將股票資料加入到一個大的字典中
d = dict()
d['GOOGL'], d['AMZN'], d['MKL'] = googl , amzn , mkl

#### 資料處理說明：
1.在上述程式碼中，我們首先匯入了 `json` 模組，用於處理 JSON 資料。
2 .定義了三個字典 `googl`、`amzn` 和 `mkl`，分別用於儲存三個股票的歷史股價資料。
3 .每個字典中的鍵是日期，值是另一個字典，包含了當天的開盤價、最高價、最低價、收盤價、調整後的收盤價和成交量。
4 .最後，我們將這三個字典加入到一個大的字典 `d` 中。

### 將資料立方體儲存為 JSON

接下來，我們需要將資料立方體儲存為 JSON 檔案。

```python
def dump_json(file_path , data ):
    with open(file_path , "w") as f:
        json.dump(data , f)

json_file = "data/cube.json"
dump_json(json_file , d)

JSON儲存說明：

1 .定義了一個函式 dump_json，用於將資料儲存為 JSON 檔案。 2 .使用 with open 陳述式開啟檔案，並使用 json.dump 將資料寫入檔案。

從 JSON 檔案讀取資料

最後，我們需要從 JSON 檔案中讀取資料。

def read_json(file_path):
    with open(file_path , "r") as f:
        return json.load(f)

d = read_json(json_file)

JSON讀取說明：

1 .定義了一個函式 read_json，用於從 JSON 檔案中讀取資料。 2 .使用 with open 陳述式開啟檔案，並使用 json.load 將檔案中的 JSON 資料解析為 Python 物件。

列印「調整後收盤價」切片

s =' '
print ('\'Adj Close\' slice:')
print (10*s ,'AMZN' , s ,'GOOGL' , s ,'MKL')
print ('Date')
for date in ['2017-09-25', '2017-09-26', '2017-09-27', "2017-09-28" ,"2017 -09 -29 "]:
    print (date , round(d ['AMZN' ][date ]['Adj Close' ],2 ),
           round(d ['GOOGL'][date ]['Adj Close' ],2 ),
           round(d ['MKL' ][date ][ "Adj Close"],2 ))

切片列印說明：

1 .列印了一個標題“‘Adj Close’ slice:”，表示即將列印的是「調整後收盤價」的切片。 2 .列印了表頭，包括日期和三個股票的程式碼。 3 .使用迴圈遍歷五個交易日，列印每個交易日的日期和三個股票的「調整後收盤價」。

資料縮放與整理

在資料科學中，資料縮放是將不同尺度或分佈的資料轉換為可比較的形式。常見的資料縮放方法包括平均居中、正規化和標準化。

平均居中

平均居中是透過從每個資料點中減去平均值來實作的。這樣可以使資料的平均值變為零。

import numpy as np

def rnd_nrml(m , s , n ):
    return np.random.normal(m , s , n )

def ctr(d ):
    return [x -np.mean(d )for x in d ]

if __name__ == "__main__":
    mu , sigma , n =10 ,15 ,100 
    s = rnd_nrml(mu , sigma , n )
    sc = ctr(s )

平均居中說明：

1 .定義了一個函式 rnd_nrml，用於生成一個正態分佈的亂數陣列。 2 .定義了一個函式 ctr，用於對輸入的陣列進行平均居中處理。 3 .在主程式中，生成了一個正態分佈的亂數陣列 s，並對其進行平均居中處理得到 sc。

正規化

正規化是將資料縮放到一個特定的範圍內，通常是 [0,1]。

def nrml(d):
    return [(x -np .amin (d ))/(np .amax (d )-np .amin (d ))for x in d ]

if __name__ == "__main__":
    mu , sigma , n =10 ,15 ,100 
    s = rnd_nrml(mu , sigma , n )
    sn = nrml(s )

正規化說明：

1 .定義了一個函式 nrml，用於對輸入的陣列進行正規化處理。 2.在主程式中，生成了一個正態分佈的亂數陣列s，並對其進行正規化處理得到sn。

圖表視覺化與解讀

本文將使用 Matplotlib 對上述資料縮放結果進行視覺化展示，以幫助讀者更好地理解這些方法的實際效果。

平均居中視覺化

import matplotlib.pyplot as plt

if __name__ == "__main__":
    # ...（省略生成亂數和平均居中的程式碼）
    plt.figure()
    ax = plt.subplot(211)
    ax.set_title('normal distribution')
    count , bins , ignored = plt.hist(s ,30 , color ='pink' , normed =True )
    ax = plt.subplot(212 )
    ax.set_title('normal distribution "centered"')
    count,bins   ignored=plt hist(sc   color='springgreen'   normed=True)
plt.tight_layout()
plt.show()

平均居中視覺化說明：

1 .使用 Matplotlib 的 hist 函式繪製原始資料和居中後的資料的直方圖。 2 .透過比較兩個直方圖，可以觀察到平均居中並未改變資料的形狀，但使資料的平均值變為零。

正規化視覺化

if __name__ == "__main__":
    # ...（省略生成亂數和正規化的程式碼）
    plt.figure()
    ax=plt.subplot(211)
ax.set_title('normal distribution')
count,bins   ignored=plt hist(s   color='orchid'   normed=True)
ax=plt.subplot(212)
ax.set_title('normal distribution"normalized"')
count,bins   ignored=plt hist(sn   color='royalblue'   normed=True)
plt.tight_layout()
plt.show()

正規化視覺化說明：

1 .同樣使用 Matplotlib 繪製原始資料和正規化後的資料的直方圖。 2 .透過比較，可以觀察到正規化使資料被縮放到 [0，1] 範圍內。