この記事は何

pandas.DataFrameに含まれる特定の列（pandas.Series形式のデータ）に対して統計計算・前処理を行うときの方法を確認した際のメモ。全てpandas.Seriesのドキュメントに記載されている内容を参考にしています。

また、以下のページにも参照してください。

全般

python3.8.5を使用してます、この記事で使用するデータをscikit-learnから読み込みます。この記事ではnumpy・pandas・scikit-learnのみ使用します。

import numpy as np
import pandas as pd
from sklearn import datasets

boston = datasets.load_boston()
X = pd.DataFrame(boston['data'])
y = pd.DataFrame(boston['target'])
X.columns = [ f"feature_{i}" for i, _ in enumerate(X.columns) ]
features = X.columns

X.head()

データ型・データ数・カラム名を取得する

feature = X["feature_0"]
dtype = feature.dtype
dnum = feature.size
name = feature.name

print(f"dtype={dtype}, dnum={dnum}, name={name}")

出力：

dtype=float64, dnum=506, name=feature_0

欠損の有無を確認する

参考文献：
- pandas.Series.hasnans — pandas 1.4.1 documentation
- pandas.Series.empty — pandas 1.4.1 documentation

hasnan・emptyで欠損があるか・完全に欠損しているかのフラグを取得できる。

feature = X["feature_0"]
ishasnan = feature.hasnans
isempty = feature.empty

print(f"ishasnan={ishasnan}, isempty={isempty}")

出力：

ishasnan=False, isempty=False

列に含まれる値に指定した式を適用する

参考文献：
- pandas.Series.apply — pandas 1.4.1 documentation

列に含まれる値一つ一つに、指定した式を適用する。

https://yuru-d.com/series-apply-lambda/

オブジェクト型となっているカラムのデータを変換する

参考文献：
- pandas.Series.infer_objects — pandas 1.4.1 documentation
- pandas.Series.convert_dtypes — pandas 1.4.1 documentation

データ読み込み時に文字列データが含まれているとほとんどがintで合ってもObject型となってしまう。このような場合、Intしか含まれない行を選択したあとでconvert_dtypesを実行することでデータ型を変換できる。特定のデータ型に変換する場合に対してもメソッドが用意されているので上記ドキュメントを参照する。

feature = X["feature_1"][:10]
print(feature.convert_dtypes())
print("========")

feature = X["feature_1"][:5]
print(feature.convert_dtypes())
print("========")

X["feature_1"]の上位５行には整数しか含まれていないため、convert_dtypes()した結果Intに変換される。

0    18.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6    12.5
7    12.5
8    12.5
9    12.5
Name: feature_1, dtype: Float64
========
0    18
1     0
2     0
3     0
4     0
Name: feature_1, dtype: Int64
========

指定した行・列のデータを抽出する

Series.at：指定したlabelに合致したただ一つの値を返す
Series.iloc：インデックスを指定して、値の集まりを返す
DataFrame.loc：指定したlabelに合致した値の集まりを返す
DataFrame.iat：行/列の位置を整数で指定して、ただ一つの値を取得する
DataFrame.iloc：行/列のインデックスを指定して、値の集まりを返す
DataFrame.xs：pandas.DataFrame.xs — pandas 1.4.1 documentation

ラベル名の正規表現でフィルターする場合はfilterを使用する： pandas.Series.filter — pandas 1.4.1 documentation

マークダウン・Latex形式で出力する

参考文献：

ドキュメントなどを記述する時に稀に使用する便利機能。tabulateというライブラリに依存しているのでto_markdownを使用する場合はインストールする。

print(X["feature_1"].to_markdown(tablefmt="grid"))

出力：

+-----+-------------+
|  52 |        21   |
+-----+-------------+
|  53 |        21   |
+-----+-------------+
|  54 |        75   |
+-----+-------------+
|  55 |        90   |
+-----+-------------+
|  56 |        85   |
+-----+-------------+
|  57 |       100   |
+-----+-------------+

特定データに対する操作

数値データ

pandas.Seriesの要素同士の足し算・引き算・比較をする

参考文献：

add・sub・gt(greater than)等のメソッドが用意されている。欠損値の扱いは fill_value パラメータで指定する。feature_1列とfeature_2列を要素ごとに足し算してみる。

X["feature_1"].add(X["feature_2"])

出力：

1       7.07
2       7.07
3       2.18
4       2.18
       ...  
501    11.93
502    11.93
503    11.93
504    11.93
505    11.93
Length: 506, dtype: float64

そして、特定の区間内の数値かどうかを判定するには pandas.Series.between を使用する。

指定した列の平均や中央値などの統計をまとめて計算する

参考文献：
- pandas.Series.aggregate — pandas 1.4.1 documentation
- pandas.Series.transform — pandas 1.4.1 documentation

Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply. (引用元：pandas.Series.aggregate — pandas 1.4.1 documentation)

funcにリストに対する集計を行うような関数を複数渡すと、それぞれに対して計算を行う。あらかじめ良く使う関数を定義しておき、aggregateでまとめて計算するように使うと便利。以下の例では最小・最大・平均・中央値・列に１が含まれるかどうかを計算。

feature = X["feature_0"]
feature.agg([min, max, np.mean, np.median, lambda s: 1.0 in s])

出力：

min          0.00632
max          88.9762
mean        3.613524
median       0.25651
<lambda>        True
Name: feature_0, dtype: object

集計ではなく、個別の値ごとに変換を適用したいときは transformを使う。

特定の列に対して指定のウィンドウ幅の移動平均を計算する

参考文献：
- pandas.Series.rolling — pandas 1.4.1 documentation

時系列順に並んだデータに対する補完などに用いることがある。以下の例では2つ・3つのウィンドウ幅ごとに合計と平均を計算する。

sample = pd.DataFrame({'A': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

print("====")
print(sample.rolling(2).sum().T)
print("====")
print(sample.rolling(3).sum().T)
print("====")
print(sample.rolling(2).mean().T)
print("====")
print(sample.rolling(3).mean().T)

出力：

====
   0    1    2    3    4    5     6     7     8     9     10
A NaN  1.0  3.0  5.0  7.0  9.0  11.0  13.0  15.0  17.0  19.0
====
   0   1    2    3    4     5     6     7     8     9     10
A NaN NaN  3.0  6.0  9.0  12.0  15.0  18.0  21.0  24.0  27.0
====
   0    1    2    3    4    5    6    7    8    9    10
A NaN  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5
====
   0   1    2    3    4    5    6    7    8    9    10
A NaN NaN  1.0  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0

特定の列の指定の分位点の値を計算する

参考文献：
- pandas.Series.quantile — pandas 1.4.1 documentation

分位点をリストで渡すと、それぞれの点での値が求まり便利。

feature = X["feature_0"]
feature.quantile([.1, .25, .5, .75, .9])

出力：

0.10     0.038195
0.25     0.082045
0.50     0.256510
0.75     3.677083
0.90    10.753000
Name: feature_0, dtype: float64

カテゴリ変数・文字列

文字列に対する操作はpandas.Series.str に用意されている。

以下のfetch_openmlを用いてopenml上のデータをダウンロードして試します。

sklearn.datasets.fetch_openml — scikit-learn 1.0.2 documentation

openmlにてsearchからデータセット名を検索＋データセット名 or データセットに紐づいたIDを指定することでデータをダウンロードできます。

この章ではカテゴリ変数を含んだデータとして https://www.openml.org/d/6332 のデータを使用します。

from sklearn.datasets import fetch_openml

X_categ, y_categ = fetch_openml(data_id=6332, return_X_y=True)
X_categ = pd.DataFrame(X_categ)
X_categ["target"] = pd.Series(y_categ)

カテゴリ変数の出現回数

参考文献：
- pandas.Series.value_counts — pandas 1.4.1 documentation
- pandas.Series.count — pandas 1.4.1 documentation

使う機会が多い、カテゴリごとの出現回数は value_count で一度に行える。

feature = X_categ["customer"]
unique = feature.unique()
nunique = feature.nunique()
count = feature.count()
value_count = feature.value_counts()

print(f"> ユニークなカテゴリ：unique={unique}")
print(f"> ユニークなカテゴリ数：nunique={nunique}")
print(f"> 欠損していない値の数：count={count}")
print(f"> カテゴリごとの出現回数：value_count={value_count}")

出力：

> ユニークなカテゴリ：unique=['tvguide', 'modmat', 'massey', 'kmart', 'roses', ..., 'cvs', 'venture', 'jfk', 'colorfulimage', 'best']
Length: 71
Categories (71, object): ['tvguide', 'modmat', 'massey', 'kmart', ..., 'venture', 'jfk', 'colorfulimage', 'best']
> ユニークなカテゴリ数：nunique=71
> 欠損していない値の数：count=540
> カテゴリごとの出現回数：value_count=kmart            67
modmat           64
target           41
tvguide          38
wards            33
                 ..
global            1
galls             1
colorfulimage     1
adco              1
1910              0
Name: customer, Length: 72, dtype: int64

カテゴリ変数の置換

参考文献：
- pandas.Series.str.replace — pandas 1.4.1 documentation
- pandas.Series.str.translate — pandas 1.4.1 documentation

tvguideをREPLACEDに置換する。

regex: bool, default True

Determines if assumes the passed-in pattern is a regular expression

regrexパラメータがTrueの時は、正規表現にマッチした文字列を置換する。デフォルトではTrue。

feature = X_categ["customer"]
print("=====")
print(feature)
print("=====")
print(feature.str.replace("tvguide", "REPLACED"))

出力：

=====
0    tvguide
1    tvguide
2     modmat
3     massey
4      kmart
Name: customer, dtype: category
Categories (72, object): ['1910', 'abbey', 'abbeypress', 'abbypress', ..., 'wards', 'woolworth', 'woolwrth', 'yieldhouse']
=====
0    REPLACED
1    REPLACED
2      modmat
3      massey
4       kmart
Name: customer, dtype: object

あらかじめ決められた変換テーブルにしたがって複数の文字列を一括で置換する場合はpandas.Series.str.translateを用いる。以下のページを参照します。