一般在建立分类模型时,当我们进行特征工程的工作经常需要对连续型变量进行离散化的处理,也就是将连续型字段转成离散型字段。离散化的过程中,连续型变量重新进行了编码。特征离散化后,模型会更稳定,降低了模型过拟合的风险。本文主要介绍3种常见的特征分箱方法:
下文以一份虚拟的成绩数据为例,进行原理讲解和代码介绍。
import numpy as np
import pandas as pd
np.random.seed(1)
n = 20
ID = np.arange(1,n+1)
SCORE = np.random.normal(80,10,n).astype('int')
df = pd.DataFrame({'ID':ID,'SCORE':SCORE})
ID | SCORE | |
---|---|---|
0 | 1 | 96 |
1 | 2 | 73 |
2 | 3 | 74 |
3 | 4 | 69 |
4 | 5 | 88 |
5 | 6 | 56 |
6 | 7 | 97 |
7 | 8 | 72 |
8 | 9 | 83 |
9 | 10 | 77 |
10 | 11 | 94 |
11 | 12 | 59 |
12 | 13 | 76 |
13 | 14 | 76 |
14 | 15 | 91 |
15 | 16 | 69 |
16 | 17 | 78 |
17 | 18 | 71 |
18 | 19 | 80 |
19 | 20 | 85 |
使用sklearn库KBinsDiscretizer类进行上述分箱操作。
n_bins
参数上指定需要分箱的个数,默认是5个strategy
指定不同的分箱策略strategy:KBinsDiscretizer类实现了不同的分箱策略,可以通过参数strategy进行选择:encode
参数表示分箱后的离散字段是否需要进一步进行独热编码或者其他编码处理KBinsDiscretizer类只能识别列向量,需要将DataFrame的数据进行转化:
score= df['SCORE'].values.reshape(-1,1)
下面将数据分成3部分
from sklearn.preprocessing import KBinsDiscretizer
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="uniform"
)
label_uniform = dis.fit_transform(score) # 转换器
模拟数据中的最小值为56,最大值为97,所以最终分成的三等分箱的边界值为[56. , 69.66666667, 83.33333333, 97. ]
等频分箱指的是每个区间内包含的取值个数是相同的。下面同样取3箱分割
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="quantile"
)
label_quantile = dis.fit_transform(score)
聚类分箱指的是先对连续型变量进行聚类,然后所属样本的类别作为标识来代替原来的数值。
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="kmeans"
)
label_kmeans = dis.transform(score) # 转换器
df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeans
参考资料
import numpy as np
import pandas as pd
np.random.seed(1)
n = 20
ID = np.arange(1,n+1)
SCORE = np.random.normal(80,10,n).astype('int')
df = pd.DataFrame({'ID':ID,'SCORE':SCORE})
df
ID | SCORE | |
---|---|---|
0 | 1 | 96 |
1 | 2 | 73 |
2 | 3 | 74 |
3 | 4 | 69 |
4 | 5 | 88 |
5 | 6 | 56 |
6 | 7 | 97 |
7 | 8 | 72 |
8 | 9 | 83 |
9 | 10 | 77 |
10 | 11 | 94 |
11 | 12 | 59 |
12 | 13 | 76 |
13 | 14 | 76 |
14 | 15 | 91 |
15 | 16 | 69 |
16 | 17 | 78 |
17 | 18 | 71 |
18 | 19 | 80 |
19 | 20 | 85 |
print(df.to_markdown())
| | ID | SCORE | |---:|-----:|--------:| | 0 | 1 | 96 | | 1 | 2 | 73 | | 2 | 3 | 74 | | 3 | 4 | 69 | | 4 | 5 | 88 | | 5 | 6 | 56 | | 6 | 7 | 97 | | 7 | 8 | 72 | | 8 | 9 | 83 | | 9 | 10 | 77 | | 10 | 11 | 94 | | 11 | 12 | 59 | | 12 | 13 | 76 | | 13 | 14 | 76 | | 14 | 15 | 91 | | 15 | 16 | 69 | | 16 | 17 | 78 | | 17 | 18 | 71 | | 18 | 19 | 80 | | 19 | 20 | 85 |
from sklearn.preprocessing import KBinsDiscretizer
KBinsDiscretizer(n_bins=5,
encode='onehot',
strategy='quantile',
dtype=None)
KBinsDiscretizer()
score= df['SCORE'].values.reshape(-1,1)
score
array([[96], [73], [74], [69], [88], [56], [97], [72], [83], [77], [94], [59], [76], [76], [91], [69], [78], [71], [80], [85]])
from sklearn.preprocessing import KBinsDiscretizer
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="uniform"
)
label_uniform = dis.fit_transform(score) # 转换器
label_uniform
array([[2.], [1.], [1.], [0.], [2.], [0.], [2.], [1.], [1.], [1.], [2.], [0.], [1.], [1.], [2.], [0.], [1.], [1.], [1.], [2.]])
dis.bin_edges_
array([array([56. , 69.66666667, 83.33333333, 97. ])], dtype=object)
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="quantile"
)
label_quantile = dis.fit_transform(score)
dis = KBinsDiscretizer(n_bins=3,
encode="ordinal",
strategy="kmeans"
)
label_kmeans = dis.fit_transform(score) # 转换器
D:\softwares\anaconda3\lib\site-packages\sklearn\cluster\_kmeans.py:1036: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=1. warnings.warn(
label_kmeans
array([[2.], [1.], [1.], [0.], [2.], [0.], [2.], [1.], [1.], [1.], [2.], [0.], [1.], [1.], [2.], [0.], [1.], [1.], [1.], [2.]])
df["label_uniform"] = label_uniform
df["label_quantile"] = label_quantile
df["label_kmeans"] = label_kmeans
df
ID | SCORE | label_uniform | label_quantile | label_kmeans | |
---|---|---|---|---|---|
0 | 1 | 96 | 2.0 | 2.0 | 2.0 |
1 | 2 | 73 | 1.0 | 0.0 | 1.0 |
2 | 3 | 74 | 1.0 | 1.0 | 1.0 |
3 | 4 | 69 | 0.0 | 0.0 | 0.0 |
4 | 5 | 88 | 2.0 | 2.0 | 2.0 |
5 | 6 | 56 | 0.0 | 0.0 | 0.0 |
6 | 7 | 97 | 2.0 | 2.0 | 2.0 |
7 | 8 | 72 | 1.0 | 0.0 | 1.0 |
8 | 9 | 83 | 1.0 | 2.0 | 1.0 |
9 | 10 | 77 | 1.0 | 1.0 | 1.0 |
10 | 11 | 94 | 2.0 | 2.0 | 2.0 |
11 | 12 | 59 | 0.0 | 0.0 | 0.0 |
12 | 13 | 76 | 1.0 | 1.0 | 1.0 |
13 | 14 | 76 | 1.0 | 1.0 | 1.0 |
14 | 15 | 91 | 2.0 | 2.0 | 2.0 |
15 | 16 | 69 | 0.0 | 0.0 | 0.0 |
16 | 17 | 78 | 1.0 | 1.0 | 1.0 |
17 | 18 | 71 | 1.0 | 0.0 | 1.0 |
18 | 19 | 80 | 1.0 | 1.0 | 1.0 |
19 | 20 | 85 | 2.0 | 2.0 | 2.0 |