通俗来讲,离群点(outlier),也被称为异常值,是指与其他观测点距离较远的观测点。但这个定义比较模糊的,因为它没有量化多少是“较远”。本文将研究异常值检测和处理技术,同时看看它们对不同类型的机器学习模型的影响。
许多机器学习模型,如线性和逻辑回归,很容易受到训练数据中的异常值的影响。像AdaBoost这样的模型在每次迭代中都会增加错误分类点的权重,因此可能会对这些异常值施加很高的权重,因为它们往往会被错误分类。如果异常值是某种类型的错误,或者我们希望我们的模型很好地一般化而不关心极端值。
异常值检测的一种常见方法是假设规则数据来自已知的分布(例如,数据是高斯分布)。识别异常值的最快和最简单的方法是使用图形将它们可视化。如果你的数据集不是很大(大约。多达10k观察和100个特征),强烈建议你构建变量的散点图和框图。如果没有异常值,你肯定会获得一些其他的见解,如相关性、可变性,或外部因素,如世界大战/衰退对经济因素的影响。但是,对于可视化能力不强的高维数据,不推荐使用此方法。
基于正态分布,3sigma准则认为超过3sigma的数据为异常点。
def three_sigma(s):
mu, std = np.mean(s), np.std(s)
lower, upper = mu-3*std, mu+3*std
return lower, upper
箱线图时基于四分位距(IQR)找异常点的。
def boxplot(s):
q1, q3 = s.quantile(.25), s.quantile(.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
return lower, upper
我们首先确定Q1和Q3的四分位数。 $$四分位差IQR = Q3 - Q1$$ $$上限= Q3+1.5*IQR$$ $$下限= Q1-1.5 *IQR$$ 任何低于下限和高于上限的值都被认为是异常值
K近邻算法可以依次计算每个样本点与其最近K个样本之间的平均距离,然后将计算出的距离与阈值进行比较。如果距离大于阈值,则认为是异常值。优点是它不需要假设数据的分布。缺点是只能找到全局异常值,而不能找到局部异常值。(这里借助pyod库来进行)
pyod库除了有KNN算法以外,同时包含其他的一些算法:
Type | Abbr | Algorithm | Year | |
---|---|---|---|---|
0 | Probabilistic | ECOD | Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions | 2022 |
1 | Probabilistic | ABOD | Angle-Based Outlier Detection | 2008 |
2 | Probabilistic | FastABOD | Fast Angle-Based Outlier Detection using approximation | 2008 |
3 | Probabilistic | COPOD | COPOD: Copula-Based Outlier Detection | 2020 |
4 | Probabilistic | MAD | Median Absolute Deviation (MAD) | 1993 |
5 | Probabilistic | SOS | Stochastic Outlier Selection | 2012 |
6 | Probabilistic | KDE | Outlier Detection with Kernel Density Functions | 2007 |
7 | Probabilistic | Sampling | Rapid distance-based outlier detection via sampling | 2013 |
8 | Probabilistic | GMM | Probabilistic Mixture Modeling for Outlier Analysis | nan |
9 | Linear Model | PCA | Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes) | 2003 |
10 | Linear Model | MCD | Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores) | 1999 |
11 | Linear Model | CD | Use Cook's distance for outlier detection | 1977 |
12 | Linear Model | OCSVM | One-Class Support Vector Machines | 2001 |
13 | Linear Model | LMDD | Deviation-based Outlier Detection (LMDD) | 1996 |
14 | Proximity-Based | LOF | Local Outlier Factor | 2000 |
15 | Proximity-Based | COF | Connectivity-Based Outlier Factor | 2002 |
16 | Proximity-Based | (Incremental) COF | Memory Efficient Connectivity-Based Outlier Factor (slower but reduce storage complexity) | 2002 |
17 | Proximity-Based | CBLOF | Clustering-Based Local Outlier Factor | 2003 |
18 | Proximity-Based | LOCI | LOCI: Fast outlier detection using the local correlation integral | 2003 |
19 | Proximity-Based | HBOS | Histogram-based Outlier Score | 2012 |
20 | Proximity-Based | kNN | k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) | 2000 |
21 | Proximity-Based | AvgKNN | Average kNN (use the average distance to k nearest neighbors as the outlier score) | 2002 |
22 | Proximity-Based | MedKNN | Median kNN (use the median distance to k nearest neighbors as the outlier score) | 2002 |
23 | Proximity-Based | SOD | Subspace Outlier Detection | 2009 |
24 | Proximity-Based | ROD | Rotation-based Outlier Detection | 2020 |
25 | Outlier Ensembles | IForest | Isolation Forest | 2008 |
26 | Outlier Ensembles | INNE | Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles | 2018 |
27 | Outlier Ensembles | FB | Feature Bagging | 2005 |
28 | Outlier Ensembles | LSCP | LSCP: Locally Selective Combination of Parallel Outlier Ensembles | 2019 |
29 | Outlier Ensembles | XGBOD | Extreme Boosting Based Outlier Detection (Supervised) | 2018 |
30 | Outlier Ensembles | LODA | Lightweight On-line Detector of Anomalies | 2016 |
31 | Outlier Ensembles | SUOD | SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection (Acceleration) | 2021 |
32 | Neural Networks | AutoEncoder | Fully connected AutoEncoder (use reconstruction error as the outlier score) | nan |
33 | Neural Networks | VAE | Variational AutoEncoder (use reconstruction error as the outlier score) | 2013 |
34 | Neural Networks | Beta-VAE | Variational AutoEncoder (all customized loss term by varying gamma and capacity) | 2018 |
35 | Neural Networks | SO_GAAL | Single-Objective Generative Adversarial Active Learning | 2019 |
36 | Neural Networks | MO_GAAL | Multiple-Objective Generative Adversarial Active Learning | 2019 |
37 | Neural Networks | DeepSVDD | Deep One-Class Classification | 2018 |
38 | Neural Networks | AnoGAN | Anomaly Detection with Generative Adversarial Networks | 2017 |
39 | Graph-based | R-Graph | Outlier detection by R-graph | 2017 |
40 | Graph-based | LUNAR | LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks | 2022 |
以下是该库的快速入门代码(以KNN模型为例):
from pyod.models.knn import KNN # kNN detector
# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)
# get the prediction label and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
# get the prediction on the test data
y_test_pred = clf.predict(X_test) # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test) # outlier scores
# it is possible to get the prediction confidence as well
y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True) # outlier labels (0 or 1) and confidence in the range of [0,1]
from pyod.utils.data import evaluate_print
# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
y_test_pred, show_figure=True, save_figure=False)
import numpy as np
import matplotlib.pyplot as plt
from pyod.models.knn import KNN
from pyod.utils.data import generate_data
outlier_fraction = 0.1 #异常值的比例
n_train = 200 # 训练集的样本个数
n_test = 100 # 测试集的样本个数
X_train,X_test,y_train, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=outlier_fraction) # 生成随机数据
#绘制图像
feature_1_train = X_train[:,0].reshape(-1,1)
feature_2_train = X_train[:,1].reshape(-1,1)
feature_1_test = X_test[:,0].reshape(-1,1)
feature_2_test = X_test[:,1].reshape(-1,1)
#散点图
plt.scatter(feature_1_train,feature_2_train)
plt.scatter(feature_1_test,feature_2_test)
plt.xlabel('feature_1')
plt.ylabel('feature_2')
Text(0, 0.5, 'feature_2')
knn=KNN(contamination=outlier_fraction)
knn.fit(X_train)
# prediction labels and outlier scores of the training data
y_train_pred = knn.labels_
y_train_scores = knn.decision_scores_
# prediction on the test data
y_test_pred = knn.predict(X_test)
y_test_scores = knn.decision_function(X_test)
# errors in test set
n_errors = (y_test_pred != y_test).sum()
print('No of Errors in test set: {}'.format(n_errors))
# accuracy in test set
print('Accuracy in test set: {}'.format((n_test-n_errors)/n_test))
No of Errors in test set: 0 Accuracy in test set: 1.0
from pyod.utils import example
example.visualize(knn, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)