通俗来讲,离群点(outlier),也被称为异常值,是指与其他观测点距离较远的观测点。但这个定义比较模糊的,因为它没有量化多少是“较远”。本文将研究异常值检测和处理技术,同时看看它们对不同类型的机器学习模型的影响。

许多机器学习模型,如线性和逻辑回归,很容易受到训练数据中的异常值的影响。像AdaBoost这样的模型在每次迭代中都会增加错误分类点的权重,因此可能会对这些异常值施加很高的权重,因为它们往往会被错误分类。如果异常值是某种类型的错误,或者我们希望我们的模型很好地一般化而不关心极端值。

异常点检测的常用方法¶

异常值检测的一种常见方法是假设规则数据来自已知的分布(例如,数据是高斯分布)。识别异常值的最快和最简单的方法是使用图形将它们可视化。如果你的数据集不是很大(大约。多达10k观察和100个特征),强烈建议你构建变量的散点图和框图。如果没有异常值,你肯定会获得一些其他的见解,如相关性、可变性,或外部因素,如世界大战/衰退对经济因素的影响。但是,对于可视化能力不强的高维数据,不推荐使用此方法。

3-$\sigma$原则¶

基于正态分布,3sigma准则认为超过3sigma的数据为异常点。

def three_sigma(s):
    mu, std = np.mean(s), np.std(s)
    lower, upper = mu-3*std, mu+3*std
    return lower, upper

箱线图¶

箱线图时基于四分位距(IQR)找异常点的。

def boxplot(s):
    q1, q3 = s.quantile(.25), s.quantile(.75)
    iqr = q3 - q1
    lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
    return lower, upper

我们首先确定Q1和Q3的四分位数。 $$四分位差IQR = Q3 - Q1$$ $$上限= Q3+1.5*IQR$$ $$下限= Q1-1.5 *IQR$$ 任何低于下限和高于上限的值都被认为是异常值

KNN¶

K近邻算法可以依次计算每个样本点与其最近K个样本之间的平均距离,然后将计算出的距离与阈值进行比较。如果距离大于阈值,则认为是异常值。优点是它不需要假设数据的分布。缺点是只能找到全局异常值,而不能找到局部异常值。(这里借助pyod库来进行)

pyod库除了有KNN算法以外,同时包含其他的一些算法:

Type Abbr Algorithm Year
0 Probabilistic ECOD Unsupervised Outlier Detection Using Empirical Cumulative Distribution Functions 2022
1 Probabilistic ABOD Angle-Based Outlier Detection 2008
2 Probabilistic FastABOD Fast Angle-Based Outlier Detection using approximation 2008
3 Probabilistic COPOD COPOD: Copula-Based Outlier Detection 2020
4 Probabilistic MAD Median Absolute Deviation (MAD) 1993
5 Probabilistic SOS Stochastic Outlier Selection 2012
6 Probabilistic KDE Outlier Detection with Kernel Density Functions 2007
7 Probabilistic Sampling Rapid distance-based outlier detection via sampling 2013
8 Probabilistic GMM Probabilistic Mixture Modeling for Outlier Analysis nan
9 Linear Model PCA Principal Component Analysis (the sum of weighted projected distances to the eigenvector hyperplanes) 2003
10 Linear Model MCD Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores) 1999
11 Linear Model CD Use Cook's distance for outlier detection 1977
12 Linear Model OCSVM One-Class Support Vector Machines 2001
13 Linear Model LMDD Deviation-based Outlier Detection (LMDD) 1996
14 Proximity-Based LOF Local Outlier Factor 2000
15 Proximity-Based COF Connectivity-Based Outlier Factor 2002
16 Proximity-Based (Incremental) COF Memory Efficient Connectivity-Based Outlier Factor (slower but reduce storage complexity) 2002
17 Proximity-Based CBLOF Clustering-Based Local Outlier Factor 2003
18 Proximity-Based LOCI LOCI: Fast outlier detection using the local correlation integral 2003
19 Proximity-Based HBOS Histogram-based Outlier Score 2012
20 Proximity-Based kNN k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score) 2000
21 Proximity-Based AvgKNN Average kNN (use the average distance to k nearest neighbors as the outlier score) 2002
22 Proximity-Based MedKNN Median kNN (use the median distance to k nearest neighbors as the outlier score) 2002
23 Proximity-Based SOD Subspace Outlier Detection 2009
24 Proximity-Based ROD Rotation-based Outlier Detection 2020
25 Outlier Ensembles IForest Isolation Forest 2008
26 Outlier Ensembles INNE Isolation-based Anomaly Detection Using Nearest-Neighbor Ensembles 2018
27 Outlier Ensembles FB Feature Bagging 2005
28 Outlier Ensembles LSCP LSCP: Locally Selective Combination of Parallel Outlier Ensembles 2019
29 Outlier Ensembles XGBOD Extreme Boosting Based Outlier Detection (Supervised) 2018
30 Outlier Ensembles LODA Lightweight On-line Detector of Anomalies 2016
31 Outlier Ensembles SUOD SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection (Acceleration) 2021
32 Neural Networks AutoEncoder Fully connected AutoEncoder (use reconstruction error as the outlier score) nan
33 Neural Networks VAE Variational AutoEncoder (use reconstruction error as the outlier score) 2013
34 Neural Networks Beta-VAE Variational AutoEncoder (all customized loss term by varying gamma and capacity) 2018
35 Neural Networks SO_GAAL Single-Objective Generative Adversarial Active Learning 2019
36 Neural Networks MO_GAAL Multiple-Objective Generative Adversarial Active Learning 2019
37 Neural Networks DeepSVDD Deep One-Class Classification 2018
38 Neural Networks AnoGAN Anomaly Detection with Generative Adversarial Networks 2017
39 Graph-based R-Graph Outlier detection by R-graph 2017
40 Graph-based LUNAR LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks 2022

以下是该库的快速入门代码(以KNN模型为例):

  • 第一步:初始化模型并进行模型训练
from pyod.models.knn import KNN   # kNN detector

# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)

# get the prediction label and outlier scores of the training data
y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_  # raw outlier scores

# get the prediction on the test data
y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test)  # outlier scores

# it is possible to get the prediction confidence as well
y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True)  # outlier labels (0 or 1) and confidence in the range of [0,1]
  • 第二步:模型评估
from pyod.utils.data import evaluate_print

# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
  • 第三步:可视化
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
    y_test_pred, show_figure=True, save_figure=False)

参考资料¶

  • https://mp.weixin.qq.com/s/Xw7YE6scG2avLvG4D1I_jA
  • https://github.com/yzhao062/pyod

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from pyod.models.knn import KNN
from pyod.utils.data import generate_data

outlier_fraction = 0.1  #异常值的比例
n_train = 200 # 训练集的样本个数
n_test = 100 # 测试集的样本个数

X_train,X_test,y_train, y_test = generate_data(n_train=n_train, n_test=n_test, contamination=outlier_fraction) # 生成随机数据
#绘制图像
feature_1_train = X_train[:,0].reshape(-1,1)
feature_2_train = X_train[:,1].reshape(-1,1)
feature_1_test = X_test[:,0].reshape(-1,1)
feature_2_test = X_test[:,1].reshape(-1,1)
#散点图
plt.scatter(feature_1_train,feature_2_train)
plt.scatter(feature_1_test,feature_2_test)
plt.xlabel('feature_1')
plt.ylabel('feature_2')
Out[1]:
Text(0, 0.5, 'feature_2')
In [2]:
knn=KNN(contamination=outlier_fraction)
knn.fit(X_train)
# prediction labels and outlier scores of the training data
y_train_pred = knn.labels_  
y_train_scores = knn.decision_scores_ 
# prediction on the test data
y_test_pred = knn.predict(X_test)  
y_test_scores = knn.decision_function(X_test)
# errors in test set
n_errors = (y_test_pred != y_test).sum()
print('No of Errors in test set: {}'.format(n_errors))
# accuracy in test set
print('Accuracy in test set: {}'.format((n_test-n_errors)/n_test))
No of Errors in test set: 0
Accuracy in test set: 1.0
In [3]:
from pyod.utils import example
example.visualize(knn, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)