1 What is an outlier?
In machine learning.Anomaly detection and handlingis a relatively small branch, or, rather, a by-product of machine learning, since in general prediction problems theA model is usually a representation of the structure of the overall sample data, which usually captures the general properties of the overall sample, and those points that behave completely inconsistent with the overall sample in terms of these properties are called outliersUsually anomalies are not welcome by developers in prediction problems, because prediction problems are concerned with the nature of the overall sample, while the mechanism for generating anomalies is completely inconsistent with the overall sample, and if the algorithm is sensitive to the anomalies, then the generated model will not be able to have a better representation of the overall sample, and thus the prediction will be inaccurate.
On the other hand, anomalies are of great interest to analysts in certain scenarios insteadFor example, in disease prediction, usually the body indicators of healthy people are similar in some dimensions, if a person's body indicators are abnormal, then his physical condition must have changed in some aspects, of course, this change is not necessarily caused by the disease (often called the noise point), but the occurrence and detection of abnormalities is an important starting point for disease prediction. Similar scenarios can be applied to credit fraud, cyber attacks, etc.
2 Detection methods for outliers
General outlier detection methods include statistical-based methods, clustering-based methods, and some methods specialized in detecting outliers, etc., which are described in the following relevant sections.
1. Simple statistics
If we use pandas, we can use describe() directly to look at a statistical description of the data (just a cursory look at some statistics), though the statistics are continuous, as follows:
()
Alternatively, the presence of outliers can be clearly observed by simply using a scatterplot. This is shown below:
2.3 The ∂ principle
There is a condition to this principle:The data needs to follow a normal distribution. Under the 3∂ principle, outliers can be considered as outliers if they are more than 3 times the standard deviation. The probability of plus or minus 3∂ is 99.7%, so the probability that a value outside 3∂ from the mean occurs is P(|x-u| > 3∂) <= 0.003, which is an extremely rare and small probability event. If the data does not obey a normal distribution, it can also be described by how many times the standard deviation away from the mean.
The red arrows point to outliers.
3. Box diagrams
This method utilizes a box plot ofInterquartile range (IQR)The detection of outliers, also calledTukey‘s test.. Box plots are defined as follows:
The interquartile range (IQR) is the difference between the upper and lower quartiles. And we pass 1.5 times the IQR as a criterion, specifying that points exceeding the upper quartile + 1.5 times the IQR distance, or the lower quartile - 1.5 times the IQR distance are outliers. Here is the code implementation in Python, which mainly uses numpy's percentile method.
Percentile = (df['length'],[0,25,50,75,100]) IQR = Percentile[3] - Percentile[1] UpLimit = Percentile[3]+ageIQR*1.5 DownLimit = Percentile[1]-ageIQR*1.5
This can also be done using seaborn's visualization method boxplot:
f,ax=(figsize=(10,8)) (y='length',data=df,ax=ax) ()
The red arrows point to outliers.
The above is a simple method of determining outliers. The following to introduce some of the more complex algorithms for detecting outliers, due to the content involved in more than just the core idea, interested friends can conduct their own in-depth research.
4. Model-based detection
This approach generally constructs aprobability distribution model, and calculates the probability that an object conforms to the model, treating objects with a low probability as outliers. If the model is a collection of clusters, the anomalies are objects that do not significantly belong to any of the clusters; when the model is a regression, the anomalies are objects that are relatively far from the predicted values.
Probabilistic definition of an outlier:An outlier is an object that models a probability distribution about data that has a low probability of. This situation presupposes that one must know what distribution the dataset obeys, and if the estimate is wrong it results in a heavy-tailed distribution.
For example, the RobustScaler method in feature engineering, when doing data feature value scaling, it will utilize the quantile distribution of data features, divide the data into multiple segments based on the quantile, and only take the middle segment to do scaling, for example, only take the data from 25% quantile to 75% quantile to do scaling. This reduces the impact of abnormal data.
Pros and Cons:
(1) There is a solid theoretical foundation in statistics, and these tests can be very effective when adequate data and knowledge of the types of tests used exist;
(2) For multivariate data, there are fewer choices available, and these detection possibilities are poor for high-dimensional data.
5. Proximity-based outlier detection
Statistical methods use the distribution of data to observe outliers, and some methods even require some distribution conditions, while in practice the distribution of data is difficult to meet some assumptions, and there are some limitations in their use.
It is easier to determine a meaningful proximity measure for a data set than to determine its statistical distribution. This approach is more general and easier to use than the statistical approach, because theThe outlier score of an object is given by the distance to its k-nearest neighbor (KNN).
It is important to note that the outlier score is highly sensitive to the value of k. If k is too small, a small number of neighboring outliers may result in a low outlier score; if k is too large, all objects in a cluster with fewer than k points may become outliers. To make the scheme more robust to the selection of k, the average distance of the k nearest neighbors can be used.
Pros and Cons:
(1) Simple;
(2) Disadvantages: proximity-based methods take O(m2) time and are not suitable for large datasets;
(3) The method is also sensitive to the choice of parameters;
(4) cannot handle datasets with regions of varying density because it uses global thresholds and cannot account for variations in such density.
5. Density-based outlier detection
From a density-based point of view, an outlier is an object in a low-density region. Density-based outlier detection is closely related to proximity-based outlier detection because density is usually defined in terms of proximity.A common way to define density is to define it as the reciprocal of the average distance to the k nearest neighbors. If that distance is small, the density is high, and vice versa.Another definition of density isUse the density definition used by the DBSCAN clustering algorithm, where the density around an object is equal to the number of objects within a specified distance d of that object.
Pros and Cons:
(1) gives a quantitative measure that the object is an outlier and handles well even if the data has different regions;
(2) As with distance-based methods, these methods necessarily have O(m2) time complexity. O(mlogm) can be achieved for low-dimensional data using specific data structures;
(3) Parameter selection is difficult. Although the LOF algorithm handles the problem by looking at different values of k and then obtaining the maximum outlier score, it is still necessary to choose upper and lower bounds for these values.
6. Clustering-based approach to anomaly detection
Clustering-based outliers:An object is an outlier based on clustering if the object does not strongly belong to any cluster.
Effect of outliers on initial clustering: if outliers are detected through clustering, there is a question whether the structure is valid or not as outliers affect clustering. This is also the disadvantage of the k-means algorithm, which is sensitive to outlier points. In order to deal with this problem, the following method can be used: the object is clustered, the outliers are removed and the object is clustered again (this is not guaranteed to produce optimal results).
Pros and Cons:
(1) Clustering techniques based on linear and near-linear complexity (k-means) to discover outliers may be highly effective;
(2) Clusters are usually defined as the complements of outliers, so it is possible to find both clusters and outliers;
(3) The resulting set of outlier points and their scores can be very dependent on the number of clusters used and the presence of outlier points in the data;
(4) The quality of the clusters produced by a clustering algorithm has a very strong influence on the quality of the outliers produced by that algorithm.
7. Specialized outlier detection
In fact, the above mentioned clustering method is intended to be an unsupervised classification, and is not designed to find outliers, it just happens to be a function that enables the detection of outliers, which is sort of a derived function.
In addition to the methods mentioned above, there are two more commonly used methods specifically for detecting anomalies: the One Class SVM and the Isolation Forest, the details of which will not be studied in depth.
3 Handling of outliers
When an outlier is detected, we need to handle it in some way. And the general methods of handling outliers can be broadly categorized as follows:
- Deleting records containing outliers: Delete records containing outliers directly;
- Considered as a missing value: The outliers are treated as missing values and are handled using the methods of missing value handling;
- Mean value correction: The outlier can be corrected by averaging the two observations before and after;
- not dealt with: Data mining directly on datasets with outliers;
Whether to remove outliers can be considered according to the actual situation. Because some models are not very sensitive to outliers, even if there are outliers, they do not affect the model effect, but some models such as logistic regression LR are very sensitive to outliers, and if they are not dealt with, very poor results such as overfitting may occur.
4 Summary of outliers
The above is a summary of outlier detection and handling methods.
Through a number of detection methods we can find outliers, but the results are not absolutely correct, the specific situation also need to be based on their own understanding of the business to determine. Similarly, how to deal with outliers, whether to delete, correct, or not deal with also need to be considered in the context of the actual situation, there is no fixed.
These are just for personal experience, I hope it can give you a reference and I hope you can support me more.