From financial fraud detection to healthcare insurance, anomaly detection is growing in significance as a technique of data analysis and alerts. Based on the assumption that similar units of data within a single dataset should be relatively homogeneous, anomaly detection allows identifying deviant entries. The latter may signal about cyber-security intrusions, attempts of fraud, insurance forgery, and other fraudulent or hazardous activities.
Thus, the method of anomaly detection is now broadly used in a variety of industries to guarantee a greater level of security. Here we explore the concept and meaning of anomaly, examine anomaly detection methods and algorithms, and review industry cases for its application.
Let's start the review with the 'anomaly' definition. An anomaly is a value that deviates from the norm considerably enough to be regarded as a rare exception. Systems detecting anomalies do that based on the assumption that such outliers, or exceptions, should stand apart from the significant portion of data in the dataset. Thus, the process of detection presupposes the establishment of patterns first and then identifying the units violating those patterns.
To embrace the significance of anomaly detection in modern information systems and information security, one should answer a question, what does it mean to be an anomaly? Overall, a standard system functions within some predetermined limitations, with all incoming and outgoing data fitting a range of values typical for it. Thus, when individual deviations from the typical expectations are identified, they always serve as red flags for the security personnel and analysts.
Some areas in which anomaly detection is popular include:
In a nutshell, detecting anomalous data within a system means that something goes wrong. For instance, if all buyers of an e-shop pay $100 on average for a pair of shoes, and some client pays $1,000 for the same purchase, it is an anomaly meaning that either there is a problem with the client's bank or some glitch occurred in the merchant's system. Similarly, if an insurer receives an average MRI check for $300-400 from patients and suddenly gets a $550 check for the same procedure, it should be an alert about a potentially fraudulent transaction requiring a closer investigation.
Another example is the significance of anomaly detection in a computer network. If some anomalous traffic patterns are identified in it, this could be a sign of sensitive data leakage from a hacked computer. Anomalies in the nervous signal transmission on the MRI scan may be a sign of some serious degenerative disease. At the same time, bizarre purchases and cashing activities with a client's credit card may be an alert of the card's theft.
Thus, as one can see, anomaly detection is helpful in many industries and fields, helping specialists identify deviations from the norm to investigate them closer and determine the cause of such deviation. The core to successful detection practices in any organization is to define 'anomalous' for their own datasets, to set specific detection signals for the analytical systems, and to feed in the feedback about correct/incorrect anomaly labeling for the system to learn.
As anomalies in information systems most often suggest some security breaches or violations, anomaly detection has been applied in a variety of industries for advancing the IT safety and detect potential abuse or attacks. Here is a couple of use cases showing how anomaly detection is applied.
By detecting anomalies early, anomaly detection can help to prevent system failures. This can save businesses money and prevent loss of data.
The choice of anomaly detection approach depends on the training data and test data measures you use. Currently, three types thereof are known:
Unsupervised anomaly detection is the most flexible of the three in terms of presenting no labels to the system and drawing no distinctions between the training and test dataset. This way, the system scores data within the dataset only based on its units' characteristics, without any predetermined normalcy values.
Anomaly detection can broadly be divided into two main categories: supervised and unsupervised.
Supervised anomaly detection uses a labeled dataset to train a model that can identify anomalies. The labeled dataset contains both normal and anomalous data points, and the model learns to distinguish between the two. This type of anomaly detection is more accurate than unsupervised anomaly detection, but it requires a labeled dataset, which can be difficult to obtain
Unsupervised anomaly detection does not use a labeled dataset. Instead, it uses statistical methods to identify data points that deviate from the norm. This type of anomaly detection is less accurate than supervised anomaly detection, but it does not require a labeled dataset.Local anomalies and micro-clusters of deviant data can be difficult to identify with both supervised and unsupervised anomaly detection. These anomalies are not as different from the rest of the dataset as global anomalies, so they may not be detected by statistical methods.A more sensitive way to detect anomalies is to score them in terms of anomaly intensity. This means assigning each data point a score that indicates how anomalous it is. This approach can be used to identify both global and local anomalies.
Speaking about supervised anomaly detection, decision trees (like C4.5) or Isolation Forest work with unbalanced data not quite productively. So, for supervised setups, Support Vector Machines and Artificial Neural Networks are more preferable. Semi-supervised anomaly detection setups work well with One-class SVMs and autoencoders. Other helpful algorithms include Gaussian Mixture Models and Kernel Density Estimation.
Isolation Forest is one of the ML algorithms used for unsupervised anomaly detection using anomaly scoring. This method is flexible in terms of not labeling units as normal/anomalous but assigning an anomaly score to them instead. As it is a tree method, it performs the outlier/non-outlier classification based on the assigned scores, visualizing the regions where the outliers fall. Other popular unsupervised algorithms include K-means, autoencoders, GMMs, PCAs, and the hypothesis tests-based analysis.
Machine learning (ML), an area of artificial intelligence (AI), has proven highly helpful for advancing the anomaly detection accuracy and helping companies and organizations manage big data. The ability of ML systems to learn by their own experience, thus refining their analytical and predictive capacity on their own, is a valuable feature for accurate anomaly detection.
So, what is an advantage of the anomaly detection method enriched with ML technology? The first undeniable benefit is the ML system's ability to handle unlabeled and unstructured data proactively, determining what is normal and what may be regarded as a data anomaly. Second, ML systems are much more sensitive to distinguishing data anomalies from noise, allowing them to differentiate data units based on the degree of their deviation from the norm. The most common ML-based approaches to anomaly detection used today are:
This approach uses the k-nearest neighbor algorithm, with k-NN being a simple, non-parametric lazy learning technique for data classification. The data are categorized based on their distance from the core indicator, with Euclidean, Manhattan, Mikowski, and Hamming distance parameters applied in this analytical method. The density of data is established based on the reachability distance, and the local outlier factor is applied to label data as abnormal or normal.
Clustering is a typical approach in the area of unsupervised learning. Using it, the system clusters data points with the help of a K-means algorithm, with data distances larger than the average distance within a cluster being labeled as anomalous.
A support vector machine (SVM) learn a soft boundary to cluster all data falling within that boundary as normal. Units falling beyond that cluster are labeled as abnormal.
With anomaly detection methods able to give a competitive edge to any business, Datrics offers numerous setups suiting a variety of goals and dealing with different datasets. You can customize an anomaly detection product from Datrics depending on your business needs and characteristics of your data. Take advantage of ML technology to get a better understanding of your data, to enhance security protection, and to inform your anomaly-related decisions.