Anomaly detection | Sort and Rank techniques

General StatisticsLeave a Comment on Anomaly detection | Sort and Rank techniques

Anomaly detection | Sort and Rank techniques

In most cases, there are complex and specific algorithms for abnormal or anomaly detection that have been implemented through various software in statistical or data analysis. However, in this article from the Arman Computer magazine, we aim to discuss the application of sorting and identifying abnormal observations, which, in addition to being simple, is also efficient and reliable. we will discuss sort or order tools in this article of armancomputer magazin.


Sorting and anomaly detection

As you know, sorting is a tool in a timeline when we want to quickly search for values in one of the table’s columns. On the other hand, sorting also allows for the categorization of observations. This means that observations with the same values in one column will be placed in consecutive rows, creating categories or groups. Now, suppose that our 2D observations consist of two columns or variables representing the amount of received salary and the base salary. We want to know if there are any individuals in our employees list whose received salary does not correspond to the base salary. The presence of such individuals is likely to indicate a problem in the data and may lead us to irregular payments or anomaly detection.

Ranking and Sorting for anomaly detection

If, after sorting a table based on a column, we assign a row number for each observation, this number can be considered the rank of the observations. Thus, ranking allows us to create a unique number for each individual based on their position in the sorted table. It is important to note that selecting or specifying a column as the basis for sorting the rows of the table is mandatory.

For example, the employee table can be sorted based on employee number or in alphabetical order by last name. Changing the column related to sorting will alter the order of the rows and consequently the rank of the observations.

With the explanations provided, we expect that someone who has a rank of 10 in the table sorted by received salary will also have a rank similar to 10 or around that in the table sorted by base salary. This is because we have experienced that there is a relationship between received salary and base salary. The higher the base salary, the higher the received salary will be, and vice versa.

The anomaly detection with rank

A table of employee salary information has been provided to us, which we have implemented in Excel as shown in the image below.

In columns D and E, the employees’ rankings based on base salary and received salary have been specified. We expect that the difference in these rankings should not exceed 2 or 3 units. Except for Mr/MrS. “F,” whose difference exceeds 3 units, the other rankings are proportional to each other.

Therefore, it seems that this individual should have been reviewed, and perhaps their overpayment is related to a loan they received. In any cases, from an auditor’s perspective, this matter appears suspicious.

Note: The formula related to the RANK function in Excel can be seen in the toolbar of the image above.

Calculation of the Correlation Coefficient between Ranks

Another method for identifying outlier data among observations is by using the Correlation Coefficient between ranks. However, if we directly calculate the Pearson correlation coefficient between the ranks, such an index is obtained. Yet, we can derive the Spearman correlation coefficient based on the actual salary values, which is the same as the correlation coefficient between ranks.

According to the data and the Excel worksheet, the Pearson correlation coefficient between ranks is 0.7, indicating a relatively good linear relationship between the two variables. However, if there were a precise linear relationship between these two variables, we would expect this index to be greater than 0.9. Therefore, we become suspicious of the existence of an outlier observation and attempt to search for the outlier using the previous method.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top