Checking for outliers in a numeric column


An outlier is an observation that is numerically distant from the rest of the data. While the presence of outliers often indicates an error in the measurements, a small proportion of outliers must be expected and they should not be discarded automatically, but need to be reviewed for validity.

There are four queries involved in getting the outliers. The first two queries adult_bmi and summary_bmi are discussed separately as they are more generally applicable. The queries deviation_bmi and outlier_bmi are discussed here. The query deviation_bmi calculates for each records the deviation of the body mass index from a group mean and expresses the deviation in units of standard deviations. The query outlier_bmi takes the results from the query deviation_bmi as input and performs a simple sub-setting and sorting.

EXAMPLE DATABASE: Nhanes

DEVIATION QUERY

In the deviation_bmi query the deviation from the group mean is calculated and the results are placed in an additional column named dev.


DATA INPUT

The query adult_bmi provides the bmi value per respondent that is needed for the computation of the deviation.



The query summary_bmi provides the corresponding group average (avg_bmi) and standard deviation (std_bmi) that are needed for the computation of the deviation.



QUERY deviation_bmi

SELECT adult_bmi.rseqn, adult_bmi.fseqn, adult_bmi.ethn, 
       adult_bmi.sex, adult_bmi.age, adult_bmi.ageunit, 
       adult_bmi.weight, adult_bmi.height, adult_bmi.bmi, 
       summary_bmi.avg_bmi, summary_bmi.std_bmi, 
       ROUND((adult_bmi.bmi - summary_bmi.avg_bmi)/summary_bmi.std_bmi, 1) AS dev
FROM adult_bmi, summary_bmi
WHERE adult_bmi.sex = summary_bmi.sex AND adult_bmi.ethn = summary_bmi.ethn;

The deviation_bmi query joins fields from the adult_bmi and summary_bmi queries and computes the deviation of the body mass index from the corresponding group mean, expressed in units of standard deviation. The query results in a total number of 17,030 records.



DISCUSSION

Joining tables is discussed in detail here and creating new columns here. In this query the body mass index from the query adult_bmi (adult_bmi.bmi) needs to be joined with the corresponding record containing the group average and standard deviation from the query summary_bmi (summary_bmi.avg_bmi, summary_bmi.std_bmi). The correspondence is by sex and ethnic group, as this is the basis for the computation of the group averages and standard deviations in the query summary_bmi. The correct correspondence is determined in the WHERE condition of the query.

1. the values in the fields sex from adult_bmi and summary_bmi must be the same, and
2. the values in the fields ethn from adult_bmi and summary_bmi must also be the same.

The expression (adult_bmi.bmi - summary_bmi.avg_bmi) / summary_bmi.std_bmi takes the difference between a respondents' body mass index (adult_bmi.bmi) and the corresponding group mean (summary_bmi.avg_bmi), and divides it by the corresponding group standard deviation (summary_bmi.std_bmi). This measure of deviation is rounded off to 1 decimal place and stored in the dev column. For a normal distribution, this measure has the following properties:

Source: Wikipedia
Enlarge
Source: Wikipedia

For the normal distribution, the dark blue area is less than one standard deviation from the mean. This accounts for about 68% of the set of values. Two standard deviations from the mean (medium and dark blue) account for about 95% of the values and three standard deviations (light, medium, and dark blue) account for about 99.7% of the values.

OUTLIER QUERY

The outlier_bmi query creates an ordered subset of the records returned by the deviation_bmi query.


DATA INPUT

The data input is from the deviation_bmi query. The dev field is used for creating the subset and the ordering.



QUERY outlier_bmi

SELECT *
FROM deviation_bmi
WHERE ABS(dev >= 3)
ORDER BY ABS(dev) DESC;

The outlier_bmi query returns all respondent records where the body mass index is 3 or more standard deviations away from the group mean, and the output is sorted in descending order of deviation. According to the example threshold, 208 records out of the 17,030 records are considered outliers.



DISCUSSION

This is a simple query that sets a threshold for the absolute value of the deviation and orders the records by the magnitude of deviation, showing the records with the largest deviation on top. The threshold condition can easily be changed in magnitude, or restricted to only positive or negative deviations. The above output shows that all deviations returned as outliers are positive, which is an indication that the body mass index values are not normally distributed. Nevertheless, this query can be helpful in spotting data errors. There is a noticeable delay in the execution of this query, as it depends on three other queries that need to be executed first in the background.


Last modified September 23, 2008 9:29 am