|
Checking for outliers in a numeric column
An outlier is an observation that is numerically distant from the rest of the data. While the presence of outliers often indicates an error in the measurements, a small proportion of outliers must be expected and they should not be discarded automatically, but need to be reviewed for validity. There are four queries involved in getting the outliers. The first two queries adult_bmi and summary_bmi are discussed separately as they are more generally applicable. The queries deviation_bmi and outlier_bmi are discussed here. The query deviation_bmi calculates for each records the deviation of the body mass index from a group mean and expresses the deviation in units of standard deviations. The query outlier_bmi takes the results from the query deviation_bmi as input and performs a simple sub-setting and sorting. EXAMPLE DATABASE: NhanesDEVIATION QUERYIn the deviation_bmi query the deviation from the group mean is calculated and the results are placed in an additional column named dev.
DATA INPUTThe query adult_bmi provides the bmi value per respondent that is needed for the computation of the deviation.
QUERY deviation_bmiSELECT adult_bmi.rseqn, adult_bmi.fseqn, adult_bmi.ethn,
adult_bmi.sex, adult_bmi.age, adult_bmi.ageunit,
adult_bmi.weight, adult_bmi.height, adult_bmi.bmi,
summary_bmi.avg_bmi, summary_bmi.std_bmi,
ROUND((adult_bmi.bmi - summary_bmi.avg_bmi)/summary_bmi.std_bmi, 1) AS dev
FROM adult_bmi, summary_bmi
WHERE adult_bmi.sex = summary_bmi.sex AND adult_bmi.ethn = summary_bmi.ethn;
The deviation_bmi query joins fields from the adult_bmi and summary_bmi queries and computes the deviation of the body mass index from the corresponding group mean, expressed in units of standard deviation. The query results in a total number of 17,030 records.
DISCUSSIONJoining tables is discussed in detail here and creating new columns here. In this query the body mass index from the query adult_bmi (adult_bmi.bmi) needs to be joined with the corresponding record containing the group average and standard deviation from the query summary_bmi (summary_bmi.avg_bmi, summary_bmi.std_bmi). The correspondence is by sex and ethnic group, as this is the basis for the computation of the group averages and standard deviations in the query summary_bmi. The correct correspondence is determined in the WHERE condition of the query.
The expression (adult_bmi.bmi - summary_bmi.avg_bmi) / summary_bmi.std_bmi takes the difference between a respondents' body mass index (adult_bmi.bmi) and the corresponding group mean (summary_bmi.avg_bmi), and divides it by the corresponding group standard deviation (summary_bmi.std_bmi). This measure of deviation is rounded off to 1 decimal place and stored in the dev column. For a normal distribution, this measure has the following properties:
OUTLIER QUERYThe outlier_bmi query creates an ordered subset of the records returned by the deviation_bmi query.
DATA INPUTThe data input is from the deviation_bmi query. The dev field is used for creating the subset and the ordering.
QUERY outlier_bmiSELECT * FROM deviation_bmi WHERE ABS(dev >= 3) ORDER BY ABS(dev) DESC; The outlier_bmi query returns all respondent records where the body mass index is 3 or more standard deviations away from the group mean, and the output is sorted in descending order of deviation. According to the example threshold, 208 records out of the 17,030 records are considered outliers.
DISCUSSIONThis is a simple query that sets a threshold for the absolute value of the deviation and orders the records by the magnitude of deviation, showing the records with the largest deviation on top. The threshold condition can easily be changed in magnitude, or restricted to only positive or negative deviations. The above output shows that all deviations returned as outliers are positive, which is an indication that the body mass index values are not normally distributed. Nevertheless, this query can be helpful in spotting data errors. There is a noticeable delay in the execution of this query, as it depends on three other queries that need to be executed first in the background.
Last modified September 23, 2008 9:29 am
|