Using an additional table


Frequently the data stored in a numeric column need to be categorized into a number of discreet classes.

In the following example the numeric column bmi (body mass index) will be categorized into the classes underweight, normal, overweight, and obese, according to the terminology and definitions of the CDC for categorizing the body mass index of adults BMI.


EXAMPLE DATABASE: Nhanes

DATA INPUT

Query adult_bmi in the database Nhanes.mdb is used as input and contains the continuous values for body mass index.



Table bmi_cat_cdc in the database Nhanes.mdb is used a input to define the categories and category labels for the four categories of body mass index. Note that biologically impossible values are given as the lower limit of the underweight class and as the upper limit of the obese class.



QUERY categorize1

SELECT adult_bmi.rseqn, adult_bmi.fseqn, adult_bmi.ethn, 
       adult_bmi.sex, adult_bmi.age, adult_bmi.ageunit, 
       adult_bmi.weight, adult_bmi.height, adult_bmi.bmi, 
       bmi_cat_cdc.status
FROM adult_bmi, bmi_cat_cdc
WHERE adult_bmi.bmi >= bmi_cat_cdc.from_bmi AND
      adult_bmi.bmi < bmi_cat_cdc.to_bmi;

Query categorize1 produces a table that applies the categorization to the continuous bmi values and adds the column status.



DISCUSSION

This query joins two table not on the usual equality of column values in the two tables, but on the result of arithmetic comparisons.

Note that the categories as defined in the table bmi_cat_cdc have shared endpoints. This possible ambiguity in classification is avoided by using the inclusive comparison operator (>=) for the lower endpoint, and the exclusive comparison operator (<) for the upper endpoint in the query. Care must be taken that the categories are not overlapping, as this might result in some records appearing multiple times in different classes.

Note that the number of records resulting from the query categorize1 are 61 less that from the query adult_bmi. This is due to those records having missing bmi values, which results in the records being excluded by this type of query.


A more complex example for categorizing body mass index can be found here. There are now six classes and they are defined differently for men and women. The new categories are captured in the table bmi_cat_iupui as shown below:



The following query categorize2 now links the two table based on the equality of the values in the column sex in both tables as well as the arithmetic comparisons of body mass index as introduced above.

SELECT adult_bmi.rseqn, adult_bmi.fseqn, adult_bmi.ethn, 
       adult_bmi.sex, adult_bmi.age, adult_bmi.ageunit, 
       adult_bmi.weight, adult_bmi.height, adult_bmi.bmi, 
       bmi_cat_iupui.status
FROM adult_bmi, bmi_cat_iupui
WHERE adult_bmi.sex = bmi_cat_iupui.sex AND
      adult_bmi.bmi >= bmi_cat_iupui.from_bmi AND
      adult_bmi.bmi < bmi_cat_iupui.to_bmi;

The advantage of using this solution for categorizing continuous variables is that both the additional table as well as the query are not very complex.


Last modified September 23, 2008 8:45 am