Changing parallel to serial data representation


In the parallel representation of data, some factors are implied by using multiple columns for the same trait variable. The factor levels are either implied by the column position, or encoded in the column name. In the example below, the factor replicate is encoded in the column names of the traits variables yield (yld) and dry matter (dm) and is also reflected in their column positions. The conversion from parallel to serial representation needs to create new columns for implied factors and extract the implied factor levels from either column names or column positions. By adding explicit factors and additional rows of data, parallel columns for the same trait variable can be reduced to a single column for each trait variable.

In the following example the factor replicate with two levels is implied for two trait variables (yld, dm) by using two columns for each of the trait variables. The factor levels can be derived from the respective column names of the trait variables.


EXAMPLE DATABASE: Training1

DATA INPUT

Table parallel, with two factors (plot, date) having two levels each, trait yield measured twice (yld1, yld2), and trait dry matter measured twice (dm1, dm2). There are 16 (4 x 4) trait data points, one of them missing.



QUERY parallel2serial

SELECT plot, date, 1 AS rep, yld1 AS yld, dm1 AS dm 
FROM parallel

UNION ALL

SELECT plot, date, 2 AS rep, yld2 AS yld, dm2 AS dm
FROM parallel

ORDER BY date, plot, rep;

Query parallel2serial produces a table with three factors (plot, date, rep) and two traits (yld, dm) with one measurement each. The number of trait data points has remained the same (16), but in a serial layout (2 x 8).



DISCUSSION

This is a UNION query that appends the records of one query to the records of another query. Both queries have to return the same number of columns of the same type, but not necessarily the same number of rows.

The first query retrieves from table parallel the columns plot and date, generates the new column rep to which the value 1 is assigned, retrieves the column yld1 and renames it to yld, and retrieves the column dm1 and renames it to dm.

The second query retrieves from table parallel the columns plot and date, generates the new column rep to which the value 2 is assigned, retrieves the column yld2 and renames it to yld, and retrieves the column dm2 and renames it to dm.

The records resulting of the second query are appended to the records resulting from the first query by the UNION statement. All records are then sorted according to the ORDER BY statement.

You can run both queries separately to verify that they return columns that match by name and type.

The query can easily be extended to more factors (plot, date) with more levels (2 x 2) and to more traits (yld, dm). If the query is extended to more replicate observations for each trait, each replicate requires a separate query and all queries need to be joined by UNION statements. Below is the outline of a query for three replicate observations, requiring three separate queries, joined by two UNION statements.

SELECT ...
UNION ALL
SELECT ...
UNION ALL
SELECT ...
ORDER BY ...

Note that missing values are converted from parallel to serial representation.


Last modified September 23, 2008 8:06 am