Equivalence of Tidy Data and Third Normal Form

Question

In Hadley Wickham's "Tidy Data" paper, he states that

In tidy data:

Each variable forms a column.
  Each observation forms a row.
  Each type of observational unit forms a table.

This is Codd’s 3rd normal form (Codd 1990), but with the constraints
  framed in statistical language, and the focus put on a single dataset
  rather than the many connected datasets common in relational
  databases.

Codd's Third Normal Form can be described as:

A table is in 1st normal form if

It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
Each column contains atomic values, and there are no repeating groups of columns.

A table is in 2nd normal form if

The table is in 1st normal form, and     
All the non-key columns are dependent on the table’s primary key.

A table is in 3rd normal form if

It is in 2nd normal form, and 
It contains only columns that are non-transitively dependent on the primary key

I am trying to understand how these two sets of rules are equivalent.

I believe that the first Tidy Data rule maps to 1st normal form.  Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".

I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".

I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.

I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.

My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?

soundofsilence · Answer

I believe it goes a like this, based on thinking about this myself:

Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.

The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:

The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.

Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.

I think someone could probably do a more thorough proof with more time.

HTH

Equivalence of Tidy Data and Third Normal Form

One Answer

Add your own answers!

Ask a Question