AnswerBun.com

Equivalence of Tidy Data and Third Normal Form

In Hadley Wickham’s “Tidy Data” paper, he states that

In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.

Codd’s Third Normal Form can be described as:

  1. A table is in 1st normal form if

    • It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
    • Each column contains atomic values, and there are no repeating groups of columns.
  2. A table is in 2nd normal form if

    • The table is in 1st normal form, and
    • All the non-key columns are dependent on the table’s primary key.
  3. A table is in 3rd normal form if

    • It is in 2nd normal form, and
    • It contains only columns that are non-transitively dependent on the primary key

I am trying to understand how these two sets of rules are equivalent.

I believe that the first Tidy Data rule maps to 1st normal form. Specifically, “Each variable forms a column” maps to “Each column contains atomic values, and there are no repeating groups of columns”.

I believe the Tidy Data rule “Each observation forms a row” maps to “All the non-key columns are dependent on the table’s primary key” and also “It contains only columns that are non-transitively dependent on the primary key”.

I suspect that “Each type of observational unit forms a table” maps to “The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row”, but I’m least sure about this mapping.

I believe my above analysis is not (quite) correct, so I’m hoping someone can clarify the link between these two sets of rules.

My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?

Data Science Asked by Greg Thatcher on January 2, 2021

1 Answers

One Answer

I believe it goes a like this, based on thinking about this myself:

Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.

The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:

The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.

Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.

I think someone could probably do a more thorough proof with more time.

HTH

Answered by soundofsilence on January 2, 2021

Add your own answers!

Related Questions

feature importance with CNN

0  Asked on February 9, 2021 by user110020

   

What is the defining Set in NLP

1  Asked on February 9, 2021

   

How to use a dataset with only one category of data

3  Asked on February 8, 2021 by finn-williams

   

Ask a Question

Get help from others!

© 2022 AnswerBun.com. All rights reserved.