# Equivalence of Tidy Data and Third Normal Form

In Hadley Wickham’s “Tidy Data” paper, he states that

In tidy data:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.

Codd’s Third Normal Form can be described as:

1. A table is in 1st normal form if

• It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
• Each column contains atomic values, and there are no repeating groups of columns.
2. A table is in 2nd normal form if

• The table is in 1st normal form, and
• All the non-key columns are dependent on the table’s primary key.
3. A table is in 3rd normal form if

• It is in 2nd normal form, and
• It contains only columns that are non-transitively dependent on the primary key

I am trying to understand how these two sets of rules are equivalent.

I believe that the first Tidy Data rule maps to 1st normal form. Specifically, “Each variable forms a column” maps to “Each column contains atomic values, and there are no repeating groups of columns”.

I believe the Tidy Data rule “Each observation forms a row” maps to “All the non-key columns are dependent on the table’s primary key” and also “It contains only columns that are non-transitively dependent on the primary key”.

I suspect that “Each type of observational unit forms a table” maps to “The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row”, but I’m least sure about this mapping.

I believe my above analysis is not (quite) correct, so I’m hoping someone can clarify the link between these two sets of rules.

My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?

Data Science Asked by Greg Thatcher on January 2, 2021

Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.

The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:

The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.

Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.

I think someone could probably do a more thorough proof with more time.

HTH

Answered by soundofsilence on January 2, 2021

## Related Questions

### How will my CNN results be affected by large discrepancies between the number of samples in some of the classes?

0  Asked on February 10, 2021 by jamil-zaitouny

### Train NN for balanced accuracy

0  Asked on February 10, 2021 by karu

### Horizontal Bar Char Issues in Tableau with Excel Spreadsheet Beginner Visualization Question

0  Asked on February 10, 2021 by ruben-gavidia

### Should Principal components be normalized before applying K means on them?

1  Asked on February 9, 2021 by udaykumarverma

### Naive Bayes and Support Vector Machine (NBSVM) Classification

1  Asked on February 9, 2021 by oldtimerambler

### feature importance with CNN

0  Asked on February 9, 2021 by user110020

### How to find whether a dataset is blanced or imbalanced?

3  Asked on February 9, 2021 by data-bee

### What is the defining Set in NLP

1  Asked on February 9, 2021

### How to add Error Bar to Matplotlib line plot?

1  Asked on February 9, 2021

### Why some papers write false positive rate per case instead of percentage rate?

1  Asked on February 9, 2021

### how to find parameters used in decision tree algorithm

2  Asked on February 9, 2021 by a-r-seif

### What is the difference between “Adding more LSTM layers” or “Adding more units on existence layers”?

3  Asked on February 9, 2021

### Add variables/features to already trained tensorflow model when new variables come in. Possible?

0  Asked on February 9, 2021 by wiggalicious

### Time series stationarize vs normalization

0  Asked on February 8, 2021

### How to use a dataset with only one category of data

3  Asked on February 8, 2021 by finn-williams

### How do we standardize arrays with NaN?

4  Asked on February 8, 2021 by uharsha33

### Data augmentation for recommendation systems

0  Asked on February 8, 2021

### Omnibus and R square improvements for OLS model

0  Asked on February 8, 2021

### Why is it important to have stable activations in the forward pass?

0  Asked on February 8, 2021