TransWikia.com

When to choose character instead of factor in R?

Data Science Asked by lupi5 on December 16, 2020

I am currently working on a dataset which contains a name attribute, which stands for a person’s first name. After reading the csv file with read.csv, the variable is a factor by default (stringsAsFactors=TRUE) with ~10k levels. Since name does not reflect any group membership, I am uncertain to leave it as factor.

Is it necessary to convert name to character? Are there some advantages in doing (or not doing) this? Does it even matter?

2 Answers

Factors are stored as numbers and a table of levels. If you have categorical data, storing it as a factor may save lots of memory.

For example, if you have a vector of length 1,000 stored as character and the strings are all 100 characters long, it will take about 100,000 bytes. If you store it as a factor, it will take about 8,000 bytes plus the sum of the lengths of the different factors.

Comparisons with factors should be quicker too because equality is tested by comparing the numbers, not the character values.

The advantage of keeping it as character comes when you want to add new items, since you are now changing the levels.

Store them as whatever makes the most sense for what the data represent. If name is not categorical, and it sounds like it isn't, then use character.

Correct answer by Spacedman on December 16, 2020

A few thoughts on the question above:

  • I find this link about Factors in R very useful.
  • If you want to create a classification model or if you like to convert the character to numeric you have to convert the character to a factor first: as.numeric(as.factor(name)). In your case that could be named with more or less than 4 letters or names starting with a specific letter.
  • As mentioned before, converting the character to a factor saves memory!

Happy coding!

Answered by Anna-Marie Tomm on December 16, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP