Strings as factors in R

Factors in R

I recently mentioned the important default change that occurred with the release of the latest version of R (version 4.0.3) in a post titled  Default preference reversal in R and how that can impact current code.

Recently I had 2 more encounters with this notion and I thought I’d write here how I found the solutions to these 2 problems. These commands used to work in R 3.6.3 .

Hunch

I had a hunch that the errors were related one more time with the R notion of “strings as factors” which can be simply understood as parsing the columns of a data table that contain strings (words or characters) as “categorical variable” which are called factors with levels in R. For example a table of data could have multiple rows and a column with 2 options within, e.g. Wild Type and mutant. The previous post explained how the new R version just reads these as “words” (character based items as opposed to numbers) but are not de facto made into categories.

Alpha Diversity

Alpha diversity: the variance within a particular sample.

The first error was related to function plot_anova_diversity() from the package microbiomeSeq. The error reported was cryptic at best:

Error in data.frame(x = c(which(levels(df[, grouping_column]) == as.character(df_pw[i,  : arguments imply differing number of rows: 0, 2

After searches with a search engine (Google) and confusing posts on various forums I finally found an “issue” github page that seemed to indicate that the problem was fixed on March 2019. The “commit page” allowed me to spot the solution: adding one line to convert the “grouping columns” as factors within the function:

df[, grouping_column] <- as.factor(df[, grouping_column])

But then why was this command not working?

  • The github page that error #29 was reported fixed has date: “Mar 30, 2019
  • In the R directory the last “commit” (saved) date of file plot_anova_diversity.R  is: “Latest commit d3e3591 on Feb 10, 2019

Therefore, at the moment, installation still uses the old, “unfixed” file that does now contain the line containing as.factor(df[, grouping_column]). The user can create a new version of the function locally by using the plot_anova_diversity.R file and adding the extra line.

Differential abundance

The R package DESeq2 was designed for RNA-Seq experiments but can be used with other appropriate tables of number. In that sense DESeq2 is used internally within function differential_abundance()in package microbiomeSeq. A Google search for the main warning Error in y - ymean : non-numeric argument to binary operatoryields many links on forums such as “Stack Exchange” that only (or mostly) referred to the notion of “random forest” (a classification algorithm consisting of many decisions trees – quoted from Understanding Random Forest.)

Typing differential_abundance without parenthesis shows the function code, which reveals this command within:

rf_res <- randomforest_res(subset.data, meta_table$Groups)

Changing this line to make sure that the groups are read as factors will fix the function:

rf_res <- randomforest_res(subset.data, as.factor(meta_table$Groups))

The user can create a new function that incorporates this modification and then there will no longer be an error.

Conclusion

Simple defaults can have catastrophic results in breaking a code that should work.

Well, it just shows that code if fragile! A bioinformatician colleague replied to my previous blog  about the change of default that for a long time already he always started his code with stringsAsFactors = F which in fact made the state of Ras it is now by default in R 4.0.x. But indeed, one has to be “in the know” to implement such things.

Since these 2 commands worked in the previous version of R3.6.3 it seems reasonable that the change in the reading of “strings as factor” is the culprit. But adding stringsAsFactors = Tdid not fix the errors. So I can’t be sure.


Credits: labyrinth image from Peggy_Marco