Factors in R
I recently mentioned the important default change that occurred with the release of the latest version of
4.0.3) in a post titled Default preference reversal in R and how that can impact current code.
Recently I had 2 more encounters with this notion and I thought I’d write here how I found the solutions to these 2 problems. These commands used to work in
R 3.6.3 .
I had a hunch that the errors were related one more time with the
R notion of “strings as factors” which can be simply understood as parsing the columns of a data table that contain strings (words or characters) as “categorical variable” which are called factors with levels in
R. For example a table of data could have multiple rows and a column with 2 options within, e.g.
Wild Type and
mutant. The previous post explained how the new
R version just reads these as “words” (character based items as opposed to numbers) but are not de facto made into categories.
Alpha diversity: the variance within a particular sample.
The first error was related to function
plot_anova_diversity() from the package
microbiomeSeq. The error reported was cryptic at best:
Error in data.frame(x = c(which(levels(df[, grouping_column]) == as.character(df_pw[i, : arguments imply differing number of rows: 0, 2
After searches with a search engine (Google) and confusing posts on various forums I finally found an “issue” github page that seemed to indicate that the problem was fixed on March 2019. The “commit page” allowed me to spot the solution: adding one line to convert the “grouping columns” as factors within the function:
df[, grouping_column] <- as.factor(df[, grouping_column])
But then why was this command not working?
- The github page that error #29 was reported fixed has date: “Mar 30, 2019”
- In the
Rdirectory the last “commit” (saved) date of file
plot_anova_diversity.Ris: “Latest commit d3e3591 on Feb 10, 2019”
Therefore, at the moment, installation still uses the old, “unfixed” file that does now contain the line containing
as.factor(df[, grouping_column]). The user can create a new version of the function locally by using the
plot_anova_diversity.R file and adding the extra line.
DESeq2 was designed for RNA-Seq experiments but can be used with other appropriate tables of number. In that sense
DESeq2 is used internally within function
microbiomeSeq. A Google search for the main warning
Error in y - ymean : non-numeric argument to binary operatoryields many links on forums such as “Stack Exchange” that only (or mostly) referred to the notion of “random forest” (a classification algorithm consisting of many decisions trees – quoted from Understanding Random Forest.)
differential_abundance without parenthesis shows the function code, which reveals this command within:
rf_res <- randomforest_res(subset.data, meta_table$Groups)
Changing this line to make sure that the groups are read as factors will fix the function:
rf_res <- randomforest_res(subset.data, as.factor(meta_table$Groups))
The user can create a new function that incorporates this modification and then there will no longer be an error.
Simple defaults can have catastrophic results in breaking a code that should work.
Well, it just shows that code if fragile! A bioinformatician colleague replied to my previous blog about the change of default that for a long time already he always started his code with
stringsAsFactors = F which in fact made the state of
Ras it is now by default in
R 4.0.x. But indeed, one has to be “in the know” to implement such things.
Since these 2 commands worked in the previous version of
R3.6.3 it seems reasonable that the change in the reading of “strings as factor” is the culprit. But adding
stringsAsFactors = Tdid not fix the errors. So I can’t be sure.