good enough research practices

In R I’m usually using a combination of base functions, functions from loaded packages, and functions I’ve defined in a script in my workflow or sourced from a helperfunctions.r type file. If you type the name of any R function into the console, you can see where it comes from if it is from a package and if there is no package you can assume it is ‘local’, but it isn’t always obvious when scrolling through code. For these reasons, I think it’d be useful to delineate. Some best practices (much of which is already in common use) might be:

use functions from base (or stats, utils, …) without any indication, for example, with(), lm()
always use :: if you are calling a function from a package, for example lme4::glm()
always load all libraries required in a script somewhere near the top of the script with a comment detailing which functions are used
if it is a tiny helper function or a locally sourced function then indicate some way… but how?

I think the first three points are obvious, but what about the third point. I’m thinking of these ‘local’ functions are sort of similar to private functions or methods in an OO context. And actually, a lot of ink has been spilled on SO/SE type forums about the topic of what naming conventions or indications you should use for private functions.

In C# I guess the convention (?) is to use leading underscores for private fields, so a lot of people suggest that for private functions too. R isn’t really supposed to have leading underscores though. Some folks use thisCase versus ThatCase which I think is just hideous. The tidyverse seems to like underscore separated words instead of this or that case ::shruggyemoji::.

I don’t really like any of the above options? In particular, I don’t think capitalization is clear / obvious enough. One exception is that I like constants to be capitalized and underscore separated. Prepending local_ or private_ to everything doesn’t seem specific or clear enough either, but it could work?

One idea to mark these functions a little more dramatically might be to use environments for ‘lexical scoping’ (is that the right term?). So, something like this:

localutils <- new.env(parent = emptyenv())
localutils$f1 <- function(x) 1

Is this too clunky? Should the localutils environment be capitalized because it is kind of like a constant? You can go one step more and make a saved RDS with an environment containing the functions you want to have available and load that RDS file in near where you make your library calls:

# load libraries and functions
library(mylibrary)
LOCALUTILS <- readRDS("01_helperfunctions/localutils.rds")

# ...

x <- 1
y <- LOCALUTILS$f1(x)
z <- mylibrary::f2(y)

This is OK. One downside is this kind of hides the fact that it is an environment, but if you have a medium amount of functions to define, or want to reuse them across multiple scripts in a workflow, this would be more convenient. Also, maybe sourcing an R file with the local functions is better if you want people to be able to more easily browse the functions in a text editor, but the syntax of readRDS is kind of nice because you can see the assignment to the name ‘LOCALUTILS’. source() actually returns the value from whatever you source, so you could do something pretty similar to readRDS. Imagine a file localutils.r:

LOCALUTILS <- new.env(parent = emptyenv())
LOCALUTILS$f1 <- function(x) 1
LOCALUTILS

and then just as above you can load that file when you load libraries:

# load libraries and functions
library(mylibrary)
LOCALUTILS <- source("localutils.r")$value

This works, but note that if your environment has a different name in localutils.r, you’ll get end up with two copies when you source, one in LOCALUTILS and one in whatever it was called in localutils.r.

An obvious alternative solution is just to always put all your helper functions in a package, but that’s a pain for one off code, especially if you want to hand code to someone without worrying about a lot of dependencies. Also sometimes it is just one or two helper functions you want to split out and that’s usually not worth making a package for.

Finally, should we add comments with sort of function declarations for these local functions somewhere near where they are sourced to help the reader since there isn’t going to be a help file for them? Something like this maybe:

# load libraries and functions
library(mylibrary) # for f1, f2
LOCALUTILS <- source("localutils.r")$value

# LOCALUTILS$f3( n ) returns a vector of n frog names
# LOCALUTILS$f4( frog_names ) returns a matrix of frog name similarity

# ...

Do these things make code more readable and sharable and maintainable or is it just confusing?

I recently saw a scientific paper with co-first authors which further stated that each author reserved the right to place their own name first in their CVs. Although I’ve read lots of papers with co-first authors, I’d never seen the comment about name order before and it seemed like an interesting approach.

I was curious how common this practice is and in searching for more examples on the internet, I found some views on the matter. One set of examples was on a stack-exchange exchange and another more nuanced \insertirony[here]{} conversation occurred on the bird app.

At least one person mentioned they already engage in this practice where appropriate. There were some objections, a couple which I wanted to think about boiled down to:

Hiring committees won’t like that and won’t bother to read your fine print justification.
Won’t that be confusing?

I’ll start with the second. I don’t think it has to be confusing. We have advanced search engines and DOIs now and organization by a single author name is much less common than it once was. When I used to photocopy or print out articles, I organized them in folders based on topic, and I do basically the same now digitally in my ref manager. How they’re organized in my mind I’m not sure, I think it is variable. We’d also have to think about conventions for in-text citation styles like APA, but I think it is surmountable.

Is it worth taking this concept a step further, and de-emphasizing author order all together and being more descriptive and honest about contributions? A project called CRediT seeks to standardize this process and has succeeded in integrating their system into some journal submissions. Probably crucial is to get hiring committees, funders, and the like to pay attention to it and ask for this type of information to be listed on CVs.

Once contributions are more transparent, we can dispense with author order altogether and ref managers and search engines can shuffle the order every time a paper comes up.

If you’re a PI and this terrifies you, maybe it should.

kata helion

Category Archives: good enough research practices

Indicating local functions in R scripts

On co-first authors.