organize blocks of code in R with with() ?

In their “Object-Oriented Programming is Bad” video, Brian Will mentions a desired reserved word: use (timestamp) that could look something like this:

a = use x, y {
...
return 3;
}

The variables you specify come from the enclosing scope and would be available as copies within a separate scope defined by the curly braces. basically an anonymous function with more intuitive (?) syntax. One benefit would be if you did want to reuse this code block later it’d be very easy to convert to a full fledged function and if it was one off code you could just leave it. You also get some scoping control and added readability (maybe?).

My first thought was that this is actually already easy to do in R with with():

x <- 1
y <- with(x, {
x <- x * 2
x + 1
})

print(paste(x, y))
# [1] "1 3"

is this pleasantly predictable behavior? not sure. the man page for with() warns:

For interactive use this is very effective and nice to read. For programming however, i.e., in one’s functions, more care is needed, and typically one should refrain from using with(), as, e.g., variables in data may accidentally override local variables, see the reference.

‘with’ help page Package base version 4.3.1 (R Core Team, 2023)

For Brian Will’s purposes these are actually desirable qualities I think. but note you have to pass with() an integer, data frame, or list. That’s not that cumbersome though, but does it spoil the simplicity?

x <- 1
z <- 2
y <- with(list(x, z), {
x <- x * 2
x + 1 + z
})

print(paste(x, z, y)
# [1] "1 2 5"

You can get the same effect with do.call, but the syntax isn’t as nice looking, because (1) the input comes after the expression and (2) you need to use an actual function– an expression doesn’t work. You also need to pass a list like with with():

x <- 1
y <- do.call(function(x) {
x + 2
}, list(x))

# or

f1 <- function(x) { x + 2 }
y <- do.call('f1', list(x))

I think readability and analogy to a simple function call aren’t quite as good for do.call in this particular use case when compared to with(), but I’m not knocking do.call, it is generally great.

Is this worth it either with do.call or with()? Does this make things easier or harder to debug? Would it be better to just use local functions to organize code into blocks like this even if they are only used once? Is it confusing if those functions live in a separate file?

Library of Congress, Prints & Photographs Division, LC-DIG-ppmsc-01765 (digital file from original)

Indicating local functions in R scripts

In R I’m usually using a combination of base functions, functions from loaded packages, and functions I’ve defined in a script in my workflow or sourced from a helperfunctions.r type file. If you type the name of any R function into the console, you can see where it comes from if it is from a package and if there is no package you can assume it is ‘local’, but it isn’t always obvious when scrolling through code. For these reasons, I think it’d be useful to delineate. Some best practices (much of which is already in common use) might be:

  • use functions from base (or stats, utils, …) without any indication, for example, with(), lm()
  • always use :: if you are calling a function from a package, for example lme4::glm()
  • always load all libraries required in a script somewhere near the top of the script with a comment detailing which functions are used
  • if it is a tiny helper function or a locally sourced function then indicate some way… but how?

I think the first three points are obvious, but what about the third point. I’m thinking of these ‘local’ functions are sort of similar to private functions or methods in an OO context. And actually, a lot of ink has been spilled on SO/SE type forums about the topic of what naming conventions or indications you should use for private functions.

In C# I guess the convention (?) is to use leading underscores for private fields, so a lot of people suggest that for private functions too. R isn’t really supposed to have leading underscores though. Some folks use thisCase versus ThatCase which I think is just hideous. The tidyverse seems to like underscore separated words instead of this or that case ::shruggyemoji::.

I don’t really like any of the above options? In particular, I don’t think capitalization is clear / obvious enough. One exception is that I like constants to be capitalized and underscore separated. Prepending local_ or private_ to everything doesn’t seem specific or clear enough either, but it could work?

One idea to mark these functions a little more dramatically might be to use environments for ‘lexical scoping’ (is that the right term?). So, something like this:

localutils <- new.env(parent = emptyenv())
localutils$f1 <- function(x) 1

Is this too clunky? Should the localutils environment be capitalized because it is kind of like a constant? You can go one step more and make a saved RDS with an environment containing the functions you want to have available and load that RDS file in near where you make your library calls:

# load libraries and functions
library(mylibrary)
LOCALUTILS <- readRDS("01_helperfunctions/localutils.rds")

# ...

x <- 1
y <- LOCALUTILS$f1(x)
z <- mylibrary::f2(y)

This is OK. One downside is this kind of hides the fact that it is an environment, but if you have a medium amount of functions to define, or want to reuse them across multiple scripts in a workflow, this would be more convenient. Also, maybe sourcing an R file with the local functions is better if you want people to be able to more easily browse the functions in a text editor, but the syntax of readRDS is kind of nice because you can see the assignment to the name ‘LOCALUTILS’. source() actually returns the value from whatever you source, so you could do something pretty similar to readRDS. Imagine a file localutils.r:

LOCALUTILS <- new.env(parent = emptyenv())
LOCALUTILS$f1 <- function(x) 1
LOCALUTILS

and then just as above you can load that file when you load libraries:

# load libraries and functions
library(mylibrary)
LOCALUTILS <- source("localutils.r")$value

This works, but note that if your environment has a different name in localutils.r, you’ll get end up with two copies when you source, one in LOCALUTILS and one in whatever it was called in localutils.r.

An obvious alternative solution is just to always put all your helper functions in a package, but that’s a pain for one off code, especially if you want to hand code to someone without worrying about a lot of dependencies. Also sometimes it is just one or two helper functions you want to split out and that’s usually not worth making a package for.

Finally, should we add comments with sort of function declarations for these local functions somewhere near where they are sourced to help the reader since there isn’t going to be a help file for them? Something like this maybe:

# load libraries and functions
library(mylibrary) # for f1, f2
LOCALUTILS <- source("localutils.r")$value

# LOCALUTILS$f3( n ) returns a vector of n frog names
# LOCALUTILS$f4( frog_names ) returns a matrix of frog name similarity

# ...

Do these things make code more readable and sharable and maintainable or is it just confusing?

Is rowSums slow?

I guess it might have been obvious, but I was surprised that rowSums appeared much slower than directly doing the matrix operation. At first I assumed it was because it has some overhead in the form of input handling/casting and other odds and ends.

set.seed(1079711697)1
n <- 1e6
m <- 3
m1 <- matrix(rnorm(n*m), n, m)
f1 <- function(m) m %*% cbind(rep(1, ncol(m)))
f2 <- function(m) rowSums(m)

microbenchmark::microbenchmark(f1(m1), f2(m1))
# Unit: milliseconds
# expr min lq mean median uq max neval
# f1(m1) 5.204801 5.335951 6.070637 5.433501 5.659101 20.2690 100
# f2(m1) 9.992201 10.181401 11.014871 10.254801 10.403851 31.1915 100

Above was run on a 6700K and R 4.3.1 (x86_64-w64-mingw32/x64). When I ran this on an Apple computer with an M1 though I was again surprised the pattern was reversed, but both were much faster and closer to one another– means were 3.3 and 2.7 for f1 and f2 respectively. That’s R 4.3.2 (aarch64-apple-darwin20). I haven’t messed with Rblas, but maybe you can get additional benefits there?

It also looks like there are some slight precision differences between the two methods:

summary(f1(m1) - f2(m1))
# V1
# Min. :-1.776e-15
# 1st Qu.: 0.000e+00
# Median : 0.000e+00
# Mean :-1.190e-19
# 3rd Qu.: 0.000e+00
# Max. : 8.882e-16

-10-19 is pretty small. It happens to be about the charge of a single electron in SI units.

  1. I didn’t use my orcid, because I don’t want you to know who I am. ↩︎
undefined
Death to eponyms!

Better help window in rstudio

RStudio takes up a lot of screen real estate with its panes. I prefer to have two floating windows for script and console, plus help when needed. Unfortunately entering ?pch, for example, opens up a pane and squeezes everything out of the way for a too tiny help window by default.

I prefer the help in a separate pop out window (for instance, in a browser). options(help_type = "html") does nothing as far as I can tell in RStudio.

For a while, I had taken to opening a separate instance of R (directly from Rgui.exe) and using it just for accessing help, but this is kind of annoying for obvious reasons.

By the way you can make a shortcut for Rgui.exe (or change the existing shortcut in the start menu) and add the option --sdi to get rid of the antiquated windows default IDE. I actually really like the default MacOS IDE, but it doesn’t seem to be under active development anymore and has quite a few crashes and bugs for me lately.

A full IDE alternative on Windows is using notepad++ and nppToR. This sends code directly from notepad++ to an R window with a keystroke (default F8). Even if this doesn’t appeal to you, notepad++ is great, has syntax highlighting and doesn’t die when opening very large text files.

You can also do something similar with vim and tmux and a plugin such as Nvim-R. (Other than RStudio this seems to be the only other way to set up the windows like this on Linux– let me know if you have alternatives though!) I’m not a complete lunatic, this was a pain to get going and even then kind of clunky. There are some advantages to GUI after all. Can you tell I’m a ratpoison fan? (If you’re curious, StumpWM is probably a better bet these days since it is a little more active. Also if you’re interested in lisp.)

If you really want to use RStudio, there is a suggestion on stackoverflow from geotheory to edit Options.R and remove a few lines. It looks like the code has changed slightly since the post– this is what I found in my options file for RStudio 2022.02.3 Build 492:

# custom browseURL implementation.
.rs.setOption("browser", function(url)
{
   .Call("rs_browseURL", url, PACKAGE = "(embedding)")
})

Note you might need admin privileges to change this file. As geotheory points out you also have to repeat this every time the options file gets updated by an RStudio install.

Seems to work for me.

The Rat Catcher, 1632 Johannes van Vliet

See also:

  1. La Londe KB, Mahoney A, Edwards TL, Cox C, Weetjens B, Durgin A, Poling A. Training pouched rats to find people. J Appl Behav Anal. 2015 Spring;48(1):1-10. doi: 10.1002/jaba.181. Epub 2014 Dec 2. PMID: 25451685.
  2. Schaverien A. “Magawa, Rat That Hunted Land Mines, Dies in Retirement.” The New York Times. 12 Jan 2022. Accessed: 2 Nov 2022.

stringsAsFactors = FALSE

R is changing the way it deals with converting strings to factors in functions like data.frame(). There is a detailed post about the plan, but that post was created before version 4.0.0 so I’m not sure if anything has changed.

I’m running R 4.0.5 right now. I know I’m behind, but I’m in the middle of a project and I don’t want to update until I finish the project. Anyway, default.stringsAsFactors() is now TRUE. And this is nice. I can also see:

args(data.frame)

# function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = default.stringsAsFactors())

That’s nice. but look at:

args(expand.grid)

# function (..., KEEP.OUT.ATTRS = TRUE, stringsAsFactors = TRUE)

That’s less nice and generated a rather confusing bug for me recently. Also look at this:

letterframe1 <- data.frame(cbind(LETTERS, 1))
class(letterframe1[, 1])
# character

letterframe2 <- data.frame(table(LETTERS))
class(letterframe2[, 1])
# factor

letterframe3 <- data.frame(table(LETTERS), stringsAsFactors = FALSE)
class(letterframe3[, 1])
# factor

letterframe4 <- as.data.frame(table(LETTERS), stringsAsFactors = FALSE)
class(letterframe4[, 1])
# character

And for the record:

letterframe5 <- as.data.frame(table(LETTERS))
class(letterframe5[, 1])
# factor

By the way, I ran all the above examples in R 3.6.1 and everything returned factor except for class(letterframe4[, 1]).

This was unexpected. I’m sure there’s a sensible reason for all that, but I don’t know enough to guess exactly what it could be.

The character input must be getting converted to a factor somewhere inside table(), but I’m not sure why the difference between as.data.frame() and data.frame(). If it is truly already a factor then both functions should return a factor regardless of stringsAsFactors. Although to me that isn’t really an optimal solution either because it isn’t obvious from table() output or from the documentation that this conversion should take place to my initial character input.

If I correctly understood the dev blog linked above, changing default.stringsAsFactors() might be a transition phase to a new system that works in a different way, so maybe this new system will encompass some of these other scenarios when implemented.

On co-first authors.

I recently saw a scientific paper with co-first authors which further stated that each author reserved the right to place their own name first in their CVs. Although I’ve read lots of papers with co-first authors, I’d never seen the comment about name order before and it seemed like an interesting approach.

I was curious how common this practice is and in searching for more examples on the internet, I found some views on the matter. One set of examples was on a stack-exchange exchange and another more nuanced \insertirony[here]{} conversation occurred on the bird app.

At least one person mentioned they already engage in this practice where appropriate. There were some objections, a couple which I wanted to think about boiled down to:

  • Hiring committees won’t like that and won’t bother to read your fine print justification.
  • Won’t that be confusing?

I’ll start with the second. I don’t think it has to be confusing. We have advanced search engines and DOIs now and organization by a single author name is much less common than it once was. When I used to photocopy or print out articles, I organized them in folders based on topic, and I do basically the same now digitally in my ref manager. How they’re organized in my mind I’m not sure, I think it is variable. We’d also have to think about conventions for in-text citation styles like APA, but I think it is surmountable.

Is it worth taking this concept a step further, and de-emphasizing author order all together and being more descriptive and honest about contributions? A project called CRediT seeks to standardize this process and has succeeded in integrating their system into some journal submissions. Probably crucial is to get hiring committees, funders, and the like to pay attention to it and ask for this type of information to be listed on CVs.

Once contributions are more transparent, we can dispense with author order altogether and ref managers and search engines can shuffle the order every time a paper comes up.

If you’re a PI and this terrifies you, maybe it should.

Giada Venturino | Ducks Swimming on Water | Casier, Veneto, Italia

Basic processing of bibtex output from Zotero

basically:

awk '!/^\tabstract/' mybib.bib | awk '!/^\tfile/' | sed '/^@/y/_/-/' | bibtool -s > mybib_edited_sorted.bib

Why?

Zotero is great, but it has a couple quirks that I don’t like when you export a collection to bibtex:

  • cite key has underscores

It doesn’t really seem to be a problem in practice, but in general latex seems to abhor underscores so I’d rather just replace them with hyphen.

  • @article includes the entire abstract

Adds clutter and usually (but not always!) I don’t need it for the kind of bibliography I’m making.

  • @article includes a ‘file’ field

Gives away the directory structure of my computer and my username and again adds clutter.

  • the entries aren’t sorted

None of this really matters, and unless you’re sharing your .bib or .tex files only you will ever know. But I do think it makes the files much easier to read even if I’m not sharing them, which is helpful for me.

The fix

It looks like someone else on the internet shares at least one of my very specific mild complaints. There is a decent solution suggested on the forum to edit the js file that generates the bibtex output. The processor seems reasonably readable, but the changes might get obliterated upon Zotero updating? Anyway I prefer to use a couple of command line utilities to clean things up. The longer the file, the more helpful these are.

Workflow

  1. awk. It comes with mac os x, and most *nix distributions. On windows you already have it if you have git-bash, linux subsystem, or mobaxterm installed. I’m sure there are many other versions you can install. These commands will get rid of any line that starts with a tab and then abstract or file.
awk '!/^\tabstract/' mybib.bib > mybib_edited.bib
awk '!/^\tfile/' mybib_edited.bib > mybib_edited.bib

! means not

^ means start of the line

\t means a tab

mybib.bib is the input file

> means save output to the file mybib_edited.bib (redirects the stdout)

  1. sed (stream editor) is another command line program. Pretty much same availability as awk. The following command will replace every _ with – only in lines starting with ‘@’.
sed '/^@/y/_/-/' mybib_edited.bib > mybib_edited.bib

^@/y matches lines starting with @

/_/-/ replaces every instance of _ with –

> outputs the file

  1. bibtool can alphabetize the entries. bibtool source is available at https://github.com/ge-ne/bibtool and also is available in Debian/Ubuntu native package manager. Not sure if there are binaries for other systems.
bibtool -s -i mybib_edited.bib -o mybib_edited_sorted.bib

-s sorts

-i introduces the input file

-o introduces the out file

You can also do more specific things like sorting on particular fields see this stack exchange question and answer for an example.

bibtool also regularizes the format of the entries a little bit, but it doesn’t look horrible so I don’t really mind.

The quick way

I kind of like going step by step to make sure I haven’t made a horrible mistake. With awk and sed for example if you omit the redirect (>) then you can just see everything echo’d to the terminal and make sure you’re getting the desired effect before you overwrite anything.

But if you’re satisfied with your pattern matching, you can just go straight to the end product using the pipe operator | which takes the output of one function and pipes it in as the input to the next. For example:

awk '!/^\tabstract/' mybib.bib | awk '!/^\tfile/' | sed '/^@/y/_/-/' | bibtool -s > mybib_edited_sorted.bib
CC BY 4.0 University of Texas at Arlington Libraries, Special Collections

Should you always use your ORCID as a random seed?

There’s some interesting discussion in response to the cross validated question: Am I creating bias by using the same random seed over and over?

whuber caught my attention in the accepted answer:

…there is extraordinary merit in having a well-known “personal seed”…

I wonder if ORCID is a good candidate for a personal seed. It’s already a unique number, directly linked to your scholarly output, and easily verified thereby providing some accountability against seed picking.

Some folks have alphanumeric ORCIDs, though I don’t think they are very common, and one could easily think of a consistent way to convert letters to integers. One more serious possible draw back would be having the ORCID in your code could break the blinding on peer review. The ORCIDs tend to be long and complicated enough, though, that an editor or reviewer would have to have an unusual memory to accidentally un-blind you.

There was also some interesting chitchat about whether a personal seed that produces rare sequences is desirable for testing purposes. In the case of using your ORCID you wouldn’t be able to control this aspect.

Just for fun, I get 53 heads if I use my ORCID to do a 100 coin flips in R:

set.seed(my_orcid)
sum(rbinom(100, 1, .5))
Peter Häger (CC0 Public Domain)