Basic processing of bibtex output from Zotero

basically:

awk '!/^\tabstract/' mybib.bib | awk '!/^\tfile/' | sed '/^@/y/_/-/' | bibtool -s > mybib_edited_sorted.bib

Why?

Zotero is great, but it has a couple quirks that I don’t like when you export a collection to bibtex:

  • cite key has underscores

It doesn’t really seem to be a problem in practice, but in general latex seems to abhor underscores so I’d rather just replace them with hyphen.

  • @article includes the entire abstract

Adds clutter and usually (but not always!) I don’t need it for the kind of bibliography I’m making.

  • @article includes a ‘file’ field

Gives away the directory structure of my computer and my username and again adds clutter.

  • the entries aren’t sorted

None of this really matters, and unless you’re sharing your .bib or .tex files only you will ever know. But I do think it makes the files much easier to read even if I’m not sharing them, which is helpful for me.

The fix

It looks like someone else on the internet shares at least one of my very specific mild complaints. There is a decent solution suggested on the forum to edit the js file that generates the bibtex output. The processor seems reasonably readable, but the changes might get obliterated upon Zotero updating? Anyway I prefer to use a couple of command line utilities to clean things up. The longer the file, the more helpful these are.

Workflow

  1. awk. It comes with mac os x, and most *nix distributions. On windows you already have it if you have git-bash, linux subsystem, or mobaxterm installed. I’m sure there are many other versions you can install. These commands will get rid of any line that starts with a tab and then abstract or file.
awk '!/^\tabstract/' mybib.bib > mybib_edited.bib
awk '!/^\tfile/' mybib_edited.bib > mybib_edited.bib

! means not

^ means start of the line

\t means a tab

mybib.bib is the input file

> means save output to the file mybib_edited.bib (redirects the stdout)

  1. sed (stream editor) is another command line program. Pretty much same availability as awk. The following command will replace every _ with – only in lines starting with ‘@’.
sed '/^@/y/_/-/' mybib_edited.bib > mybib_edited.bib

^@/y matches lines starting with @

/_/-/ replaces every instance of _ with –

> outputs the file

  1. bibtool can alphabetize the entries. bibtool source is available at https://github.com/ge-ne/bibtool and also is available in Debian/Ubuntu native package manager. Not sure if there are binaries for other systems.
bibtool -s -i mybib_edited.bib -o mybib_edited_sorted.bib

-s sorts

-i introduces the input file

-o introduces the out file

You can also do more specific things like sorting on particular fields see this stack exchange question and answer for an example.

bibtool also regularizes the format of the entries a little bit, but it doesn’t look horrible so I don’t really mind.

The quick way

I kind of like going step by step to make sure I haven’t made a horrible mistake. With awk and sed for example if you omit the redirect (>) then you can just see everything echo’d to the terminal and make sure you’re getting the desired effect before you overwrite anything.

But if you’re satisfied with your pattern matching, you can just go straight to the end product using the pipe operator | which takes the output of one function and pipes it in as the input to the next. For example:

awk '!/^\tabstract/' mybib.bib | awk '!/^\tfile/' | sed '/^@/y/_/-/' | bibtool -s > mybib_edited_sorted.bib
CC BY 4.0 University of Texas at Arlington Libraries, Special Collections