Marie Kondo (Netflix via Giphy)
Thank you package-makers
Iβve used a lot of packages in 2019 and many have brought great joy to my R experience. Thank you to everyone who has created, maintained or contributed to a package this year.
Some particular packages of note for me have been:
- π€ {usethis} by Hadley Wickham and Jenny Bryan
- π¦ {drake} by Will Landau
- π {purrr} by Lionel Henry and Hadley Wickham
And some honourable mentions are:
- π {blogdown} by Yihui Xie
- βοΈ {xaringan} by Yihui Xie
- π {polite} by Dmytro Perepolkin
- βοΈ {arsenal} by Ethan Heinzen, Jason Sinnwell, Elizabeth Atkinson, Tina Gunderson and Gregory Dougherty
Click the package name to jump to that section.
Packages of note
{usethis}
The format and content of R packages is objectively odd. What files are necessary? What structure should it have? The {usethis} package from RStudioβs Hadley Wickham and Jenny Bryan makes it far easier for newcomers and experienced useRs alike.
In fact, you can make a minimal package in two lines:
create_package()to create the necessary package structureuse_r()to create in the right place an R script for your functions
But thereβs way more functions to help you set up your package. To name a few more that I use regularly:
use_vignette()anduse_readme_md()for more documentationuse_testthat()anduse_test()for setting up testsuse_package()to add packages to theImportssection of theDESCRIPTIONfileuse_data()anduse_data_raw()to add data sets to the package and the code used to create themuse_*_license()to add a license
There are also other flavours of function like git_*() and pr_*() to work with version control and proj_*() for working with RStudio Projects.
I focused this year on making different types of package. {usethis} made it much easier to develop:
- {altcheckr} to read and assess image alt text from web pages
- {oystr} to handle London travel-history data from an Oyster card
- {gdstheme} to use a {xaringan} presentation theme and template
- {blogsnip} to insert blog-related code snippets via an RStudio addin (thereβs even a
use_addin()function to create the all-importantinst/rstudio/addins.dcffile)
For more package-development info, I recommend Emil Hvitfeldtβs {usethis} workflow, as well as Karl Bromanβs R Package Primer and Hadley Wickhamβs R Packages book. To help me remember this stuff, I also wrote some slides about developing a package from scratch with {usethis} functions.
{drake}
Your analysis has got 12 input data files. They pass through 15 functions There are some computationally-intensive, long-running processes. Plots and tables are produced and R Markdown files are rendered. How do you keep on top of this? Is it enough to have a set of numbered script files (01_read.R, etc) or a single script file that sources the rest? What if something changes? Do you have to re-run everything from scratch?
You need a workflow manager. Save yourself some hassle and use Will Landauβs {drake} package, backed by rOpenSciβs peer review process. {drake} βremembersβ all the dependencies between files and only re-runs what needs to be re-run if any errors are found or changes are made. It also provides visualisations of your workflow and allows for high-performance computing.
In short, you:
- Supply the steps of your analysis as functions to
drake_plan(), which generates a data frame of commands (functions) to operate over a set of targets (objects) - Run
make()on your plan to run the steps and generate the outputs - If required, make changes anywhere in your workflow and re-
make()the plan β {drake} will only re-run things that are dependent on what you changed
Below is an extreme example from a happy customer (click through to the image if you canβt see the embedded tweet). Each point on the graph is an object or function; black ones are out of date and will be updated when make() is next run.
I'm *so* glad {drake} is tracking those dependencies between #rstats computations for me. pic.twitter.com/QsqCAH8Kg7
— FrederikAust@fediscience.org (@FrederikAust) December 12, 2019
Itβs hard to do {drake} justice in just a few paragraphs, but luckily itβs one of the best-documented packages out there. Take a look at:
- the {drake} rOpenSci website
- the thorough user manual
- the learndrake GitHub repo, which can be launched in the cloud
- the drakeplanner Shiny app
- Willβs {drake} examples page
- this rOpenSci community call
- a Journal of Open Source Software (JOSS) paper
- more things listed in the documentation section of the user manual
I wrote about {drake} earlier in the year and made a demo and some slides. I think it could be useful for reproducibility of statistical publications in particular.
{purrr}
You want to apply a function over the elements of some list or vector.
The map() family of functions from the {purrr} packageβby Lionel Henry and Hadley Wickham of RStudioβhas a concise and consistent syntax for doing this.
You can choose what gets returned from your iterations by selecting the appropriate map_*() variant: map() for a list, map_df() for a data frame, map_chr() for a character vector and so on. Hereβs a trivial example that counts the number of Street Fighter characters from selected continents. Hereβs a list:
# Create the example list
street_fighter <- list(
china = "Chun Li", japan = c("Ryu", "E Honda"),
usa = c("Ken", "Guile", "Balrog"), `???` = "M Bison"
)
street_fighter # take a look at the list
## $china
## [1] "Chun Li"
##
## $japan
## [1] "Ryu" "E Honda"
##
## $usa
## [1] "Ken" "Guile" "Balrog"
##
## $`???`
## [1] "M Bison"
Now to map the length() function to each element of the list and return a named integer vector.
library(purrr) # load the package
# Get the length of each list element
purrr::map_int(
street_fighter, # list
length # function
)
## china japan usa ???
## 1 2 3 1
But what if you want to iterate over two or more elements? You can use map2() or pmap(). And what if you want to get the side effects? walk() and pwalk().
{purrr} is also great for working with data frames with columns that contain lists (listcols), like the starwars data from the {dplyr} package. Letβs use the length() function again, but in the context of a listcol, to get the characters in the most films.
# Load packages
suppressPackageStartupMessages(library(dplyr))
library(purrr)
# map() a listcol within a mutate() call
starwars %>%
mutate(films_count = map_int(films, length)) %>%
select(name, films, films_count) %>%
arrange(desc(films_count)) %>% head()
## # A tibble: 6 x 3
## name films films_count
## <chr> <list> <int>
## 1 R2-D2 <chr [7]> 7
## 2 C-3PO <chr [6]> 6
## 3 Obi-Wan Kenobi <chr [6]> 6
## 4 Luke Skywalker <chr [5]> 5
## 5 Leia Organa <chr [5]> 5
## 6 Chewbacca <chr [5]> 5
Why not just write a loop or use the *apply functions? Jenny Bryan has a good {purrr} tutorial that explains why you might consider either choice. Basically, do what you feel; I like the syntax consistency and the ability to predict what function I need based on its name.
Check out the excellent {purrr} cheatsheet for some prompts and excellent visual guidance.
Honourable mentions
{blogdown}
This blog, and Iβm sure many others, wouldnβt exist without {blogdown} by Yihui Xie. {blogdown} lets you write and render R Markdown files into blog posts via static site generators like Hugo. This is brilliant if youβre trying to get R output into a blog post with minimal fuss. The {blogdown} book by Yihui, Amber Thomas, Alison Presmanes Hill is particularly helpful.
{xaringan}
{xaringan} is another great package from Yihui Xie that lets you turn R Markdown into a slideshow using remark.js. Itβs very customisable via CSS, to the extent that I was able to mimic the house style of my organisation this year. One of my favourite functions1 is inf_mr() (Infinite Moon Reader), which lets you live-preview your outputs as theyβre written.
{polite}
Web scraping is ethically dubious if you fail to respect the terms of the sites youβre visiting. Dmytro Perepolkin has made it easy to be a good citizen of the internet with the {polite} package, which has just hit version 1.0.0 and is on CRAN (congratulations!). First you introduce yourself to the site with a bow() and collect any information about limits and no-go pages from the robots.txt file, then you can modify search paths with a nod() and collect information from them with a scrape(). Very responsible.
{arsenal}
Iβve been using the handy2 {arsenal} package to compare data frames as part of a quality assurance process. First, you supply two data frames to comparedf() to create a βcompareβ object. Run diffs() on that object to create a new data frame where each row is a mismatch, given a tolerance, with columns for the location and values that are causing problems. We managed to quality assure nearly a million values with this method in next to no time. Check out their vignette on how to do this.
Bonus!
{govdown}
Aha, well done for reading this far. As a bonus, Iβm calling out Duncan Garmonswayβs {govdown} package. Duncan grappled with the complexities of things like Pandoc and Lua filters to build a package that applies the accessibility-friendly GOV.UK design system to R Markdown. This means you can create things like the the Reproducible Analaytical Pipelines (RAP) website in the style of GOV.UK. Endorsed by Yihui Xie himself! Check out Duncanβs {tidyxl} and {unpivotr} packages for handling nightmare Excel files while youβre at it.
Session info
## [1] "Last updated 2020-01-02"
## β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## setting value
## version R version 3.6.1 (2019-07-05)
## os macOS Sierra 10.12.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_GB.UTF-8
## ctype en_GB.UTF-8
## tz Europe/London
## date 2020-01-02
##
## β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
## backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.0)
## blogdown 0.17 2019-11-13 [1] CRAN (R 3.6.0)
## bookdown 0.16 2019-11-22 [1] CRAN (R 3.6.0)
## cli 2.0.0 2019-12-09 [1] CRAN (R 3.6.1)
## crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
## digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.0)
## dplyr * 0.8.3 2019-07-04 [1] CRAN (R 3.6.0)
## emo 0.0.0.9000 2019-12-23 [1] Github (hadley/emo@3f03b11)
## evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
## fansi 0.4.0 2018-10-05 [1] CRAN (R 3.6.0)
## glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
## htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.0)
## knitr 1.26 2019-11-12 [1] CRAN (R 3.6.0)
## lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.6.0)
## magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
## pillar 1.4.3 2019-12-20 [1] CRAN (R 3.6.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
## purrr * 0.3.3 2019-10-18 [1] CRAN (R 3.6.0)
## R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.0)
## Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.0)
## rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.0)
## rmarkdown 2.0 2019-12-12 [1] CRAN (R 3.6.0)
## rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.6.0)
## sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
## stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
## tibble 2.1.3 2019-06-06 [1] CRAN (R 3.6.0)
## tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.6.0)
## utf8 1.1.4 2018-05-24 [1] CRAN (R 3.6.0)
## vctrs 0.2.1 2019-12-17 [1] CRAN (R 3.6.1)
## withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
## xfun 0.11 2019-11-12 [1] CRAN (R 3.6.0)
## yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
## zeallot 0.1.0 2018-01-28 [1] CRAN (R 3.6.0)
##
## [1] /Users/matt.dray/Library/R/3.6/library
## [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
Along with
yolo: true, of course.β©Unlike Arsenal FC in 2019, rofl.β©