Chapter 9 Data

Sometimes it is necessary to include a dataset to the package. For example, you might need a dataset with a specific format to be able to run examples for your functions, or test them.Sometimes, a specific dataset is required for your package to run. Depending on the use of this dataset, there are different ways to include it into the package:

9.1 Example Datasets

This dataset will be visible to users - they will be able to load it and use it as input to the functions in your package or possibly some other packages or functions. Example datasets are stored in a subdirectory data. One can use different formats, but *.rda or *.RData are recommended.It is normally best to distribute large datasets as .rda images prepared by save(, compress = TRUE).

Suppose we created a dataset mydataset:

mydataset <- data.frame(id=1:5, names=LETTERS[1:5])

To add this dataset to the package (so other users could load this dataset from our package), we can execute:

usethis::use_data(mydataset)
✓ Setting active project to '/projectnb2/krcs/rpackage/myutils'
✓ Adding 'R' to Depends field in DESCRIPTION
✓ Creating 'data/'
✓ Saving 'mydataset' to 'data/mydataset.rda'
• Document your data (see 'https://r-pkgs.org/data.html')

By default, the DESCRIPTION file for a package, has a following line:

LazyData: true

This means that once someone loads our package (using the library() function), this dataset is not loaded.

If we would like to save the way we created this dataset, we should also run the following function:

usethis::use_data_raw("mydataset")
✓ Writing 'data-raw/mydataset.R'
• Modify 'data-raw/mydataset.R'
• Finish the data preparation script in 'data-raw/mydataset.R'
• Use `usethis::use_data()` to add prepared data to package

This function will create mydataset.R file in data-raw subdirectory, where we can document how this dataset was generated. Usually, we do not want to include this file (that creates the dataset) into the project, so it is added to the .Rbuildignore file.

9.2 Data documentation

In the output of use_data() function we can see the following message: “Document your data.”

First we need to create a data.R file in the R subdirectory:

use_r("data")

Then, for each dataset we store in the data subdirectory, we include a documentation block:

#' First five letters of the alphabet
#'
#' A dataset containing first five letters of English alphabet
#'
#' @format A data frame with 5 rows and 2 variables:
#' \describe{
#'   \item{id}{integer from 1 to 5}
#'   \item{names}{letter, from A to E}
#'   ...
#' }
"mydataset"

9.3 Internal Datasets

If the package you write needs some dataset that it will use internally but it does not need to be exposed to users, use usethis::used_data() function with argument internal set to TRUE :

usethis::use_data(int_data, internal = TRUE)
✓ Saving 'int_data' to 'R/sysdata.rda'

This dataset will be saved in R/sysdata.rda file. You do not need to document it. If you would like to save the code how the dataset was created, use use_data_raw() function as we did for the example dataset above.

9.4 Size of the datasets

If you are planning to publish your package on CRAN, make sure your package data is less than 1MB. Otherwise you will need to present a strong argument why it is necessary to have a larger dataset.