Package 'upstartr' reference manual

Title:	Utilities Powering the Globe and Mail's Data Journalism Template
Description:	Core functions necessary for using The Globe and Mail's R data journalism template, 'startr', along with utilities for day-to-day data journalism tasks, such as reading and writing files, producing graphics and cleaning up datasets.
Authors:	Tom Cardoso [aut, cre] (creator and maintainer), Michael Pereira [ctb], The Globe and Mail Inc. [cph]
Maintainer:	Tom Cardoso <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.2
Built:	2025-03-12 05:33:27 UTC
Source:	https://github.com/globeandmail/upstartr

Opposite of %in%

Description

Given vectors A and B, returns only the entities from vector A that don't occur in vector B.

Usage

x %not_in% table
x %not_in% table

Arguments

`x`	The vector you want to check.
`table`	Table in which to do lookups against x.

Value

Same form of return as %in% — except it will return only elements on the lhs that aren't present on the rhs

Examples

c(1, 2, 3, 4, 5) %not_in% c(4, 5, 6, 7, 8)

c(1, 2, 3, 4, 5) %not_in% c(4, 5, 6, 7, 8)

Runs the pre-processing step on a startr project.

Description

The pre-processing step, run as part of upstartr::run_process during the process.R stage of a startr project, logs all variables currently in the global environment, which will then be removed during the post-processing step to keep the startr environment unpolluted.

Usage

begin_processing(should_clean_processing_variables = TRUE)
begin_processing(should_clean_processing_variables = TRUE)

Arguments

should_clean_processing_variables

Either TRUE, FALSE, or pulled from the environment if set.

Value

A list of all environment variables present before the function was run

Index values

Description

Index numeric vector to first value. By default, the index base will be 0, turning regular values into percentage change. In some cases, you may want to index to a different base, like 100, such as if you're looking at financial data.

Usage

calc_index(m, base = 0)
calc_index(m, base = 0)

Arguments

`m`	Numeric vector to index to first value.
`base`	Base to index against. (Default: 0)

Value

An vector of indexed values.

Examples

calc_index(c(5, 2, 8, 17, 7, 3, 1, -4))
calc_index(c(5, 2, 8, 17, 7, 3, 1, -4), base = 100)

calc_index(c(5, 2, 8, 17, 7, 3, 1, -4))
calc_index(c(5, 2, 8, 17, 7, 3, 1, -4), base = 100)

Calculate mode

Description

Calculates the mode of a given vector.

Usage

calc_mode(x)
calc_mode(x)

Arguments

`x`	Any kind of vector — numeric, character, logical.

Value

The mode(s) of that vector.

Examples

calc_mode(c(1, 1, 2, 3, 4))
calc_mode(c('the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'))
calc_mode(c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))

calc_mode(c(1, 1, 2, 3, 4))
calc_mode(c('the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog'))
calc_mode(c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))

Cleans up column names by forcing them into tidyverse style

Description

Zero-configuration function that takes unwieldy column names and coerces them into tidyverse-styled column names.

Usage

clean_columns(x)
clean_columns(x)

Arguments

`x`	A vector of column names.

Value

A character vector of column names.

Examples

clean_columns(c("Date of Purchase", "Item No.", "description", "",
  "Transaction at Jane's Counter?", "Auditing - Worth it?"))

clean_columns(c("Date of Purchase", "Item No.", "description", "",
  "Transaction at Jane's Counter?", "Auditing - Worth it?"))

Combine CSVs in a directory

Description

Given a directory (and, optionally, a pattern to search against), concatenate all CSV files into a single tibble.

Usage

combine_csvs(dir, pattern = "*.csv", ...)
combine_csvs(dir, pattern = "*.csv", ...)

Arguments

`dir`	Path to the directory to look at for files.
`pattern`	Pattern to use for detecting files. (Default: '*.csv')
`...`	Parameters to pass to `readr::read_csv`.

Value

A tibble of concatenated data from multiple CSV files.

Combine Excel files in a directory

Description

Given a directory (and, optionally, a pattern to search against), concatenate all Excel files into a single tibble.

Usage

combine_excels(dir, pattern = "*.xls[x]?", all_sheets = FALSE, ...)
combine_excels(dir, pattern = "*.xls[x]?", all_sheets = FALSE, ...)

Arguments

`dir`	Path to the directory to look at for files.
`pattern`	Pattern to use for detecting files. (Default: '*.xls[x]?')
`all_sheets`	Should this function also concatenate all sheets within each Excel file into one long tibble? (Default: FALSE)
`...`	Parameters to pass to `readxl::read_excel`.

Value

A tibble of concatenated data from multiple Excel files.

Converts a character vector to logicals

Description

Takes a character vector and converts it to logicals, optionally using a vector of patterns to match against for truthy and falsy values.

Usage

convert_str_to_logical(
  x,
  truthy = c("T", "TRUE", "Y", "YES"),
  falsy = c("F", "FALSE", "N", "NO")
)
convert_str_to_logical(
  x,
  truthy = c("T", "TRUE", "Y", "YES"),
  falsy = c("F", "FALSE", "N", "NO")
)

Arguments

`x`	A character vector.
`truthy`	A vector of case-insensitive truthy values to turn into TRUE.
`falsy`	A vector of case-insensitive falsy values to turn into FALSE.

Value

A logical vector.

Examples

convert_str_to_logical(c('YES', 'Y', 'No', 'N', 'YES', 'yes', 'no', 'Yes', 'NO', 'Y', 'y'))

convert_str_to_logical(c('YES', 'Y', 'No', 'N', 'YES', 'yes', 'no', 'Yes', 'NO', 'Y', 'y'))

Get path within cached data directory.

Description

Constructs a path within startr's data/cache/ directory.

Usage

dir_data_cache(...)
dir_data_cache(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within disposable data outputs directory.

Description

Constructs a path within startr's data/out/ directory.

Usage

dir_data_out(...)
dir_data_out(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within processed data directory.

Description

Constructs a path within startr's data/processed/ directory.

Usage

dir_data_processed(...)
dir_data_processed(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within raw data directory.

Description

Constructs a path within startr's data/raw/ directory.

Usage

dir_data_raw(...)
dir_data_raw(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Construct an arbitrary path.

Description

Convenience function that constructs a path. Wraps here::here.

Usage

dir_path(...)
dir_path(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within plots directory.

Description

Constructs a path within startr's plots/ directory.

Usage

dir_plots(...)
dir_plots(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within reports directory.

Description

Constructs a path within startr's reports/ directory.

Usage

dir_reports(...)
dir_reports(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within scrape directory.

Description

Constructs a path within startr's scrape/ directory.

Usage

dir_scrape(...)
dir_scrape(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Get path within src directory

Description

Constructs a path within startr's main R/ directory.

Usage

dir_src(...)
dir_src(...)

Arguments

...

Any number of path strings, passed in the same fashion as here::here.

Value

A path string.

Runs the post-processing step on a startr project.

Description

The post-processing step, run as part of upstartr::run_process during the process.R stage of a startr project, removes all variables saved by upstartr::begin_processing and then beeps to announce it's finished.

Usage

end_processing(
  should_clean_processing_variables = TRUE,
  should_beep = TRUE,
  logged_vars = NULL
)
end_processing(
  should_clean_processing_variables = TRUE,
  should_beep = TRUE,
  logged_vars = NULL
)

Arguments

`should_clean_processing_variables`	Either TRUE, FALSE, or pulled from the environment if set.
`should_beep`	Either TRUE, FALSE, or pulled from the environment if set.
`logged_vars`	A list of variables that existed before the processing step began.

Value

No return value, called for side effects

Initialize startr project

Description

Used to initialize a startr template for analysis. Will enforce some startr-required standards for analysis (such as removing scientific notation, setting timezones, and writing some project configs to 'options').

Usage

initialize_startr(
  author = "Firstname Lastname <[email protected]>",
  title = "startr",
  scipen = 999,
  timezone = "America/Toronto",
  should_render_notebook = FALSE,
  should_process_data = TRUE,
  should_timestamp_output_files = FALSE,
  should_clean_processing_variables = TRUE,
  should_beep = TRUE,
  set_minimal_graphics_theme = TRUE,
  packages = c()
)
initialize_startr(
  author = "Firstname Lastname <[email protected]>",
  title = "startr",
  scipen = 999,
  timezone = "America/Toronto",
  should_render_notebook = FALSE,
  should_process_data = TRUE,
  should_timestamp_output_files = FALSE,
  should_clean_processing_variables = TRUE,
  should_beep = TRUE,
  set_minimal_graphics_theme = TRUE,
  packages = c()
)

Arguments

`author`	Name and email of the startr project author
`title`	Title of the startr project
`scipen`	Which level of scientific precision to use. (Default: 999)
`timezone`	The timezone for analysis. (Default: 'America/Toronto')
`should_render_notebook`	Whether the RMarkdown notebook should be rendered. (Default: FALSE)
`should_process_data`	Whether startr's process step should be run. (Default: TRUE)
`should_timestamp_output_files`	Whether write_excel's output files should be timestamped. (Default: FALSE)
`should_clean_processing_variables`	Whether processing variables should be cleaned from the environment after processing is complete. (Default: TRUE)
`should_beep`	Whether startr should beep after tasks like processing or knitting RMarkdown notebooks. (Default: TRUE)
`set_minimal_graphics_theme`	Whether the minimal graphics theme should be used. (Default: TRUE)
`packages`	Vector of package names, from CRAN, Github or Bioconductor to be installed. If using GitHub, package names should be in the format 'user/repo', e.g. 'globeandmail/upstartr'.

Value

No return value, called for side effects

Opposite of is.na

Description

Given a vector, returns TRUE for all entities that aren't NA.

Usage

not.na(x)
not.na(x)

Arguments

`x`	A vector to check for NAs against.

Value

A vector of elements that aren't NA

Examples

not.na(c(1, NA, 2, NA))

not.na(c(1, NA, 2, NA))

Opposite of is.null

Description

Given a list, returns TRUE for all entities that aren't NULL.

Usage

not.null(x)
not.null(x)

Arguments

`x`	A vector to check for NULLs against.

Value

Elements that aren't NULL

Examples

not.null(list(1, NULL, 2, NULL))

not.null(list(1, NULL, 2, NULL))

Combine all sheets in an Excel file

Description

Reads all sheets in a single Excel file using readxl::read_excel and concatenates them into a single, long tibble.

Usage

read_all_excel_sheets(filepath, ...)
read_all_excel_sheets(filepath, ...)

Arguments

`filepath`	Path to the Excel file.
`...`	Parameters to pass to `readxl::read_excel`.

Value

A tibble data concatenated from a all sheets in an Excel file.

Removes non-UTF-8 characters

Description

Removes non-UTF-8 characters in a given character vector.

Usage

remove_non_utf8(x)
remove_non_utf8(x)

Arguments

`x`	A character vector.

Value

A character vector of strings without non-UTF-8 characters.

Examples

non_utf8 <- 'fa\xE7ile'
  Encoding(non_utf8) <- 'latin1'
  remove_non_utf8(non_utf8)

non_utf8 <- 'fa\xE7ile'
  Encoding(non_utf8) <- 'latin1'
  remove_non_utf8(non_utf8)

Renders out an RMarkdown notebook.

Description

Renders an RMarkdown notebook using upstartr::render_notebook and then beeps.

Usage

render_notebook(notebook_file, output_dir = dir_reports())
render_notebook(notebook_file, output_dir = dir_reports())

Arguments

`notebook_file`	The path for the RMarkdown notebook you're rendering.
`output_dir`	The directory to write the outputs to.

Value

No return value, called for side effects

Runs the analysis step for a startr project.

Description

Sources analyze.R.

Usage

run_analyze()
run_analyze()

Value

No return value, called for side effects

Configures an existing startr project

Description

Sources config.R and functions.R in turn.

Usage

run_config()
run_config()

Value

No return value, called for side effects

Runs the notebook rendering step for a startr project.

Description

Renders an RMarkdown notebook using upstartr::render_notebook and then beeps.

Usage

run_notebook(
  filename = "notebook.Rmd",
  should_beep = TRUE,
  should_render_notebook = TRUE
)
run_notebook(
  filename = "notebook.Rmd",
  should_beep = TRUE,
  should_render_notebook = TRUE
)

Arguments

`filename`	The filename for the RMarkdown notebook you want to render.
`should_beep`	Either TRUE, FALSE, or pulled from the environment if set.
`should_render_notebook`	Either TRUE, FALSE, or pulled from the environment if set.

Value

No return value, called for side effects

Runs the processing step on a startr project.

Description

Runs the pre-processing step (see upstartr::begin_processing for details), then sources process.R, then runs the post-processing step (see upstartr::end_processing for details).

Usage

run_process(should_process_data = TRUE)
run_process(should_process_data = TRUE)

Arguments

should_process_data

Either TRUE, FALSE, or pulled from the environment if set.

Value

No return value, called for side effects

Runs the visualization step for a startr project.

Description

Sources visualize.R.

Usage

run_visualize()
run_visualize()

Value

No return value, called for side effects

Create a continuous x-axis scale using percentages

Description

Convenience function to return a scale_x_continuous function using percentage labels.

Usage

scale_x_percent(...)
scale_x_percent(...)

Arguments

...

All your usual continuous x-axis scale parameters.

Value

A scale object to be consumed by ggplot2.

Create a continuous y-axis scale using percentages

Description

Convenience function to return a scale_y_continuous function using percentage labels.

Usage

scale_y_percent(...)
scale_y_percent(...)

Arguments

...

All your usual continuous y-axis scale parameters.

Value

A scale object to be consumed by ggplot2.

Simplifies strings for analysis

Description

Takes a character vector and "simplifies" it by uppercasing, removing most non-alphabetic (or alphanumeric) characters, removing accents, forcing UTF-8 encoding, removing excess spaces, and optionally removing stop words. Useful in cases where you have two large vector of person or business names you need to compare, but where misspellings may be common.

Usage

simplify_string(
  x,
  alpha = TRUE,
  digits = FALSE,
  unaccent = TRUE,
  utf8_only = TRUE,
  case = "upper",
  trim = TRUE,
  stopwords = NA
)
simplify_string(
  x,
  alpha = TRUE,
  digits = FALSE,
  unaccent = TRUE,
  utf8_only = TRUE,
  case = "upper",
  trim = TRUE,
  stopwords = NA
)

Arguments

`x`	A character vector.
`alpha`	Should alphabetic characters be included in the cleaned up string? (Default: TRUE)
`digits`	Should digits be included in the cleaned up string? (Default: FALSE)
`unaccent`	Should characters be de-accented? (Default: TRUE)
`utf8_only`	Should characters be UTF-8 only? (Default: TRUE)
`case`	What casing should characters use? Can be one of 'upper', 'lower', 'sentence', 'title', or 'keep' for the existing casing (Default: 'upper')
`trim`	Should strings be trimmed of excess spaces? (Default: TRUE)
`stopwords`	An optional vector of stop words to be removed.

Value

A character vector of simplified strings.

Examples

simplify_string(c('J. Jonah Jameson', 'j jonah jameson',
  'j   jonah 123   jameson', 'J Jónah Jameson...'))
simplify_string(c('123 Business Inc.', '123 business incorporated',
  '123 ... Business ... Inc.'), digits = TRUE, stopwords = c('INC', 'INCORPORATED'))

simplify_string(c('J. Jonah Jameson', 'j jonah jameson',
  'j   jonah 123   jameson', 'J Jónah Jameson...'))
simplify_string(c('123 Business Inc.', '123 business incorporated',
  '123 ... Business ... Inc.'), digits = TRUE, stopwords = c('INC', 'INCORPORATED'))

De-accents strings

Description

Replace accented characters with their non-accented versions. Useful when dealing with languages like French, Spanish or Portuguese, where accents can lead to compatibility issues during data analysis.

Usage

unaccent(x, remove.nonconverted = FALSE, ...)
unaccent(x, remove.nonconverted = FALSE, ...)

Arguments

`x`	A character vector.
`remove.nonconverted`	Should the function remove unmapped encodings? (Default: FALSE)
`...`	Parameters passed to `textclean::replace_non_ascii`

Value

A character vector of strings without accents.

Examples

unaccent('façile')
unaccent('Montréal')

unaccent('façile')
unaccent('Montréal')

Write out an Excel file with minimal configuration

Description

Takes a tibble or dataframe variable and saves it out as an Excel file using the variable name as the filename.

Usage

write_excel(
  variable,
  output_dir = dir_data_out(),
  should_timestamp_output_files = FALSE
)
write_excel(
  variable,
  output_dir = dir_data_out(),
  should_timestamp_output_files = FALSE
)

Arguments

`variable`	A tibble or dataframe object.
`output_dir`	The directory to save the file out to.
`should_timestamp_output_files`	Either TRUE, FALSE, or pulled from the environment if set.

Value

No return value, called for side effects

Write out a ggplot2 graphic with minimal configuration

Description

Takes a ggplot2 object and writes it to disk via ggplot2::ggsave using the variable name as the filename.

Usage

write_plot(variable, format = "png", output_dir = dir_plots(), ...)
write_plot(variable, format = "png", output_dir = dir_plots(), ...)

Arguments

`variable`	A tibble or dataframe object.
`format`	The desired format for the plot, be it 'png', 'pdf', etc. Accepts formats you'd pass to `ggplot2::ggsave`'s 'device' parameter.
`output_dir`	The directory to save the plot out to.
`...`	Other settings to pass to ggsave, such as format, width, height or dpi.

Value

No return value, called for side effects

Write a shapefile to disk

Description

Utility function that wraps sf::st_write, but first removes a previous version of the shapefile if it exists (by default, sf::st_write will throw an error.)

Usage

write_shp(shp, path, ...)
write_shp(shp, path, ...)

Arguments

`shp`	A spatial object.
`path`	The desired filepath for the shapefile.
`...`	Other settings to pass to st_write, such as format, width, height or dpi.

Value

No return value, called for side effects

Package 'upstartr'

Help Index

Opposite of %in%

Description

Usage

Arguments

Value

Examples

Runs the pre-processing step on a startr project.

Description

Usage

Arguments

Value

Index values

Description

Usage

Arguments

Value

Examples

Calculate mode

Description

Usage

Arguments

Value

Examples

Cleans up column names by forcing them into tidyverse style

Description

Usage

Arguments

Value

Examples

Combine CSVs in a directory

Description

Usage

Arguments

Value

Combine Excel files in a directory

Description

Usage

Arguments

Value

Converts a character vector to logicals

Description

Usage

Arguments

Value

Examples

Get path within cached data directory.

Description

Usage

Arguments

Value

Get path within disposable data outputs directory.

Description

Usage

Arguments

Value

Get path within processed data directory.

Description

Usage

Arguments

Value

Get path within raw data directory.

Description

Usage

Arguments

Value

Construct an arbitrary path.

Description

Usage

Arguments

Value

Get path within plots directory.

Description

Usage

Arguments

Value

Get path within reports directory.

Description

Usage