Title: | Utilities Powering the Globe and Mail's Data Journalism Template |
---|---|
Description: | Core functions necessary for using The Globe and Mail's R data journalism template, 'startr', along with utilities for day-to-day data journalism tasks, such as reading and writing files, producing graphics and cleaning up datasets. |
Authors: | Tom Cardoso [aut, cre] (creator and maintainer), Michael Pereira [ctb], The Globe and Mail Inc. [cph] |
Maintainer: | Tom Cardoso <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-11-12 05:57:52 UTC |
Source: | https://github.com/globeandmail/upstartr |
Given vectors A and B, returns only the entities from vector A that don't occur in vector B.
x %not_in% table
x %not_in% table
x |
The vector you want to check. |
table |
Table in which to do lookups against x. |
Same form of return as %in% — except it will return only elements on the lhs that aren't present on the rhs
c(1, 2, 3, 4, 5) %not_in% c(4, 5, 6, 7, 8)
c(1, 2, 3, 4, 5) %not_in% c(4, 5, 6, 7, 8)
The pre-processing step, run as part of upstartr::run_process
during the process.R
stage of a startr project, logs all variables
currently in the global environment, which will then be removed during the
post-processing step to keep the startr environment unpolluted.
begin_processing(should_clean_processing_variables = TRUE)
begin_processing(should_clean_processing_variables = TRUE)
should_clean_processing_variables |
Either TRUE, FALSE, or pulled from the environment if set. |
A list of all environment variables present before the function was run
Index numeric vector to first value. By default, the index base will be 0, turning regular values into percentage change. In some cases, you may want to index to a different base, like 100, such as if you're looking at financial data.
calc_index(m, base = 0)
calc_index(m, base = 0)
m |
Numeric vector to index to first value. |
base |
Base to index against. (Default: 0) |
An vector of indexed values.
calc_index(c(5, 2, 8, 17, 7, 3, 1, -4)) calc_index(c(5, 2, 8, 17, 7, 3, 1, -4), base = 100)
calc_index(c(5, 2, 8, 17, 7, 3, 1, -4)) calc_index(c(5, 2, 8, 17, 7, 3, 1, -4), base = 100)
Calculates the mode of a given vector.
calc_mode(x)
calc_mode(x)
x |
Any kind of vector — numeric, character, logical. |
The mode(s) of that vector.
calc_mode(c(1, 1, 2, 3, 4)) calc_mode(c('the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog')) calc_mode(c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))
calc_mode(c(1, 1, 2, 3, 4)) calc_mode(c('the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog')) calc_mode(c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))
Zero-configuration function that takes unwieldy column names and coerces them into tidyverse-styled column names.
clean_columns(x)
clean_columns(x)
x |
A vector of column names. |
A character vector of column names.
clean_columns(c("Date of Purchase", "Item No.", "description", "", "Transaction at Jane's Counter?", "Auditing - Worth it?"))
clean_columns(c("Date of Purchase", "Item No.", "description", "", "Transaction at Jane's Counter?", "Auditing - Worth it?"))
Given a directory (and, optionally, a pattern to search against), concatenate all CSV files into a single tibble.
combine_csvs(dir, pattern = "*.csv", ...)
combine_csvs(dir, pattern = "*.csv", ...)
dir |
Path to the directory to look at for files. |
pattern |
Pattern to use for detecting files. (Default: '*.csv') |
... |
Parameters to pass to |
A tibble of concatenated data from multiple CSV files.
Given a directory (and, optionally, a pattern to search against), concatenate all Excel files into a single tibble.
combine_excels(dir, pattern = "*.xls[x]?", all_sheets = FALSE, ...)
combine_excels(dir, pattern = "*.xls[x]?", all_sheets = FALSE, ...)
dir |
Path to the directory to look at for files. |
pattern |
Pattern to use for detecting files. (Default: '*.xls[x]?') |
all_sheets |
Should this function also concatenate all sheets within each Excel file into one long tibble? (Default: FALSE) |
... |
Parameters to pass to |
A tibble of concatenated data from multiple Excel files.
Takes a character vector and converts it to logicals, optionally using a vector of patterns to match against for truthy and falsy values.
convert_str_to_logical( x, truthy = c("T", "TRUE", "Y", "YES"), falsy = c("F", "FALSE", "N", "NO") )
convert_str_to_logical( x, truthy = c("T", "TRUE", "Y", "YES"), falsy = c("F", "FALSE", "N", "NO") )
x |
A character vector. |
truthy |
A vector of case-insensitive truthy values to turn into TRUE. |
falsy |
A vector of case-insensitive falsy values to turn into FALSE. |
A logical vector.
convert_str_to_logical(c('YES', 'Y', 'No', 'N', 'YES', 'yes', 'no', 'Yes', 'NO', 'Y', 'y'))
convert_str_to_logical(c('YES', 'Y', 'No', 'N', 'YES', 'yes', 'no', 'Yes', 'NO', 'Y', 'y'))
Constructs a path within startr's data/cache/ directory.
dir_data_cache(...)
dir_data_cache(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's data/out/ directory.
dir_data_out(...)
dir_data_out(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's data/processed/ directory.
dir_data_processed(...)
dir_data_processed(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's data/raw/ directory.
dir_data_raw(...)
dir_data_raw(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Convenience function that constructs a path. Wraps here::here
.
dir_path(...)
dir_path(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's plots/ directory.
dir_plots(...)
dir_plots(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's reports/ directory.
dir_reports(...)
dir_reports(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's scrape/ directory.
dir_scrape(...)
dir_scrape(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
Constructs a path within startr's main R/ directory.
dir_src(...)
dir_src(...)
... |
Any number of path strings, passed in the same fashion as |
A path string.
The post-processing step, run as part of upstartr::run_process
during the process.R
stage of a startr project, removes all variables
saved by upstartr::begin_processing
and then beeps to announce it's finished.
end_processing( should_clean_processing_variables = TRUE, should_beep = TRUE, logged_vars = NULL )
end_processing( should_clean_processing_variables = TRUE, should_beep = TRUE, logged_vars = NULL )
should_clean_processing_variables |
Either TRUE, FALSE, or pulled from the environment if set. |
should_beep |
Either TRUE, FALSE, or pulled from the environment if set. |
logged_vars |
A list of variables that existed before the processing step began. |
No return value, called for side effects
Used to initialize a startr template for analysis. Will enforce some startr-required standards for analysis (such as removing scientific notation, setting timezones, and writing some project configs to 'options').
initialize_startr( author = "Firstname Lastname <[email protected]>", title = "startr", scipen = 999, timezone = "America/Toronto", should_render_notebook = FALSE, should_process_data = TRUE, should_timestamp_output_files = FALSE, should_clean_processing_variables = TRUE, should_beep = TRUE, set_minimal_graphics_theme = TRUE, packages = c() )
initialize_startr( author = "Firstname Lastname <[email protected]>", title = "startr", scipen = 999, timezone = "America/Toronto", should_render_notebook = FALSE, should_process_data = TRUE, should_timestamp_output_files = FALSE, should_clean_processing_variables = TRUE, should_beep = TRUE, set_minimal_graphics_theme = TRUE, packages = c() )
author |
Name and email of the startr project author |
title |
Title of the startr project |
scipen |
Which level of scientific precision to use. (Default: 999) |
timezone |
The timezone for analysis. (Default: 'America/Toronto') |
should_render_notebook |
Whether the RMarkdown notebook should be rendered. (Default: FALSE) |
should_process_data |
Whether startr's process step should be run. (Default: TRUE) |
should_timestamp_output_files |
Whether write_excel's output files should be timestamped. (Default: FALSE) |
should_clean_processing_variables |
Whether processing variables should be cleaned from the environment after processing is complete. (Default: TRUE) |
should_beep |
Whether startr should beep after tasks like processing or knitting RMarkdown notebooks. (Default: TRUE) |
set_minimal_graphics_theme |
Whether the minimal graphics theme should be used. (Default: TRUE) |
packages |
Vector of package names, from CRAN, Github or Bioconductor to be installed. If using GitHub, package names should be in the format 'user/repo', e.g. 'globeandmail/upstartr'. |
No return value, called for side effects
Given a vector, returns TRUE for all entities that aren't NA.
not.na(x)
not.na(x)
x |
A vector to check for NAs against. |
A vector of elements that aren't NA
not.na(c(1, NA, 2, NA))
not.na(c(1, NA, 2, NA))
Given a list, returns TRUE for all entities that aren't NULL.
not.null(x)
not.null(x)
x |
A vector to check for NULLs against. |
Elements that aren't NULL
not.null(list(1, NULL, 2, NULL))
not.null(list(1, NULL, 2, NULL))
Reads all sheets in a single Excel file using readxl::read_excel
and concatenates them into a single, long tibble.
read_all_excel_sheets(filepath, ...)
read_all_excel_sheets(filepath, ...)
filepath |
Path to the Excel file. |
... |
Parameters to pass to |
A tibble data concatenated from a all sheets in an Excel file.
Removes non-UTF-8 characters in a given character vector.
remove_non_utf8(x)
remove_non_utf8(x)
x |
A character vector. |
A character vector of strings without non-UTF-8 characters.
non_utf8 <- 'fa\xE7ile' Encoding(non_utf8) <- 'latin1' remove_non_utf8(non_utf8)
non_utf8 <- 'fa\xE7ile' Encoding(non_utf8) <- 'latin1' remove_non_utf8(non_utf8)
Renders an RMarkdown notebook using upstartr::render_notebook
and then beeps.
render_notebook(notebook_file, output_dir = dir_reports())
render_notebook(notebook_file, output_dir = dir_reports())
notebook_file |
The path for the RMarkdown notebook you're rendering. |
output_dir |
The directory to write the outputs to. |
No return value, called for side effects
Sources analyze.R
.
run_analyze()
run_analyze()
No return value, called for side effects
Sources config.R
and functions.R
in turn.
run_config()
run_config()
No return value, called for side effects
Renders an RMarkdown notebook using upstartr::render_notebook
and then beeps.
run_notebook( filename = "notebook.Rmd", should_beep = TRUE, should_render_notebook = TRUE )
run_notebook( filename = "notebook.Rmd", should_beep = TRUE, should_render_notebook = TRUE )
filename |
The filename for the RMarkdown notebook you want to render. |
should_beep |
Either TRUE, FALSE, or pulled from the environment if set. |
should_render_notebook |
Either TRUE, FALSE, or pulled from the environment if set. |
No return value, called for side effects
Runs the pre-processing step (see upstartr::begin_processing
for details), then sources process.R
, then runs the post-processing step
(see upstartr::end_processing
for details).
run_process(should_process_data = TRUE)
run_process(should_process_data = TRUE)
should_process_data |
Either TRUE, FALSE, or pulled from the environment if set. |
No return value, called for side effects
Sources visualize.R
.
run_visualize()
run_visualize()
No return value, called for side effects
Convenience function to return a scale_x_continuous function using percentage labels.
scale_x_percent(...)
scale_x_percent(...)
... |
All your usual continuous x-axis scale parameters. |
A scale object to be consumed by ggplot2.
Convenience function to return a scale_y_continuous function using percentage labels.
scale_y_percent(...)
scale_y_percent(...)
... |
All your usual continuous y-axis scale parameters. |
A scale object to be consumed by ggplot2.
Takes a character vector and "simplifies" it by uppercasing, removing most non-alphabetic (or alphanumeric) characters, removing accents, forcing UTF-8 encoding, removing excess spaces, and optionally removing stop words. Useful in cases where you have two large vector of person or business names you need to compare, but where misspellings may be common.
simplify_string( x, alpha = TRUE, digits = FALSE, unaccent = TRUE, utf8_only = TRUE, case = "upper", trim = TRUE, stopwords = NA )
simplify_string( x, alpha = TRUE, digits = FALSE, unaccent = TRUE, utf8_only = TRUE, case = "upper", trim = TRUE, stopwords = NA )
x |
A character vector. |
alpha |
Should alphabetic characters be included in the cleaned up string? (Default: TRUE) |
digits |
Should digits be included in the cleaned up string? (Default: FALSE) |
unaccent |
Should characters be de-accented? (Default: TRUE) |
utf8_only |
Should characters be UTF-8 only? (Default: TRUE) |
case |
What casing should characters use? Can be one of 'upper', 'lower', 'sentence', 'title', or 'keep' for the existing casing (Default: 'upper') |
trim |
Should strings be trimmed of excess spaces? (Default: TRUE) |
stopwords |
An optional vector of stop words to be removed. |
A character vector of simplified strings.
simplify_string(c('J. Jonah Jameson', 'j jonah jameson', 'j jonah 123 jameson', 'J Jónah Jameson...')) simplify_string(c('123 Business Inc.', '123 business incorporated', '123 ... Business ... Inc.'), digits = TRUE, stopwords = c('INC', 'INCORPORATED'))
simplify_string(c('J. Jonah Jameson', 'j jonah jameson', 'j jonah 123 jameson', 'J Jónah Jameson...')) simplify_string(c('123 Business Inc.', '123 business incorporated', '123 ... Business ... Inc.'), digits = TRUE, stopwords = c('INC', 'INCORPORATED'))
Replace accented characters with their non-accented versions. Useful when dealing with languages like French, Spanish or Portuguese, where accents can lead to compatibility issues during data analysis.
unaccent(x, remove.nonconverted = FALSE, ...)
unaccent(x, remove.nonconverted = FALSE, ...)
x |
A character vector. |
remove.nonconverted |
Should the function remove unmapped encodings? (Default: FALSE) |
... |
Parameters passed to |
A character vector of strings without accents.
unaccent('façile') unaccent('Montréal')
unaccent('façile') unaccent('Montréal')
Takes a tibble or dataframe variable and saves it out as an Excel file using the variable name as the filename.
write_excel( variable, output_dir = dir_data_out(), should_timestamp_output_files = FALSE )
write_excel( variable, output_dir = dir_data_out(), should_timestamp_output_files = FALSE )
variable |
A tibble or dataframe object. |
output_dir |
The directory to save the file out to. |
should_timestamp_output_files |
Either TRUE, FALSE, or pulled from the environment if set. |
No return value, called for side effects
Takes a ggplot2
object and writes it to disk via ggplot2::ggsave
using the
variable name as the filename.
write_plot(variable, format = "png", output_dir = dir_plots(), ...)
write_plot(variable, format = "png", output_dir = dir_plots(), ...)
variable |
A tibble or dataframe object. |
format |
The desired format for the plot, be it 'png', 'pdf', etc. Accepts formats
you'd pass to |
output_dir |
The directory to save the plot out to. |
... |
Other settings to pass to ggsave, such as format, width, height or dpi. |
No return value, called for side effects
Utility function that wraps sf::st_write
, but first
removes a previous version of the shapefile if it exists (by default, sf::st_write
will throw an error.)
write_shp(shp, path, ...)
write_shp(shp, path, ...)
shp |
A spatial object. |
path |
The desired filepath for the shapefile. |
... |
Other settings to pass to st_write, such as format, width, height or dpi. |
No return value, called for side effects