Browse Source

(v1.1.0.9020) updated taxonomy

main
parent
commit
86d44054f0
  1. 4
      DESCRIPTION
  2. 16
      NEWS.md
  3. 3
      R/aa_helper_functions.R
  4. 2
      R/ab.R
  5. 10
      R/catalogue_of_life.R
  6. 18
      R/data.R
  7. 2
      R/disk.R
  8. 109
      R/eucast_rules.R
  9. 2
      R/join_microorganisms.R
  10. 2
      R/mic.R
  11. 92
      R/mo.R
  12. 18
      R/mo_property.R
  13. 2
      R/rsi.R
  14. 4
      R/rsi_calc.R
  15. BIN
      R/sysdata.rda
  16. 4
      data-raw/antibiotics.txt
  17. BIN
      data-raw/data_dsmz.rds
  18. 16
      data-raw/eucast_rules.tsv
  19. BIN
      data-raw/microorganisms.translation.rds
  20. 136556
      data-raw/microorganisms.txt
  21. 7
      data-raw/reproduction_of_antibiotics.R
  22. 658
      data-raw/reproduction_of_microorganisms.R
  23. 446
      data-raw/rsi_translation.txt
  24. 127
      data-raw/snomed.R
  25. BIN
      data/antibiotics.rda
  26. BIN
      data/example_isolates.rda
  27. BIN
      data/microorganisms.codes.rda
  28. BIN
      data/microorganisms.old.rda
  29. BIN
      data/microorganisms.rda
  30. BIN
      data/rsi_translation.rda
  31. 2
      docs/404.html
  32. 2
      docs/LICENSE-text.html
  33. 2
      docs/articles/index.html
  34. 2
      docs/authors.html
  35. 6
      docs/index.html
  36. 28
      docs/news/index.html
  37. 2
      docs/pkgdown.yml
  38. 2
      docs/reference/as.disk.html
  39. 2
      docs/reference/as.mic.html
  40. 3
      docs/reference/as.mo.html
  41. 2
      docs/reference/as.rsi.html
  42. 14
      docs/reference/catalogue_of_life.html
  43. 6
      docs/reference/index.html
  44. 12
      docs/reference/microorganisms.codes.html
  45. 14
      docs/reference/microorganisms.html
  46. 10
      docs/reference/microorganisms.old.html
  47. 4
      index.md
  48. 1
      man/as.mo.Rd
  49. 10
      man/catalogue_of_life.Rd
  50. 9
      man/microorganisms.Rd
  51. 4
      man/microorganisms.codes.Rd
  52. 7
      man/microorganisms.old.Rd
  53. 4
      tests/testthat/test-data.R
  54. 15
      tests/testthat/test-mo.R
  55. 3
      tests/testthat/test-mo_property.R

4
DESCRIPTION

@ -1,6 +1,6 @@ @@ -1,6 +1,6 @@
Package: AMR
Version: 1.1.0.9019
Date: 2020-05-25
Version: 1.1.0.9020
Date: 2020-05-27
Title: Antimicrobial Resistance Analysis
Authors@R: c(
person(role = c("aut", "cre"),

16
NEWS.md

@ -1,5 +1,5 @@ @@ -1,5 +1,5 @@
# AMR 1.1.0.9019
## <small>Last updated: 25-May-2020</small>
# AMR 1.1.0.9020
## <small>Last updated: 27-May-2020</small>
### Breaking
* Removed code dependency on all other R packages, making this package fully independent of the development process of others. This is a major code change, but will probably not be noticeable by most users.
@ -13,8 +13,17 @@ @@ -13,8 +13,17 @@
* For developers: classes `mo` and `ab` now both also inherit class `character`, to support any data transformation. This change invalidates code that checks for class length == 1.
### Changed
* Taxonomy:
* Updated the taxonomy of microorganisms tot May 2020, using the Catalogue of Life (CoL), the Global Biodiversity Information Facility (GBIF) and the List of Prokaryotic names with Standing in Nomenclature (LPSN, hosted by DSMZ since February 2020)
* Removed the Catalogue of Life IDs (like 776351), since they now work with a species ID (hexadecimal string)
* EUCAST rules:
* The `eucast_rules()` function no longer applies "other" rules at default that are made available by this package (like setting ampicillin = R when ampicillin + enzym inhibitor = R). The default input value for `rules` is now `c("breakpoints", "expert")` instead of `"all"`, but this can be changed by the user. To return to the old behaviour, set `options(AMR.eucast_rules = "all")`.
* The `eucast_rules()` function no longer applies "other" rules at default that are made available by this package (like setting ampicillin = R when ampicillin + enzyme inhibitor = R). The default input value for `rules` is now `c("breakpoints", "expert")` instead of `"all"`, but this can be changed by the user. To return to the old behaviour, set `options(AMR.eucast_rules = "all")`.
* Fixed a bug where checking antimicrobial results in the original data were not regarded as valid R/SI values
* All "other" rules now apply for all drug combinations in the `antibiotics` data set these two rules:
1. A drug **with** enzyme inhibitor will be set to S if the drug **without** enzyme inhibitor is S
2. A drug **without** enzyme inhibitor will be set to R if the drug **with** enzyme inhibitor is R
This works for all drug combinations, such as ampicillin/sulbactam, ceftazidime/avibactam, trimethoprim/sulfamethoxazole, etc.
* Added official drug names to verbose output of `eucast_rules()`
* Added function `ab_url()` to return the direct URL of an antimicrobial agent from the official WHO website
* Improvements for algorithm in `as.ab()`, so that e.g. `as.ab("ampi sul")` and `ab_name("ampi sul")` work
@ -22,6 +31,7 @@ @@ -22,6 +31,7 @@
* Small fix for some text input that could not be coerced as valid MIC values
* Fix for interpretation of generic CLSI interpretation rules (thanks to Anthony Underwood)
* Fix for `set_mo_source()` to make sure that column `mo` will always be the second column
* Added abbreviation "cfsc" for Cefoxitin and "cfav" for Ceftazidime/avibactam
### Other
* Removed previously deprecated function `p.symbol()` - it was replaced with `p_symbol()`

3
R/aa_helper_functions.R

@ -78,8 +78,7 @@ check_dataset_integrity <- function() { @@ -78,8 +78,7 @@ check_dataset_integrity <- function() {
check_microorganisms <- all(c("mo", "fullname", "kingdom", "phylum",
"class", "order", "family", "genus",
"species", "subspecies", "rank",
"col_id", "species_id", "source",
"ref", "prevalence", "snomed") %in% colnames(microorganisms),
"species_id", "source", "ref", "prevalence") %in% colnames(microorganisms),
na.rm = TRUE) & NROW(microorganisms) == NROW(MO_lookup)
check_antibiotics <- all(c("ab", "atc", "cid", "name", "group",
"atc_group1", "atc_group2", "abbreviations",

2
R/ab.R

@ -347,7 +347,7 @@ is.ab <- function(x) { @@ -347,7 +347,7 @@ is.ab <- function(x) {
#' @export
#' @noRd
print.ab <- function(x, ...) {
cat("Class 'ab'\n")
cat("Class <ab>\n")
print(as.character(x), quote = FALSE)
}

10
R/catalogue_of_life.R

@ -50,9 +50,9 @@ @@ -50,9 +50,9 @@
#'
#'
#' # Get a note when a species was renamed
#' mo_shortname("Chlamydia psittaci")
#' # Note: 'Chlamydia psittaci' (Page, 1968) was renamed
#' # 'Chlamydophila psittaci' (Everett et al., 1999)
#' mo_shortname("Chlamydophila psittaci")
#' # Note: 'Chlamydophila psittaci' (Everett et al., 1999) was renamed back to
#' # 'Chlamydia psittaci' (Page, 1968)
#' # [1] "C. psittaci"
#'
#' # Get any property from the entire taxonomic tree for all included species
@ -70,9 +70,9 @@ @@ -70,9 +70,9 @@
#'
#' # Do not get mistaken - this package is about microorganisms
#' mo_kingdom("C. elegans")
#' # [1] "Bacteria" # Bacteria?!
#' # [1] "Fungi" # Fungi?!
#' mo_name("C. elegans")
#' # [1] "Chroococcus limneticus elegans" # Because a microorganism was found
#' # [1] "Cladosporium elegans" # Because a microorganism was found
NULL
#' Version info of included Catalogue of Life

18
R/data.R

@ -82,7 +82,6 @@ @@ -82,7 +82,6 @@
#' @inheritSection catalogue_of_life Catalogue of Life
#' @format A [`data.frame`] with `r format(nrow(microorganisms), big.mark = ",")` observations and `r ncol(microorganisms)` variables:
#' - `mo`\cr ID of microorganism as used by this package
#' - `col_id`\cr Catalogue of Life ID
#' - `fullname`\cr Full name, like `"Escherichia coli"`
#' - `kingdom`, `phylum`, `class`, `order`, `family`, `genus`, `species`, `subspecies`\cr Taxonomic rank of the microorganism
#' - `rank`\cr Text of the taxonomic rank of the microorganism, like `"species"` or `"genus"`
@ -113,6 +112,8 @@ @@ -113,6 +112,8 @@
#'
#' From: <https://www.dsmz.de/services/online-tools/prokaryotic-nomenclature-up-to-date/complete-list-readme>
#' @source Catalogue of Life: Annual Checklist (public online taxonomic database), <http://www.catalogueoflife.org> (check included annual version with [catalogue_of_life_version()]).
#'
#' Parte, A.C. (2018). LPSN — List of Prokaryotic names with Standing in Nomenclature (bacterio.net), 20 years on. International Journal of Systematic and Evolutionary Microbiology, 68, 1825-1829; doi: 10.1099/ijsem.0.002786
#'
#' Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures, Germany, Prokaryotic Nomenclature Up-to-Date, <https://www.dsmz.de/services/online-tools/prokaryotic-nomenclature-up-to-date> (check included version with [catalogue_of_life_version()]).
#' @inheritSection AMR Read more on our website!
@ -120,11 +121,11 @@ @@ -120,11 +121,11 @@
"microorganisms"
catalogue_of_life <- list(
year = 2018,
year = 2019,
version = "Catalogue of Life: {year} Annual Checklist",
url_CoL = "http://www.catalogueoflife.org/annual-checklist/{year}/",
url_DSMZ = "https://www.dsmz.de/services/online-tools/prokaryotic-nomenclature-up-to-date/prokaryotic-nomenclature-up-to-date/genus-search",
yearmonth_DSMZ = "August 2019"
url_CoL = "http://www.catalogueoflife.org/col/",
url_DSMZ = "https://lpsn.dsmz.de",
yearmonth_DSMZ = "May 2020"
)
#' Data set with previously accepted taxonomic names
@ -132,17 +133,18 @@ catalogue_of_life <- list( @@ -132,17 +133,18 @@ catalogue_of_life <- list(
#' A data set containing old (previously valid or accepted) taxonomic names according to the Catalogue of Life. This data set is used internally by [as.mo()].
#' @inheritSection catalogue_of_life Catalogue of Life
#' @format A [`data.frame`] with `r format(nrow(microorganisms.old), big.mark = ",")` observations and `r ncol(microorganisms.old)` variables:
#' - `col_id`\cr Catalogue of Life ID that was originally given
#' - `col_id_new`\cr New Catalogue of Life ID that responds to an entry in the [microorganisms] data set
#' - `fullname`\cr Old full taxonomic name of the microorganism
#' - `fullname_new`\cr New full taxonomic name of the microorganism
#' - `ref`\cr Author(s) and year of concerning scientific publication
#' - `prevalence`\cr Prevalence of the microorganism, see [as.mo()]
#' @source Catalogue of Life: Annual Checklist (public online taxonomic database), <http://www.catalogueoflife.org> (check included annual version with [catalogue_of_life_version()]).
#'
#' Parte, A.C. (2018). LPSN — List of Prokaryotic names with Standing in Nomenclature (bacterio.net), 20 years on. International Journal of Systematic and Evolutionary Microbiology, 68, 1825-1829; doi: 10.1099/ijsem.0.002786
#' @inheritSection AMR Read more on our website!
#' @seealso [as.mo()] [mo_property()] [microorganisms]
"microorganisms.old"
#' Translation table for common microorganism codes
#' Translation table with `r format(nrow(microorganisms.codes), big.mark = ",")` common microorganism codes
#'
#' A data set containing commonly used codes for microorganisms, from laboratory systems and WHONET. Define your own with [set_mo_source()]. They will all be searched when using [as.mo()] and consequently all the [`mo_*`][mo_property()] functions.
#' @format A [`data.frame`] with `r format(nrow(microorganisms.codes), big.mark = ",")` observations and `r ncol(microorganisms.codes)` variables:

2
R/disk.R

@ -99,7 +99,7 @@ is.disk <- function(x) { @@ -99,7 +99,7 @@ is.disk <- function(x) {
#' @export
#' @noRd
print.disk <- function(x, ...) {
cat("Class 'disk'\n")
cat("Class <disk>\n")
print(as.integer(x), quote = FALSE)
}

109
R/eucast_rules.R

@ -245,6 +245,7 @@ eucast_rules <- function(x, @@ -245,6 +245,7 @@ eucast_rules <- function(x,
}
warned <- FALSE
warn_lacking_rsi_class <- FALSE
txt_error <- function() {
if (info == TRUE) cat("", font_red_bg(font_white(" ERROR ")), "\n\n")
@ -410,6 +411,7 @@ eucast_rules <- function(x, @@ -410,6 +411,7 @@ eucast_rules <- function(x,
RID <- cols_ab["RID"]
RIF <- cols_ab["RIF"]
RXT <- cols_ab["RXT"]
SAM <- cols_ab["SAM"]
SIS <- cols_ab["SIS"]
SXT <- cols_ab["SXT"]
TCY <- cols_ab["TCY"]
@ -440,7 +442,9 @@ eucast_rules <- function(x, @@ -440,7 +442,9 @@ eucast_rules <- function(x,
cols <- unique(cols[!is.na(cols) & !is.null(cols)])
if (length(rows) > 0 & length(cols) > 0) {
before_df <- x_original
if (any(!sapply(x[, cols, drop = FALSE], is.rsi), na.rm = TRUE)) {
warn_lacking_rsi_class <<- TRUE
}
tryCatch(
# insert into original table
x_original[rows, cols] <<- to,
@ -599,14 +603,79 @@ eucast_rules <- function(x, @@ -599,14 +603,79 @@ eucast_rules <- function(x,
}
}
if (info == TRUE & !any(c("other", "all") %in% rules, na.rm = TRUE)) {
cat(font_red("\nSkipping inheritance rules defined by this package, such as setting trimethoprim (TMP) = R where trimethoprim/sulfamethoxazole (SXT) = R.\nUse eucast_rules(..., rules = \"all\") to also apply those rules.\n"))
as.rsi_no_warning <- function(x) suppressWarnings(as.rsi(x))
no_added <- 0
no_changed <- 0
# Other rules: enzyme inhibitors ------------------------------------------
if (any(c("all", "other") %in% rules)) {
if (info == TRUE) {
cat(font_bold(paste0("\nRules by this AMR package (",
font_red(paste0("v", utils::packageVersion("AMR"), ", ",
format(utils::packageDate("AMR"), "%Y"))), ")\n")))
}
ab_enzyme <- subset(antibiotics, name %like% "/")[, c("ab", "name")]
ab_enzyme$base_name <- gsub("^([a-zA-Z0-9]+).*", "\\1", ab_enzyme$name)
ab_enzyme$base_ab <- as.ab(ab_enzyme$base_name)
for (i in seq_len(nrow(ab_enzyme))) {
if (all(c(ab_enzyme[i, ]$ab, ab_enzyme[i, ]$base_ab) %in% names(cols_ab), na.rm = TRUE)) {
ab_name_base <- ab_name(cols_ab[ab_enzyme[i, ]$base_ab], language = NULL, tolower = TRUE)
ab_name_enzyme <- ab_name(cols_ab[ab_enzyme[i, ]$ab], language = NULL, tolower = TRUE)
# Set base to R where base + enzyme inhibitor is R
rule_current <- paste0("Set ", ab_name_base, " (", cols_ab[ab_enzyme[i, ]$base_ab], ") = R where ",
ab_name_enzyme, " (", cols_ab[ab_enzyme[i, ]$ab], ") = R")
if (info == TRUE) {
cat(rule_current)
}
run_changes <- edit_rsi(to = "R",
rule = c(rule_current, "Other rules", ""),
rows = which(as.rsi_no_warning(x[, cols_ab[ab_enzyme[i, ]$ab]]) == "R"),
cols = cols_ab[ab_enzyme[i, ]$base_ab])
no_added <- no_added + run_changes$added
no_changed <- no_changed + run_changes$changed
# Print number of new changes
if (info == TRUE) {
# print only on last one of rules in this group
txt_ok(no_added = no_added, no_changed = no_changed)
# and reset counters
no_added <- 0
no_changed <- 0
}
# Set base + enzyme inhibitor to S where base is S
rule_current <- paste0("Set ", ab_name_enzyme, " (", cols_ab[ab_enzyme[i, ]$ab], ") = S where ",
ab_name_base, " (", cols_ab[ab_enzyme[i, ]$base_ab], ") = S")
if (info == TRUE) {
cat(rule_current)
}
run_changes <- edit_rsi(to = "S",
rule = c(rule_current, "Other rules", ""),
rows = which(as.rsi_no_warning(x[, cols_ab[ab_enzyme[i, ]$base_ab]]) == "S"),
cols = cols_ab[ab_enzyme[i, ]$ab])
no_added <- no_added + run_changes$added
no_changed <- no_changed + run_changes$changed
# Print number of new changes
if (info == TRUE) {
# print only on last one of rules in this group
txt_ok(no_added = no_added, no_changed = no_changed)
# and reset counters
no_added <- 0
no_changed <- 0
}
}
}
} else {
if (info == TRUE) {
cat(font_red("\nSkipping inheritance rules defined by this package, such as setting trimethoprim (TMP) = R where trimethoprim/sulfamethoxazole (SXT) = R.\nUse eucast_rules(..., rules = \"all\") to also apply those rules.\n"))
}
}
# Official EUCAST rules ---------------------------------------------------
eucast_notification_shown <- FALSE
eucast_rules_df <- eucast_rules_file # internal data file
no_added <- 0
no_changed <- 0
for (i in seq_len(nrow(eucast_rules_df))) {
rule_previous <- eucast_rules_df[max(1, i - 1), "reference.rule"]
@ -637,18 +706,14 @@ eucast_rules <- function(x, @@ -637,18 +706,14 @@ eucast_rules <- function(x,
if (rule_group_current %like% "expert" & !any(c("all", "expert") %in% rules)) {
next
}
if (rule_group_current %like% "other" & !any(c("all", "other") %in% rules)) {
next
}
if (info == TRUE & !rule_group_current %like% "other" & eucast_notification_shown == FALSE) {
cat(paste0(
"\n----\nRules by the ", font_bold("European Committee on Antimicrobial Susceptibility Testing (EUCAST)"),
cat(paste0("\n", font_grey(strrep("-", options()$width - 1)),
"\nRules by the ", font_bold("European Committee on Antimicrobial Susceptibility Testing (EUCAST)"),
"\n", font_blue("http://eucast.org/"), "\n"))
eucast_notification_shown <- TRUE
}
if (info == TRUE) {
# Print rule (group) ------------------------------------------------------
if (rule_group_current != rule_group_previous) {
@ -662,7 +727,7 @@ eucast_rules <- function(x, @@ -662,7 +727,7 @@ eucast_rules <- function(x,
rule_group_current %like% "expert",
paste0("\nEUCAST Expert Rules, Intrinsic Resistance and Exceptional Phenotypes (",
font_red(paste0("v", EUCAST_VERSION_EXPERT_RULES)), ")\n"),
"\nOther rules by this AMR package\n"))))
""))))
}
# Print rule -------------------------------------------------------------
if (rule_current != rule_previous) {
@ -733,18 +798,18 @@ eucast_rules <- function(x, @@ -733,18 +798,18 @@ eucast_rules <- function(x,
rows <- integer(0)
} else if (length(source_antibiotics) == 1) {
rows <- tryCatch(which(x[, col_mo_property] %like% mo_value
& x[, source_antibiotics[1L]] == source_value[1L]),
& as.rsi_no_warning(x[, source_antibiotics[1L]]) == source_value[1L]),
error = function(e) integer(0))
} else if (length(source_antibiotics) == 2) {
rows <- tryCatch(which(x[, col_mo_property] %like% mo_value
& x[, source_antibiotics[1L]] == source_value[1L]
& x[, source_antibiotics[2L]] == source_value[2L]),
& as.rsi_no_warning(x[, source_antibiotics[1L]]) == source_value[1L]
& as.rsi_no_warning(x[, source_antibiotics[2L]]) == source_value[2L]),
error = function(e) integer(0))
} else if (length(source_antibiotics) == 3) {
rows <- tryCatch(which(x[, col_mo_property] %like% mo_value
& x[, source_antibiotics[1L]] == source_value[1L]
& x[, source_antibiotics[2L]] == source_value[2L]
& x[, source_antibiotics[3L]] == source_value[3L]),
& as.rsi_no_warning(x[, source_antibiotics[1L]]) == source_value[1L]
& as.rsi_no_warning(x[, source_antibiotics[2L]]) == source_value[2L]
& as.rsi_no_warning(x[, source_antibiotics[3L]]) == source_value[3L]),
error = function(e) integer(0))
} else {
stop("only 3 antibiotics supported for source_antibiotics ", call. = FALSE)
@ -784,7 +849,7 @@ eucast_rules <- function(x, @@ -784,7 +849,7 @@ eucast_rules <- function(x,
arrange(row, rule_group, rule_name, col)
cat(paste0("\n", font_grey(strrep("-", options()$width - 1)), "\n"))
cat(font_bold(paste("EUCAST rules", paste0(wouldve, "affected"),
cat(font_bold(paste("The rules", paste0(wouldve, "affected"),
formatnr(n_distinct(verbose_info$row)),
"out of", formatnr(nrow(x_original)),
"rows, making a total of", formatnr(nrow(verbose_info)), "edits\n")))
@ -846,6 +911,12 @@ eucast_rules <- function(x, @@ -846,6 +911,12 @@ eucast_rules <- function(x,
}
}
if (isTRUE(warn_lacking_rsi_class)) {
warning("Not all columns with antimicrobial results are of class <rsi>.\n",
"Transform eligible columns to class <rsi> on beforehand: your_data %>% mutate_if(is.rsi.eligible, as.rsi)",
call. = FALSE)
}
# Return data set ---------------------------------------------------------
if (verbose == TRUE) {
rownames(verbose_info) <- NULL

2
R/join_microorganisms.R

@ -154,7 +154,7 @@ joins_check_df <- function(x, by) { @@ -154,7 +154,7 @@ joins_check_df <- function(x, by) {
by <- "mo"
x[, "mo"] <- as.mo(x[, "mo"])
} else {
stop("Cannot join - no column found with name or class `mo`.", call. = FALSE)
stop("Cannot join - no column found with name or class <mo>.", call. = FALSE)
}
}
message('Joining, by = "', by, '"') # message same as dplyr::join functions

2
R/mic.R

@ -174,7 +174,7 @@ droplevels.mic <- function(x, exclude = ifelse(anyNA(levels(x)), NULL, NA), ...) @@ -174,7 +174,7 @@ droplevels.mic <- function(x, exclude = ifelse(anyNA(levels(x)), NULL, NA), ...)
#' @export
#' @noRd
print.mic <- function(x, ...) {
cat("Class 'mic'\n")
cat("Class <mic>\n")
print(as.character(x), quote = FALSE)
}

92
R/mo.R

@ -126,7 +126,6 @@ @@ -126,7 +126,6 @@
#' as.mo("MRSA") # Methicillin Resistant S. aureus
#' as.mo("VISA") # Vancomycin Intermediate S. aureus
#' as.mo("VRSA") # Vancomycin Resistant S. aureus
#' as.mo(22242419) # Catalogue of Life ID
#' as.mo(115329001) # SNOMED CT code
#'
#' # Dyslexia is no problem - these all work:
@ -556,20 +555,44 @@ exec_as.mo <- function(x, @@ -556,20 +555,44 @@ exec_as.mo <- function(x,
if (initial_search == TRUE) {
progress$tick()
}
if (x_backup[i] %like_case% "\\(unknown [a-z]+\\)" | tolower(x_backup_without_spp[i]) %in% c("other", "none", "unknown")) {
# empty and nonsense values, ignore without warning
x[i] <- lookup(mo == "UNKNOWN")
# valid MO code ----
found <- lookup(mo == toupper(x_backup[i]))
if (!is.na(found)) {
x[i] <- found[1L]
next
}
# valid MO code ---
found <- lookup(mo == toupper(x_backup[i]))
# valid fullname ----
found <- lookup(fullname_lower %in% gsub("[^a-zA-Z0-9_. -]", "", tolower(c(x_backup[i], x_backup_without_spp[i]))))
# added the gsub() for "(unknown fungus)", since fullname_lower does not contain brackets
if (!is.na(found)) {
x[i] <- found[1L]
next
}
# old fullname ----
found <- lookup(fullname_lower %in% tolower(c(x_backup[i], x_backup_without_spp[i])),
column = NULL, # all columns
haystack = MO.old_lookup)
if (!all(is.na(found))) {
# when property is "ref" (which is the case in mo_ref, mo_authors and mo_year), return the old value, so:
# mo_ref() of "Chlamydia psittaci" will be "Page, 1968" (with warning)
# mo_ref() of "Chlamydophila psittaci" will be "Everett et al., 1999"
if (property == "ref") {
x[i] <- found["ref"]
} else {
x[i] <- lookup(fullname == found["fullname_new"], haystack = MO_lookup)
}
options(mo_renamed_last_run = found["fullname"])
was_renamed(name_old = found["fullname"],
name_new = lookup(fullname == found["fullname_new"], "fullname", haystack = MO_lookup),
ref_old = found["ref"],
ref_new = lookup(fullname == found["fullname_new"], "ref", haystack = MO_lookup),
mo = lookup(fullname == found["fullname_new"], "mo", haystack = MO_lookup))
next
}
# old mo code, used in previous versions of this package ----
if (x_backup[i] %in% microorganisms.translation$mo_old) {
old_mo_warning <- TRUE
@ -582,10 +605,9 @@ exec_as.mo <- function(x, @@ -582,10 +605,9 @@ exec_as.mo <- function(x,
}
}
found <- lookup(fullname_lower %in% tolower(c(x_backup[i], x_backup_without_spp[i])))
# most probable: is exact match in fullname
if (!is.na(found)) {
x[i] <- found[1L]
if (x_backup[i] %like_case% "\\(unknown [a-z]+\\)" | tolower(x_backup_without_spp[i]) %in% c("other", "none", "unknown")) {
# empty and nonsense values, ignore without warning
x[i] <- lookup(mo == "UNKNOWN")
next
}
@ -614,13 +636,6 @@ exec_as.mo <- function(x, @@ -614,13 +636,6 @@ exec_as.mo <- function(x,
next
}
# valid Catalogue of Life ID ---
found <- lookup(col_id == x_backup[i])
if (!is.na(found)) {
x[i] <- found[1L]
next
}
# WHONET and other common LIS codes ----
found <- lookup(code %in% toupper(c(x_backup_untouched[i], x_backup[i], x_backup_without_spp[i])),
column = "mo",
@ -943,21 +958,20 @@ exec_as.mo <- function(x, @@ -943,21 +958,20 @@ exec_as.mo <- function(x,
column = NULL, # all columns
haystack = data.old_to_check)
if (!all(is.na(found))) {
col_id_new <- found["col_id_new"]
# when property is "ref" (which is the case in mo_ref, mo_authors and mo_year), return the old value, so:
# mo_ref() of "Chlamydia psittaci" will be "Page, 1968" (with warning)
# mo_ref() of "Chlamydophila psittaci" will be "Everett et al., 1999"
if (property == "ref") {
x[i] <- found["ref"]
} else {
x[i] <- lookup(col_id == found["col_id_new"], haystack = MO_lookup)
x[i] <- lookup(fullname == found["fullname_new"], haystack = MO_lookup)
}
options(mo_renamed_last_run = found["fullname"])
was_renamed(name_old = found["fullname"],
name_new = lookup(col_id == found["col_id_new"], "fullname", haystack = MO_lookup),
name_new = lookup(fullname == found["fullname_new"], "fullname", haystack = MO_lookup),
ref_old = found["ref"],
ref_new = lookup(col_id == found["col_id_new"], "ref", haystack = MO_lookup),
mo = lookup(col_id == found["col_id_new"], "mo", haystack = MO_lookup))
ref_new = lookup(fullname == found["fullname_new"], "ref", haystack = MO_lookup),
mo = lookup(fullname == found["fullname_new"], "mo", haystack = MO_lookup))
return(x[i])
}
@ -997,18 +1011,18 @@ exec_as.mo <- function(x, @@ -997,18 +1011,18 @@ exec_as.mo <- function(x,
# mo_ref("Chlamydophila psittaci) = "Everett et al., 1999"
x <- found["ref"]
} else {
x <- lookup(col_id == found["col_id_new"], haystack = MO_lookup)
x <- lookup(fullname == found["fullname_new"], haystack = MO_lookup)
}
was_renamed(name_old = found["fullname"],
name_new = lookup(col_id == found["col_id_new"], "fullname", haystack = MO_lookup),
name_new = lookup(fullname == found["fullname_new"], "fullname", haystack = MO_lookup),
ref_old = found["ref"],
ref_new = lookup(col_id == found["col_id_new"], "ref", haystack = MO_lookup),
mo = lookup(col_id == found["col_id_new"], "mo", haystack = MO_lookup))
ref_new = lookup(fullname == found["fullname_new"], "ref", haystack = MO_lookup),
mo = lookup(fullname == found["fullname_new"], "mo", haystack = MO_lookup))
options(mo_renamed_last_run = found["fullname"])
uncertainties <<- rbind(uncertainties,
format_uncertainty_as_df(uncertainty_level = now_checks_for_uncertainty_level,
input = a.x_backup,
result_mo = lookup(col_id == found["col_id_new"], "mo", haystack = MO_lookup)))
result_mo = lookup(fullname == found["fullname_new"], "mo", haystack = MO_lookup)))
return(x)
}
@ -1366,6 +1380,10 @@ exec_as.mo <- function(x, @@ -1366,6 +1380,10 @@ exec_as.mo <- function(x,
failures <- c(failures, x_backup[i])
}
}
if (initial_search == TRUE) {
close(progress)
}
}
# handling failures ----
@ -1494,7 +1512,7 @@ exec_as.mo <- function(x, @@ -1494,7 +1512,7 @@ exec_as.mo <- function(x,
if (property == "mo") {
x <- to_class_mo(x)
}
if (length(mo_renamed()) > 0) {
print(mo_renamed())
}
@ -1552,7 +1570,7 @@ format_uncertainty_as_df <- function(uncertainty_level, @@ -1552,7 +1570,7 @@ format_uncertainty_as_df <- function(uncertainty_level,
#' @export
#' @noRd
print.mo <- function(x, ...) {
cat("Class 'mo'\n")
cat("Class <mo>\n")
x_names <- names(x)
x <- as.character(x)
names(x) <- x_names
@ -1711,6 +1729,9 @@ print.mo_renamed <- function(x, ...) { @@ -1711,6 +1729,9 @@ print.mo_renamed <- function(x, ...) {
font_italic(x$old_name[i]), ifelse(x$old_ref[i] %in% c("", NA), "",
paste0(" (", gsub("et al.", font_italic("et al."), x$old_ref[i]), ")")),
" was renamed ",
ifelse(as.integer(gsub("[^0-9]", "", x$new_ref[i])) < as.integer(gsub("[^0-9]", "", x$old_ref[i])),
font_bold("back to "),
""),
font_italic(x$new_name[i]), ifelse(x$new_ref[i] %in% c("", NA), "",
paste0(" (", gsub("et al.", font_italic("et al."), x$new_ref[i]), ")")),
" [", x$mo[i], "]")))
@ -1747,9 +1768,14 @@ translate_allow_uncertain <- function(allow_uncertain) { @@ -1747,9 +1768,14 @@ translate_allow_uncertain <- function(allow_uncertain) {
}
get_mo_failures_uncertainties_renamed <- function() {
list(failures = getOption("mo_failures"),
uncertainties = getOption("mo_uncertainties"),
renamed = getOption("mo_renamed"))
remember <- list(failures = getOption("mo_failures"),
uncertainties = getOption("mo_uncertainties"),
renamed = getOption("mo_renamed"))
# empty them, otherwise mo_shortname("Chlamydophila psittaci") will give 3 notes
options("mo_failures" = NULL)
options("mo_uncertainties" = NULL)
options("mo_renamed" = NULL)
remember
}
load_mo_failures_uncertainties_renamed <- function(metadata) {

18
R/mo_property.R

@ -149,6 +149,7 @@ mo_fullname <- mo_name @@ -149,6 +149,7 @@ mo_fullname <- mo_name
#' @export
mo_shortname <- function(x, language = get_locale(), ...) {
x.mo <- as.mo(x, ...)
metadata <- get_mo_failures_uncertainties_renamed()
replace_empty <- function(x) {
@ -158,7 +159,7 @@ mo_shortname <- function(x, language = get_locale(), ...) { @@ -158,7 +159,7 @@ mo_shortname <- function(x, language = get_locale(), ...) {
# get first char of genus and complete species in English
shortnames <- paste0(substr(mo_genus(x.mo, language = NULL), 1, 1), ". ", replace_empty(mo_species(x.mo, language = NULL)))
# exceptions for Staphylococci
shortnames[shortnames == "S. coagulase-negative"] <- "CoNS"
shortnames[shortnames == "S. coagulase-positive"] <- "CoPS"
@ -315,9 +316,9 @@ mo_synonyms <- function(x, ...) { @@ -315,9 +316,9 @@ mo_synonyms <- function(x, ...) {
x <- as.mo(x, ...)
metadata <- get_mo_failures_uncertainties_renamed()
IDs <- mo_property(x = x, property = "col_id", language = NULL)
syns <- lapply(IDs, function(col_id) {
res <- sort(microorganisms.old[which(microorganisms.old$col_id_new == col_id), "fullname"])
IDs <- mo_name(x = x, language = NULL)
syns <- lapply(IDs, function(newname) {
res <- sort(microorganisms.old[which(microorganisms.old$fullname_new == newname), "fullname"])
if (length(res) == 0) {
NULL
} else {
@ -368,14 +369,9 @@ mo_url <- function(x, open = FALSE, ...) { @@ -368,14 +369,9 @@ mo_url <- function(x, open = FALSE, ...) {
df <- data.frame(mo, stringsAsFactors = FALSE) %>%
left_join(select(microorganisms, mo, source, species_id), by = "mo")
df$url <- ifelse(df$source == "CoL",
paste0(gsub("{year}",
catalogue_of_life$year,
catalogue_of_life$url_CoL,
fixed = TRUE),
"details/species/id/",
df$species_id),
paste0(catalogue_of_life$url_CoL, "details/species/id/", df$species_id, "/"),
ifelse(df$source == "DSMZ",
paste0(catalogue_of_life$url_DSMZ, "/", unlist(lapply(strsplit(mo_names, ""), function(x) x[1]))),
paste0(catalogue_of_life$url_DSMZ, "/advanced_search?adv[taxon-name]=", gsub(" ", "+", mo_names), "/"),
NA_character_))
u <- df$url
names(u) <- mo_names

2
R/rsi.R

@ -533,7 +533,7 @@ is.rsi.eligible <- function(x, threshold = 0.05) { @@ -533,7 +533,7 @@ is.rsi.eligible <- function(x, threshold = 0.05) {
#' @export
#' @noRd
print.rsi <- function(x, ...) {
cat("Class 'rsi'\n")
cat("Class <rsi>\n")
print(as.character(x), quote = FALSE)
}

4
R/rsi_calc.R

@ -128,7 +128,7 @@ rsi_calc <- function(..., @@ -128,7 +128,7 @@ rsi_calc <- function(...,
}
if (print_warning == TRUE) {
warning("Increase speed by transforming to class `rsi` on beforehand: df %>% mutate_if(is.rsi.eligible, as.rsi)",
warning("Increase speed by transforming to class <rsi> on beforehand: your_data %>% mutate_if(is.rsi.eligible, as.rsi)",
call. = FALSE)
}
@ -177,7 +177,7 @@ rsi_calc_df <- function(type, # "proportion", "count" or "both" @@ -177,7 +177,7 @@ rsi_calc_df <- function(type, # "proportion", "count" or "both"
}
if (!any(sapply(data, is.rsi), na.rm = TRUE)) {
stop("No columns with class 'rsi' found. See ?as.rsi.", call. = FALSE)
stop("No columns with class <rsi> found. See ?as.rsi.", call. = FALSE)
}
if (as.character(translate_ab) %in% c("TRUE", "official")) {

BIN
R/sysdata.rda

Binary file not shown.

4
data-raw/antibiotics.txt

@ -89,7 +89,7 @@ @@ -89,7 +89,7 @@
"CTF" "J01DC07" 43708 "Cefotiam" "Cephalosporins (2nd gen.)" "Other beta-lactam antibacterials" "Second-generation cephalosporins" "" "c(\"cefotiam\", \"cefotiam?\", \"cefotiamum\", \"ceradolan\", \"ceradon\", \"haloapor\")" 1.2 "g" 4 "g"
"CHE" 125846 "Cefotiam hexetil" "Cephalosporins (3rd gen.)" "" "c(\"cefotiam cilexetil\", \"pansporin t\")"
"FOV" 9578573 "Cefovecin" "Cephalosporins (3rd gen.)" "" ""
"FOX" "J01DC01" 441199 "Cefoxitin" "Cephalosporins (2nd gen.)" "Other beta-lactam antibacterials" "Second-generation cephalosporins" "c(\"cfox\", \"cfx\", \"cfxt\", \"cx\", \"fox\", \"fx\")" "c(\"cefoxitin\", \"cefoxitina\", \"cefoxitine\", \"cefoxitinum\", \"cefoxotin\", \"cephoxitin\", \"mefoxin\", \"mefoxitin\", \"rephoxitin\")" 6 "g" "c(\"25240-3\", \"3448-8\")"
"FOX" "J01DC01" 441199 "Cefoxitin" "Cephalosporins (2nd gen.)" "Other beta-lactam antibacterials" "Second-generation cephalosporins" "c(\"cfox\", \"cfsc\", \"cfx\", \"cfxt\", \"cx\", \"fox\", \"fx\")" "c(\"cefoxitin\", \"cefoxitina\", \"cefoxitine\", \"cefoxitinum\", \"cefoxotin\", \"cephoxitin\", \"mefoxin\", \"mefoxitin\", \"rephoxitin\")" 6 "g" "c(\"25240-3\", \"3448-8\")"
"ZOP" 9571080 "Cefozopran" "Cephalosporins (4th gen.)" "" "cefozopran"
"CFZ" 68597 "Cefpimizole" "Cephalosporins (3rd gen.)" "" "c(\"cefpimizol\", \"cefpimizole\", \"cefpimizole sodium\", \"cefpimizolum\")"
"CPM" "J01DD11" 636405 "Cefpiramide" "Cephalosporins (3rd gen.)" "Other beta-lactam antibacterials" "Third-generation cephalosporins" "" "c(\"cefpiramide\", \"cefpiramide acid\", \"cefpiramido\", \"cefpiramidum\")" 2 "g"
@ -105,7 +105,7 @@ @@ -105,7 +105,7 @@
"CPT" "J01DI02" 56841980 "Ceftaroline" "Cephalosporins (5th gen.)" "c(\"\", \"cfro\")" "c(\"teflaro\", \"zinforo\")"
"CPA" "Ceftaroline/avibactam" "Cephalosporins (5th gen.)" "" ""
"CAZ" "J01DD02" 5481173 "Ceftazidime" "Cephalosporins (3rd gen.)" "Other beta-lactam antibacterials" "Third-generation cephalosporins" "c(\"caz\", \"cefta\", \"cfta\", \"cftz\", \"taz\", \"tz\", \"xtz\")" "c(\"ceftazidim\", \"ceftazidima\", \"ceftazidime\", \"ceftazidimum\", \"ceptaz\", \"fortaz\", \"fortum\", \"pentacef\", \"tazicef\", \"tazidime\")" 4 "g" "c(\"21151-6\", \"3449-6\", \"80960-8\")"
"CZA" "Ceftazidime/avibactam" "Cephalosporins (3rd gen.)" "" ""
"CZA" "Ceftazidime/avibactam" "Cephalosporins (3rd gen.)" "c(\"\", \"cfav\")" ""
"CCV" "J01DD52" 9575352 "Ceftazidime/clavulanic acid" "Cephalosporins (3rd gen.)" "Other beta-lactam antibacterials" "Third-generation cephalosporins" "c(\"czcl\", \"xtzl\")" ""
"CEM" 6537431 "Cefteram" "Cephalosporins (3rd gen.)" "" "c(\"cefteram\", \"cefterame\", \"cefteramum\", \"ceftetrame\")"
"CPL" 5362114 "Cefteram pivoxil" "Cephalosporins (3rd gen.)" "" "c(\"cefteram pivoxil\", \"tomiron\")"

BIN
data-raw/data_dsmz.rds

Binary file not shown.

16
data-raw/eucast_rules.tsv

@ -9,22 +9,6 @@ @@ -9,22 +9,6 @@
# >>>>> IF YOU WANT TO IMPORT THIS FILE INTO YOUR OWN SOFTWARE, HAVE THE FIRST 10 LINES SKIPPED <<<<<
# -------------------------------------------------------------------------------------------------------------------------------
if_mo_property like.is.one_of this_value and_these_antibiotics have_these_values then_change_these_antibiotics to_value reference.rule reference.rule_group
genus like .* AMP S AMX S Non-EUCAST: inherit ampicillin results for unavailable amoxicillin Other rules
genus like .* AMP I AMX I Non-EUCAST: inherit ampicillin results for unavailable amoxicillin Other rules
genus like .* AMP R AMX R Non-EUCAST: inherit ampicillin results for unavailable amoxicillin Other rules
genus like .* AMX S AMP S Non-EUCAST: inherit amoxicillin results for unavailable ampicillin Other rules
genus like .* AMX I AMP I Non-EUCAST: inherit amoxicillin results for unavailable ampicillin Other rules
genus like .* AMX R AMP R Non-EUCAST: inherit amoxicillin results for unavailable ampicillin Other rules
genus like .* AMC R AMP, AMX R Non-EUCAST: set ampicillin = R where amoxicillin/clav acid = R Other rules
genus like .* SAM R AMP, AMX R Non-EUCAST: set ampicillin = R where ampicillin/sulbactam = R Other rules
genus like .* TZP R PIP R Non-EUCAST: set piperacillin = R where piperacillin/tazobactam = R Other rules
genus like .* SXT R TMP R Non-EUCAST: set trimethoprim = R where trimethoprim/sulfa = R Other rules
genus like .* AMP S AMC S Non-EUCAST: set amoxicillin/clav acid = S where ampicillin = S Other rules
genus like .* AMX S AMC S Non-EUCAST: set amoxicillin/clav acid = S where ampicillin = S Other rules
genus like .* AMP S SAM S Non-EUCAST: set ampicillin/sulbactam = S where ampicillin = S Other rules
genus like .* AMX S SAM S Non-EUCAST: set ampicillin/sulbactam = S where ampicillin = S Other rules
genus like .* PIP S TZP S Non-EUCAST: set piperacillin/tazobactam = S where piperacillin = S Other rules
genus like .* TMP S SXT S Non-EUCAST: set trimethoprim/sulfa = S where trimethoprim = S Other rules
order is Enterobacterales AMP S AMX S Enterobacterales (Order) Breakpoints
order is Enterobacterales AMP I AMX I Enterobacterales (Order) Breakpoints
order is Enterobacterales AMP R AMX R Enterobacterales (Order) Breakpoints

Can't render this file because it contains an unexpected character in line 6 and column 96.

BIN
data-raw/microorganisms.translation.rds

Binary file not shown.

136556
data-raw/microorganisms.txt

File diff suppressed because it is too large Load Diff

7
data-raw/reproduction_of_antibiotics.R

@ -322,7 +322,7 @@ antibiotics[which(antibiotics$ab == as.ab("cefuroxim")), "abbreviations"][[1]] < @@ -322,7 +322,7 @@ antibiotics[which(antibiotics$ab == as.ab("cefuroxim")), "abbreviations"][[1]] <
antibiotics[which(antibiotics$ab == as.ab("cefotaxim")), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == as.ab("cefotaxim")), "abbreviations"][[1]], "cftx"))
antibiotics[which(antibiotics$ab == as.ab("ceftazidime")), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == as.ab("ceftazidime")), "abbreviations"][[1]], "cftz"))
antibiotics[which(antibiotics$ab == as.ab("cefepime")), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == as.ab("cefepime")), "abbreviations"][[1]], "cfpi"))
antibiotics[which(antibiotics$ab == as.ab("cefoxitin")), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == as.ab("cefoxitin")), "abbreviations"][[1]], "cfxt"))
antibiotics[which(antibiotics$ab == as.ab("cefoxitin")), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == as.ab("cefoxitin")), "abbreviations"][[1]], "cfxt", "cfsc"))
# More GLIMS codes
antibiotics[which(antibiotics$ab == "CAZ"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CAZ"), "abbreviations"][[1]], "cftz"))
antibiotics[which(antibiotics$ab == "CRO"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CRO"), "abbreviations"][[1]], "cftr"))
@ -377,6 +377,7 @@ antibiotics[which(antibiotics$ab == "CTX"), "abbreviations"][[1]] <- list(c(anti @@ -377,6 +377,7 @@ antibiotics[which(antibiotics$ab == "CTX"), "abbreviations"][[1]] <- list(c(anti
antibiotics[which(antibiotics$ab == "CAZ"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CAZ"), "abbreviations"][[1]], "cftz"))
antibiotics[which(antibiotics$ab == "CFM"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CFM"), "abbreviations"][[1]], "cfxm"))
antibiotics[which(antibiotics$ab == "FOX"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "FOX"), "abbreviations"][[1]], "cfxt"))
antibiotics[which(antibiotics$ab == "CZA"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CZA"), "abbreviations"][[1]], "cfav"))
antibiotics[which(antibiotics$ab == "CZO"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CZO"), "abbreviations"][[1]], "cfzl"))
antibiotics[which(antibiotics$ab == "CZX"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CZX"), "abbreviations"][[1]], "cfzx"))
antibiotics[which(antibiotics$ab == "CHL"), "abbreviations"][[1]] <- list(c(antibiotics[which(antibiotics$ab == "CHL"), "abbreviations"][[1]], "chlo"))
@ -577,10 +578,10 @@ antibiotics <- antibiotics %>% @@ -577,10 +578,10 @@ antibiotics <- antibiotics %>%
# set as data.frame again
antibiotics <- as.data.frame(antibiotics, stringsAsFactors = FALSE)
class(antibiotics$ab) <- "ab"
class(antibiotics$ab) <- c("ab", "character")
antibiotics <- antibiotics %>% arrange(name)
# make all abbreviations and synonyms lower case, unique and alphabetically sorted
# make all abbreviations and synonyms lower case, unique and alphabetically sorted ----
for (i in 1:nrow(antibiotics)) {
abb <- sort(unique(tolower(antibiotics[i, "abbreviations"][[1]])))
syn <- sort(unique(tolower(antibiotics[i, "synonyms"][[1]])))

658
data-raw/reproduction_of_microorganisms.R

@ -23,94 +23,134 @@ @@ -23,94 +23,134 @@
# Data retrieved from the Catalogue of Life (CoL) through the Encyclopaedia of Life:
# https://opendata.eol.org/dataset/catalogue-of-life/
# Data retrieved from the Global Biodiversity Information Facility (GBIF):
# https://doi.org/10.15468/rffz4x
# (download the resource file with a name like "Catalogue of Life yyyy-mm-dd")
# and from the Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures
# https://www.dsmz.de/support/bacterial-nomenclature-up-to-date-downloads.html
# (download the latest "Complete List" as xlsx file)
#
# And from the Leibniz Institute: German Collection of Microorganisms and Cell Cultures (DSMZ)
# (register first at https://bacdive.dsmz.de/api/pnu/registration/register/ and use API as done below)
library(dplyr)
library(AMR)
# unzip and extract taxon.tab (around 1.5 GB) from the CoL archive, then:
# data_col <- data.table::fread("data-raw/taxon.tab")
data_col <- data.table::fread("data-raw/taxa.txt", quote = "")
# read the xlsx file from DSMZ (only around 2.5 MB):
data_dsmz <- readxl::read_xlsx("data-raw/DSMZ_bactnames.xlsx")
# also needed: data.table, httr, jsonlite, cleaner, stringr
# unzip and extract taxa.txt (both around 1.5 GB, 3.7-3.9M rows) from Col and GBIF, then:
data_col_raw <- data.table::fread("data-raw/taxon.tab", quote = "")
data_gbif <- data.table::fread("data-raw/taxa.txt", quote = "")
# merge the two
data_col <- data_gbif %>%
rename(referenceID = identifier) %>%
bind_rows(data_col_raw) %>%
distinct(scientificName, kingdom, genus, specificEpithet, infraspecificEpithet, .keep_all = TRUE)
rm(data_col_raw)
rm(data_gbif)
# read the data from the DSMZ API (around 19000 rows)
dsmz_username <- ""
dsmz_password <- ""
GET_df <- function(url) {
result <- httr::GET(url, httr::authenticate(dsmz_username, dsmz_password))
httr::stop_for_status(result)
result %>%
httr::content(type = "text", encoding = "UTF-8") %>%
jsonlite::fromJSON(flatten = TRUE)
}
dsmz_first <- GET_df("https://bacdive.dsmz.de/api/pnu/species?page=1&format=json")
data_dsmz <- dsmz_first$results
# this next process will take appr. `dsmz_first$count / 100 * 5 / 60` minutes
for (i in 2:round((dsmz_first$count / 100) + 0.5)) {
data_dsmz <<- rbind(data_dsmz,
GET_df(paste0("https://bacdive.dsmz.de/api/pnu/species/?page=", i, "&format=json"))$results)
cat(i, "-", AMR:::percentage(i / round((dsmz_first$count / 100) + 0.5)), "\n")
}
rm(dsmz_first)
# the CoL data is over 3.7M rows:
data_col %>% freq(kingdom)
data_col %>% cleaner::freq(kingdom)
# Item Count Percent Cum. Count Cum. Percent
# --- ---------- ---------- -------- ----------- -------------
# 1 Animalia 2,225,627 59.1% 2,225,627 59.1%
# 2 Plantae 1,177,412 31.3% 3,403,039 90.4%
# 3 Fungi 290,145 7.7% 3,693,184 98.1%
# 4 Chromista 47,126 1.3% 3,740,310 99.3%
# 5 Bacteria 14,478 0.4% 3,754,788 99.7%
# 6 Protozoa 6,060 0.2% 3,760,848 99.9%
# 7 Viruses 3,827 0.1% 3,764,675 100.0%
# 8 Archaea 610 0.0% 3,765,285 100.0%
# 1 Animalia 2,494,992 55.43% 2,494,992 55.43%
# 2 Plantae 1,379,674 30.65% 3,874,666 86.08%
# 3 Fungi 547,619 12.17% 4,422,285 98.24%
# 4 Chromista 51,475 1.14% 4,473,760 99.39%
# 5 Bacteria 14,442 0.32% 4,488,202 99.71%
# 6 Protozoa 8,750 0.19% 4,496,952 99.90%
# 7 Viruses 3,805 0.08% 4,500,757 99.99%
# 8 Archaea 609 0.01% 4,501,366 100.00%
# clean data_col
data_col.bak <- data_col
data_col_old <- data_col %>%
# filter: has new accepted name
filter(!is.na(acceptedNameUsageID)) %>%
as_tibble() %>%
transmute(fullname = trimws(stringr::str_replace(scientificName,
pattern = stringr::fixed(scientificNameAuthorship),
replacement = "")),
fullname_new = trimws(paste(ifelse(is.na(genus), "", genus),
ifelse(is.na(specificEpithet), "", specificEpithet),
ifelse(is.na(infraspecificEpithet), "", infraspecificEpithet))),
ref = scientificNameAuthorship,
prevalence = NA_integer_)
data_col <- data_col %>%
# filter: has no new accepted name
filter(is.na(acceptedNameUsageID)) %>%
as_tibble() %>%
select(col_id = taxonID,
col_id_new = acceptedNameUsageID,
fullname = scientificName,
kingdom,
phylum,
class,
order,
family,
genus,
species = specificEpithet,
subspecies = infraspecificEpithet,
rank = taxonRank,
ref = scientificNameAuthorship,
species_id = references)
data_col$source <- "CoL"
transmute(fullname = "",
kingdom,
phylum,
class,
order,
family,
genus,
species = specificEpithet,
subspecies = infraspecificEpithet,
rank = taxonRank,
ref = scientificNameAuthorship,
species_id = referenceID,
source = "CoL")
# clean data_dsmz
data_dsmz.bak <- data_dsmz
data_dsmz_old <- data_dsmz %>%
# filter: correct name is not NULL
filter(!sapply(correct_name, is.null)) %>%
as_tibble() %>%
transmute(fullname = trimws(paste(ifelse(is.na(genus), "", genus),
ifelse(is.na(species_epithet), "", species_epithet),
ifelse(is.na(subspecies_epithet), "", subspecies_epithet))),
fullname_new = sapply(correct_name, function(x) x[2L]),
ref = authors,
prevalence = NA_integer_)
data_dsmz <- data_dsmz %>%
# filter: correct name is NULL
filter(sapply(correct_name, is.null)) %>%
as_tibble() %>%
transmute(col_id = NA_integer_,
col_id_new = NA_integer_,
fullname = "",
# kingdom = "",
# phylum = "",
# class = "",
# order = "",
# family = "",
genus = ifelse(is.na(GENUS), "", GENUS),
species = ifelse(is.na(SPECIES), "", SPECIES),
subspecies = ifelse(is.na(SUBSPECIES), "", SUBSPECIES),
transmute(fullname = "",
kingdom = regio,
phylum,
class = classis,
# order = "", # does not contain order, will add later based on CoL
family = familia,
genus = ifelse(is.na(genus), "", genus),
species = ifelse(is.na(species_epithet), "", species_epithet),
subspecies = ifelse(is.na(subspecies_epithet), "", subspecies_epithet),
rank = ifelse(species == "", "genus", "species"),
ref = AUTHORS,
species_id = as.character(RECORD_NO),
ref = authors,
species_id = as.character(pnu_no),
source = "DSMZ")
# DSMZ only contains genus/(sub)species, try to find taxonomic properties based on genus and data_col
ref_taxonomy <- data_col %>%
filter(genus %in% data_dsmz$genus,
kingdom %in% c("Bacteria", "Chromista", "Archaea", "Protozoa", "Fungi"),
family != "") %>%
mutate(kingdom = factor(kingdom,
# in the left_join following, try Bacteria first, then Chromista, ...
levels = c("Bacteria", "Chromista", "Archaea", "Protozoa", "Fungi"),
ordered = TRUE)) %>%
filter(family %in% data_dsmz$family & family != "") %>%
arrange(kingdom) %>%
distinct(genus, .keep_all = TRUE) %>%
select(kingdom, phylum, class, order, family, genus)
distinct(family, .keep_all = TRUE) %>%
select(family, order)
data_dsmz <- data_dsmz %>%
left_join(ref_taxonomy, by = "genus") %>%
mutate(kingdom = "Bacteria",
phylum = ifelse(is.na(phylum), "(unknown phylum)", phylum),
class = ifelse(is.na(class), "(unknown class)", class),
order = ifelse(is.na(order), "(unknown order)", order),
family = ifelse(is.na(family), "(unknown family)", family),
)
left_join(ref_taxonomy, by = "family") # NAs will later become "(unknown ...)"
# combine everything
data_total <- data_col %>%
@ -119,6 +159,8 @@ data_total <- data_col %>% @@ -119,6 +159,8 @@ data_total <- data_col %>%
rm(data_col)
rm(data_dsmz)
rm(ref_taxonomy)
rm(data_col.bak)
rm(data_dsmz.bak)
mo_found_in_NL <- c("Absidia", "Acremonium", "Actinotignum", "Aedes", "Alternaria", "Anaerosalibacter", "Ancylostoma",
"Angiostrongylus", "Anisakis", "Anopheles", "Apophysomyces", "Arachnia", "Ascaris", "Aspergillus",
@ -158,8 +200,6 @@ MOs <- data_total %>% @@ -158,8 +200,6 @@ MOs <- data_total %>%
)
# or the genus has to be one of the genera we found in our hospitals last decades (Northern Netherlands, 2002-2018)
| genus %in% mo_found_in_NL
# or the taxonomic entry is old - the species was renamed
| !is.na(col_id_new)
) %>%
# really no Plantae (e.g. Dracunculus exist both as worm and as plant)
filter(kingdom != "Plantae") %>%
@ -174,59 +214,56 @@ MOs <- MOs %>% bind_rows(data_total %>% @@ -174,59 +214,56 @@ MOs <- MOs %>% bind_rows(data_total %>%
| (family %in% MOs$family & rank == "family")
| (genus %in% MOs$genus & rank == "genus")))
# filter old taxonomic names so only the ones with an existing reference will be kept
MOs <- MOs %>%
filter(is.na(col_id_new) | (!is.na(col_id_new) & col_id_new %in% MOs$col_id))
MOs <- MOs %>%
# remove text if it contains 'Not assigned' like phylum in viruses
mutate_all(~gsub("(Not assigned|\\[homonym\\]|\\[mistake\\])", "", ., ignore.case = TRUE))
MOs <- MOs %>%
# Only keep first author, e.g. transform 'Smith, Jones, 2011' to 'Smith et al., 2011':
mutate(authors2 = iconv(ref, from = "UTF-8", to = "ASCII//TRANSLIT"),
# remove leading and trailing brackets
authors2 = gsub("^[(](.*)[)]$", "\\1", authors2),
# only take part after brackets if there's a name
authors2 = ifelse(grepl(".*[)] [a-zA-Z]+.*", authors2),
gsub(".*[)] (.*)", "\\1", authors2),
authors2),
# get year from last 4 digits
lastyear = as.integer(gsub(".*([0-9]{4})$", "\\1", authors2)),
# can never be later than now
lastyear = ifelse(lastyear > as.integer(format(Sys.Date(), "%Y")),
NA,
lastyear),
# get authors without last year
authors = gsub("(.*)[0-9]{4}$", "\\1", authors2),
# remove nonsense characters from names
authors = gsub("[^a-zA-Z,'& -]", "", authors),
# remove trailing and leading spaces
authors = trimws(authors),
# only keep first author and replace all others by 'et al'
authors = gsub("(,| and| et| &| ex| emend\\.?) .*", " et al.", authors),
# et al. always with ending dot
authors = gsub(" et al\\.?", " et al.", authors),
authors = gsub(" ?,$", "", authors),
# don't start with 'sensu' or 'ehrenb'
authors = gsub("^(sensu|Ehrenb.?) ", "", authors, ignore.case = TRUE),
# no initials, only surname
authors = gsub("^([A-Z]+ )+", "", authors, ignore.case = FALSE),
# combine author and year if year is available
ref = ifelse(!is.na(lastyear),
paste0(authors, ", ", lastyear),
authors),
# fix beginning and ending
ref = gsub(", $", "", ref),
ref = gsub("^, ", "", ref),
ref = gsub("^(emend|et al.,?)", "", ref),
ref = trimws(ref)
)
# a lot start with a lowercase character - fix that
MOs$ref[!grepl("^d[A-Z]", MOs$ref)] <- gsub("^([a-z])", "\\U\\1", MOs$ref[!grepl("^d[A-Z]", MOs$ref)], perl = TRUE)
# specific one for the French that are named dOrbigny
MOs$ref[grepl("^d[A-Z]", MOs$ref)] <- gsub("^d", "d'", MOs$ref[grepl("^d[A-Z]", MOs$ref)])
MOs <- MOs %>% mutate(ref = gsub(" +", " ", ref))
get_author_year <- function(ref) {
# Only keep first author, e.g. transform 'Smith, Jones, 2011' to 'Smith et al., 2011'
authors2 <- iconv(ref, from = "UTF-8", to = "ASCII//TRANSLIT")
# remove leading and trailing brackets
authors2 <- gsub("^[(](.*)[)]$", "\\1", authors2)
# only take part after brackets if there's a name
authors2 <- ifelse(grepl(".*[)] [a-zA-Z]+.*", authors2),
gsub(".*[)] (.*)", "\\1", authors2),
authors2)
# get year from last 4 digits
lastyear = as.integer(gsub(".*([0-9]{4})$", "\\1", authors2))
# can never be later than now
lastyear = ifelse(lastyear > as.integer(format(Sys.Date(), "%Y")),
NA,
lastyear)
# get authors without last year
authors <- gsub("(.*)[0-9]{4}$", "\\1", authors2)
# remove nonsense characters from names
authors <- gsub("[^a-zA-Z,'& -]", "", authors)
# remove trailing and leading spaces
authors <- trimws(authors)
# only keep first author and replace all others by 'et al'
authors <- gsub("(,| and| et| &| ex| emend\\.?) .*", " et al.", authors)
# et al. always with ending dot
authors <- gsub(" et al\\.?", " et al.", authors)
authors <- gsub(" ?,$", "", authors)
# don't start with 'sensu' or 'ehrenb'
authors <- gsub("^(sensu|Ehrenb.?) ", "", authors, ignore.case = TRUE)
# no initials, only surname
authors <- gsub("^([A-Z]+ )+", "", authors, ignore.case = FALSE)
# combine author and year if year is available
ref <- ifelse(!is.na(lastyear),
paste0(authors, ", ", lastyear),
authors)
# fix beginning and ending
ref <- gsub(", $", "", ref)
ref <- gsub("^, ", "", ref)
ref <- gsub("^(emend|et al.,?)", "", ref)
ref <- trimws(ref)
# a lot start with a lowercase character - fix that
ref[!grepl("^d[A-Z]", ref)] <- gsub("^([a-z])", "\\U\\1", ref[!grepl("^d[A-Z]", ref)], perl = TRUE)
# specific one for the French that are named dOrbigny
ref[grepl("^d[A-Z]", ref)] <- gsub("^d", "d'", ref[grepl("^d[A-Z]", ref)])
ref <- gsub(" +", " ", ref)
ref
}
MOs <- MOs %>% mutate(ref = get_author_year(ref))
# Remove non-ASCII characters (these are not allowed by CRAN)
MOs <- MOs %>%
@ -235,53 +272,58 @@ MOs <- MOs %>% @@ -235,53 +272,58 @@ MOs <- MOs %>%
# remove invalid characters
mutate_all(~gsub("[\"'`]+", "", .))
# Split old taxonomic names - they refer in the original data to a new `taxonID` with `acceptedNameUsageID`
MOs.old <- MOs %>%
filter(!is.na(col_id_new),
ref != "",
source != "DSMZ") %>%
transmute(col_id,
col_id_new,
fullname =
trimws(
gsub("(.*)[(].*", "\\1",
stringr::str_replace(
string = fullname,
pattern = stringr::fixed(authors2),
replacement = "")) %>%
gsub(" (var|f|subsp)[.]", "", .)),
ref) %>%
filter(!is.na(fullname)) %>%
# set new fullnames
MOs <- MOs %>%
mutate(fullname = trimws(case_when(rank == "family" ~ family,
rank == "order" ~ order,
rank == "class" ~ class,
rank == "phylum" ~ phylum,
rank == "kingdom" ~ kingdom,
TRUE ~ paste(genus, species, subspecies))),
fullname = gsub(" (var|f|subsp)[.]", "", fullname)) %>%
# remove text if it contains 'Not assigned', etc.
mutate_all(function(x) ifelse(x %like% "(not assigned|homonym|mistake)", NA, x)) %>%
# clean taxonomy
mutate(kingdom = ifelse(is.na(kingdom) | trimws(kingdom) == "", "(unknown kingdom)", trimws(kingdom)),
phylum = ifelse(is.na(phylum) | trimws(phylum) == "", "(unknown phylum)", trimws(phylum)),
class = ifelse(is.na(class) | trimws(class) == "", "(unknown class)", trimws(class)),
order = ifelse(is.na(order) | trimws(order) == "", "(unknown order)", trimws(order)),
family = ifelse(is.na(family) | trimws(family) == "", "(unknown family)", trimws(family)))
# Split old taxonomic names
MOs.old <- data_col_old %>%
filter(!gsub(" (var|f|subsp)[.]", "", fullname_new) %in% data_dsmz_old$fullname) %>%
bind_rows(data_dsmz_old) %>%
mutate(fullname_new = gsub(" (var|f|subsp)[.]", "", fullname_new),
fullname = gsub(" (var|f|subsp)[.]", "", fullname)) %>%
# for cases like Chlamydia pneumoniae -> Chlamydophila pneumoniae -> Chlamydia pneumoniae:
filter(!fullname %in% fullname_new &
fullname_new %in% MOs$fullname &
!is.na(fullname) &
fullname != fullname_new) %>%
distinct(fullname, .keep_all = TRUE) %>%
arrange(col_id)
MO.bak <- MOs
arrange(fullname) %>%
mutate(ref = get_author_year(ref))
MOs <- MOs %>%
filter(is.na(col_id_new) | source == "DSMZ") %>%
transmute(col_id,
fullname = trimws(case_when(rank == "family" ~ family,
rank == "order" ~ order,
rank == "class" ~ class,
rank == "phylum" ~ phylum,
rank == "kingdom" ~ kingdom,
TRUE ~ paste(genus, species, subspecies))),
# remove entries that are old and in MOs.old
filter(!fullname %in% MOs.old$fullname) %>%
# mark up
transmute(fullname,
kingdom,
phylum,
class,
order,
family,
genus = gsub(":", "", genus),
genus,
species,
subspecies,