# ==================================================================== #
# TITLE #
# Antimicrobial Resistance (AMR) Data Analysis for R #
# #
# SOURCE #
# https://github.com/msberends/AMR #
# #
# LICENCE #
# (c) 2018-2021 Berends MS, Luz CF et al. #
# Developed at the University of Groningen, the Netherlands, in #
# collaboration with non-profit organisations Certe Medical #
# Diagnostics & Advice, and University Medical Center Groningen. #
# #
# This R package is free software; you can freely use and distribute #
# it for both personal and commercial purposes under the terms of the #
# GNU General Public License version 2.0 (GNU GPL-2), as published by #
# the Free Software Foundation. #
# We created this package for both routine data analysis and academic #
# research and it was publicly released in the hope that it will be #
# useful, but it comes WITHOUT ANY WARRANTY OR LIABILITY. #
# #
# Visit our website for the full manual and a complete tutorial about #
# how to conduct AMR data analysis: https://msberends.github.io/AMR/ #
# ==================================================================== #
#' Transform Input to a Microorganism ID
#'
#' Use this function to determine a valid microorganism ID ([`mo`]). Determination is done using intelligent rules and the complete taxonomic kingdoms Bacteria, Chromista, Protozoa, Archaea and most microbial species from the kingdom Fungi (see *Source*). The input can be almost anything: a full name (like `"Staphylococcus aureus"`), an abbreviated name (such as `"S. aureus"`), an abbreviation known in the field (such as `"MRSA"`), or just a genus. See *Examples*.
#' @inheritSection lifecycle Stable Lifecycle
#' @param x a character vector or a [data.frame] with one or two columns
#' @param Becker a logical to indicate whether staphylococci should be categorised into coagulase-negative staphylococci ("CoNS") and coagulase-positive staphylococci ("CoPS") instead of their own species, according to Karsten Becker *et al.* (1,2,3).
#'
#' This excludes *Staphylococcus aureus* at default, use `Becker = "all"` to also categorise *S. aureus* as "CoPS".
#' @param Lancefield a logical to indicate whether beta-haemolytic *Streptococci* should be categorised into Lancefield groups instead of their own species, according to Rebecca C. Lancefield (4). These *Streptococci* will be categorised in their first group, e.g. *Streptococcus dysgalactiae* will be group C, although officially it was also categorised into groups G and L.
#'
#' This excludes *Enterococci* at default (who are in group D), use `Lancefield = "all"` to also categorise all *Enterococci* as group D.
#' @param allow_uncertain a number between `0` (or `"none"`) and `3` (or `"all"`), or `TRUE` (= `2`) or `FALSE` (= `0`) to indicate whether the input should be checked for less probable results, see *Details*
#' @param reference_df a [data.frame] to be used for extra reference when translating `x` to a valid [`mo`]. See [set_mo_source()] and [get_mo_source()] to automate the usage of your own codes (e.g. used in your analysis or organisation).
#' @param ignore_pattern a regular expression (case-insensitive) of which all matches in `x` must return `NA`. This can be convenient to exclude known non-relevant input and can also be set with the option `AMR_ignore_pattern`, e.g. `options(AMR_ignore_pattern = "(not reported|contaminated flora)")`.
#' @param language language to translate text like "no growth", which defaults to the system language (see [get_locale()])
#' @param info a [logical] to indicate if a progress bar should be printed if more than 25 items are to be coerced, defaults to `TRUE` only in interactive mode
#' @param ... other arguments passed on to functions
#' @rdname as.mo
#' @aliases mo
#' @keywords mo Becker becker Lancefield lancefield guess
#' @details
#' ## General Info
#'
#' A microorganism ID from this package (class: [`mo`]) is human readable and typically looks like these examples:
#' ```
#' Code Full name
#' --------------- --------------------------------------
#' B_KLBSL Klebsiella
#' B_KLBSL_PNMN Klebsiella pneumoniae
#' B_KLBSL_PNMN_RHNS Klebsiella pneumoniae rhinoscleromatis
#' | | | |
#' | | | |
#' | | | \---> subspecies, a 4-5 letter acronym
#' | | \----> species, a 4-5 letter acronym
#' | \----> genus, a 5-7 letter acronym
#' \----> taxonomic kingdom: A (Archaea), AN (Animalia), B (Bacteria),
#' C (Chromista), F (Fungi), P (Protozoa)
#' ```
#'
#' Values that cannot be coerced will be considered 'unknown' and will get the MO code `UNKNOWN`.
#'
#' Use the [`mo_*`][mo_property()] functions to get properties based on the returned code, see *Examples*.
#'
#' The algorithm uses data from the Catalogue of Life (see below) and from one other source (see [microorganisms]).
#'
#' The [as.mo()] function uses several coercion rules for fast and logical results. It assesses the input matching criteria in the following order:
#'
#' 1. Human pathogenic prevalence: the function starts with more prevalent microorganisms, followed by less prevalent ones;
#' 2. Taxonomic kingdom: the function starts with determining Bacteria, then Fungi, then Protozoa, then others;
#' 3. Breakdown of input values to identify possible matches.
#'
#' This will lead to the effect that e.g. `"E. coli"` (a microorganism highly prevalent in humans) will return the microbial ID of *Escherichia coli* and not *Entamoeba coli* (a microorganism less prevalent in humans), although the latter would alphabetically come first.
#'
#' ## Coping with Uncertain Results
#'
#' In addition, the [as.mo()] function can differentiate four levels of uncertainty to guess valid results:
#' - Uncertainty level 0: no additional rules are applied;
#' - Uncertainty level 1: allow previously accepted (but now invalid) taxonomic names and minor spelling errors;
#' - Uncertainty level 2: allow all of level 1, strip values between brackets, inverse the words of the input, strip off text elements from the end keeping at least two elements;
#' - Uncertainty level 3: allow all of level 1 and 2, strip off text elements from the end, allow any part of a taxonomic name.
#'
#' The level of uncertainty can be set using the argument `allow_uncertain`. The default is `allow_uncertain = TRUE`, which is equal to uncertainty level 2. Using `allow_uncertain = FALSE` is equal to uncertainty level 0 and will skip all rules. You can also use e.g. `as.mo(..., allow_uncertain = 1)` to only allow up to level 1 uncertainty.
#'
#' With the default setting (`allow_uncertain = TRUE`, level 2), below examples will lead to valid results:
#' - `"Streptococcus group B (known as S. agalactiae)"`. The text between brackets will be removed and a warning will be thrown that the result *Streptococcus group B* (``r as.mo("Streptococcus group B")``) needs review.
#' - `"S. aureus - please mind: MRSA"`. The last word will be stripped, after which the function will try to find a match. If it does not, the second last word will be stripped, etc. Again, a warning will be thrown that the result *Staphylococcus aureus* (``r as.mo("Staphylococcus aureus")``) needs review.
#' - `"Fluoroquinolone-resistant Neisseria gonorrhoeae"`. The first word will be stripped, after which the function will try to find a match. A warning will be thrown that the result *Neisseria gonorrhoeae* (``r as.mo("Neisseria gonorrhoeae")``) needs review.
#'
#' There are three helper functions that can be run after using the [as.mo()] function:
#' - Use [mo_uncertainties()] to get a [data.frame] that prints in a pretty format with all taxonomic names that were guessed. The output contains the matching score for all matches (see *Matching Score for Microorganisms* below).
#' - Use [mo_failures()] to get a [character] [vector] with all values that could not be coerced to a valid value.
#' - Use [mo_renamed()] to get a [data.frame] with all values that could be coerced based on old, previously accepted taxonomic names.
#'
#' ## Microbial Prevalence of Pathogens in Humans
#'
#' The intelligent rules consider the prevalence of microorganisms in humans grouped into three groups, which is available as the `prevalence` columns in the [microorganisms] and [microorganisms.old] data sets. The grouping into human pathogenic prevalence is explained in the section *Matching Score for Microorganisms* below.
#' @inheritSection mo_matching_score Matching Score for Microorganisms
#' @inheritSection catalogue_of_life Catalogue of Life
# (source as a section here, so it can be inherited by other man pages:)
#' @section Source:
#' 1. Becker K *et al.* **Coagulase-Negative Staphylococci**. 2014. Clin Microbiol Rev. 27(4): 870–926; \doi{10.1128/CMR.00109-13}
#' 2. Becker K *et al.* **Implications of identifying the recently defined members of the *S. aureus* complex, *S. argenteus* and *S. schweitzeri*: A position paper of members of the ESCMID Study Group for staphylococci and Staphylococcal Diseases (ESGS).** 2019. Clin Microbiol Infect; \doi{10.1016/j.cmi.2019.02.028}
#' 3. Becker K *et al.* **Emergence of coagulase-negative staphylococci** 2020. Expert Rev Anti Infect Ther. 18(4):349-366; \doi{10.1080/14787210.2020.1730813}
#' 4. Lancefield RC **A serological differentiation of human and other groups of hemolytic streptococci**. 1933. J Exp Med. 57(4): 571–95; \doi{10.1084/jem.57.4.571}
#' 5. `r gsub("{year}", CATALOGUE_OF_LIFE$year, CATALOGUE_OF_LIFE$version, fixed = TRUE)`, <http://www.catalogueoflife.org>
#' 6. List of Prokaryotic names with Standing in Nomenclature (`r CATALOGUE_OF_LIFE$yearmonth_LPSN`), \doi{10.1099/ijsem.0.004332}
#' 7. `r SNOMED_VERSION$current_source`, retrieved from the `r SNOMED_VERSION$title`, OID `r SNOMED_VERSION$current_oid`, version `r SNOMED_VERSION$current_version`; url: <`r SNOMED_VERSION$url`>
#' @export
#' @return A [character] [vector] with additional class [`mo`]
#' @seealso [microorganisms] for the [data.frame] that is being used to determine ID's.
#'
#' The [`mo_*`][mo_property()] functions (such as [mo_genus()], [mo_gramstain()]) to get properties based on the returned code.
#' @inheritSection AMR Reference Data Publicly Available
#' @inheritSection AMR Read more on Our Website!
#' @examples
#' \donttest{
#' # These examples all return "B_STPHY_AURS", the ID of S. aureus:
#' as.mo("sau") # WHONET code
#' as.mo("stau")
#' as.mo("STAU")
#' as.mo("staaur")
#' as.mo("S. aureus")
#' as.mo("S aureus")
#' as.mo("Staphylococcus aureus")
#' as.mo("Staphylococcus aureus (MRSA)")
#' as.mo("Zthafilokkoockus oureuz") # handles incorrect spelling
#' as.mo("MRSA") # Methicillin Resistant S. aureus
#' as.mo("VISA") # Vancomycin Intermediate S. aureus
#' as.mo("VRSA") # Vancomycin Resistant S. aureus
#' as.mo(115329001) # SNOMED CT code
#'
#' # Dyslexia is no problem - these all work:
#' as.mo("Ureaplasma urealyticum")
#' as.mo("Ureaplasma urealyticus")
#' as.mo("Ureaplasmium urealytica")
#' as.mo("Ureaplazma urealitycium")
#'
#' as.mo("Streptococcus group A")
#' as.mo("GAS") # Group A Streptococci
#' as.mo("GBS") # Group B Streptococci
#'
#' as.mo("S. epidermidis") # will remain species: B_STPHY_EPDR
#' as.mo("S. epidermidis", Becker = TRUE) # will not remain species: B_STPHY_CONS
#'
#' as.mo("S. pyogenes") # will remain species: B_STRPT_PYGN
#' as.mo("S. pyogenes", Lancefield = TRUE) # will not remain species: B_STRPT_GRPA
#'
#' # All mo_* functions use as.mo() internally too (see ?mo_property):
#' mo_genus("E. coli") # returns "Escherichia"
#' mo_gramstain("E. coli") # returns "Gram negative"
#' mo_is_intrinsic_resistant("E. coli", "vanco") # returns TRUE
#' }
as.mo <- function ( x ,
Becker = FALSE ,
Lancefield = FALSE ,
allow_uncertain = TRUE ,
reference_df = get_mo_source ( ) ,
ignore_pattern = getOption ( " AMR_ignore_pattern" ) ,
language = get_locale ( ) ,
info = interactive ( ) ,
... ) {
meet_criteria ( x , allow_class = c ( " mo" , " data.frame" , " list" , " character" , " numeric" , " integer" , " factor" ) , allow_NA = TRUE )
meet_criteria ( Becker , allow_class = c ( " logical" , " character" ) , has_length = 1 )
meet_criteria ( Lancefield , allow_class = c ( " logical" , " character" ) , has_length = 1 )
meet_criteria ( allow_uncertain , allow_class = c ( " logical" , " numeric" , " integer" ) , has_length = 1 )
meet_criteria ( reference_df , allow_class = " data.frame" , allow_NULL = TRUE )
meet_criteria ( ignore_pattern , allow_class = " character" , has_length = 1 , allow_NULL = TRUE )
meet_criteria ( language , has_length = 1 , is_in = c ( LANGUAGES_SUPPORTED , " " ) , allow_NULL = TRUE , allow_NA = TRUE )
meet_criteria ( info , allow_class = " logical" , has_length = 1 )
check_dataset_integrity ( )
if ( tryCatch ( all ( x [ ! is.na ( x ) ] %in% MO_lookup $ mo )
& isFALSE ( Becker )
& isFALSE ( Lancefield ) , error = function ( e ) FALSE ) ) {
# don't look into valid MO codes, just return them
# is.mo() won't work - MO codes might change between package versions
return ( set_clean_class ( x , new_class = c ( " mo" , " character" ) ) )
}
# start off with replaced language-specific non-ASCII characters with ASCII characters
x <- parse_and_convert ( x )
# replace mo codes used in older package versions
x <- replace_old_mo_codes ( x , property = " mo" )
# ignore cases that match the ignore pattern
x <- replace_ignore_pattern ( x , ignore_pattern )
# WHONET: xxx = no growth
x [tolower ( as.character ( paste0 ( x , " " ) ) ) %in% c ( " " , " xxx" , " na" , " nan" ) ] <- NA_character_
# Laboratory systems: remove (translated) entries like "no growth", etc.
x [trimws2 ( x ) %like% translate_AMR ( " no .*growth" , language = language ) ] <- NA_character_
x [trimws2 ( x ) %like% paste0 ( " ^(" , translate_AMR ( " no|not" , language = language ) , " ) [a-z]+" ) ] <- " UNKNOWN"
uncertainty_level <- translate_allow_uncertain ( allow_uncertain )
if ( tryCatch ( all ( x == " " | gsub ( " .*(unknown ).*" , " unknown name" , tolower ( x ) , perl = TRUE ) %in% MO_lookup $ fullname_lower , na.rm = TRUE )
& isFALSE ( Becker )
& isFALSE ( Lancefield ) , error = function ( e ) FALSE ) ) {
# to improve speed, special case for taxonomically correct full names (case-insensitive)
return ( MO_lookup [match ( gsub ( " .*(unknown ).*" , " unknown name" , tolower ( x ) , perl = TRUE ) , MO_lookup $ fullname_lower ) , " mo" , drop = TRUE ] )
}
if ( ! is.null ( reference_df )
&& check_validity_mo_source ( reference_df )
&& isFALSE ( Becker )
&& isFALSE ( Lancefield )
&& all ( x %in% unlist ( reference_df ) , na.rm = TRUE ) ) {
reference_df <- repair_reference_df ( reference_df )
suppressWarnings (
y <- data.frame ( x = x , stringsAsFactors = FALSE ) %pm>%
pm_left_join ( reference_df , by = " x" ) %pm>%
pm_pull ( mo )
)
} else if ( all ( x [ ! is.na ( x ) ] %in% MO_lookup $ mo )
& isFALSE ( Becker )
& isFALSE ( Lancefield ) ) {
y <- x
} else {
# will be checked for mo class in validation and uses exec_as.mo internally if necessary
y <- mo_validate ( x = x , property = " mo" ,
Becker = Becker , Lancefield = Lancefield ,
allow_uncertain = uncertainty_level ,
reference_df = reference_df ,
ignore_pattern = ignore_pattern ,
language = language ,
info = info ,
... )
}
set_clean_class ( y ,
new_class = c ( " mo" , " character" ) )
}
#' @rdname as.mo
#' @export
is.mo <- function ( x ) {
inherits ( x , " mo" )
}
# param property a column name of microorganisms
# param initial_search logical - is FALSE when coming from uncertain tries, which uses exec_as.mo internally too
# param dyslexia_mode logical - also check for characters that resemble others
# param debug logical - show different lookup texts while searching
# param reference_data_to_use data.frame - the data set to check for
# param actual_uncertainty - (only for initial_search = FALSE) the actual uncertainty level used in the function for score calculation (sometimes passed as 2 or 3 by uncertain_fn())
# param actual_input - (only for initial_search = FALSE) the actual, original input
# param language - used for translating "no growth", etc.
exec_as.mo <- function ( x ,
Becker = FALSE ,
Lancefield = FALSE ,
allow_uncertain = TRUE ,
reference_df = get_mo_source ( ) ,
info = interactive ( ) ,
property = " mo" ,
initial_search = TRUE ,
dyslexia_mode = FALSE ,
debug = FALSE ,
ignore_pattern = getOption ( " AMR_ignore_pattern" ) ,
reference_data_to_use = MO_lookup ,
actual_uncertainty = 1 ,
actual_input = NULL ,
language = get_locale ( ) ) {
meet_criteria ( x , allow_class = c ( " mo" , " data.frame" , " list" , " character" , " numeric" , " integer" , " factor" ) , allow_NA = TRUE )
meet_criteria ( Becker , allow_class = c ( " logical" , " character" ) , has_length = 1 )
meet_criteria ( Lancefield , allow_class = c ( " logical" , " character" ) , has_length = 1 )
meet_criteria ( allow_uncertain , allow_class = c ( " logical" , " numeric" , " integer" ) , has_length = 1 )
meet_criteria ( reference_df , allow_class = " data.frame" , allow_NULL = TRUE )
meet_criteria ( property , allow_class = " character" , has_length = 1 , is_in = colnames ( microorganisms ) )
meet_criteria ( initial_search , allow_class = " logical" , has_length = 1 )
meet_criteria ( dyslexia_mode , allow_class = " logical" , has_length = 1 )
meet_criteria ( debug , allow_class = " logical" , has_length = 1 )
meet_criteria ( ignore_pattern , allow_class = " character" , has_length = 1 , allow_NULL = TRUE )
meet_criteria ( reference_data_to_use , allow_class = " data.frame" )
meet_criteria ( actual_uncertainty , allow_class = " numeric" , has_length = 1 )
meet_criteria ( actual_input , allow_class = " character" , allow_NULL = TRUE )
meet_criteria ( language , has_length = 1 , is_in = c ( LANGUAGES_SUPPORTED , " " ) , allow_NULL = TRUE , allow_NA = TRUE )
check_dataset_integrity ( )
if ( isTRUE ( debug ) && initial_search == TRUE ) {
time_start_tracking ( )
}
lookup <- function ( needle ,
column = property ,
haystack = reference_data_to_use ,
n = 1 ,
debug_mode = debug ,
initial = initial_search ,
uncertainty = actual_uncertainty ,
input_actual = actual_input ) {
if ( ! is.null ( input_actual ) ) {
input <- input_actual
} else {
input <- tryCatch ( x_backup [i ] , error = function ( e ) " " )
}
# `column` can be NULL for all columns, or a selection
# returns a character (vector) - if `column` > length 1 then with columns as names
if ( isTRUE ( debug_mode ) ) {
cat ( font_silver ( " Looking up: " , substitute ( needle ) , collapse = " " ) ,
" \n " , time_track ( ) )
}
if ( length ( column ) == 1 ) {
res_df <- haystack [which ( eval ( substitute ( needle ) , envir = haystack , enclos = parent.frame ( ) ) ) , , drop = FALSE ]
if ( NROW ( res_df ) > 1 & uncertainty != -1 ) {
# sort the findings on matching score
scores <- mo_matching_score ( x = input ,
n = res_df [ , " fullname" , drop = TRUE ] )
res_df <- res_df [order ( scores , decreasing = TRUE ) , , drop = FALSE ]
}
res <- as.character ( res_df [ , column , drop = TRUE ] )
if ( length ( res ) == 0 ) {
if ( isTRUE ( debug_mode ) ) {
cat ( font_red ( " (no match)\n" ) )
}
NA_character_
} else {
if ( isTRUE ( debug_mode ) ) {
cat ( font_green ( paste0 ( " MATCH (" , NROW ( res_df ) , " results)\n" ) ) )
}
if ( ( length ( res ) > n | uncertainty > 1 ) & uncertainty != -1 ) {
# save the other possible results as well, but not for forced certain results (then uncertainty == -1)
uncertainties <<- rbind ( uncertainties ,
format_uncertainty_as_df ( uncertainty_level = uncertainty ,
input = input ,
result_mo = res_df [1 , " mo" , drop = TRUE ] ,
candidates = as.character ( res_df [ , " fullname" , drop = TRUE ] ) ) ,
stringsAsFactors = FALSE )
}
res [seq_len ( min ( n , length ( res ) ) ) ]
}
} else {
if ( is.null ( column ) ) {
column <- names ( haystack )
}
res <- haystack [which ( eval ( substitute ( needle ) , envir = haystack , enclos = parent.frame ( ) ) ) , , drop = FALSE ]
res <- res [seq_len ( min ( n , nrow ( res ) ) ) , column , drop = TRUE ]
if ( NROW ( res ) == 0 ) {
if ( isTRUE ( debug_mode ) ) {
cat ( font_red ( " (no rows)\n" ) )
}
res <- rep ( NA_character_ , length ( column ) )
} else {
if ( isTRUE ( debug_mode ) ) {
cat ( font_green ( paste0 ( " MATCH (" , NROW ( res ) , " rows)\n" ) ) )
}
}
res <- as.character ( res )
names ( res ) <- column
res
}
}
# start off with replaced language-specific non-ASCII characters with ASCII characters
x <- parse_and_convert ( x )
# replace mo codes used in older package versions
x <- replace_old_mo_codes ( x , property )
# ignore cases that match the ignore pattern
x <- replace_ignore_pattern ( x , ignore_pattern )
# WHONET: xxx = no growth
x [tolower ( as.character ( paste0 ( x , " " ) ) ) %in% c ( " " , " xxx" , " na" , " nan" ) ] <- NA_character_
# Laboratory systems: remove (translated) entries like "no growth", etc.
x [trimws2 ( x ) %like% translate_AMR ( " no .*growth" , language = language ) ] <- NA_character_
x [trimws2 ( x ) %like% paste0 ( " ^(" , translate_AMR ( " no|not" , language = language ) , " ) [a-z]+" ) ] <- " UNKNOWN"
if ( initial_search == TRUE ) {
# keep track of time - give some hints to improve speed if it takes a long time
start_time <- Sys.time ( )
pkg_env $ mo_failures <- NULL
pkg_env $ mo_uncertainties <- NULL
pkg_env $ mo_renamed <- NULL
}
pkg_env $ mo_renamed_last_run <- NULL
failures <- character ( 0 )
uncertainty_level <- translate_allow_uncertain ( allow_uncertain )
uncertainties <- data.frame ( uncertainty = integer ( 0 ) ,
input = character ( 0 ) ,
fullname = character ( 0 ) ,
renamed_to = character ( 0 ) ,
mo = character ( 0 ) ,
candidates = character ( 0 ) ,
stringsAsFactors = FALSE )
x_input <- x
# already strip leading and trailing spaces
x <- trimws ( x )
# only check the uniques, which is way faster
x <- unique ( x )
# remove empty values (to later fill them in again with NAs)
# ("xxx" is WHONET code for 'no growth')
x <- x [ ! is.na ( x )
& ! is.null ( x )
& ! identical ( x , " " )
& ! identical ( x , " xxx" ) ]
# defined df to check for
if ( ! is.null ( reference_df ) ) {
check_validity_mo_source ( reference_df )
reference_df <- repair_reference_df ( reference_df )
}
# all empty
if ( all ( identical ( trimws ( x_input ) , " " ) | is.na ( x_input ) | length ( x ) == 0 ) ) {
if ( property == " mo" ) {
return ( set_clean_class ( rep ( NA_character_ , length ( x_input ) ) ,
new_class = c ( " mo" , " character" ) ) )
} else {
return ( rep ( NA_character_ , length ( x_input ) ) )
}
} else if ( all ( x %in% reference_df [ , 1 ] [ [1 ] ] ) ) {
# all in reference df
colnames ( reference_df ) [1 ] <- " x"
suppressWarnings (
x <- MO_lookup [match ( reference_df [match ( x , reference_df $ x ) , " mo" , drop = TRUE ] , MO_lookup $ mo ) , property , drop = TRUE ]
)
} else if ( all ( x %in% reference_data_to_use $ mo ) ) {
x <- MO_lookup [match ( x , MO_lookup $ mo ) , property , drop = TRUE ]
} else if ( all ( tolower ( x ) %in% reference_data_to_use $ fullname_lower ) ) {
# we need special treatment for very prevalent full names, they are likely!
# e.g. as.mo("Staphylococcus aureus")
x <- MO_lookup [match ( tolower ( x ) , MO_lookup $ fullname_lower ) , property , drop = TRUE ]
} else if ( all ( x %in% reference_data_to_use $ fullname ) ) {
# we need special treatment for very prevalent full names, they are likely!
# e.g. as.mo("Staphylococcus aureus")
x <- MO_lookup [match ( x , MO_lookup $ fullname ) , property , drop = TRUE ]
} else if ( all ( toupper ( x ) %in% microorganisms.codes $ code ) ) {
# commonly used MO codes
x <- MO_lookup [match ( microorganisms.codes [match ( toupper ( x ) ,
microorganisms.codes $ code ) ,
" mo" ,
drop = TRUE ] ,
MO_lookup $ mo ) ,
property ,
drop = TRUE ]
} else if ( ! all ( x %in% microorganisms [ , property ] ) ) {
strip_whitespace <- function ( x , dyslexia_mode ) {
# all whitespaces (tab, new lines, etc.) should be one space
# and spaces before and after should be left blank
trimmed <- trimws2 ( x )
# also, make sure the trailing and leading characters are a-z or 0-9
# in case of non-regex
if ( dyslexia_mode == FALSE ) {
trimmed <- gsub ( " ^[^a-zA-Z0-9)(]+" , " " , trimmed , perl = TRUE )
trimmed <- gsub ( " [^a-zA-Z0-9)(]+$" , " " , trimmed , perl = TRUE )
}
trimmed
}
x_backup_untouched <- x
x <- strip_whitespace ( x , dyslexia_mode )
# translate 'unknown' names back to English
if ( any ( x %like% " unbekannt|onbekend|desconocid|sconosciut|iconnu|desconhecid" , na.rm = TRUE ) ) {
trns <- subset ( translations_file , pattern %like% " unknown" | affect_mo_name == TRUE )
langs <- LANGUAGES_SUPPORTED [LANGUAGES_SUPPORTED != " en" ]
for ( l in langs ) {
for ( i in seq_len ( nrow ( trns ) ) ) {
if ( ! is.na ( trns [i , l , drop = TRUE ] ) ) {
x <- gsub ( pattern = trns [i , l , drop = TRUE ] ,
replacement = trns $ pattern [i ] ,
x = x ,
ignore.case = TRUE ,
perl = TRUE )
}
}
}
}
x_backup <- x
# from here on case-insensitive
x <- tolower ( x )
x_backup [x %like_case% " ^(fungus|fungi)$" ] <- " (unknown fungus)" # will otherwise become the kingdom
x_backup [x_backup_untouched == " Fungi" ] <- " Fungi" # is literally the kingdom
# Fill in fullnames and MO codes at once
known_names <- tolower ( x_backup ) %in% MO_lookup $ fullname_lower
x [known_names ] <- MO_lookup [match ( tolower ( x_backup ) [known_names ] , MO_lookup $ fullname_lower ) , property , drop = TRUE ]
known_codes <- toupper ( x_backup ) %in% MO_lookup $ mo
x [known_codes ] <- MO_lookup [match ( toupper ( x_backup ) [known_codes ] , MO_lookup $ mo ) , property , drop = TRUE ]
already_known <- known_names | known_codes
# now only continue where the right taxonomic output is not already known
if ( any ( ! already_known ) ) {
x_known <- x [already_known ]
# remove spp and species
x <- gsub ( " +(spp.?|ssp.?|sp.? |ss ?.?|subsp.?|subspecies|biovar |serovar |species)" , " " , x )
x <- gsub ( " (spp.?|subsp.?|subspecies|biovar|serovar|species)" , " " , x )
x <- gsub ( " ^([a-z]{2,4})(spe.?)$" , " \\1" , x , perl = TRUE ) # when ending in SPE instead of SPP and preceded by 2-4 characters
x <- strip_whitespace ( x , dyslexia_mode )
x_backup_without_spp <- x
x_species <- paste ( x , " species" )
# translate to English for supported languages of mo_property
x <- gsub ( " (gruppe|groep|grupo|gruppo|groupe)" , " group" , x , perl = TRUE )
# no groups and complexes as ending
x <- gsub ( " (complex|group)$" , " " , x , perl = TRUE )
x <- gsub ( " (^|[^a-z])((an)?aero+b)[a-z]*" , " " , x , perl = TRUE )
x <- gsub ( " ^atyp[a-z]*" , " " , x , perl = TRUE )
x <- gsub ( " (vergroen)[a-z]*" , " viridans" , x , perl = TRUE )
x <- gsub ( " [a-z]*diff?erent[a-z]*" , " " , x , perl = TRUE )
x <- gsub ( " (hefe|gist|gisten|levadura|lievito|fermento|levure)[a-z]*" , " yeast" , x , perl = TRUE )
x <- gsub ( " (schimmels?|mofo|molde|stampo|moisissure|fungi)[a-z]*" , " fungus" , x , perl = TRUE )
x <- gsub ( " fungus[ph|f]rya" , " fungiphrya" , x , perl = TRUE )
# no contamination
x <- gsub ( " (contamination|kontamination|mengflora|contaminaci.n|contamina..o)" , " " , x , perl = TRUE )
# remove non-text in case of "E. coli" except dots and spaces
x <- trimws ( gsub ( " [^.a-zA-Z0-9/ \\-]+" , " " , x , perl = TRUE ) )
# but make sure that dots are followed by a space
x <- gsub ( " [.] ?" , " . " , x , perl = TRUE )
# replace minus by a space
x <- gsub ( " -+" , " " , x , perl = TRUE )
# replace hemolytic by haemolytic
x <- gsub ( " ha?emoly" , " haemoly" , x , perl = TRUE )
# place minus back in streptococci
x <- gsub ( " (alpha|beta|gamma).?ha?emoly" , " \\1-haemoly" , x , perl = TRUE )
# remove genus as first word
x <- gsub ( " ^genus " , " " , x , perl = TRUE )
# remove 'uncertain'-like texts
x <- trimws ( gsub ( " (uncertain|susp[ie]c[a-z]+|verdacht)" , " " , x , perl = TRUE ) )
# allow characters that resemble others = dyslexia_mode ----
if ( dyslexia_mode == TRUE ) {
x <- tolower ( x )
x <- gsub ( " [iy]+" , " [iy]+" , x )
x <- gsub ( " (c|k|q|qu|s|z|x|ks)+" , " (c|k|q|qu|s|z|x|ks)+" , x )
x <- gsub ( " (ph|hp|f|v)+" , " (ph|hp|f|v)+" , x )
x <- gsub ( " (th|ht|t)+" , " (th|ht|t)+" , x )
x <- gsub ( " a+" , " a+" , x )
x <- gsub ( " u+" , " u+" , x )
# allow any ending of -um, -us, -ium, -icum, -ius, -icus, -ica, -ia and -a (needs perl for the negative backward lookup):
x <- gsub ( " (u\\+\\(c\\|k\\|q\\|qu\\+\\|s\\|z\\|x\\|ks\\)\\+)(?![a-z])" ,
" (u[s|m]|[iy][ck]?u[ms]|[iy]?[ck]?a)" , x , perl = TRUE )
x <- gsub ( " (\\[iy\\]\\+\\(c\\|k\\|q\\|qu\\+\\|s\\|z\\|x\\|ks\\)\\+a\\+)(?![a-z])" ,
" (u[s|m]|[iy][ck]?u[ms]|[iy]?[ck]?a)" , x , perl = TRUE )
x <- gsub ( " (\\[iy\\]\\+u\\+m)(?![a-z])" ,
" (u[s|m]|[iy][ck]?u[ms]|[iy]?[ck]?a)" , x , perl = TRUE )
x <- gsub ( " (\\[iy\\]\\+a\\+)(?![a-z])" ,
" ([iy]*a+|[iy]+a*)" , x , perl = TRUE )
x <- gsub ( " e+" , " e+" , x )
x <- gsub ( " o+" , " o+" , x )
x <- gsub ( " (.)\\1+" , " \\1+" , x )
# allow multiplication of all other consonants
x <- gsub ( " ([bdgjlnrw]+)" , " \\1+" , x , perl = TRUE )
# allow ending in -en or -us
x <- gsub ( " e\\+n(?![a-z[])" , " (e+n|u+(c|k|q|qu|s|z|x|ks)+)" , x , perl = TRUE )
# if the input is longer than 10 characters, allow any forgotten consonant between all characters, as some might just have forgotten one...
# this will allow "Pasteurella damatis" to be correctly read as "Pasteurella dagmatis".
consonants <- paste ( letters [ ! letters %in% c ( " a" , " e" , " i" , " o" , " u" ) ] , collapse = " " )
x [nchar ( x_backup_without_spp ) > 10 ] <- gsub ( " [+]" , paste0 ( " +[" , consonants , " ]?" ) , x [nchar ( x_backup_without_spp ) > 10 ] )
# allow au and ou after all above regex implementations
x <- gsub ( " a+[bcdfghjklmnpqrstvwxyz]?u+[bcdfghjklmnpqrstvwxyz]?" , " (a+u+|o+u+)[bcdfghjklmnpqrstvwxyz]?" , x , fixed = TRUE )
x <- gsub ( " o+[bcdfghjklmnpqrstvwxyz]?u+[bcdfghjklmnpqrstvwxyz]?" , " (a+u+|o+u+)[bcdfghjklmnpqrstvwxyz]?" , x , fixed = TRUE )
}
x <- strip_whitespace ( x , dyslexia_mode )
# make sure to remove regex overkill (will lead to errors)
x <- gsub ( " ++" , " +" , x , fixed = TRUE )
x <- gsub ( " ?+" , " ?" , x , fixed = TRUE )
x_trimmed <- x
x_trimmed_species <- paste ( x_trimmed , " species" )
x_trimmed_without_group <- gsub ( " gro.u.p$" , " " , x_trimmed , perl = TRUE )
# remove last part from "-" or "/"
x_trimmed_without_group <- gsub ( " (.*)[-/].*" , " \\1" , x_trimmed_without_group )
# replace space and dot by regex sign
x_withspaces <- gsub ( " [ .]+" , " .* " , x , perl = TRUE )
x <- gsub ( " [ .]+" , " .*" , x , perl = TRUE )
# add start en stop regex
x <- paste0 ( " ^" , x , " $" )
x_withspaces_start_only <- paste0 ( " ^" , x_withspaces )
x_withspaces_end_only <- paste0 ( x_withspaces , " $" )
x_withspaces_start_end <- paste0 ( " ^" , x_withspaces , " $" )
if ( isTRUE ( debug ) ) {
cat ( paste0 ( font_blue ( " x" ) , ' "' , x , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_species" ) , ' "' , x_species , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_withspaces_start_only" ) , ' "' , x_withspaces_start_only , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_withspaces_end_only" ) , ' "' , x_withspaces_end_only , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_withspaces_start_end" ) , ' "' , x_withspaces_start_end , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_backup" ) , ' "' , x_backup , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_backup_without_spp" ) , ' "' , x_backup_without_spp , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_trimmed" ) , ' "' , x_trimmed , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_trimmed_species" ) , ' "' , x_trimmed_species , ' "\n' ) )
cat ( paste0 ( font_blue ( " x_trimmed_without_group" ) , ' "' , x_trimmed_without_group , ' "\n' ) )
}
if ( initial_search == TRUE ) {
progress <- progress_ticker ( n = length ( x [ ! already_known ] ) , n_min = 25 , print = info ) # start if n >= 25
on.exit ( close ( progress ) )
}
for ( i in which ( ! already_known ) ) {
if ( initial_search == TRUE ) {
progress $ tick ( )
}
# valid MO code ----
found <- lookup ( mo == toupper ( x_backup [i ] ) )
if ( ! is.na ( found ) ) {
x [i ] <- found [1L ]
next
}
# valid fullname ----
found <- lookup ( fullname_lower %in% gsub ( " [^a-zA-Z0-9_. -]" , " " , tolower ( c ( x_backup [i ] , x_backup_without_spp [i ] ) ) , perl = TRUE ) )
# added the gsub() for "(unknown fungus)", since fullname_lower does not contain brackets
if ( ! is.na ( found ) ) {
x [i ] <- found [1L ]
next
}
# old fullname ----
found <- lookup ( fullname_lower %in% tolower ( c ( x_backup [i ] , x_backup_without_spp [i ] ) ) ,
column = NULL , # all columns
haystack = MO.old_lookup )
if ( ! all ( is.na ( found ) ) ) {
# when property is "ref" (which is the case in mo_ref, mo_authors and mo_year), return the old value, so:
# mo_ref() of "Chlamydia psittaci" will be "Page, 1968" (with warning)
# mo_ref() of "Chlamydophila psittaci" will be "Everett et al., 1999"
if ( property == " ref" ) {
x [i ] <- found [ " ref" ]
} else {
x [i ] <- lookup ( fullname == found [ " fullname_new" ] , haystack = MO_lookup )
}
pkg_env $ mo_renamed_last_run <- found [ " fullname" ]
was_renamed ( name_old = found [ " fullname" ] ,
name_new = lookup ( fullname == found [ " fullname_new" ] , " fullname" , haystack = MO_lookup ) ,
ref_old