String tools: magic edition

This vignette describes stringmagic tools for handling character vectors. It details:

Detection of regex patterns

Detecting a single regex pattern is pretty straightforward with regular tools like base::grepl or stringr::string_detect. Things become more complicated when we want to detect the presence of multiple patterns.

stringmagic offers three functions with an intuitive syntax to deal with complex pattern detection:

Pattern detection with string_is, string_which and string_get

Use string_is, string_which and string_get to detect patterns in character vectors and obtain either a logical vector, an integer vector, or the values.

In this section we give examples for string_get which hopefully will be explicit enough to illustrate how it works. For the record, string_get uses string_is internally so these examples are equivalent with string_is or string_which.

Ex.1: series of examples using the recommended syntax.

cars = row.names(mtcars)
cat_magic("All cars from mtcars:\n{C, 60 swidth ? cars}")
#> All cars from mtcars:
#> Mazda RX4, Mazda RX4 Wag, Datsun 710, Hornet 4 Drive, Hornet
#> Sportabout, Valiant, Duster 360, Merc 240D, Merc 230, Merc
#> 280, Merc 280C, Merc 450SE, Merc 450SL, Merc 450SLC,
#> Cadillac Fleetwood, Lincoln Continental, Chrysler Imperial,
#> Fiat 128, Honda Civic, Toyota Corolla, Toyota Corona, Dodge
#> Challenger, AMC Javelin, Camaro Z28, Pontiac Firebird, Fiat
#> X1-9, Porsche 914-2, Lotus Europa, Ford Pantera L, Ferrari
#> Dino, Maserati Bora and Volvo 142E

# cars with an 'a', an 'e', an 'i', and an 'o', all in lower case
string_get(cars, "a & e & i & o")
#> [1] "Cadillac Fleetwood"  "Lincoln Continental" "Pontiac Firebird"   
#> [4] "Ferrari Dino"        "Maserati Bora"

# cars with no 'e' and at least one digit
string_get(cars, "!e & \\d")
#> [1] "Mazda RX4"     "Mazda RX4 Wag" "Datsun 710"    "Fiat 128"     
#> [5] "Camaro Z28"    "Fiat X1-9"     "Volvo 142E"

# flags apply to all
# contains the 'words' 2, 9 or l
# alternative syntax for flags: "wi/2 | 9 | l"
string_get(cars, "word, ignore/2 | 9 | l")
#> [1] "Fiat X1-9"      "Porsche 914-2"  "Ford Pantera L"

The default syntax is string_get(x, ...) (same for string_is and string_which), where ... contains any number of patterns to detect. By default the results of these pattern detections are combined with a logical AND. To combine them with a logical OR, you need to use the argument or = TRUE. You can also pass the flags as regular function arguments. They then apply to all patterns.

Ex.2: replication of Ex.1 using an alternative syntax.

# string_get(cars, "a & e & i & o")
# cars with an 'a', an 'e', an 'i', and an 'o', all in lower case
string_get(cars, "a", "e", "i", "o")
#> [1] "Cadillac Fleetwood"  "Lincoln Continental" "Pontiac Firebird"   
#> [4] "Ferrari Dino"        "Maserati Bora"

# string_get(cars, "!e & \\d")
# cars with no 'e' and at least one digit
string_get(cars, "!e", "\\d")
#> [1] "Mazda RX4"     "Mazda RX4 Wag" "Datsun 710"    "Fiat 128"     
#> [5] "Camaro Z28"    "Fiat X1-9"     "Volvo 142E"

# string_get(cars, "!/e & \\d")
# This example cannot be replicated directly, we need to apply logical equivalence
string_get(cars, "!e", "!\\d", or = TRUE)
#>  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
#>  [4] "Hornet Sportabout"   "Valiant"             "Cadillac Fleetwood" 
#>  [7] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
#> [10] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
#> [13] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
#> [16] "Pontiac Firebird"    "Fiat X1-9"           "Lotus Europa"       
#> [19] "Ford Pantera L"      "Ferrari Dino"        "Maserati Bora"      
#> [22] "Volvo 142E"

# string_get(cars, "wi/2 | 9 | l")
# contains the 'words' 2, 9 or l
string_get(cars, "2", "9", "l", or = TRUE, word = TRUE, ignore.case = TRUE)
#> [1] "Fiat X1-9"      "Porsche 914-2"  "Ford Pantera L"

Specificities of srt_get

On top of the detection previously described, the function srt_get changes its behavior with the arguments seq or seq.unik. It also supports automatic caching.

Sequentially appending results

As seen previously, patterns in ... are combined with a logical AND. If you set seq = TRUE, this behavior changes. The results of each pattern becomes stacked sequentially. Schematically, you obtain the vector c(x_that_contains_pat1, x_that_contains_pat2, etc) with pat1 the first pattern in ..., pat2 the second pattern, etc.

Using seq.unik = TRUE is like seq but applies the function unique() at the end.

Ex: sequentially combining results.

# cars without digits, then cars with 2 'a's or 2 'e's and a digit
string_get(cars, "!\\d", "i/a.+a | e.+e & \\d", seq = TRUE)
#>  [1] "Hornet Sportabout"   "Valiant"             "Cadillac Fleetwood" 
#>  [4] "Lincoln Continental" "Chrysler Imperial"   "Honda Civic"        
#>  [7] "Toyota Corolla"      "Toyota Corona"       "Dodge Challenger"   
#> [10] "AMC Javelin"         "Pontiac Firebird"    "Lotus Europa"       
#> [13] "Ford Pantera L"      "Ferrari Dino"        "Maserati Bora"      
#> [16] "Mazda RX4"           "Mazda RX4 Wag"       "Hornet 4 Drive"     
#> [19] "Merc 450SE"          "Camaro Z28"

# let's get the first word of each car name
car_first = string_ops(cars, "extract.first")
# we select car brands ending with 'a', then ending with 'i'
string_get(car_first, "a$", "i$", seq = TRUE)
#> [1] "Mazda"    "Mazda"    "Honda"    "Toyota"   "Toyota"   "Ferrari"  "Maserati"
# seq.unik is similar to seq but applies unique()
string_get(car_first, "a$", "i$", seq.unik = TRUE)
#> [1] "Mazda"    "Honda"    "Toyota"   "Ferrari"  "Maserati"

Caching

At the exploration stage, we often run the same command with a few variations on the same data set. Acknowledging this, string_get supports the caching of the data argument in interactive use. This means that the user can concentrate in the pattern to find and need not bother to write the data from where to fectch the values. Note that string_get is the only stringmagic function to have this ability.

Caching is always enabled, you don’t need to do anything.

Ex: caching of the data.

# Since we used `car_first` in the previous example, we don't need to provide
# it explicitly now
# => brands containing 'M' and ending with 'a' or 'i'; brands containing 'M'
string_get("M & [ai]$", "M", seq.unik = TRUE)
#> [1] "Mazda"    "Maserati" "Merc"     "AMC"

Chaining string operations with string_ops

Formatting text data often requires applying many functions (be it for parsing, text analysis, etc). Even for simple tasks, the number of operations can quickly balloon, adding many lines of code, reducing readability, and all this for basic processing.

The function string_ops tries to solve this problem. It has access to all (50+) string_magic operations, allowing for a compact and readable way to chain basic operations on character strings.

Below are a few motivating examples.

Ex.1: Parsing data.

# parsing an input: extracting the numbers
input = "8.5in, 5.5, .5 cm"
string_ops(input, "','split, tws, '^\\. => 0.'replace, '^\\D+|\\D+$'replace, num")
#> [1] 8.5 5.5 0.5


# Explanation------------------------------------------------------------------|
# ','split: splitting w.r.t. ','                                               |
# tws: trimming the whitespaces                                                |
# '^\\. => 0.'replace: adds a 0 to strings starting with '.'                   |
# '^\\D+|\\D+$'replace: removes non-digits on both ends of the string          |
# num: converts to numeric                                                     |


# now extracting the units
string_ops(input, "','split, '^[ \\d.]+'replace, tws")
#> [1] "in" ""   "cm"


# Explanation------------------------------------------------------------------|
# ','split: splitting w.r.t. ','                                               |
# '^[ \\d.]+'replace: removes the ' ', digit                                   |
#                     and '.' at the beginning of the string                   |
# tws: trimming the whitespaces                                                |

Ex.2: extracing information from text.

# Now using the car data
cars = row.names(mtcars)

# let's get the brands starting with an "m"
string_ops(cars, "'i/^m'get, x, unik")
#> [1] "Mazda"    "Merc"     "Maserati"


# Explanation------------------------------------------------------------------|
# 'i/^m'get: keeps only the elements starting with an m,                       |
#            i/ is the 'regex-flag' "ignore" to ignore the case                |
#            ^m means "starts with an m" in regex language                     |
# x: extracts the first pattern. The default pattern is "[[:alnum:]]+"         |
#    which means an alpha-numeric word                                         |
# unik: applies unique() to the vector                                         |


# let's get the 3 largest numbers appearing in the car models
string_ops(cars, "'\\d+'x, rm, unik, num, dsort, 3 first")
#> [1] 914 710 450


# Explanation------------------------------------------------------------------|
# '\\d+'x: extracts the first pattern, the pattern meaning "a succession"      |
#          of digits in regex language                                         |
# rm: removes elements equal to the empty string (default behavior)            |
# unik: applies unique() to the vector                                         |
# num: converts to numeric                                                     |
# dsort: sorts in decreasing order                                             |
# 3 first: keeps only the first three element                                  |

As you can see, an operation that would take multiple lines to read and understand now can be read from left to right in a single line.

string_clean: One function to clean them all

The function string_clean streamlines the cleaning of character vectors by providing:

    1. a specialized syntax to replace multiple regex patterns,
    1. a direct access to many low level string operations, and
    1. the ability to chain these two operations.

Cleaning syntax

This function is of the form string_clean(x, ...) with x the vector to clean and ... any number of cleaning operations which can be of two types:

  1. use "pat1, pat2 => replacement" to replace the regex patterns pat1 and pat2 with the value replacement.
  2. use "@op1, op2" to perform any arbitrary sequence of string_magic operation

In the operation "pat1, pat2 => replacement", the pattern is first split with respect to the pipe, " => " (change it with argument pipe), to get replacement. Then the pattern is split with respect to commas (i.e. ",[ \t\n]+", change it with argument sep) to get pat1 and pat2. A sequence of base::gsub calls is performed to replace each patx with replacement.

By default the replacement is the empty string. This means that writting "pat1, pat2" will lead to erasing these two patterns.

If a pattern starts with an "@", the subsequent character string is sent to string_ops. For example "@ascii, lower" is equivalent to string_ops(x, "ascii, lower") which turns x to ASCII and lowers the case.

Example of text cleaning

monologue = c("For who would bear the whips and scorns of time",
              "Th' oppressor's wrong, the proud man's contumely,",
              "The pangs of despis'd love, the law's delay,",
              "The insolence of office, and the spurns",
              "That patient merit of th' unworthy takes,",
              "When he himself might his quietus make",
              "With a bare bodkin? Who would these fardels bear,",
              "To grunt and sweat under a weary life,",
              "But that the dread of something after death-",
              "The undiscover'd country, from whose bourn",
              "No traveller returns- puzzles the will,",
              "And makes us rather bear those ills we have",
              "Than fly to others that we know not of?")

# Cleaning a text
string_clean(monologue, 
          # use string_magic to: lower the case and remove basic stopwords
          "@lower, stopword",
          # remove a few extra stopwords(we use the flag word 'w/')
          "w/th, 's",
          # manually stem some verbs
          "despis'd => despise", "undiscover'd => undiscover", "(m|t)akes => \\1ake",
          # still stemming: dropping the ending 's' for words of 4+ letters, except for quietus
          "(\\w{3,}[^u])s\\b => \\1",
          # normalizing the whitespaces + removing punctuation
          "@ws.punct")
#>  [1] "bear whip scorn time"                "oppressor wrong proud man contumely"
#>  [3] "pang despise love law delay"         "insolence office spurn"             
#>  [5] "patient merit unworthy take"         "might quietus make"                 
#>  [7] "bare bodkin fardel bear"             "grunt sweat weary life"             
#>  [9] "dread something death"               "undiscover country whose bourn"     
#> [11] "traveller return puzzle will"        "make us rather bear ills"           
#> [13] "fly other know"

Create simple character vectors with string_vec

The function string_vec is dedicated to the creation of small character vectors. You feed it a comma separated list of values in a string and it will turn it into a vector.

Ex.1: creating a simple vector.

fruits = string_vec("orange, apple, pineapple, strawberry")
fruits
#> [1] "orange"     "apple"      "pineapple"  "strawberry"

Within the enumeration, you can use interpolation, with curly brackets ({}), to insert the elements from a vector into the current string.

Ex.2: adding a vector into an enumeration.

more_fruits = string_vec("lemon, {fruits}, peach")
more_fruits
#> [1] "lemon"      "orange"     "apple"      "pineapple"  "strawberry"
#> [6] "peach"

The interpolation is performed with string_magic. This means that any string_magic operation can be applied on-the-fly.

Ex.3: replicating Ex.2 but shortening long fruit names.

more_fruits = string_vec("lemon, {6 Shorten ? fruits}, peach")
more_fruits
#> [1] "lemon"   "orange"  "apple"   "pinea.." "straw.." "peach"

Since interpolations are resolved with string_magic, you can add any text before/after the interpolation:

Ex.4: adding text before the interpolation.

pkgs = string_vec("pandas, os, time, re")
imports = string_vec("import numpy as np, import {pkgs}")
imports
#> [1] "import numpy as np" "import pandas"      "import os"         
#> [4] "import time"        "import re"

Creating small matrices or data frames

You can transform the returned vector into a matrix or a data frame using the arguments .cmat, .nmat (character or numeric matrix) or .df.

Ex.5: returning a matrix.

string_vec("1, 5,
            3, 2,
            5, 12", .nmat = TRUE)
#>      [,1] [,2]
#> [1,]    1    5
#> [2,]    3    2
#> [3,]    5   12

The number of rows is guessed from the number of newlines in the string. You can avoid using character strings, but in that case you need to explicitly give the number of rows.

Ex.5-bis: returning a numeric matrix, giving .nmat the number of rows.

string_vec(1, 5,
           3, 2,
           5, 12, .nmat = 3)
#>      [,1] [,2]
#> [1,]    1    5
#> [2,]    3    2
#> [3,]    5   12

If you want to return a data.frame, you can add the column names in the .df argument: either in a regular vector, either in a comma separated list. Note that columns looking like numeric values are always converted.

Ex.6: returning a data frame.

# you can add the column names directly in the argument .df
df = string_vec("1, john,
                 3, marie,
                 5, harry", .df = "id, name")
df
#>   id  name
#> 1  1  john
#> 2  3 marie
#> 3  5 harry

# automatic conversion of numeric values
df$id * 5
#> [1]  5 15 25

Split vectors and turn the result into a data frame, and vice versa

The function string_split2df (and string_split2dt) splits a vector using a regular expression pattern and turns it into a data frame, remembering the original identifiers. You can get the original vectors back (almost) with the function paste_conditional.

Ex.1: breaking up two sentences with respect to punctuation and spaces; then merging them back.

x = c("Nor rain, wind, thunder, fire are my daughters.",
      "When my information changes, I alter my conclusions.")

# we split at each word
sentences_split = string_split2df(x, "[[:punct:] ]+")
sentences_split
#>    obs           x
#> 1    1         Nor
#> 2    1        rain
#> 3    1        wind
#> 4    1     thunder
#> 5    1        fire
#> 6    1         are
#> 7    1          my
#> 8    1   daughters
#> 9    2        When
#> 10   2          my
#> 11   2 information
#> 12   2     changes
#> 13   2           I
#> 14   2       alter
#> 15   2          my
#> 16   2 conclusions

# recovering the original vectors (we only lose the punctuation)
paste_conditional(sentences_split$x, sentences_split$obs)
#>                                                    1 
#>        "Nor rain wind thunder fire are my daughters" 
#>                                                    2 
#> "When my information changes I alter my conclusions"

If identifiers are associated to the elements of the vector, you can provide them so that the data frame returned contains them.

Ex.2: splitting with identifiers and merging back with a formula.

id = c("ws", "jmk")
# we add the identifier
base_words = string_split2df(x, "[[:punct:] ]+", id = list(author = id))

# merging back using a formula
paste_conditional(x ~ author, base_words)
#>                                          author: jmk 
#> "When my information changes I alter my conclusions" 
#>                                           author: ws 
#>        "Nor rain wind thunder fire are my daughters"