double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
backslash <- "\\"2 Transform
2.1 Logical vectors
2.2 Numbers
2.3 Strings
2.3.1 Creating strings, escapes and raw strings
Use ' or " to create strings! " is recommended per tidyverse style guide except when there are multiple quotes in a string. To include a literal single or double quote, use \ to escape \' and \". \ is also used to escape special characters like new line \n or tab \t, \\ for backslash itself. See ?Quotes for all special characters.
The printed representation of strings is different from the string itself. The printed string object shows the escapes. To see the row content of a string, use str_view().
When you habve many escapes \ in a string, so called leaning toothpick syndrome, you can use raw strings.
tricky <- "double_quote <- \"\\\"\" # or '\"'
single_quote <- '\\'' # or \"'\""
str_view(tricky)[1] │ double_quote <- "\"" # or '"'
│ single_quote <- '\'' # or "'"
tricky_raw <- r"(double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'")"
str_view(tricky_raw)[1] │ double_quote <- "\"" # or '"'
│ single_quote <- '\'' # or "'"
A raw string starts with r"( and ends with )", and everything in between is taken literally, including quotes and backslashes. If the string )", you can use r"[ ]", or r"{ }", or inserting any number of dashes to make the opening and closing delimiters unique, e.g. r"--( )--" or r"---( )---".
str_view() uses curly braces for tabs {\t} to make them more visible.
2.3.2 Exercises
Create strings that contain the following values:
1. He said "That's amazing!"
2. \a\b\c\d
3. \\\\\\
Create the string in your R session and print it.
What happens to the special “\u00a0”?
How does str_view() display it?
Can you do a little googling to figure out what this special character is?
x <- "This\u00a0is\u00a0tricky"
`\u00a0` is a non-breaking space character.
2.3.3 strings from data: str_c(), str_glue(), str_flatten()
2.3.4 str_c()
str_c():
- takes any number of vectors and returns a character vector similar to
base::paste0().
- obeys tidyverse rule for recycling and propagating missing values
NA: is designed to be used withmutate().
- if missing values should display in another way, use
coalesce()inside or outsidestr_c().
2.3.5 str_glue()
str_glue() from the {glue} package allows the use of {} to mix fixed and variable strings. This improves readability compared to str_c(). But str_glue() does not propagate missing values NA by default, so you may need to use coalesce().
To escape { or }, use double braces {{ or }}.
2.3.6 str_flatten()
str_c()andstr_glue()work well withmutate()because their output is the same length as their inputs.str_flatten()takes a character vector, combines each element of the vector and returns a single string: it works well withsummarize().
str_flatten()
str_flatten(
string,
collapse = "",
last = NULL,
na.rm = FALSE
)
str_flatten_comma(
string,
last = NULL,
na.rm = FALSE
)An exemple of how str_flatten() works well with summarise():
2.3.7 Exercises
Compare and contrast the results of paste0() with str_c() for the following inputs:
str_c("hi ", NA)
str_c(letters[1:2], letters[1:3])What’s the difference between `paste()` and `paste0()?` How can you recreate the equivalent of `paste()` with `str_c()`?
paste (..., sep = " ", collapse = NULL, recycle0 = FALSE)
paste0(..., collapse = NULL, recycle0 = FALSE)
Convert the following expressions from str_c() to str_glue() or vice versa:
a. `str_c("The price of ", food, " is ", price)`
b. `str_glue("I'm {age} years old and live in {country}")`
c. `str_c("\\section{", title, "}")`
2.3.8 data from strings:
If multiple variables are crammed into a single string, you can use four tydr functions to separate them:
separate_longer_delim(col, delim): creates new rows –delimsplits string with delimiter e.g.", "or" "separate_longer_position(col, width): creates new rows –positionsplits string at specified widths e.g.c(3, 2)separate_wider_delim(col, delim, names): creates new columnsseparate_wider_position(col, widths): creates new columns
[!Tips] Look at these too:
str_split(),str_split_fixed(),str_extract(), andstr_match().
2.3.9 separate_longer_delim() and separate_longer_position()
2.3.10 separate_wider_delim() and separate_wider_position()
Here x object is made up of a code, an edition, and a year separated by ".".
If a specific is not useful, you can use an NA to omit it from the results:
separate_wider_position() works a little differently: its arguments are the string, widths (of each column). widths = c(name=value, name=value, etc) a named integer vector, where the name gives the name of the new column, and the value is the number of characters it occupies. You can omit values from the output by not naming them:
2.3.11 Diagnosing widening problems
separate_wider_delim() requires a fixed and known set of columns. If some rows don’t have the expected (equal) number of pieces, the function issues an error. To resolve the issue, the function has 2 arguments too_few and too_many.
The error gives even suggestion on how to proceed: debug or silence the error message.
The “debug” mode helps identify the input that succeeded (_ok is TRUE) and those that failed (_ok is FALSE). _pieces column shows how many pieces were found compared to the expected number of pieces (length of names argument). _remainderis more useful when there are too many pieces. _ok can be used to filter the rows that failed vs. succeeded.
You may want to fill the missing pieces with NA: too_few = align_start fills missing pieces at the end with NA, while too_few = align_end fills missing pieces at the start with NA.
The same principles apply when you have too many pieces: too_many = "debug" allows the input with extra/additional pieces in _remainder column.
You can either ‘drop’ the extra pieces with too_many = "drop" or “merge” them into the last column with too_many = "merge".
2.3.12 Letters
2.3.13 str_length()
str_length()counts the number of characters (incl. spaces) in a string.- Use with
count()to find the distribtion of lengths of strings, e.g. baby names in the US. - Use with
filter()to find the shortest or longest strings/names.
2.3.14 Subsetting with str_sub()
str_sub(string, start, end) extracts substrings from string, starting at position start and ending at position end. Negative values for start or end count backwards from the end of the string. The start and end arguments are inclusive so the length of the returned substring is end - start + 1.
If the string is too short, it returns as many characters as possible.
2.3.15 Exercises
When computing the distribution of the length of babynames, why did we use wt = n?
Use str_length() and str_sub() to extract the middle letter from each baby name. What will you do if the string has an even number of characters?
Are there any major trends in the length of babynames over time? What about the popularity of first and last letters?
Length of babynames over time:
Popularity of first letters over time:
2.3.16 Non-English text
Regarding non-English text, there are challenges one may face: endoding, letter variations, and locale-dependent functions.
charToRaw("Hadley")[1] 48 61 64 6c 65 79
Each of these hexadecimal numbers represents one letter. The mapping from hexadecimal numbers to letters is called an encoding. In this case the encoding is called ASCII. There are two different encoding in Europe, Latin1 and Latin2. But fortunately, today there is one standard that is supported almost everywhere: UTF-8 that can code about every character used by humans incl. symbols and emoji. {readr} uses UTF-8 everywhere. UTF-8 is good default but it will fail for data produced by older older systems. When this happens your strings will look weird.
To read this, specify the encoding with locale() argument of read_csv().
To find the encoding, use readr::guess_encoding() to help figure out. You may need to try a few before finding the right one.
To learn more on encoding, see https://kunststube.net/encoding/.
2.3.17 Letter variations
Working with languages with accents is challenging as str_length() and str_sub() may not behave as expected.
But both strings differ in length and their first character:
Also, comparing these strings with == shows they are different, with str_equal() recognizes them as equal.
2.3.18 Locale-dependent functions
There are some stringr functions that behave differently depending on the locale. A locale is specified by a lower-case language abbreviation, optionally followed by a _ and an upper-case region identifier, e.g. "en_US" for English as used in the United States, or "fr_FR" for French as used in France.
To see a list of supported locale in stringr, call stringi::stri_locale_list().
Base R automatically uses the system locale. Base R string functions will do what you expect in your language but your code may behave differently if you share them to someone from another country. To avoid this stringr functions default to the "en" locale and require you to specify the locale to override it.
There are only two set of functions where the local matter: changing case and sorting.
Ex. Turkish has two different versions of the letter “i”: a dotted version “i” and a dotless version “ı”. Since they two different letters, they are capitalised differently:
Another example is sorting: sorting depends on the order of the alphabet which is not the same in every language. In Czech, “ch” is considered a single letter that comes after “h”.
dplyr::arrange() has locale argument.
2.4 Regex
2.4.1 Introduction to regex
The chapter will use regex functions from stringr and tidyr as well as data from the babynames package, and three character vector from stringr: fruit (contains the names of 80 fruits), words (contains 900 common English words), and sentences (contains 720 short sentences).
2.4.2 Pattern basics
We will use str_view() to see how regex patterns work. We will use it with its second argument, a regex.
str_view() will show only the elements of the string that match, surrounding it with <>.
Letters and numbers match exactly and are called literal characters. Punctuation characters like ., (, ), \, |, ?, *, +, {, }, ^, and $ have special meaning in regex and are called metacharacters:
.: matches any character except a new line.^and$: anchors to matches the start and end of a string.[]: defines character classes matches any character inside the brackets.^at the start of a character class negates it, matching any character not in the brackets.|: defines alternation, and means “or” (pick one or more alternative patterns).?,*,+,{}: quantifiers that specify how many times the preceding element should be matched:?: matches 0 or 1 time (makes a pattern optional).*: matches 0 or more times.+: matches 1 or more times.{}: matches a specific number of times.
Quantifiers:
Character classes:
Alternation:
2.4.3 str_detect() detects matches
str_detect(): - returns a logical vector that indicates whether or not each string contains a match to the pattern. - returns a logical vector of the same length as the input vector. - it pairs well with filter(). - it can be used with summerise()by pairing with sum() and mean(): sum(str_dectect(x, pattern)) to show the number of observations that match the pattern and mean(str_detect(x, pattern)) shows the proportion of observations that match the pattern.
This code finds all the most popular names containing a lower-case “x”:
The proportion of baby names4 that contain “x”, broken down by year:
Two other functions related to str_detect() are str_which() and str_subset(): str_which() return an integer vector giving the positions of the strings that match the pattern, while str_subset() returns a character vector containing only the strings that match the pattern.
2.4.4 str_count() counts matches
str_count() tells how many matches there are in each string.
x <- c("apple", "banana", "pear")
str_count(x, "p")[1] 2 0 1
Each match starts at the end of the previous match, so regex matches never overlap.
str_count("abababa", "aba")[1] 2
str_view("abababa", "aba")[1] │ <aba>b<aba>
str_count() can be used with mutate(). In the exemple below, count the number of vowels and consonants in each name.
babynames |>
count(name) |>
mutate(
vowels = str_count(name, "[aeiou]"),
consonants = str_count(name, "[^aeiou]")
)# A tibble: 97,310 × 4
name n vowels consonants
<chr> <int> <int> <int>
1 Aaban 10 2 3
2 Aabha 5 2 3
3 Aabid 2 2 3
4 Aabir 1 2 3
5 Aabriella 5 4 5
6 Aada 1 2 2
7 Aadam 26 2 3
8 Aadan 11 2 3
9 Aadarsh 17 2 5
10 Aaden 18 2 3
# ℹ 97,300 more rows
The code found only 2 vowels in Aaban instead of 3 because regex are case-sensitive. To make them case-insensitive: - Add upper case to character classs: str_count(name, "[aeiouAEIOU]") - Use the argument ignore_case = TRUE: str_count(name, "[aeiou]", ignore_case = TRUE). - Use str_to_lower() or str_to_upper() to convert the strings to lower or upper case before counting: str_count(str_to_lower(name), "[aeiou]").
babynames |>
count(name) |>
mutate(
name = str_to_lower(name),
vowels = str_count(name, "[aeiou]"),
consonants = str_count(name, "[^aeiou]")
)2.4.5 str_replace() and str_replace_all() modifie matches
str_replace() replaces the first match, and str_replace_all() replaces all matches.
x <- c("apple", "pear", "banana")
str_replace_all(x, "[aeiou]", "-")[1] "-ppl-" "p--r" "b-n-n-"
str_replace(x, "[aeiou]", "-")[1] "-pple" "p-ar" "b-nana"
str_remove() and str_remove_all() are shortcuts of str_replace(x, pattern, "") and str_replace_all(x, pattern, "").
2.4.6 separate_wider_regex() extracts from columns into new columns
The family of separate_wider_*()functions live in tidyr because they operate on columns of data frames rather than on individual vectors.
Below are some data derived from babynames where we have the name, gender, and age of a bunch of people in a weird format:
df <- tribble(
~str ,
"<Sheryl>-F_34" ,
"<Kisha>-F_45" ,
"<Brandon>-N_33" ,
"<Sharon>-F_38" ,
"<Penny>-F_58" ,
"<Justin>-M_41" ,
"<Patricia>-F_84" ,
)Using separate_wider_regex(), we can extract different parts of df$str into new columns:
df |>
separate_wider_regex(
str,
pattern = c(
"<",
name = "[A-Za-z]+",
">-",
gender = "[NMF]",
"_",
age = "[0-9]+"
)
)# A tibble: 7 × 3
name gender age
<chr> <chr> <chr>
1 Sheryl F 34
2 Kisha F 45
3 Brandon N 33
4 Sharon F 38
5 Penny F 58
6 Justin M 41
7 Patricia F 84
If the match fails, use too_few = debug to diagnose the problem.
2.4.7 Exercises
What baby name has the most vowels? What name has the highest proportion of vowels? (Hint: what is the denominator?)
- 1
- Show most vowels at the top: 15
- 2
- Show highest proportion of vowels at the top: 0.85 (after 1 those with only vowels)
Replace all forward slashes in “a/b/c/d/e” with backslashes. What happens if you attempt to undo the transformation by replacing all backslashes with forward slashes? (We’ll discuss the problem very soon.)
It seems like, slash in the pattern argument needs to be escaped with \\, but replacement argument we do not need to escape.
Implement a simple version of str_to_lower() using str_replace_all().
babynames |>
count(name) |>
mutate(
vers = str_replace_all(name, "^[A-Z]?", "")
)Create a regular expression that will match telephone numbers as commonly written in your country.
- 1
-
Start of string
^with+46or0046or0. Obs.\+to escape+and|alternation. - 2
-
\s?matches any whitespace (space, tab);?makes it optional.-?optional dash.\(?literal(that is optional.\dmatches a 0-9 digit; quantifier{1,3}specifies 1 to 3 digits.\)?optional). - 3
-
([\s-]?\d{2,3}){2,3}$matches the second part of the phone number, which is 2 to 3 digits, and is repeated 2 to 3 times. Meaning that it matches 2-3 groups of 2-3 preceded by an optional space or dash.$asserts position at the end of the string.
() specify capturing groups. Regex can match even without () to only get the full match with str_match(). But with () str_match() returns a matrix where the first column is the complete match, and subsequent columns are the capturing groups.
2.4.8 Pattern details
2.4.9 Escaping metacharacters
dot <- "\\."
str_view(dot)[1] │ \.
str_view(c("abc", "a.c", "bef"), "a\\.c")[2] │ <a.c>
To match a literal backslash \, you need to escape it creating regex \\. To create this regex, you need to use a string which also needs to escape each backslash, resulting in four backslashes \\\\ to match a single backslash.
In this book, we’ll usually write regular expression without quotes, like \.. If we need to emphasize what you’ll actually type, we’ll surround it with quotes and add extra escapes, like "\\.".
x <- "a\\b"
str_view(x)[1] │ a\b
str_view(x, "\\\\")[1] │ a<\>b
Raw strings can be easier to use:
str_view(x, r"{\\}")[1] │ a<\>b
Another alternative when trying to literal ., $, *, |, +, ?, (, ), {, } is to use a character class: [.], [$], [|], etc.
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")[2] │ <a.c>
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")[3] │ <a*c>
2.4.10 Anchors: ^ and $
head(fruit)[1] "apple" "apricot" "avocado" "banana" "bell pepper"
[6] "bilberry"
str_view(fruit, "^a")[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado
str_view(fruit, "a$") [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>
You can match the boundary between words with \b. \bsum\b matches sum() but not summary() or summarize().
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)
str_view(x, "\\bsum\\b")[4] │ <sum>(x)
When used alone, anchors will produce a zero-width match:
str_view("abc", c("$", "^", "\\b"))[1] │ abc<>
[2] │ <>abc
[3] │ <>abc<>
See what happens when you replace a standalone anchor:
str_replace_all("abc", c("$", "^", "\\b"), "--")[1] "abc--" "--abc" "--abc--"
2.4.11 Character classes
Character classs or character set allows to match any character from a set of characters defined in [].
[abc]matches any of the letters a, b, or c,[^abc]matches any character except a, b, or c-defines a range such as[a-z]matches any lower-case letter and[0-9]matches any digit.\escapes special characters, such as[\^\-\]]matches a caret^, hyphen-, or closing bracket].\dmatches a digit (equivalent to[0-9]),\Dmatches anything that is not a digit,\wmatches a word character (equivalent to[A-Za-z0-9_]- matches all excluding space, punctuation, symbols and special characters),\Wmatches anything that is not a word character,\smatches a whitespace character (space, tab, newline).\Smatches anything that is not a whitespace.
2.4.11.1 Hint
- 1
- matches one or more digits
- 2
- matches all that is not digits, one or more times
- 3
- matches one or more whitespace characters
- 4
- matches all that is not whitespace, one or more times
- 5
- matches one or more word characters
- 6
- matches all that is not word characters, one or more times, such as space, newline, tab, punctuation, symbols and special characters.
2.4.12 Quantifiers: ?, *, +, {}
Quantifiers control how many times a pattern matches.
?matches 0 or 1 time (makes a pattern optional).*matches 0 or more times.+matches 1 or more times.{n}matches exactly n times.{n,}matches n or more times.{n,m}matches between n and m times.
2.4.13 Operator precedence and ()
In regex, quantifiers have higher precedence and alternation has low precedence: ab+ is equal to a(b+), while ab|cd is equal to (ab)|(cd). Feel free to use parentheses () to remeber the precedence rule and override operator precedence for regex.
2.4.14 Grouping and capturing with ()
Parentheses () allow also to create capturing groups that allow to use sub-components of the match.
The first way to use capturing group is to refer back to it within a match with back reference: \1, \2, etc. refer to the first, second, etc. parentheses (capturing group).
1str_view(fruit, "(..)\\1")
#> [4] │ b<anan>a
#>[20] │ <coco>nut
#>[22] │ <cucu>mber
#>[41] │ <juju>be
#>[56] │ <papa>ya
#>[73] │ s<alal> berry - 1
-
(..)captures any two characters; and\1refers to the first capturing group, so the pattern matches any two characters followed by the same two characters.
The code below find any word with has same 2 letter at the beginning with ^(..) and the end of the word by capturing the first back reference \\1$ . .* specifies any characters in between.
You can use back references in str_replace(). The code below switches the order of the second and third words in sentences:
The matches can be extracted with str_match() with the full match and the specific matches; but this returns a matrix, which is not easy to work with. The matrix can be converted to a tibble and name the columns:
This is how separate_wider_regex()works behind the scenes. It converts the vector of patterns to a single regex that uses grouping to capture the named components.
When you want to use () without creating matching groups, use non-capturing groups (?:).
2.4.15 Exercises
How would you match literal string "'\"? How about "$^$"?
By escaping ’ and . “\’” and “\\”
- 1
-
To create
\in a string, it needs to be escaped with another\as it is used to escape other characters.'does not need to be escaped within"". - 2
-
\\'matches literal'and\\\\matches literal\. As string'can be created with"'"without escaping, even\'and'without escaping matches literaral'.
Explain why each of these patterns don’t match a \: "\", "\\", "\\\".
Given the corpus of common words in stringr::words, create regular expressions that find all words that:
- Start with “y”.
- Don’t start with “y”.
- End with “x”.
- Are exactly three letters long. (Don’t cheat by using str_length()!)
- Have seven letters or more.
- Contain a vowel-consonant pair.
- Contain at least two vowel-consonant pairs in a row.
- Only consist of repeated vowel-consonant pairs.
create 11 regular expressions that match the british or american spellings for each of the following words: airplane/aeroplane, aluminum/aluminium, analog/analogue, ass/arse, center/centre, defense/defence, donut/doughnut, gray/grey, modeling/modelling, skeptic/sceptic, summarize/summarise. try and make the shortest possible regex!
Switch the first and last letters in words. Which of those strings are still words?
Describe in words what these regular expressions match: (read carefully to see if each entry is a regular expression or a string that defines a regular expression.)
^.*$
"\\{.+\\}"\d{4}-\d{2}-\d{2}"\\\\{4}"\<\.\.\.\>(.)\1\1"(..)\\1"
- matches any string.
- matches any sting of 1 or more letters enclosed in {}
- matches any string of digits composed as a date 4 digits, 2 digits, and 2 digits separated with a dash. looks like a date format
yyy-mm-dd. - matches a string of four backslashes.
- matches a string of three points enclosed in < >.
- matches any character repeated three times in a row.
- matches any two characters repeated twice in a row.
2.4.15.1 Question 7
Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner.
2.4.16 Pattern control
Flags can be used to control the default behavior of regex. In stringr, flags can be used by wrapping the pattern in a call to regex(). Some useful flags are:
ignore_case = TRUE: macth both upper- and lowercase
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")[1] │ <banana>
str_view(bananas, regex("banana", ignore_case = TRUE))[1] │ <banana>
[2] │ <Banana>
[3] │ <BANANA>
dotall = TRUE: lets.match everything, incl.\n.
x <- "Line 1\nLine 2\nLine 3"
str_view(x, ".Line")✖ Empty `string` provided.
str_view(x, regex(".Line", dotall = TRUE))[1] │ Line 1<
│ Line> 2<
│ Line> 3
multiline = TRUE: makes^and$match the start and end of each line rather than the start/end of a string.
x <- "Line 1\nLine 2\nLine 3"
str_view(x, "^Line")[1] │ <Line> 1
│ Line 2
│ Line 3
str_view(x, regex("^Line", multiline = TRUE))[1] │ <Line> 1
│ <Line> 2
│ <Line> 3
comments = TRUE: tweaks the pattern language to ignore everything after#.
phone <- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code
[)\-]? # optional closing parens or dash
\ ? # optional space
(\d{3}) # another three numbers
[\ -]? # optional space or dash
(\d{4}) # four more numbers
)",
comments = TRUE
)
str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)[1] "514-791-8141" "(123) 456 7890" NA
If you are using comments, to have a space, newlune or #, we need to escape it with \.
2.4.17 Fixed matches
fixed() can be used instead of regex().
str_view(c("", "a", "."), fixed("."))[3] │ <.>
str_view("x X", "X")[1] │ x <X>
str_view("x X", fixed("X", ignore_case = TRUE))[1] │ <x> <X>
If you are working with non-English languages, use coll() instead of fixed() as it impliments the full rules for capitalization and locale.
str_view("i İ ı I", fixed("İ", ignore_case = TRUE))[1] │ i <İ> ı I
str_view("i İ ı I", coll("İ", ignore_case = TRUE, locale = "tr"))[1] │ <i> <İ> ı I
2.4.18 Practice
- Checking your work
library(tidyverse)
library(babynames)
# let’s find all sentences that start with “The”.
str_view(head(sentences), "^The")[1] │ <The> birch canoe slid on the smooth planks.
[4] │ <The>se days a chicken leg is a rare dish.
[6] │ <The> juice of lemons makes fine punch.
# We need to make sure that the “e” is the last letter in the word, which we can do by adding a word boundary:
str_view(sentences, "^The\\b" ) [1] │ <The> birch canoe slid on the smooth planks.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
[13] │ <The> source of the huge river is the clear spring.
[18] │ <The> soft cushion broke the man's fall.
[19] │ <The> salt breeze came across from the sea.
[20] │ <The> girl at the booth sold fifty bonds.
[21] │ <The> small pup gnawed a hole in the sock.
[22] │ <The> fish twisted and turned on the bent hook.
[24] │ <The> swan dive was far short of perfect.
[25] │ <The> beauty of the view stunned the young boy.
[28] │ <The> colt reared and threw the tall rider.
[36] │ <The> wrist was badly strained and hung limp.
[37] │ <The> stray cat gave birth to kittens.
[38] │ <The> young girl gave no clear response.
[39] │ <The> meal was cooked before the bell rang.
[42] │ <The> ship was torn apart on the sharp reef.
[44] │ <The> wide road shimmered in the hot sun.
... and 236 more
# What about finding all sentences that begin with a pronoun?
str_view(sentences, "^(She\\b|He\\b|We\\b|I\\b|You\\b|They\\b)") [61] │ <We> talked of the side show in the circus.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
[116] │ <He> ordered peach pie with ice cream.
[120] │ <We> find joy in the simplest things.
[132] │ <He> said the same phrase thirty times.
[139] │ <We> frown when events take a bad turn.
[153] │ <He> broke a new shoelace that day.
[158] │ <We> tried to replace the coin but failed.
[159] │ <She> sewed the torn coat quite neatly.
[168] │ <He> knew the skill of the great young actress.
[179] │ <They> felt gay when the ship arrived in port.
[187] │ <We> admire and love a good cook.
[189] │ <He> carved a head from the round block of marble.
[190] │ <She> has a smart way of wearing clothes.
[199] │ <They> could laugh although they were sad.
[209] │ <He> wrote his last novel there at the inn.
[225] │ <They> took the axe and the saw to the forest.
[232] │ <They> are pushed back each time they attack.
[233] │ <He> broke his ties with groups of former friends.
... and 37 more
str_view(sentences, "^(She|He|We|I|You|They)\\b") [61] │ <We> talked of the side show in the circus.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
[116] │ <He> ordered peach pie with ice cream.
[120] │ <We> find joy in the simplest things.
[132] │ <He> said the same phrase thirty times.
[139] │ <We> frown when events take a bad turn.
[153] │ <He> broke a new shoelace that day.
[158] │ <We> tried to replace the coin but failed.
[159] │ <She> sewed the torn coat quite neatly.
[168] │ <He> knew the skill of the great young actress.
[179] │ <They> felt gay when the ship arrived in port.
[187] │ <We> admire and love a good cook.
[189] │ <He> carved a head from the round block of marble.
[190] │ <She> has a smart way of wearing clothes.
[199] │ <They> could laugh although they were sad.
[209] │ <He> wrote his last novel there at the inn.
[225] │ <They> took the axe and the saw to the forest.
[232] │ <They> are pushed back each time they attack.
[233] │ <He> broke his ties with groups of former friends.
... and 37 more
To make sure the regex works and to spot errors, create some positvive and negative matches and test them to test that patterns work as expected:
pos <- c("He is a boy", "She had a good time")
neg <- c("Shells come from the sea", "Hadley said 'It's a great day'")
pattern <- "^(She|He|It|They)\\b"
str_detect(pos, pattern)[1] TRUE TRUE
str_detect(neg, pattern)[1] FALSE FALSE
- Boolean operations
When you want to find words that only contain consonants, you can create a character class that contains all letters except for the vowels ([^aeiou]), then allow that to match any number of letters ([^aeiou]+), then force it to match the whole string by anchoring to the beginning and the end (^[^aeiou]+$).
str_view(words, "^[^aeiou]+$")[123] │ <by>
[249] │ <dry>
[328] │ <fly>
[538] │ <mrs>
[895] │ <try>
[952] │ <why>
Another alternative is, instead of matching consonants, one can match words that do not contain vowels.
words[!str_detect(words, "[aeiou]")][1] "by" "dry" "fly" "mrs" "try" "why"
Whenever you are dealing with logical combinations, particularly those involving “and” or “or”, or in general when you get stuck trying to create a single regex to solve your problem, **step back and think if you can break the problem into smaller pieces, solving each challenge before moving onto the next one, and combine them in several str_detect() calls.
To fin a words that contain “a” and “b”, you match words that contain “a” followed by “b” or “b” followed by “a” using str_view(words, "a.*b|b.*a") or use a simpler approach of two calls to str_detect() such as str_detect(words, "a") & str_detect(words, "b").
What if we wanted to see if there was a word that contains all vowels? If we did it with patterns we’d need to generate 5! (120) different patterns:
words[str_detect(words, "a.*e.*i.*o.*u")]
# ...
words[str_detect(words, "u.*o.*i.*e.*a")]But it is much simpler to combine five calls to st_detect():
words[
str_detect(words, "a") &
str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o") &
str_detect(words, "u")
]- Creating a pattern with code
If we want to find all sentences that mention a color, we can:
str_view(sentences, "\\b(red|green|blue)\\b") [2] │ Glue the sheet to the dark <blue> background.
[26] │ Two <blue> fish swam in the tank.
[92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
[174] │ The sky that morning was clear and bright <blue>.
[204] │ A <blue> crane is a tall wading bird.
[217] │ It is hard to erase <blue> or <red> ink.
[224] │ The lamp shone with a steady <green> flame.
[247] │ The box is held by a bright <red> snapper.
[256] │ The houses are built of <red> clay bricks.
[274] │ The <red> tape bound the smuggled food.
[288] │ Hedge apples may stain your hands <green>.
[302] │ The plant grew large and <green> in the window.
[330] │ Bathe and relax in the cool <green> grass.
[368] │ The lake sparkled in the <red> hot sun.
[372] │ Mark the spot with a sign painted <red>.
[452] │ The couch cover and hall drapes were <blue>.
[491] │ A man in a <blue> sweater sat at the desk.
[551] │ The small <red> neon lamp went out.
... and 6 more
But writing this gets combersome if is the list of colors is long. We can instead put the colors in a vector vector and use str_c() or/and str_flatten() to create the pattern.
rgb <- c("red", "green", "blue")
pattern <- str_c("\\b(", str_flatten(rgb, "|"), ")\\b")Let’s we want to use the colors that come with R in colors() function.
- 1
- exclude color names with digitis.
- 2
- build the pattern
- 3
- apply the pattern and find the match.
Whenever you create patterns from existing strings, it’s wise to run them through str_escape() to ensure they match literally.
2.4.19 Execises
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect() calls.
- Find all
wordsthat start or end withx. - Find all
wordsthat start with a vowel and end with a consonant. - Are there any
wordsthat contain at least one of each different vowel?
#a.
str_view(words, "^x.*|.*x$") [108] │ <box>
[747] │ <sex>
[772] │ <six>
[841] │ <tax>
words[str_detect(words, "^x") | str_detect(words, "x$")][1] "box" "sex" "six" "tax"
#b.
str_view(words, "^[aeiou].+[^aeiou]$") [3] │ <about>
[5] │ <accept>
[6] │ <account>
[8] │ <across>
[9] │ <act>
[11] │ <actual>
[12] │ <add>
[13] │ <address>
[14] │ <admit>
[16] │ <affect>
[17] │ <afford>
[18] │ <after>
[19] │ <afternoon>
[20] │ <again>
[21] │ <against>
[23] │ <agent>
[26] │ <air>
[27] │ <all>
[28] │ <allow>
[29] │ <almost>
... and 93 more
words[str_detect(words, "^[aeiou]") & !str_detect(words, "[aeiou]$")] [1] "about" "accept" "account" "across" "act"
[6] "actual" "add" "address" "admit" "affect"
[11] "afford" "after" "afternoon" "again" "against"
[16] "agent" "air" "all" "allow" "almost"
[21] "along" "already" "alright" "although" "always"
[26] "amount" "and" "another" "answer" "any"
[31] "apart" "apparent" "appear" "apply" "appoint"
[36] "approach" "arm" "around" "art" "as"
[41] "ask" "at" "attend" "authority" "away"
[46] "awful" "each" "early" "east" "easy"
[51] "eat" "economy" "effect" "egg" "eight"
[56] "either" "elect" "electric" "eleven" "employ"
[61] "end" "english" "enjoy" "enough" "enter"
[66] "environment" "equal" "especial" "even" "evening"
[71] "ever" "every" "exact" "except" "exist"
[76] "expect" "explain" "express" "identify" "if"
[81] "important" "in" "indeed" "individual" "industry"
[86] "inform" "instead" "interest" "invest" "it"
[91] "item" "obvious" "occasion" "odd" "of"
[96] "off" "offer" "often" "okay" "old"
[101] "on" "only" "open" "opportunity" "or"
[106] "order" "original" "other" "ought" "out"
[111] "over" "own" "under" "understand" "union"
[116] "unit" "university" "unless" "until" "up"
[121] "upon" "usual"
#c.
str_view(words, "[aeiou]+") [1] │ <a>
[2] │ <a>bl<e>
[3] │ <a>b<ou>t
[4] │ <a>bs<o>l<u>t<e>
[5] │ <a>cc<e>pt
[6] │ <a>cc<ou>nt
[7] │ <a>ch<ie>v<e>
[8] │ <a>cr<o>ss
[9] │ <a>ct
[10] │ <a>ct<i>v<e>
[11] │ <a>ct<ua>l
[12] │ <a>dd
[13] │ <a>ddr<e>ss
[14] │ <a>dm<i>t
[15] │ <a>dv<e>rt<i>s<e>
[16] │ <a>ff<e>ct
[17] │ <a>ff<o>rd
[18] │ <a>ft<e>r
[19] │ <a>ft<e>rn<oo>n
[20] │ <a>g<ai>n
... and 954 more
words[str_detect(words, "[aeiou]")] [1] "a" "able" "about" "absolute" "accept"
[6] "account" "achieve" "across" "act" "active"
[11] "actual" "add" "address" "admit" "advertise"
[16] "affect" "afford" "after" "afternoon" "again"
[21] "against" "age" "agent" "ago" "agree"
[26] "air" "all" "allow" "almost" "along"
[31] "already" "alright" "also" "although" "always"
[36] "america" "amount" "and" "another" "answer"
[41] "any" "apart" "apparent" "appear" "apply"
[46] "appoint" "approach" "appropriate" "area" "argue"
[51] "arm" "around" "arrange" "art" "as"
[56] "ask" "associate" "assume" "at" "attend"
[61] "authority" "available" "aware" "away" "awful"
[66] "baby" "back" "bad" "bag" "balance"
[71] "ball" "bank" "bar" "base" "basis"
[76] "be" "bear" "beat" "beauty" "because"
[81] "become" "bed" "before" "begin" "behind"
[86] "believe" "benefit" "best" "bet" "between"
[91] "big" "bill" "birth" "bit" "black"
[96] "bloke" "blood" "blow" "blue" "board"
[101] "boat" "body" "book" "both" "bother"
[106] "bottle" "bottom" "box" "boy" "break"
[111] "brief" "brilliant" "bring" "britain" "brother"
[116] "budget" "build" "bus" "business" "busy"
[121] "but" "buy" "cake" "call" "can"
[126] "car" "card" "care" "carry" "case"
[131] "cat" "catch" "cause" "cent" "centre"
[136] "certain" "chair" "chairman" "chance" "change"
[141] "chap" "character" "charge" "cheap" "check"
[146] "child" "choice" "choose" "Christ" "Christmas"
[151] "church" "city" "claim" "class" "clean"
[156] "clear" "client" "clock" "close" "closes"
[161] "clothe" "club" "coffee" "cold" "colleague"
[166] "collect" "college" "colour" "come" "comment"
[171] "commit" "committee" "common" "community" "company"
[176] "compare" "complete" "compute" "concern" "condition"
[181] "confer" "consider" "consult" "contact" "continue"
[186] "contract" "control" "converse" "cook" "copy"
[191] "corner" "correct" "cost" "could" "council"
[196] "count" "country" "county" "couple" "course"
[201] "court" "cover" "create" "cross" "cup"
[206] "current" "cut" "dad" "danger" "date"
[211] "day" "dead" "deal" "dear" "debate"
[216] "decide" "decision" "deep" "definite" "degree"
[221] "department" "depend" "describe" "design" "detail"
[226] "develop" "die" "difference" "difficult" "dinner"
[231] "direct" "discuss" "district" "divide" "do"
[236] "doctor" "document" "dog" "door" "double"
[241] "doubt" "down" "draw" "dress" "drink"
[246] "drive" "drop" "due" "during" "each"
[251] "early" "east" "easy" "eat" "economy"
[256] "educate" "effect" "egg" "eight" "either"
[261] "elect" "electric" "eleven" "else" "employ"
[266] "encourage" "end" "engine" "english" "enjoy"
[271] "enough" "enter" "environment" "equal" "especial"
[276] "europe" "even" "evening" "ever" "every"
[281] "evidence" "exact" "example" "except" "excuse"
[286] "exercise" "exist" "expect" "expense" "experience"
[291] "explain" "express" "extra" "eye" "face"
[296] "fact" "fair" "fall" "family" "far"
[301] "farm" "fast" "father" "favour" "feed"
[306] "feel" "few" "field" "fight" "figure"
[311] "file" "fill" "film" "final" "finance"
[316] "find" "fine" "finish" "fire" "first"
[321] "fish" "fit" "five" "flat" "floor"
[326] "follow" "food" "foot" "for" "force"
[331] "forget" "form" "fortune" "forward" "four"
[336] "france" "free" "friday" "friend" "from"
[341] "front" "full" "fun" "function" "fund"
[346] "further" "future" "game" "garden" "gas"
[351] "general" "germany" "get" "girl" "give"
[356] "glass" "go" "god" "good" "goodbye"
[361] "govern" "grand" "grant" "great" "green"
[366] "ground" "group" "grow" "guess" "guy"
[371] "hair" "half" "hall" "hand" "hang"
[376] "happen" "happy" "hard" "hate" "have"
[381] "he" "head" "health" "hear" "heart"
[386] "heat" "heavy" "hell" "help" "here"
[391] "high" "history" "hit" "hold" "holiday"
[396] "home" "honest" "hope" "horse" "hospital"
[401] "hot" "hour" "house" "how" "however"
[406] "hullo" "hundred" "husband" "idea" "identify"
[411] "if" "imagine" "important" "improve" "in"
[416] "include" "income" "increase" "indeed" "individual"
[421] "industry" "inform" "inside" "instead" "insure"
[426] "interest" "into" "introduce" "invest" "involve"
[431] "issue" "it" "item" "jesus" "job"
[436] "join" "judge" "jump" "just" "keep"
[441] "key" "kid" "kill" "kind" "king"
[446] "kitchen" "knock" "know" "labour" "lad"
[451] "lady" "land" "language" "large" "last"
[456] "late" "laugh" "law" "lay" "lead"
[461] "learn" "leave" "left" "leg" "less"
[466] "let" "letter" "level" "lie" "life"
[471] "light" "like" "likely" "limit" "line"
[476] "link" "list" "listen" "little" "live"
[481] "load" "local" "lock" "london" "long"
[486] "look" "lord" "lose" "lot" "love"
[491] "low" "luck" "lunch" "machine" "main"
[496] "major" "make" "man" "manage" "many"
[501] "mark" "market" "marry" "match" "matter"
[506] "may" "maybe" "mean" "meaning" "measure"
[511] "meet" "member" "mention" "middle" "might"
[516] "mile" "milk" "million" "mind" "minister"
[521] "minus" "minute" "miss" "mister" "moment"
[526] "monday" "money" "month" "more" "morning"
[531] "most" "mother" "motion" "move" "much"
[536] "music" "must" "name" "nation" "nature"
[541] "near" "necessary" "need" "never" "new"
[546] "news" "next" "nice" "night" "nine"
[551] "no" "non" "none" "normal" "north"
[556] "not" "note" "notice" "now" "number"
[561] "obvious" "occasion" "odd" "of" "off"
[566] "offer" "office" "often" "okay" "old"
[571] "on" "once" "one" "only" "open"
[576] "operate" "opportunity" "oppose" "or" "order"
[581] "organize" "original" "other" "otherwise" "ought"
[586] "out" "over" "own" "pack" "page"
[591] "paint" "pair" "paper" "paragraph" "pardon"
[596] "parent" "park" "part" "particular" "party"
[601] "pass" "past" "pay" "pence" "pension"
[606] "people" "per" "percent" "perfect" "perhaps"
[611] "period" "person" "photograph" "pick" "picture"
[616] "piece" "place" "plan" "play" "please"
[621] "plus" "point" "police" "policy" "politic"
[626] "poor" "position" "positive" "possible" "post"
[631] "pound" "power" "practise" "prepare" "present"
[636] "press" "pressure" "presume" "pretty" "previous"
[641] "price" "print" "private" "probable" "problem"
[646] "proceed" "process" "produce" "product" "programme"
[651] "project" "proper" "propose" "protect" "provide"
[656] "public" "pull" "purpose" "push" "put"
[661] "quality" "quarter" "question" "quick" "quid"
[666] "quiet" "quite" "radio" "rail" "raise"
[671] "range" "rate" "rather" "read" "ready"
[676] "real" "realise" "really" "reason" "receive"
[681] "recent" "reckon" "recognize" "recommend" "record"
[686] "red" "reduce" "refer" "regard" "region"
[691] "relation" "remember" "report" "represent" "require"
[696] "research" "resource" "respect" "responsible" "rest"
[701] "result" "return" "rid" "right" "ring"
[706] "rise" "road" "role" "roll" "room"
[711] "round" "rule" "run" "safe" "sale"
[716] "same" "saturday" "save" "say" "scheme"
[721] "school" "science" "score" "scotland" "seat"
[726] "second" "secretary" "section" "secure" "see"
[731] "seem" "self" "sell" "send" "sense"
[736] "separate" "serious" "serve" "service" "set"
[741] "settle" "seven" "sex" "shall" "share"
[746] "she" "sheet" "shoe" "shoot" "shop"
[751] "short" "should" "show" "shut" "sick"
[756] "side" "sign" "similar" "simple" "since"
[761] "sing" "single" "sir" "sister" "sit"
[766] "site" "situate" "six" "size" "sleep"
[771] "slight" "slow" "small" "smoke" "so"
[776] "social" "society" "some" "son" "soon"
[781] "sorry" "sort" "sound" "south" "space"
[786] "speak" "special" "specific" "speed" "spell"
[791] "spend" "square" "staff" "stage" "stairs"
[796] "stand" "standard" "start" "state" "station"
[801] "stay" "step" "stick" "still" "stop"
[806] "story" "straight" "strategy" "street" "strike"
[811] "strong" "structure" "student" "study" "stuff"
[816] "stupid" "subject" "succeed" "such" "sudden"
[821] "suggest" "suit" "summer" "sun" "sunday"
[826] "supply" "support" "suppose" "sure" "surprise"
[831] "switch" "system" "table" "take" "talk"
[836] "tape" "tax" "tea" "teach" "team"
[841] "telephone" "television" "tell" "ten" "tend"
[846] "term" "terrible" "test" "than" "thank"
[851] "the" "then" "there" "therefore" "they"
[856] "thing" "think" "thirteen" "thirty" "this"
[861] "thou" "though" "thousand" "three" "through"
[866] "throw" "thursday" "tie" "time" "to"
[871] "today" "together" "tomorrow" "tonight" "too"
[876] "top" "total" "touch" "toward" "town"
[881] "trade" "traffic" "train" "transport" "travel"
[886] "treat" "tree" "trouble" "true" "trust"
[891] "tuesday" "turn" "twelve" "twenty" "two"
[896] "type" "under" "understand" "union" "unit"
[901] "unite" "university" "unless" "until" "up"
[906] "upon" "use" "usual" "value" "various"
[911] "very" "video" "view" "village" "visit"
[916] "vote" "wage" "wait" "walk" "wall"
[921] "want" "war" "warm" "wash" "waste"
[926] "watch" "water" "way" "we" "wear"
[931] "wednesday" "wee" "week" "weigh" "welcome"
[936] "well" "west" "what" "when" "where"
[941] "whether" "which" "while" "white" "who"
[946] "whole" "wide" "wife" "will" "win"
[951] "wind" "window" "wish" "with" "within"
[956] "without" "woman" "wonder" "wood" "word"
[961] "work" "world" "worry" "worse" "worth"
[966] "would" "write" "wrong" "year" "yes"
[971] "yesterday" "yet" "you" "young"
Construct patterns to find evidence for and against the rule “i before e except after c”?
colors() contains a number of modifiers like “lightgray” and “darkblue”. How could you automatically identify these modifiers? (Think about how you might detect and then remove the colors that are modified).
colrs <- colors()
colrs[!str_detect(colrs, "\\d")] [1] "white" "aliceblue" "antiquewhite"
[4] "aquamarine" "azure" "beige"
[7] "bisque" "black" "blanchedalmond"
[10] "blue" "blueviolet" "brown"
[13] "burlywood" "cadetblue" "chartreuse"
[16] "chocolate" "coral" "cornflowerblue"
[19] "cornsilk" "cyan" "darkblue"
[22] "darkcyan" "darkgoldenrod" "darkgray"
[25] "darkgreen" "darkgrey" "darkkhaki"
[28] "darkmagenta" "darkolivegreen" "darkorange"
[31] "darkorchid" "darkred" "darksalmon"
[34] "darkseagreen" "darkslateblue" "darkslategray"
[37] "darkslategrey" "darkturquoise" "darkviolet"
[40] "deeppink" "deepskyblue" "dimgray"
[43] "dimgrey" "dodgerblue" "firebrick"
[46] "floralwhite" "forestgreen" "gainsboro"
[49] "ghostwhite" "gold" "goldenrod"
[52] "gray" "green" "greenyellow"
[55] "grey" "honeydew" "hotpink"
[58] "indianred" "ivory" "khaki"
[61] "lavender" "lavenderblush" "lawngreen"
[64] "lemonchiffon" "lightblue" "lightcoral"
[67] "lightcyan" "lightgoldenrod" "lightgoldenrodyellow"
[70] "lightgray" "lightgreen" "lightgrey"
[73] "lightpink" "lightsalmon" "lightseagreen"
[76] "lightskyblue" "lightslateblue" "lightslategray"
[79] "lightslategrey" "lightsteelblue" "lightyellow"
[82] "limegreen" "linen" "magenta"
[85] "maroon" "mediumaquamarine" "mediumblue"
[88] "mediumorchid" "mediumpurple" "mediumseagreen"
[91] "mediumslateblue" "mediumspringgreen" "mediumturquoise"
[94] "mediumvioletred" "midnightblue" "mintcream"
[97] "mistyrose" "moccasin" "navajowhite"
[100] "navy" "navyblue" "oldlace"
[103] "olivedrab" "orange" "orangered"
[106] "orchid" "palegoldenrod" "palegreen"
[109] "paleturquoise" "palevioletred" "papayawhip"
[112] "peachpuff" "peru" "pink"
[115] "plum" "powderblue" "purple"
[118] "red" "rosybrown" "royalblue"
[121] "saddlebrown" "salmon" "sandybrown"
[124] "seagreen" "seashell" "sienna"
[127] "skyblue" "slateblue" "slategray"
[130] "slategrey" "snow" "springgreen"
[133] "steelblue" "tan" "thistle"
[136] "tomato" "turquoise" "violet"
[139] "violetred" "wheat" "whitesmoke"
[142] "yellow" "yellowgreen"
rgb <- c("red" , "green ", "blue")
pattern <- str_c("(\\d+$)|", "(\\w+)(", str_flatten(rgb, "|"), ")", "(\\w*)")
pattern[1] "(\\d+$)|(\\w+)(red|green |blue)(\\w*)"
colrs[str_detect(colrs, pattern)] [1] "aliceblue" "antiquewhite1" "antiquewhite2" "antiquewhite3"
[5] "antiquewhite4" "aquamarine1" "aquamarine2" "aquamarine3"
[9] "aquamarine4" "azure1" "azure2" "azure3"
[13] "azure4" "bisque1" "bisque2" "bisque3"
[17] "bisque4" "blue1" "blue2" "blue3"
[21] "blue4" "brown1" "brown2" "brown3"
[25] "brown4" "burlywood1" "burlywood2" "burlywood3"
[29] "burlywood4" "cadetblue" "cadetblue1" "cadetblue2"
[33] "cadetblue3" "cadetblue4" "chartreuse1" "chartreuse2"
[37] "chartreuse3" "chartreuse4" "chocolate1" "chocolate2"
[41] "chocolate3" "chocolate4" "coral1" "coral2"
[45] "coral3" "coral4" "cornflowerblue" "cornsilk1"
[49] "cornsilk2" "cornsilk3" "cornsilk4" "cyan1"
[53] "cyan2" "cyan3" "cyan4" "darkblue"
[57] "darkgoldenrod1" "darkgoldenrod2" "darkgoldenrod3" "darkgoldenrod4"
[61] "darkolivegreen1" "darkolivegreen2" "darkolivegreen3" "darkolivegreen4"
[65] "darkorange1" "darkorange2" "darkorange3" "darkorange4"
[69] "darkorchid1" "darkorchid2" "darkorchid3" "darkorchid4"
[73] "darkred" "darkseagreen1" "darkseagreen2" "darkseagreen3"
[77] "darkseagreen4" "darkslateblue" "darkslategray1" "darkslategray2"
[81] "darkslategray3" "darkslategray4" "deeppink1" "deeppink2"
[85] "deeppink3" "deeppink4" "deepskyblue" "deepskyblue1"
[89] "deepskyblue2" "deepskyblue3" "deepskyblue4" "dodgerblue"
[93] "dodgerblue1" "dodgerblue2" "dodgerblue3" "dodgerblue4"
[97] "firebrick1" "firebrick2" "firebrick3" "firebrick4"
[101] "gold1" "gold2" "gold3" "gold4"
[105] "goldenrod1" "goldenrod2" "goldenrod3" "goldenrod4"
[109] "gray0" "gray1" "gray2" "gray3"
[113] "gray4" "gray5" "gray6" "gray7"
[117] "gray8" "gray9" "gray10" "gray11"
[121] "gray12" "gray13" "gray14" "gray15"
[125] "gray16" "gray17" "gray18" "gray19"
[129] "gray20" "gray21" "gray22" "gray23"
[133] "gray24" "gray25" "gray26" "gray27"
[137] "gray28" "gray29" "gray30" "gray31"
[141] "gray32" "gray33" "gray34" "gray35"
[145] "gray36" "gray37" "gray38" "gray39"
[149] "gray40" "gray41" "gray42" "gray43"
[153] "gray44" "gray45" "gray46" "gray47"
[157] "gray48" "gray49" "gray50" "gray51"
[161] "gray52" "gray53" "gray54" "gray55"
[165] "gray56" "gray57" "gray58" "gray59"
[169] "gray60" "gray61" "gray62" "gray63"
[173] "gray64" "gray65" "gray66" "gray67"
[177] "gray68" "gray69" "gray70" "gray71"
[181] "gray72" "gray73" "gray74" "gray75"
[185] "gray76" "gray77" "gray78" "gray79"
[189] "gray80" "gray81" "gray82" "gray83"
[193] "gray84" "gray85" "gray86" "gray87"
[197] "gray88" "gray89" "gray90" "gray91"
[201] "gray92" "gray93" "gray94" "gray95"
[205] "gray96" "gray97" "gray98" "gray99"
[209] "gray100" "green1" "green2" "green3"
[213] "green4" "grey0" "grey1" "grey2"
[217] "grey3" "grey4" "grey5" "grey6"
[221] "grey7" "grey8" "grey9" "grey10"
[225] "grey11" "grey12" "grey13" "grey14"
[229] "grey15" "grey16" "grey17" "grey18"
[233] "grey19" "grey20" "grey21" "grey22"
[237] "grey23" "grey24" "grey25" "grey26"
[241] "grey27" "grey28" "grey29" "grey30"
[245] "grey31" "grey32" "grey33" "grey34"
[249] "grey35" "grey36" "grey37" "grey38"
[253] "grey39" "grey40" "grey41" "grey42"
[257] "grey43" "grey44" "grey45" "grey46"
[261] "grey47" "grey48" "grey49" "grey50"
[265] "grey51" "grey52" "grey53" "grey54"
[269] "grey55" "grey56" "grey57" "grey58"
[273] "grey59" "grey60" "grey61" "grey62"
[277] "grey63" "grey64" "grey65" "grey66"
[281] "grey67" "grey68" "grey69" "grey70"
[285] "grey71" "grey72" "grey73" "grey74"
[289] "grey75" "grey76" "grey77" "grey78"
[293] "grey79" "grey80" "grey81" "grey82"
[297] "grey83" "grey84" "grey85" "grey86"
[301] "grey87" "grey88" "grey89" "grey90"
[305] "grey91" "grey92" "grey93" "grey94"
[309] "grey95" "grey96" "grey97" "grey98"
[313] "grey99" "grey100" "honeydew1" "honeydew2"
[317] "honeydew3" "honeydew4" "hotpink1" "hotpink2"
[321] "hotpink3" "hotpink4" "indianred" "indianred1"
[325] "indianred2" "indianred3" "indianred4" "ivory1"
[329] "ivory2" "ivory3" "ivory4" "khaki1"
[333] "khaki2" "khaki3" "khaki4" "lavenderblush1"
[337] "lavenderblush2" "lavenderblush3" "lavenderblush4" "lemonchiffon1"
[341] "lemonchiffon2" "lemonchiffon3" "lemonchiffon4" "lightblue"
[345] "lightblue1" "lightblue2" "lightblue3" "lightblue4"
[349] "lightcyan1" "lightcyan2" "lightcyan3" "lightcyan4"
[353] "lightgoldenrod1" "lightgoldenrod2" "lightgoldenrod3" "lightgoldenrod4"
[357] "lightpink1" "lightpink2" "lightpink3" "lightpink4"
[361] "lightsalmon1" "lightsalmon2" "lightsalmon3" "lightsalmon4"
[365] "lightskyblue" "lightskyblue1" "lightskyblue2" "lightskyblue3"
[369] "lightskyblue4" "lightslateblue" "lightsteelblue" "lightsteelblue1"
[373] "lightsteelblue2" "lightsteelblue3" "lightsteelblue4" "lightyellow1"
[377] "lightyellow2" "lightyellow3" "lightyellow4" "magenta1"
[381] "magenta2" "magenta3" "magenta4" "maroon1"
[385] "maroon2" "maroon3" "maroon4" "mediumblue"
[389] "mediumorchid1" "mediumorchid2" "mediumorchid3" "mediumorchid4"
[393] "mediumpurple1" "mediumpurple2" "mediumpurple3" "mediumpurple4"
[397] "mediumslateblue" "mediumvioletred" "midnightblue" "mistyrose1"
[401] "mistyrose2" "mistyrose3" "mistyrose4" "navajowhite1"
[405] "navajowhite2" "navajowhite3" "navajowhite4" "navyblue"
[409] "olivedrab1" "olivedrab2" "olivedrab3" "olivedrab4"
[413] "orange1" "orange2" "orange3" "orange4"
[417] "orangered" "orangered1" "orangered2" "orangered3"
[421] "orangered4" "orchid1" "orchid2" "orchid3"
[425] "orchid4" "palegreen1" "palegreen2" "palegreen3"
[429] "palegreen4" "paleturquoise1" "paleturquoise2" "paleturquoise3"
[433] "paleturquoise4" "palevioletred" "palevioletred1" "palevioletred2"
[437] "palevioletred3" "palevioletred4" "peachpuff1" "peachpuff2"
[441] "peachpuff3" "peachpuff4" "pink1" "pink2"
[445] "pink3" "pink4" "plum1" "plum2"
[449] "plum3" "plum4" "powderblue" "purple1"
[453] "purple2" "purple3" "purple4" "red1"
[457] "red2" "red3" "red4" "rosybrown1"
[461] "rosybrown2" "rosybrown3" "rosybrown4" "royalblue"
[465] "royalblue1" "royalblue2" "royalblue3" "royalblue4"
[469] "salmon1" "salmon2" "salmon3" "salmon4"
[473] "seagreen1" "seagreen2" "seagreen3" "seagreen4"
[477] "seashell1" "seashell2" "seashell3" "seashell4"
[481] "sienna1" "sienna2" "sienna3" "sienna4"
[485] "skyblue" "skyblue1" "skyblue2" "skyblue3"
[489] "skyblue4" "slateblue" "slateblue1" "slateblue2"
[493] "slateblue3" "slateblue4" "slategray1" "slategray2"
[497] "slategray3" "slategray4" "snow1" "snow2"
[501] "snow3" "snow4" "springgreen1" "springgreen2"
[505] "springgreen3" "springgreen4" "steelblue" "steelblue1"
[509] "steelblue2" "steelblue3" "steelblue4" "tan1"
[513] "tan2" "tan3" "tan4" "thistle1"
[517] "thistle2" "thistle3" "thistle4" "tomato1"
[521] "tomato2" "tomato3" "tomato4" "turquoise1"
[525] "turquoise2" "turquoise3" "turquoise4" "violetred"
[529] "violetred1" "violetred2" "violetred3" "violetred4"
[533] "wheat1" "wheat2" "wheat3" "wheat4"
[537] "yellow1" "yellow2" "yellow3" "yellow4"
colrs[!str_detect(colrs, pattern)] [1] "white" "antiquewhite" "aquamarine"
[4] "azure" "beige" "bisque"
[7] "black" "blanchedalmond" "blue"
[10] "blueviolet" "brown" "burlywood"
[13] "chartreuse" "chocolate" "coral"
[16] "cornsilk" "cyan" "darkcyan"
[19] "darkgoldenrod" "darkgray" "darkgreen"
[22] "darkgrey" "darkkhaki" "darkmagenta"
[25] "darkolivegreen" "darkorange" "darkorchid"
[28] "darksalmon" "darkseagreen" "darkslategray"
[31] "darkslategrey" "darkturquoise" "darkviolet"
[34] "deeppink" "dimgray" "dimgrey"
[37] "firebrick" "floralwhite" "forestgreen"
[40] "gainsboro" "ghostwhite" "gold"
[43] "goldenrod" "gray" "green"
[46] "greenyellow" "grey" "honeydew"
[49] "hotpink" "ivory" "khaki"
[52] "lavender" "lavenderblush" "lawngreen"
[55] "lemonchiffon" "lightcoral" "lightcyan"
[58] "lightgoldenrod" "lightgoldenrodyellow" "lightgray"
[61] "lightgreen" "lightgrey" "lightpink"
[64] "lightsalmon" "lightseagreen" "lightslategray"
[67] "lightslategrey" "lightyellow" "limegreen"
[70] "linen" "magenta" "maroon"
[73] "mediumaquamarine" "mediumorchid" "mediumpurple"
[76] "mediumseagreen" "mediumspringgreen" "mediumturquoise"
[79] "mintcream" "mistyrose" "moccasin"
[82] "navajowhite" "navy" "oldlace"
[85] "olivedrab" "orange" "orchid"
[88] "palegoldenrod" "palegreen" "paleturquoise"
[91] "papayawhip" "peachpuff" "peru"
[94] "pink" "plum" "purple"
[97] "red" "rosybrown" "saddlebrown"
[100] "salmon" "sandybrown" "seagreen"
[103] "seashell" "sienna" "slategray"
[106] "slategrey" "snow" "springgreen"
[109] "tan" "thistle" "tomato"
[112] "turquoise" "violet" "wheat"
[115] "whitesmoke" "yellow" "yellowgreen"
Create a regular expression that finds any base R dataset. You can get a list of these datasets via a special use of the data() function: data(package = "datasets")$results[, "Item"]. Note that a number of old datasets are individual vectors; these contain the name of the grouping “data frame” in parentheses, so you’ll need to strip those off.
library(tidyverse)
library(babynames)
data()2.5 Factors
In addition to base R, we till use the {forcats} package in the core tidyverse.
2.5.1 Basics
factor() converts missing levels to NA without a warning, which is risky.
forcats::fct() warns if there is a value missing in the levels: whithout setting levels(), forcats::ftc() orders the levels by appearance.
If you omit the levels, factor() will take them in alphabetical order, but forcats::fct() will order by appearance.
You can also create a factor when reading your data with {webreadr} with col_factor():
2.5.3 Modify factor order: fct_reorder() and fct_relevel()
This plot presenting the average number of hours spent watching TV by religion (affiliation) is hard to read because the number of hours is not increasing or decreasing.
We may want to modify this plot by changing the order of the levels: We can use fct_reorder() and fct_relevel().
fct_reorder() takes three arguments: 1. .f, the factor whose levels you want to modify. 2. .x, a numeric vector that you want to use to reorder the levels. 3. Optionally, .fun, a function that’s used if there are multiple values of .x for each value of .f. The default value is median.
The factor-reordering can be done with mutate() before ggplot.
Plot how average age varies by income bracket:
fct_reorder() is suited for factors with arbitrary order. When factors have a natural ordering, and you want to pull to the front one level, use fct_relevel(). It takes as arguments the factor, .f, and then any number of levels that you want to move to the front of the line.
Another type of reordering is useful when you are coloring the lines on a plot. fct_reorder2(.f, .x, .y) reorders the factor .f by the .y values associated with the largest .x values. This makes the plot easier to read because the colors of the line at the far right of the plot will line up with the legend.
In bar plots, use fct_infreq() to show order in decreasing frequency: it can be combined with fct_rev() to show order in increasing frequency.
2.5.3.1 Exercises
- There are some suspiciously high numbers in tvhours. Is the mean a good summary?
- For each factor in gss_cat identify whether the order of the levels is arbitrary or principled.
- Why did moving “Not applicable” to the front of the levels move it to the bottom of the plot?
2.5.4 Modify values of factor levels: fct_recode()
fct_recode() takes as argument .f factor, and the levels to recode/rename with “new levels” on the left = “old levels” on the right. It leaves unchanged levels that are not mentioned and warn if your refer to that do not exist as “old levels”.
To combine groups, you can assign multiple old levels to the same new level.
If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode(). For each new variable, you can provide a vector of old levels:
Sometimes, one may want to lump small groups together: this can be chieved with fct_lump_*() family of functions.
fct_lump_lowfreq()lumps the smallest groups into “Other” category, keeping “Other” always the smallest category.
fct_lump_lowfreq()can lump groups in a less useful way: fct_lump_n() can be used to specify the exact number of categories you want to keep.
See also fct_lump_min() and fct_lump_prop().
2.5.4.1 Exercises
- How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
- How could you collapse rincome into a small set of categories?
- Notice there are 9 groups (excluding other) in the
fct_lumpexample above. Why not 10? (Hint: type?fct_lump, and find the default for the argumentother_levelis “Other”.)
2.5.5 Ordered factors: ordered()
Created with the ordered() function, ordered factors imply a strict ordering between levels, but don’t specify anything about the magnitude of the differences between the levels. You use ordered factors when you know there the levels are ranked, but there’s no precise numerical ranking.
Ordered factors uses <symbols between levels when it is printed.
In both base R and the tidyverse, ordered factors behave very similarly to regular factors. There are only two places where you might notice different behavior:
- If you map an ordered factor to color or fill in ggplot2, it will default to scale_color_viridis()/scale_fill_viridis(), a color scale that implies a ranking.
- If you use an ordered predictor in a linear model, it will use “polynomial contrasts”. These are mildly useful, but you are unlikely to have heard of them unless you have a PhD in Statistics, and even then you probably don’t routinely interpret them. If you want to learn more, we recommend vignette(“contrasts”, package = “faux”) by Lisa DeBruine.
2.6 Dates and times
- To teach to create date-times from various inputs.
- To teach how to extract date-times components.
- To work with time spans.
- How to work with time zomes.
This section willl use the{lubridate} package and nycflights13 dataset in tydiverse package.
2.6.1 Creating date/times
Three types of date/time data:
- date: tibbles print
<date>. - time within a day: tibbles print
<time>. - date-time (a date plus a time): identifies an instant in time typically to the nearest second. Tibbles print
<dttm>. Base R calls these POSIXct.
R does not have a native class for storing times: use {hms} package if need be.
Use the simplest date type the works for your needs. Date-times are more complicated because they need to handle time zones. Use a date instead if enough for your needs!
library(tidyverse)
library(nycflights13)
library(gt)
today() # current date[1] "2026-03-05"
now() # current date-time[1] "2026-03-05 16:02:45 CET"
Four ways to create a date/time:
- during file import with
{readr} - from a string
- from individual date-time components
- from an existing date/time object
2.6.1.1 During file import
If your CSV file contains an ISO8601, readr will recognize it.
ISO8601 is an international standard for writing dates where the components are organised from biggest to smallest (year to day) separated by -. ISO8601 can include hour, minute, and second separated by :; and date and time components are separated with a T or a space.
For example, 4:26pm on May 2 2022 is 2022-05-03 16:26 or 2022-05-03T1626.
For other date/time formats not in ISO8601, use readr arguments col_types() plus col_date() or col_datetime(). Table 2.1 shows date formats understood by readr.
| code | description | example |
|---|---|---|
| Year | ||
| `%Y` | four-digit year | 2010 |
| `%y` | two-digit year | 10 |
| Month | ||
| `%m` | Number | 2 |
| `%b` | Abbreviated month name | Jan |
| `%B` | Full month name | January |
| Day | ||
| `%d` | One or two digits | 1 |
| `%e` | Two digits | 01 |
| Time | ||
| `%H` | 24-hour hour | 13 |
| `%I` | 12-hour hour | 1 |
| `%p` | AM/PM indicator | pm |
| `%M` | 45 | Minutes |
| `%S` | 35 | Seconds |
| `%OS` | 45.35 | Seconds with decimal component |
| `%z` | Timezone offset from UTC | +0800 |
| `%Z` | Timezone name | America/Chicago |
| Other | ||
| `%.` | : | Skip one digit |
| %* | Skip any number of non-digits | |
No matter how date is specifies, it is always displayed the same way in R.
2.6.1.2 From a string
Use the lubridate functions ymd(), mdy(), dmy() for dates and ymd_hms(), mdy_hms(), dmy_hms() for date-times, respecting the order in which year, month, day, hour, minute, and second appear in the string. These helper functions are powerful because they can automatically recognize and parse a wide variety of date formats. ymd(, tz = "") can take an timezone argument.
ymd("2017-01-31")[1] "2017-01-31"
mdy("January 31st, 2017")[1] "2017-01-31"
dmy("31-Jan-2017")[1] "2017-01-31"
mdy_hm("01/31/2017 08:01")[1] "2017-01-31 08:01:00 UTC"
ymd_hms("2017-01-31 20:01:00", tz = "UTC")[1] "2017-01-31 20:01:00 UTC"
2.6.1.3 From individual date-time components
Individual date-time compontents can be combined from multiple columns using make_date() for dates and make_datetime() that take numeric vectors of year, month, day, hour, minute, and second. make_datetime() can also take a characte tz timezone argument.
flights |>
select(year, month, day, hour, minute) |>
mutate( departure = make_datetime(year, month, day, hour, minute))# A tibble: 336,776 × 6
year month day hour minute departure
<int> <int> <int> <dbl> <dbl> <dttm>
1 2013 1 1 5 15 2013-01-01 05:15:00
2 2013 1 1 5 29 2013-01-01 05:29:00
3 2013 1 1 5 40 2013-01-01 05:40:00
4 2013 1 1 5 45 2013-01-01 05:45:00
5 2013 1 1 6 0 2013-01-01 06:00:00
6 2013 1 1 5 58 2013-01-01 05:58:00
7 2013 1 1 6 0 2013-01-01 06:00:00
8 2013 1 1 6 0 2013-01-01 06:00:00
9 2013 1 1 6 0 2013-01-01 06:00:00
10 2013 1 1 6 0 2013-01-01 06:00:00
# ℹ 336,766 more rows
flights |>
filter(!is.na(dep_time), !is.na(arr_time)) |>
select(year, month, day, dep_time, arr_time, sched_dep_time, sched_arr_time)# A tibble: 328,063 × 7
year month day dep_time arr_time sched_dep_time sched_arr_time
<int> <int> <int> <int> <int> <int> <int>
1 2013 1 1 517 830 515 819
2 2013 1 1 533 850 529 830
3 2013 1 1 542 923 540 850
4 2013 1 1 544 1004 545 1022
5 2013 1 1 554 812 600 837
6 2013 1 1 554 740 558 728
7 2013 1 1 555 913 600 854
8 2013 1 1 557 709 600 723
9 2013 1 1 557 838 600 846
10 2013 1 1 558 753 600 745
# ℹ 328,053 more rows
In this dataset, dep_time, arr_time, sched_dep_time, and sched_arr_time are stored as integers in the format HHMM so that 830 is 8:30. Applying the integer division or floor division (%/%) – rounds down to nearest integer and discards the fractional part – 830 %/% 100 gives the hour (8) component of the time, whereas applying the modulus operator (%%) – returns the remainder after division – 830 %% 100 gives the minute (30) component.
flights |>
select(year, month, day, dep_time, arr_time, sched_dep_time, sched_arr_time) |>
filter(!is.na(dep_time), !is.na(arr_time)) |>
mutate(
dep_time = make_datetime(year, month, day, dep_time %/% 100, dep_time %% 100),
arr_time = make_datetime(year, month, day, arr_time %/% 100, arr_time %% 100),
sched_dep_time = make_datetime(year, month, day, sched_dep_time %/% 100, sched_dep_time %% 100),
sched_arr_time = make_datetime(year, month, day, sched_arr_time %/% 100, sched_arr_time %% 100)
)# A tibble: 328,063 × 7
year month day dep_time arr_time sched_dep_time
<int> <int> <int> <dttm> <dttm> <dttm>
1 2013 1 1 2013-01-01 05:17:00 2013-01-01 08:30:00 2013-01-01 05:15:00
2 2013 1 1 2013-01-01 05:33:00 2013-01-01 08:50:00 2013-01-01 05:29:00
3 2013 1 1 2013-01-01 05:42:00 2013-01-01 09:23:00 2013-01-01 05:40:00
4 2013 1 1 2013-01-01 05:44:00 2013-01-01 10:04:00 2013-01-01 05:45:00
5 2013 1 1 2013-01-01 05:54:00 2013-01-01 08:12:00 2013-01-01 06:00:00
6 2013 1 1 2013-01-01 05:54:00 2013-01-01 07:40:00 2013-01-01 05:58:00
7 2013 1 1 2013-01-01 05:55:00 2013-01-01 09:13:00 2013-01-01 06:00:00
8 2013 1 1 2013-01-01 05:57:00 2013-01-01 07:09:00 2013-01-01 06:00:00
9 2013 1 1 2013-01-01 05:57:00 2013-01-01 08:38:00 2013-01-01 06:00:00
10 2013 1 1 2013-01-01 05:58:00 2013-01-01 07:53:00 2013-01-01 06:00:00
# ℹ 328,053 more rows
# ℹ 1 more variable: sched_arr_time <dttm>
When using date-times in numeric contexts (like in graphs), 1 means 1 second. Specifying a bandwith of 86400 means 1 day (60 * 60 * 24 = 86400 seconds).