1  Introduction

1.1 Visualization

1.2 Data transformation

1.2.1 Rows

Had an arrival delay of two or more hours:

Flew to Houston (IAH or HOU):

Were operated by United, American, or Delta:

Departed in summer (July, August, and September)

Were delayed by at least an hour, but made up over 30 minutes in flight

2.Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

3.Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

4.Was there a flight on every day of 2013?

Yes!

5.Which flights traveled the farthest distance? Which traveled the least distance?

6.Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

1.3 Columns

1.3.1 mutate()

mutate(df, new_var = ...)

1.3.2 select()

There are a number of helper functions you can use within select():

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • num_range("x", 1:3): matches x1, x2 and x3.

1.3.3 rename()

rename(df, new_var = old_var) select(df, new_var = old_var)

1.3.4 relocate()

1.3.5 Exercices

1.Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

2.Brainstorm as many ways as possible to select dep_time, arr_time, and from {flights}.

3.What happens if you specify the name of the same variable multiple times in a select() call?

4.What does the any_of() function do? Why might it be helpful in conjunction with this vector?

5.Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

Yes, it does surprise me since the variable names are lowercase but the string in contains() is uppercase. By default, contains() ignores case.

To change this default behavior, set ignore.case = FALSE.

6.Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

7.Why doesn’t the following work, and what does the error mean?

1.4 The pipe

1.5 Groups

1.5.1 group_by()

1.5.2 summarize()

1.5.3 The slice_functions

There are five handy functions that allow you extract specific rows within each group:

  • df |> slice_head(n = 1) takes the first row from each group.
  • df |> slice_tail(n = 1) takes the last row in each group.
  • df |> slice_min(x, n = 1) takes the row with the smallest value of column x.
  • df |> slice_max(x, n = 1) takes the row with the largest value of column x.
  • df |> slice_sample(n = 1) takes one random row.

You can vary n to select more than one row, or instead of n =, you can use prop = 0.1 to select (e.g.) 10% of the rows in each group.

[OBS] By default, slice_min() and slice_max() keep tied values so n = 1 means give us all rows with the highest value. If you want exactly one row per group you can set with_ties = FALSE.

1.5.4 Grouping by multiple variables

Group of each date:

When you summarize a tibble grouped by more than one variable, each summary peels off the last group. In hindsight, this wasn’t a great way to make this function work, but it’s difficult to change without breaking existing code. To make it obvious what’s happening, dplyr displays a message that tells you how you can change this behavior:

1.5.5 Ungrouping with ungroup()

1.5.6 .by argument for grouping

Or if you want to group by multiple variables:

[OBS.] .by works with all verbs and has the advantage that you don’t need to use the .groups argument to suppress the grouping message or ungroup() when you’re done.

1.5.6.1 Exercises

1.Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

2.Find the flights that are most delayed upon departure from each destination.

3.How do delays vary over the course of the day. Illustrate your answer with a plot.

4.What happens if you supply a negative n to slice_min() and friends?

5.Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

If sort = TRUE, the largest group is shown at the top.

6.Suppose we have the following tiny data frame:

1.6 Case study: aggregates and sample size

1.6.1 Summary

To manipulate rows: filter(), arrarege()

For columns: select(), mutate(), group(), and summarize().

1.7 Workflow: code style

  1. names
  2. spaces: Put spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, <, …), and around the assignment operator (<-).
  3. Pipes: |> should always have a space before it and should typically be the last thing on a line.
    1. If the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line.
    2. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.
    3. After the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |> . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.
    4. Be wary of writing very long pipes, say longer than 10-15 lines. Try to break them up into smaller sub-tasks, giving each task an informative name.

4.The same basic rules that apply to the pipe also apply to {ggplot2}; just treat + the same way as |>. Again, if you can’t fit all of the arguments to a function on to a single line, put each argument on its own line:

5.Sectioning comments

1.7.1 Exercises

1.8 Data tyding

1.8.1 Exercises

1.9 Data import

We focus on importing CSV file. A CSV file looks like this: The first row, commonly called the header row, gives the column names, and the following six rows provide the data. The columns are separated, aka delimited, by commas.

By default, read_csv() only recognizes empty strings ("") in this dataset as NAs, we want it to also recognize the character string "N/A".

An alternative approach is to use janitor::clean_names() to use some heuristics to turn them all into snake case at once:

Another common task after reading in data is to consider variable types. For example, meal_plan is a categorical variable, which in R should be represented as a factor:

1.9.1 Other arguments

Usually, read_csv() uses the first line of the data for the column names. If a few fist line include other text othen the columns names, you can use skip = n to skip the first n lines or use comment = "#" to drop all lines that start with (e.g.) #:

1.9.2 Other file types

Once you’ve mastered read_csv(), it’s just a matter of knowing which function to reach for:

  • read_csv2() reads semicolon-separated files. These use ; instead of , to separate fields and are common in countries that use , as the decimal marker.

  • read_tsv() reads tab-delimited files.

  • read_delim() reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it.

  • read_fwf() reads fixed-width files. You can specify fields by their widths with fwf_widths() or by their positions with fwf_positions().

  • read_table() reads a common variation of fixed-width files where columns are separated by white space.

  • read_log() reads Apache-style log files.

1.9.3 Exercises

1

2, 3

6

1.10 Getting help

1.11 Visualize

1.11.0.1 Layers

You can also set the visual properties of your geom manually as an argument of your geom function (outside of aes()) instead of relying on a variable mapping to determine the appearance

1.11.1 Exercises

1

2

3

Stroke aesthetic controls the size of the edge/border of the points for shapes 21-24 (filled circle, square, triangle, and diamond).

1.11.2 Geometric objects

Not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line.

If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers. You can use the same idea to specify different data for each layer.

1.11.3 Exercises

1.11.3.1 Facets

To facet your plot with the combination of two variables, switch from facet_wrap() to facet_grid()

By default each of the facets share the same scale and range for x and y axes. This is useful when you want to compare data across facets but it can be limiting when you want to visualize the relationship within each facet better. To have different axis scales across both rows and columns, set the scales argument in a faceting function to "free":  "free_x" will allow for different scales across rows, and "free_y" will allow for different scales across columns.