6  Data tidying

6.1 Notes

Using prose, describe how the variables and observations are organised in each of the sample tables.

##Solutions

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

6.2 6.2.1 Exercises

1.Using prose, describe how the variables and observations are organised in each of the sample tables

Sketch out the process you’d use to calculate the rate for table2 and table4a + table4b. You will need to perform four operations:

  1. Extract the number of TB cases per country per year.

  2. Extract the matching population per country per year.

  3. Divide cases by population, and multiply by 10000.

  4. Store back in the appropriate place.

    table2 |>
      pivot_wider(
        names_from = type,
        values_from = count
      ) |> mutate(rate = cases/population * 10000)
    # A tibble: 6 × 5
      country      year  cases population  rate
      <chr>       <dbl>  <dbl>      <dbl> <dbl>
    1 Afghanistan  1999    745   19987071 0.373
    2 Afghanistan  2000   2666   20595360 1.29 
    3 Brazil       1999  37737  172006362 2.19 
    4 Brazil       2000  80488  174504898 4.61 
    5 China        1999 212258 1272915272 1.67 
    6 China        2000 213766 1280428583 1.67 
table4a |>
  pivot_longer(
    cols = -country,
    names_to = "year",
    values_to = "n"
  )
# A tibble: 6 × 3
  country     year       n
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766