Data wrangling within the #tidyverse – the design philosophy behind the sjmisc-package #rstats

I’m pleased to announce sjmisc 2.3.0, which was just updated on CRAN. The update might break existing code – however, functions were largely revised to work seamlessly within the tidyverse. In the long run, consistent design makes working with sjmisc more intuitive.

Basically, sjmisc covers two domains of functionality:

  • Reading and writing data between R and other statistical software packages like SPSS, SAS or Stata and working with labelled data; this includes easy ways to get and set label attributes, to convert labelled vectors into factors (and vice versa), or to deal with multiple declared missing values etc.
  • Data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. The data transformation functions also support labelled data.

This posting briefly describes some of the changes to the function design that do data transformation tasks.

The design of data transformation functions

The design of data transformation functions in this package follows, where appropriate, the tidyverse-approach, with the first argument of a function always being the data (either a data frame or vector), followed by variable names that should be processed by the function. If no variables are specified as argument, the function applies to the complete data that was indicated as first function argument.

The data-argument

A slight difference to dplyr-functions like select() or filter() is that the data-argument (the first argument of each function) may either be a data frame or a vector. The returned object for each function equals the type of the data-argument:

  • If the data-argument is a vector, the function returns a vector.
  • If the data-argument is a data frame, the function returns a data frame.
library(sjmisc)
data(efc)

# returns a vector
x <- rec(efc$e42dep, recodes = "1,2=1; 3,4=2")
str(x)
#>  atomic [1:908] 2 2 2 2 2 2 2 2 2 2 ...
#>  - attr(*, "label")= chr "elder's dependency"

# returns a data frame (a tibble, to be exactly)
rec(efc, e42dep, recodes = "1,2=1; 3,4=2")

#> # A tibble: 908 × 1
#>    e42dep_r
#>       <dbl>
#> 1         2
#> 2         2
#> 3         2
#> 4         2
#> 5         2
#> 6         2
#> 7         2
#> 8         2
#> 9         2
#> 10        2
#> # ... with 898 more rows

This design-choice is mainly due to compatibility- and convenience-reasons. It does not affect the usual “tidyverse-workflow” or when using pipe-chains.

The …-ellipses-argument

The selection of variables specified in the ...-ellipses-argument is powered by dplyr’s select(). This means, you can use existing functions like : to select a range of variables, or also use dplyr’s select_helpers, like contains() or one_of(). However, select_helpers must begin with a ~-sign.

# select all variables with "cop" in their names, and also
# the range from c161sex to c175empl
rec(efc, ~contains("cop"), c161sex:c175empl, recodes = "0,1=0; else=1")

#> # A tibble: 908 × 12
#>    c161sex_r c172code_r c175empl_r c82cop1_r c83cop2_r c84cop3_r c85cop4_r
#>        <dbl>      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1          1          1          0         1         1         1         1
#> 2          1          1          0         1         1         1         1
#> 3          0          0          0         1         1         0         1
#> 4          0          1          0         1         0         1         0
#> 5          1          1          0         1         1         0         1
#> 6          0          1          0         1         1         1         1
#> 7          1          1          0         1         1         1         0
#> 8          1          1          0         1         1         1         0
#> 9          1         NA          0         1         1         1         1
#> 10         1          1          0         1         1         0         1
#> # ... with 898 more rows, and 5 more variables: c86cop5_r <dbl>,
#> #   c87cop6_r <dbl>, c88cop7_r <dbl>, c89cop8_r <dbl>, c90cop9_r <dbl>

# center all variables with "age" in name, variable c12hour
# and all variables from column 19 to 21
center(efc, c12hour, ~contains("age"), 19:21)

#> # A tibble: 908 × 6
#>     c12hour_c barthtot_c  neg_c_7_c pos_v_4_c    e17age_c  c160age_c
#>         <dbl>      <dbl>      <dbl>     <dbl>       <dbl>      <dbl>
#> 1  -26.399113  10.453001  0.1502242 -0.476731   3.8787879  2.5371809
#> 2  105.600887  10.453001  8.1502242 -1.476731   8.8787879  0.5371809
#> 3   27.600887 -29.546999 -0.8497758  0.523269   2.8787879 26.5371809
#> 4  125.600887 -64.546999 -1.8497758  2.523269 -12.1212121 15.5371809
#> 5  125.600887 -39.546999  0.1502242  2.523269   4.8787879 -6.4628191
#> 6  -26.399113  -4.546999  7.1502242 -3.476731   5.8787879  2.5371809
#> 7  118.600887 -59.546999  3.1502242  0.523269  -5.1212121  7.5371809
#> 8   67.600887 -29.546999 -0.8497758  1.523269   7.8787879 13.5371809
#> 9  -14.399113 -49.546999  3.1502242  0.523269  -0.1212121  5.5371809
#> 10  -2.399113 -64.546999 -1.8497758  0.523269   3.8787879 -4.4628191
#> # ... with 898 more rows

The function-types

There are two types of function designs:

coercing/converting functions

Functions like to_factor() or add_labels(), which convert variables into other types or add additional information like variable or value labels as attribute, typically return the complete data frame that was given as first argument. The variables specified in the ...-ellipses argument are converted, all other variables remain unchanged.

to_factor(efc, e42dep, e16sex)
#> # A tibble: 908 × 26
#>    c12hour e15relat e16sex e17age e42dep c82cop1 c83cop2 c84cop3 c85cop4
#>      <dbl>    <dbl> <fctr>  <dbl> <fctr>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1       16        2      2     83      3       3       2       2       2
#> 2      148        2      2     88      3       3       3       3       3
#> 3       70        1      2     82      3       2       2       1       4
#> 4      168        1      2     67      4       4       1       3       1
#> 5      168        2      2     84      4       3       2       1       2
#> 6       16        2      2     85      4       2       2       3       3
#> 7      161        1      1     74      4       4       2       4       1
#> 8      110        4      2     87      4       3       2       2       1
#> 9       28        2      2     79      4       3       2       3       2
#> 10      40        2      2     83      4       3       2       1       2
#> # ... with 898 more rows, and 17 more variables: c86cop5 <dbl>,
#> #   c87cop6 <dbl>, c88cop7 <dbl>, c89cop8 <dbl>, c90cop9 <dbl>,
#> #   c160age <dbl>, c161sex <dbl>, c172code <dbl>, c175empl <dbl>,
#> #   barthtot <dbl>, neg_c_7 <dbl>, pos_v_4 <dbl>, quol_5 <dbl>,
#> #   resttotn <dbl>, tot_sc_e <dbl>, n4pstu <dbl>, nur_pst <dbl>

transformation/recoding functions

Functions like rec() or dicho(), which transform or recode variables, typically do not return the complete data frame that was given as first argument, but only the transformed and recoded variables specified in the ...-ellipses argument. The reason for this is to preserve the original variable and not overwriting it with completeley new values.

rec(efc, c82cop1, c83cop2, recodes = "1,2=0; 3:4=2")
#> # A tibble: 908 × 2
#>    c82cop1_r c83cop2_r
#>        <dbl>     <dbl>
#> 1          2         0
#> 2          2         2
#> 3          0         0
#> 4          2         0
#> 5          2         0
#> 6          0         0
#> 7          2         0
#> 8          2         0
#> 9          2         0
#> 10         2         0
#> # ... with 898 more rows

These variables usually get a suffix, so you can bind these variables as new columns to a data frame, for instance with add_columns(). add_columns() is useful if you want to bind/add columns within a pipe-chain to the end of a data frame.

efc %>% 
  rec(c82cop1, c83cop2, recodes = "1,2=0; 3:4=2") %>% 
  add_columns(efc)
#> # A tibble: 908 × 28
#>    c12hour e15relat e16sex e17age e42dep c82cop1 c83cop2 c84cop3 c85cop4
#>      <dbl>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#> 1       16        2      2     83      3       3       2       2       2
#> 2      148        2      2     88      3       3       3       3       3
#> 3       70        1      2     82      3       2       2       1       4
#> 4      168        1      2     67      4       4       1       3       1
#> 5      168        2      2     84      4       3       2       1       2
#> 6       16        2      2     85      4       2       2       3       3
#> 7      161        1      1     74      4       4       2       4       1
#> 8      110        4      2     87      4       3       2       2       1
#> 9       28        2      2     79      4       3       2       3       2
#> 10      40        2      2     83      4       3       2       1       2
#> # ... with 898 more rows, and 19 more variables: c86cop5 <dbl>,
#> #   c87cop6 <dbl>, c88cop7 <dbl>, c89cop8 <dbl>, c90cop9 <dbl>,
#> #   c160age <dbl>, c161sex <dbl>, c172code <dbl>, c175empl <dbl>,
#> #   barthtot <dbl>, neg_c_7 <dbl>, pos_v_4 <dbl>, quol_5 <dbl>,
#> #   resttotn <dbl>, tot_sc_e <dbl>, n4pstu <dbl>, nur_pst <dbl>,
#> #   c82cop1_r <dbl>, c83cop2_r <dbl>

sjmisc and dplyr

The functions of sjmisc are designed to work together seamlessly with other packes from the tidyverse, like dplyr. For instance, you can use the functions from sjmisc both within a pipe-worklflow to manipulate data frames, or to create new variables with mutate():

efc %>% 
  select(c82cop1, c83cop2) %>% 
  rec(recodes = "1,2=0; 3:4=2")
#> # A tibble: 908 × 2
#>    c82cop1_r c83cop2_r
#>        <dbl>     <dbl>
#> 1          2         0
#> 2          2         2
#> 3          0         0
#> 4          2         0
#> 5          2         0
#> 6          0         0
#> 7          2         0
#> 8          2         0
#> 9          2         0
#> 10         2         0
#> # ... with 898 more rows

efc %>% 
  select(c82cop1, c83cop2) %>% 
  mutate(
    c82cop1_dicho = rec(c82cop1, recodes = "1,2=0; 3:4=2"),
    c83cop2_dicho = rec(c83cop2, recodes = "1,2=0; 3:4=2")
  ) %>% 
  head()
#>   c82cop1 c83cop2 c82cop1_dicho c83cop2_dicho
#> 1       3       2             2             0
#> 2       3       3             2             2
#> 3       2       2             0             0
#> 4       4       1             2             0
#> 5       3       2             2             0
#> 6       2       2             0             0

Final words

I am trying to make my packages work more naturally together with other tidyverse-packages. From my experience, this philosophy makes absolutely sense (see tidyverse-manifesto). If packages (and their functions) follow a certain logic (or „grammar“), it’s much easier to work with R and especially to switch to R from other statistical packages (to cite from the manifesto: „once you’ve mastered one [R package], you have a head start on mastering the others“). This meant that past and current updates of my packages introduced major changes in function design, most likely breaking existing code, but I think that after the latest updates of my packages, this progress is completed. Future updates probably mainly target enhancements or new functions, but do not mean major revisions of function designs.