The sjmisc-package

My last posting was about reading and writing data between R and other statistical packages like SPSS, Stata or SAS. After that, I decided to bundle all functions that are not directly related to plotting or printing tables, into a new package called sjmisc.

Basically, this package covers three domains of functionality:

  • reading and writing data between other statistical packages (like SPSS) and R, based on the haven and foreign packages; hence, sjmisc also includes function to work with labelled data.
  • frequently used statistical tests, or at least convenient wrappers for such test functions
  • frequently applied recoding and variable conversion tasks

In this posting, I want to give a quick and short introduction into the labeling features.

Labelled Data

In software like SPSS, it is common to have value and variable labels as variable attributes. Variable values, even if categorical, are mostly numeric. In R, however, you may use labels as values directly:

> factor(c("low", "high", "mid", "high", "low"))
[1] low  high mid  high low 
Levels: high low mid

Reading SPSS-data (from haven, foreign or sjmisc), keeps the numeric values for variables and adds the value and variable labels as attributes. See following example from the sample-dataset efc, which is part of the sjmisc-package:

library(sjmisc)
data(efc)
str(efc$e42dep)

> atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
> - attr(*, "label")= chr "how dependent is the elder? - subjective perception of carer"
> - attr(*, "labels")= Named num [1:4] 1 2 3 4
>  ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

While all plotting and table functions of the sjPlot-package make use of these attributes (see many examples here), many packages and/or functions do not consider these attributes, e.g. R base graphics:

library(sjmisc)
data(efc)
barplot(table(efc$e42dep, efc$e16sex), 
        beside = T, 
        legend.text = T)

barplot_1

Adding value labels as factor values

to_label is a sjmisc-function that converts a numeric variable into a factor and sets attribute-value-labels as factor levels. Using factors with valued levels, the bar plot is labelled.

library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T)

Rplot

to_fac is a convenient replacement of as.factor, which converts a numeric vector into a factor, but keeps the value and variable label attributes.

Getting and setting value and variable labels

There are four functions that let you easily set or get value and variable labels of either a single vector or a complete data frame:

  • get_var_labels() to get variable labels
  • get_val_labels() to get value labels
  • set_var_labels() to set variable labels (add them as vector attribute)
  • set_val_labels() to set value labels (add them as vector attribute)
library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T,
        main = get_var_labels(efc$e42dep))

Rplot01

get_var_labels(efc) would return all data.frame’s variable labels. And get_val_labels(etc) would return a list with all value labels of all data.frame’s variables.

Restore labels from subsetted data

The base subset function as well as dplyr’s (at least up to 0.4.1) filter and select functions omit label attributes (or vector attributes in general) when subsetting data. In the current development-snapshot of sjmisc at GitHub (which will most likely become version 1.0.3 and released in June or July), there are handy functions to deal with this problem: add_labels and remove_labels.

add_labels adds back labels to a subsetted data frame based on the original data frame. And remove_labels removes all label attributes (this might be necessary when working with dplyr up to 0.4.1, dplyr sometimes throws an error when working with labelled data – this issue should be addressed for the next dplyr-update).

Losing labels during subset

library(sjmisc)
data(efc)
efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
> $ e17age : num  74 68 80 72 94 79 67 80 76 88 ...
> $ e42dep : num  4 4 1 3 3 4 3 4 2 4 ...
> $ c82cop1: num  4 3 3 4 3 3 4 2 2 3 ...
> $ c83cop2: num  2 4 2 2 2 2 1 3 2 2 ...
> $ c84cop3: num  4 4 1 1 1 4 2 4 2 4 ...

Add back labels

efc.sub <- add_labels(efc.sub, efc)
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
>  $ e17age : atomic  74 68 80 72 94 79 67 80 76 88 ...
>   ..- attr(*, "label")= Named chr "elder' age"
>   .. ..- attr(*, "names")= chr "e17age"
>  $ e42dep : atomic  4 4 1 3 3 4 3 4 2 4 ...
>   ..- attr(*, "label")= Named chr "how dependent is the elder? - subjective perception of carer"
>   .. ..- attr(*, "names")= chr "e42dep"
>   ..- attr(*, "labels")= Named chr  "1" "2" "3" "4"
>   .. ..- attr(*, "names")= chr  "independent" "slightly dependent" "moderately dependent" "severely dependent"

# truncated output

So, when working with labelled data, especially when working with data sets imported from other software packages, it comes very handy to make use of the label attributes. The sjmisc package supports this feature and offers some useful functions for these tasks…