sjmisc – package for working with (labelled) data #rstats

The sjmisc-package

My last posting was about reading and writing data between R and other statistical packages like SPSS, Stata or SAS. After that, I decided to bundle all functions that are not directly related to plotting or printing tables, into a new package called sjmisc.

Basically, this package covers three domains of functionality:

  • reading and writing data between other statistical packages (like SPSS) and R, based on the haven and foreign packages; hence, sjmisc also includes function to work with labelled data.
  • frequently used statistical tests, or at least convenient wrappers for such test functions
  • frequently applied recoding and variable conversion tasks

In this posting, I want to give a quick and short introduction into the labeling features.

Labelled Data

In software like SPSS, it is common to have value and variable labels as variable attributes. Variable values, even if categorical, are mostly numeric. In R, however, you may use labels as values directly:

> factor(c("low", "high", "mid", "high", "low"))
[1] low  high mid  high low 
Levels: high low mid

Reading SPSS-data (from haven, foreign or sjmisc), keeps the numeric values for variables and adds the value and variable labels as attributes. See following example from the sample-dataset efc, which is part of the sjmisc-package:


> atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
> - attr(*, "label")= chr "how dependent is the elder? - subjective perception of carer"
> - attr(*, "labels")= Named num [1:4] 1 2 3 4
>  ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

While all plotting and table functions of the sjPlot-package make use of these attributes (see many examples here), many packages and/or functions do not consider these attributes, e.g. R base graphics:

barplot(table(efc$e42dep, efc$e16sex), 
        beside = T, 
        legend.text = T)


Adding value labels as factor values

to_label is a sjmisc-function that converts a numeric variable into a factor and sets attribute-value-labels as factor levels. Using factors with valued levels, the bar plot is labelled.

        beside = T, 
        legend.text = T)


to_fac is a convenient replacement of as.factor, which converts a numeric vector into a factor, but keeps the value and variable label attributes.

Getting and setting value and variable labels

There are four functions that let you easily set or get value and variable labels of either a single vector or a complete data frame:

  • get_var_labels() to get variable labels
  • get_val_labels() to get value labels
  • set_var_labels() to set variable labels (add them as vector attribute)
  • set_val_labels() to set value labels (add them as vector attribute)
        beside = T, 
        legend.text = T,
        main = get_var_labels(efc$e42dep))


get_var_labels(efc) would return all data.frame’s variable labels. And get_val_labels(etc) would return a list with all value labels of all data.frame’s variables.

Restore labels from subsetted data

The base subset function as well as dplyr’s (at least up to 0.4.1) filter and select functions omit label attributes (or vector attributes in general) when subsetting data. In the current development-snapshot of sjmisc at GitHub (which will most likely become version 1.0.3 and released in June or July), there are handy functions to deal with this problem: add_labels and remove_labels.

add_labels adds back labels to a subsetted data frame based on the original data frame. And remove_labels removes all label attributes (this might be necessary when working with dplyr up to 0.4.1, dplyr sometimes throws an error when working with labelled data – this issue should be addressed for the next dplyr-update).

Losing labels during subset

efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))

> 'data.frame':	296 obs. of  5 variables:
> $ e17age : num  74 68 80 72 94 79 67 80 76 88 ...
> $ e42dep : num  4 4 1 3 3 4 3 4 2 4 ...
> $ c82cop1: num  4 3 3 4 3 3 4 2 2 3 ...
> $ c83cop2: num  2 4 2 2 2 2 1 3 2 2 ...
> $ c84cop3: num  4 4 1 1 1 4 2 4 2 4 ...

Add back labels

efc.sub <- add_labels(efc.sub, efc)

> 'data.frame':	296 obs. of  5 variables:
>  $ e17age : atomic  74 68 80 72 94 79 67 80 76 88 ...
>   ..- attr(*, "label")= Named chr "elder' age"
>   .. ..- attr(*, "names")= chr "e17age"
>  $ e42dep : atomic  4 4 1 3 3 4 3 4 2 4 ...
>   ..- attr(*, "label")= Named chr "how dependent is the elder? - subjective perception of carer"
>   .. ..- attr(*, "names")= chr "e42dep"
>   ..- attr(*, "labels")= Named chr  "1" "2" "3" "4"
>   .. ..- attr(*, "names")= chr  "independent" "slightly dependent" "moderately dependent" "severely dependent"

# truncated output

So, when working with labelled data, especially when working with data sets imported from other software packages, it comes very handy to make use of the label attributes. The sjmisc package supports this feature and offers some useful functions for these tasks…

sjmisc – package for working with (labelled) data #rstats

Reading from and writing to SPSS, SAS and STATA with R #rstats #sjPlot

On CRAN now

My sjPlot-package was updated on CRAN (binaries will be available soon, I guess). This update contains, besides many small improvements and fixes, two major features:

  1. First, new features to print table summaries of linear models and generalized linear models (for sjt.glm, the same new features were added as to sjt.lm – however, the manual page is not finished yet). I have introduced these features in a former posting.
  2. Second, functions for reading data from and writing to other statistical packages like SPSS, SAS or STATA have been revamped or new features have been added. Furthermore, there are improved getters and setters to extract and set variable and value labels. A short introduction is available online.

The haven package

There are two reasons why this update focuses on reading and writing data as well as getting and setting value and variable labels. First, I wanted to rename all functions who formerly had the prefixes sji. or sju. in order to have more “intuitive” function names, so people better understand what these functions may do.

The second reason is the release of the haven package, which supports fast reading and writing from or to different file formats (like SPSS, SAS or STATA). I believe, this package will become frequently used when reading or writing data from/to other formats, so I wanted to ensure compatibility between sjPlot and haven imported data.

The haven package reads data to a data frame where all variables (vectors) are of class type labelled, which means these variables are atomic (i.e. they have numeric values, even if they are categorical or factors, see this introduction on RStudio) and each variable has – where applicable – a variable label and value labels attribute.
An example:

## Class 'labelled'  atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
##   ..- attr(*, "label")= chr "how dependent is the elder?"
##   ..- attr(*, "labels")= Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

Until recently, the sjPlot package solely used the read.spss function from the foreign package to read data from SPSS. The foreign package uses following structure to import value and variable labels:

##  atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
##  - attr(*, "value.labels")= Named chr [1:4] "1" "2" "3" "4"
##   ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
##  - attr(*, "variable.label")= chr "how dependent is the elder?"

Since version 1.7, sjPlot can also read data using the haven read-functions (simply use my_dataframe <- read_spss("path/to/spss-file.sav", option = "haven")).

These kind of attributes, whether from haven or foreign, provide huge advantages in case you want to plot or print (summaries of) variables and don’t want to manually set axis labels or titles, because you can extract these information from any variable’s attributes. This is one of the core functionality of all sjPlot plotting and table printing functions:

# load sample data
# set plot theme
sjp.setTheme(theme = "539")
# plot frequencies


The new sjPlot update can now deal with both structures of either haven or foreign imported data. It doesn’t matter whether efc2$e42dep from the above example was read with foreign, or is a labelled class vector from haven.

Also, reading value and variable labels works for both vector types. get_var_labels() and get_val_labels() extract variable and value labels from both haven-data and foreign-data.

Writing data

The constructor of the labelled class only supports creating value labels, not variable labels. Thus, writing data back to SPSS or STATA do not write variable labels by default (at least for new created variables – variables that have been read with haven and already have the variable label attribute label will correctly save back variable labels).

So I wrote a wrapper class to write data, called write_spss and write_stata. These functions convert your data, independent whether it was imported with the foreign or haven package, or if you manually created new variables, into a format that will keep value and variable labels when writing data to SPSS or STATA.

When you create new variables, make sure you use set_val_labels and set_var_labels to add the necessary label attributes to new variables:

# create dummy variable
dummy <- sample(1:4, 40, replace=TRUE)
# manually attach value and variable labels
dummy <- set_val_labels(dummy, c("very low", "low", "mid", "hi"))
dummy <- set_var_labels(dummy, "This is a dummy")
# check structure of dummy
##  atomic [1:40] 2 2 2 3 3 2 1 4 4 2 ...
##  - attr(*, "value.labels")= Named chr [1:4] "1" "2" "3" "4"
##   ..- attr(*, "names")= chr [1:4] "very low" "low" "mid" "hi"
##  - attr(*, "variable.label")= chr "This is a dummy"

Finally, I just like to mention convenient conversion functions, e.g. to convert atomic variables into factors without losing the label attributes. These are to_fac, to_label or to_value. Further notes on the read and write functions of the sjPlot package are in the online manual.

Reading from and writing to SPSS, SAS and STATA with R #rstats #sjPlot

Beautiful table outputs in R, part 2 #rstats #sjPlot

First of all, I’d like to thank my readers for the lots of feedback on my last post on beautiful outputs in R. I tried to consider all suggestions, updated the existing table-output-functions and added some new ones, which will be described in this post. The updated package is already available on CRAN.

This posting is divided in two major parts:

  1. the new functions are described, and
  2. the new features of all table-output-functions are introduced (including knitr-integration and office-import)

Read on …

Beautiful table outputs in R, part 2 #rstats #sjPlot

No need for SPSS – beautiful output in R #rstats

Note: There’s a second part of this series here.

About one year ago, I seriously started migrating from SPSS to R. Though I’m still using SPSS (because I have to in some situations), I’m quite comfortable and happy with R now and learnt a lot in the past months. But since SPSS is still very wide spread in social sciences, I get asked every now and then, whether I really needed to learn R, because SPSS meets all my needs…

Well, learning R had at least two major benefits for me: 1.) I could improve my statistical knowledge a lot, simply by using formulas, asking why certain R commands do not automatically give the same results like SPSS, reading R resources and papers etc. and 2.) the possibilities of data visualization are way better in R than in SPSS (though SPSS can do well as well…). Of course, there are even many more reasons to use R.

Still, one thing I often miss in R is a beautiful output of simple statistics or maybe even advanced statistics. Not always as plot or graph, but neither as “cryptic” console output. I’d like to have a simple table view, just like the SPSS output window (though the SPSS output is not “beautiful”). That’s why I started writing functions that put the results of certain statistics in HTML tables. These tables can be saved to disk or, even better for quick inspection, shown in a web browser or viewer pane (like in RStudio viewer pane).

All of the following functions are available in my sjPlot-package on CRAN.

Read on …

No need for SPSS – beautiful output in R #rstats

Easily plotting grouped bars with ggplot #rstats

This tutorial shows how to create diagrams with grouped bar charts or dot plots with ggplot. The groups can also be displayed as facet grids.

Importing the data from SPSS
All following examples are based on an imported SPSS data set. Refer to this posting for more details on how to do that and to my script page to download the scripts. This is important to know because the way the variable and value labels are accessed may depend on whether you use an imported SPSS dataset or not (i.e. you may have to change parameters to get the sample running).

You can, for instance, import your SPSS data like this, if you are using my script:

efc <- importSPSS("GER_Services_FU_PV_dt.sav")
efc_vars <- getVariableLabels(efc)
efc_labels <- getValueLabels(efc)

The R script
You can download the script from my script page. I will not describe the code in detail because the source code is (hopefully) well commented. Basically, the script just transforms the data from two variables (one count variable with categories and one grouping variables) to fit into the ggplot-requirements for plotting bar charts. You can use a lot of parameters to change the style of the output, e.g. you can plot bars or dots, dodged or stacked bars, change colors etc. and you don’t need to know how this works in ggplot. You simply pass your “preferred settings” as parameters.

You can include the script via this single line:


Continue reading this post…

Easily plotting grouped bars with ggplot #rstats

Simplify frequency plots with ggplot in R #rstats

Update March 5th
All downloads are now accessible from my script page!

This posting shows how to plot frequency plots using the ggplot-package in R. Compared to SPSS standard outputs, you will learn how to create appealing diagrams ready for use in your papers.

Frequency plots in SPSS
In SPSS, you can create frequencies of variables by using this short script:


which gives you following overview:


If you add another line to your syntax script, you can plot either bar charts (/BARCHARTS) or histograms (/HIST), too:


which gives you following results:



It seems to be more effort creating graphs like the ones above in R, but actually it’s almost easier – and you even have more beautiful plots. The only preparation you need is a general function for plotting frequencies in R.

Continue reading this post…

Simplify frequency plots with ggplot in R #rstats

Simplify your R workflow with functions #rstats

Update/ Thanks to Bernd I could improve the function of how to import the data, so here’s the updated script! /Update

In R, you often may have scripts or code snippets that will be reused. In such cases, you can write functions for your every-day-tasks. For instance, importing and converting data is such a task. I have written a small function importSPSS.R to do this:

importSPSS <- function(path, enc=NA) {
  # init foreign package
  # import data as data frame
  data.spss <- read.spss(path,, use.value.labels=FALSE, reencode=enc)
  # return data frame
getValueLabels <- function(dat) {
  a <- lapply(dat, FUN = getValLabels)
  return (a)
getValLabels <- function(x){
  rev(names(attr(x, "value.labels")))
getVariableLabels <- function(dat) {
  return(attr(dat, "variable.labels"))

This small function only gives little benefits regarding the saved typing effort. Referring to the code example under Migration, step 3: Importing (SPSS) variable and value labels, following things will change:

Continue reading this posting…

Simplify your R workflow with functions #rstats

Migrating from SPSS to R #rstats

I will every now and then post my experience with R, a package for statistical analyses. I try to show some solutions for common types of analyses or problems you are facing when you start working with R. These “tutorials” especially address people who are used to work with SPSS or maybe also Strata.

Since I myself am new to R, my solutions probably are not the most elegant ones! Thus, any feedback is welcome!

This post just shows how to properly import SPSS data and get access to data values, variable and value labels. We need this basics for later tutorials where I focus on proper graphical output.

Why R?
I recently started using the statistical package R to do my statistical analyses at work. We all have SPSS licences at work, but still I was interested in testing R for some reasons:

  • It’s free and runs on Windows, Mac and Linux
  • The amount of different statistical analyses / modeling
  • The various possibilities of creating graphics (see, e.g., here, here or here)

Migration, step 1: Installation
First of all, R only provides a console for any input and output and has no GUI (graphical user interface). This is probably the most hindering reason for migrating from SPSS to R, because calculating cross tabs on the fly, for instance, is not as easy as in SPSS. So, the first step when you have downloaded R and want to use it is to download a nice editor for it, too.

I would recommend R Studio, because it’s also free, runs on Windows/Mac/Linux, it’s beautiful and supports much the work with R.

Continue reading this posting…

Migrating from SPSS to R #rstats