sjmisc – package for working with (labelled) data #rstats

The sjmisc-package

My last posting was about reading and writing data between R and other statistical packages like SPSS, Stata or SAS. After that, I decided to bundle all functions that are not directly related to plotting or printing tables, into a new package called sjmisc.

Basically, this package covers three domains of functionality:

  • reading and writing data between other statistical packages (like SPSS) and R, based on the haven and foreign packages; hence, sjmisc also includes function to work with labelled data.
  • frequently used statistical tests, or at least convenient wrappers for such test functions
  • frequently applied recoding and variable conversion tasks

In this posting, I want to give a quick and short introduction into the labeling features.

Labelled Data

In software like SPSS, it is common to have value and variable labels as variable attributes. Variable values, even if categorical, are mostly numeric. In R, however, you may use labels as values directly:

> factor(c("low", "high", "mid", "high", "low"))
[1] low  high mid  high low 
Levels: high low mid

Reading SPSS-data (from haven, foreign or sjmisc), keeps the numeric values for variables and adds the value and variable labels as attributes. See following example from the sample-dataset efc, which is part of the sjmisc-package:

library(sjmisc)
data(efc)
str(efc$e42dep)

> atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
> - attr(*, "label")= chr "how dependent is the elder? - subjective perception of carer"
> - attr(*, "labels")= Named num [1:4] 1 2 3 4
>  ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

While all plotting and table functions of the sjPlot-package make use of these attributes (see many examples here), many packages and/or functions do not consider these attributes, e.g. R base graphics:

library(sjmisc)
data(efc)
barplot(table(efc$e42dep, efc$e16sex), 
        beside = T, 
        legend.text = T)

barplot_1

Adding value labels as factor values

to_label is a sjmisc-function that converts a numeric variable into a factor and sets attribute-value-labels as factor levels. Using factors with valued levels, the bar plot is labelled.

library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T)

Rplot

to_fac is a convenient replacement of as.factor, which converts a numeric vector into a factor, but keeps the value and variable label attributes.

Getting and setting value and variable labels

There are four functions that let you easily set or get value and variable labels of either a single vector or a complete data frame:

  • get_var_labels() to get variable labels
  • get_val_labels() to get value labels
  • set_var_labels() to set variable labels (add them as vector attribute)
  • set_val_labels() to set value labels (add them as vector attribute)
library(sjmisc)
data(efc)
barplot(table(to_label(efc$e42dep),
              to_label(efc$e16sex)), 
        beside = T, 
        legend.text = T,
        main = get_var_labels(efc$e42dep))

Rplot01

get_var_labels(efc) would return all data.frame’s variable labels. And get_val_labels(etc) would return a list with all value labels of all data.frame’s variables.

Restore labels from subsetted data

The base subset function as well as dplyr’s (at least up to 0.4.1) filter and select functions omit label attributes (or vector attributes in general) when subsetting data. In the current development-snapshot of sjmisc at GitHub (which will most likely become version 1.0.3 and released in June or July), there are handy functions to deal with this problem: add_labels and remove_labels.

add_labels adds back labels to a subsetted data frame based on the original data frame. And remove_labels removes all label attributes (this might be necessary when working with dplyr up to 0.4.1, dplyr sometimes throws an error when working with labelled data – this issue should be addressed for the next dplyr-update).

Losing labels during subset

library(sjmisc)
data(efc)
efc.sub <- subset(efc, subset = e16sex == 1, select = c(4:8))
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
> $ e17age : num  74 68 80 72 94 79 67 80 76 88 ...
> $ e42dep : num  4 4 1 3 3 4 3 4 2 4 ...
> $ c82cop1: num  4 3 3 4 3 3 4 2 2 3 ...
> $ c83cop2: num  2 4 2 2 2 2 1 3 2 2 ...
> $ c84cop3: num  4 4 1 1 1 4 2 4 2 4 ...

Add back labels

efc.sub <- add_labels(efc.sub, efc)
str(efc.sub)

> 'data.frame':	296 obs. of  5 variables:
>  $ e17age : atomic  74 68 80 72 94 79 67 80 76 88 ...
>   ..- attr(*, "label")= Named chr "elder' age"
>   .. ..- attr(*, "names")= chr "e17age"
>  $ e42dep : atomic  4 4 1 3 3 4 3 4 2 4 ...
>   ..- attr(*, "label")= Named chr "how dependent is the elder? - subjective perception of carer"
>   .. ..- attr(*, "names")= chr "e42dep"
>   ..- attr(*, "labels")= Named chr  "1" "2" "3" "4"
>   .. ..- attr(*, "names")= chr  "independent" "slightly dependent" "moderately dependent" "severely dependent"

# truncated output

So, when working with labelled data, especially when working with data sets imported from other software packages, it comes very handy to make use of the label attributes. The sjmisc package supports this feature and offers some useful functions for these tasks…

sjmisc – package for working with (labelled) data #rstats

Reading from and writing to SPSS, SAS and STATA with R #rstats #sjPlot

On CRAN now

My sjPlot-package was updated on CRAN (binaries will be available soon, I guess). This update contains, besides many small improvements and fixes, two major features:

  1. First, new features to print table summaries of linear models and generalized linear models (for sjt.glm, the same new features were added as to sjt.lm – however, the manual page is not finished yet). I have introduced these features in a former posting.
  2. Second, functions for reading data from and writing to other statistical packages like SPSS, SAS or STATA have been revamped or new features have been added. Furthermore, there are improved getters and setters to extract and set variable and value labels. A short introduction is available online.

The haven package

There are two reasons why this update focuses on reading and writing data as well as getting and setting value and variable labels. First, I wanted to rename all functions who formerly had the prefixes sji. or sju. in order to have more “intuitive” function names, so people better understand what these functions may do.

The second reason is the release of the haven package, which supports fast reading and writing from or to different file formats (like SPSS, SAS or STATA). I believe, this package will become frequently used when reading or writing data from/to other formats, so I wanted to ensure compatibility between sjPlot and haven imported data.

The haven package reads data to a data frame where all variables (vectors) are of class type labelled, which means these variables are atomic (i.e. they have numeric values, even if they are categorical or factors, see this introduction on RStudio) and each variable has – where applicable – a variable label and value labels attribute.
An example:

## Class 'labelled'  atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
##   ..- attr(*, "label")= chr "how dependent is the elder?"
##   ..- attr(*, "labels")= Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

Until recently, the sjPlot package solely used the read.spss function from the foreign package to read data from SPSS. The foreign package uses following structure to import value and variable labels:

##  atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
##  - attr(*, "value.labels")= Named chr [1:4] "1" "2" "3" "4"
##   ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
##  - attr(*, "variable.label")= chr "how dependent is the elder?"

Since version 1.7, sjPlot can also read data using the haven read-functions (simply use my_dataframe <- read_spss("path/to/spss-file.sav", option = "haven")).

These kind of attributes, whether from haven or foreign, provide huge advantages in case you want to plot or print (summaries of) variables and don’t want to manually set axis labels or titles, because you can extract these information from any variable’s attributes. This is one of the core functionality of all sjPlot plotting and table printing functions:

library(sjPlot)
# load sample data
data(efc2)
# set plot theme
sjp.setTheme(theme = "539")
# plot frequencies
sjp.frq(efc2$e42dep)

label_example

The new sjPlot update can now deal with both structures of either haven or foreign imported data. It doesn’t matter whether efc2$e42dep from the above example was read with foreign, or is a labelled class vector from haven.

Also, reading value and variable labels works for both vector types. get_var_labels() and get_val_labels() extract variable and value labels from both haven-data and foreign-data.

Writing data

The constructor of the labelled class only supports creating value labels, not variable labels. Thus, writing data back to SPSS or STATA do not write variable labels by default (at least for new created variables – variables that have been read with haven and already have the variable label attribute label will correctly save back variable labels).

So I wrote a wrapper class to write data, called write_spss and write_stata. These functions convert your data, independent whether it was imported with the foreign or haven package, or if you manually created new variables, into a format that will keep value and variable labels when writing data to SPSS or STATA.

When you create new variables, make sure you use set_val_labels and set_var_labels to add the necessary label attributes to new variables:

# create dummy variable
dummy <- sample(1:4, 40, replace=TRUE)
# manually attach value and variable labels
dummy <- set_val_labels(dummy, c("very low", "low", "mid", "hi"))
dummy <- set_var_labels(dummy, "This is a dummy")
# check structure of dummy
str(dummy)
##  atomic [1:40] 2 2 2 3 3 2 1 4 4 2 ...
##  - attr(*, "value.labels")= Named chr [1:4] "1" "2" "3" "4"
##   ..- attr(*, "names")= chr [1:4] "very low" "low" "mid" "hi"
##  - attr(*, "variable.label")= chr "This is a dummy"

Finally, I just like to mention convenient conversion functions, e.g. to convert atomic variables into factors without losing the label attributes. These are to_fac, to_label or to_value. Further notes on the read and write functions of the sjPlot package are in the online manual.

Reading from and writing to SPSS, SAS and STATA with R #rstats #sjPlot

CRAN download statistics of any packages #rstats

Hadley Wickham announced at Twitter that RStudio now provides CRAN package download logs. I was wondering about the download numbers of my package and wrote some code to extract that information from the logs…

The first code snippet is taken from the log website itself:

# Here's an easy way to get all the URLs in R
start <- as.Date('2013-11-28')
today <- as.Date('2015-03-04')

all_days <- seq(start, today, by = 'day')

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

Then I downloaded all files into a folder:

for (i in 1:length(urls)) {
  download.file(urls[i], sprintf("~/Desktop/rstats/temp%i.csv.gz", i))
}

Unzipping did not work with unzip, so I just “opened” all files with the OS X unarchiver, which was quite convenient.

Than I read all csv-files and extracted the information for my package, sjPlot, from each csv-file and merged everything into one data frame:

sjPlot.df <- data.frame()
library(dplyr)
pb <- txtProgressBar(min=0, max=length(urls), style=3)

for (i in 1:length(urls)) {
  df.csv <- read.csv(sprintf("~/Desktop/rstats/temp%i.csv", i))
  pack <- tolower(as.character(df.csv$package))
  my.package <- which(pack == "sjplot")
  if (length(my.package) > 0 ) {
    dummy.df <- df.csv %>% dplyr::slice(my.package) %>% dplyr::select(date, package, version, country)
    sjPlot.df <- dplyr::bind_rows(sjPlot.df, dummy.df)
  }
  setTxtProgressBar(pb, i)
}
close(pb)
sjPlot.df$date.short <- strftime(sjPlot.df$date, format="%Y-%m")

Finally, the download-stats as plot:

library(sjPlot)
library(ggplot2)

mydf <- sjPlot.df %>% dplyr::count(date.short)

sjp.setTheme(theme = "539", axis.angle.x = 90)
ggplot(mydf, aes(x = date.short, y = n)) +
  geom_bar(stat = "identity", width = .5, alpha = .5, fill = "#3399cc") +
  scale_y_continuous(expand = c(0, 0), breaks = seq(250, 1500, 250)) +
  labs(x = sprintf("Monthly CRAN-downloads of sjPlot package since first release until 4th March (total download: %i)", sum(mydf$n)), y = NULL)

sjPlot-downloads

By the way, there’s already a shiny app for this…

CRAN download statistics of any packages #rstats

Beautiful tables for linear model summaries #rstats

Beautiful HTML tables of linear models

In this blog post I’d like to show some (old and) new features of the sjt.lm function from my sjPlot-package. These functions are currently only implemented in the development snapshot on GitHub. A package update is planned to be submitted soon to CRAN.

There are two new major features I added to this function: Comparing models with different predictors (e.g. stepwise regression) and automatic grouping of categorical predictors. There are examples below that demonstrate these features.

The sjt.lm function prints results and summaries of linear models as HTML-table. These tables can be viewed in the RStudio Viewer pane, web browser or easily exported to office applications. See also my former posts on the table printing functions of my package here and here.

Please note: The following tables may look a bit cluttered – this is because I just pasted the HTML-code created by knitr into this blog post, so style sheets may interfere. The original online-manual for this function can be found here.

All following tables can be reproduced with the sjPlot package and the sample data set from this package.

Linear model summaries as HTML table

The sjt.lm function prints summaries of linear models (fitted with the lm function) as nicely formatted html-tables.

Before starting, sample data is loaded and sample models are fitted:

# sample data
data(efc)
# set variable labels
efc <- set_var_labels(efc, get_var_labels(efc))
# fit first model
fit1 <- lm(barthtot ~ c160age + c12hour + c161sex + c172code, data=efc)
# fit second model
fit2 <- lm(neg_c_7 ~ c160age + c12hour + c161sex + c172code, data=efc)
# Note that both models share the same predictors and only differ 
# in their dependent variable. See examples of stepwise models 
# later...

The simplest way of producing the table output is by passing the fitted models as parameter. By default, estimates (B), confidence intervals (CI) and p-values (p) are reported. The models are named Model 1 and Model 2.

sjt.lm(fit1, fit2)
Model 1 Model 2
  B CI p B CI p
(Intercept) 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
carer’ age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
average number of hours of care for the elder in a week -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
carer’s gender -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
carer’s level of education: recoding of variable c172edu1 -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

Custom labels

You can specify the ‘model’ label via labelDependentVariables parameter:

sjt.lm(fit1,
       fit2,
       labelDependentVariables = c("Barthel-Index",
                                   "Negative Impact"))
Barthel-Index Negative Impact
  B CI p B CI p
(Intercept) 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
carer’ age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
average number of hours of care for the elder in a week -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
carer’s gender -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
carer’s level of education: recoding of variable c172edu1 -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

More custom labels

Here is an example how to change the other labels. Note that showHeaderStrings makes the two labels on top and top left corner appear in the table.

sjt.lm(fit1,
       fit2,
       showHeaderStrings = TRUE,
       stringB = "Estimate",
       stringCI = "Conf. Int.",
       stringP = "p-value",
       stringDependentVariables = "Response",
       stringPredictors = "Coefficients",
       stringIntercept = "Konstante",
       labelDependentVariables = c("Barthel-Index",
                                   "Negative Impact"))
Coefficients Response
Barthel-Index Negative Impact
  Estimate Conf. Int. p-value Estimate Conf. Int. p-value
Konstante 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
carer’ age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
average number of hours of care for the elder in a week -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
carer’s gender -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
carer’s level of education: recoding of variable c172edu1 -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

Changing summary style and content

You can change the table style with specific parameters, e.g. to include CI into the same table cell as the estimates, print asterisks instead of numeric p-values etc.

sjt.lm(fit1, fit2,
       separateConfColumn = FALSE, # ci in same cell as estimates
       showStdBeta = TRUE,         # also show standardized beta values
       pvaluesAsNumbers = FALSE)   # "*" instead of numeric values
Model 1 Model 2
  B (CI) std. Beta (CI) B (CI) std. Beta (CI)
(Intercept) 90.06
(77.95 – 102.18) ***
8.46
(6.67 – 10.24) ***
carer’ age -0.22
(-0.36 – -0.08) **
-0.10
(-0.16 – -0.04)
0.01
(-0.01 – 0.03)
0.05
(-0.03 – 0.12)
average number of hours of care for the elder in a week -0.28
(-0.31 – -0.24) ***
-0.48
(-0.54 – -0.41)
0.02
(0.01 – 0.02) ***
0.25
(0.18 – 0.32)
carer’s gender -0.26
(-4.36 – 3.83)
-0.00
(-0.06 – 0.06)
0.57
(-0.03 – 1.17)
0.06
(-0.00 – 0.13)
carer’s level of education: recoding of variable c172edu1 -0.76
(-3.55 – 2.02)
-0.02
(-0.08 – 0.04)
0.44
(0.03 – 0.86) *
0.07
(0.01 – 0.14)
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075
Notes * p<0.05   ** p<0.01   *** p<0.001

Custom variable labels

In the above example, the original variable labels are long and not much pretty. You can change variable labels either with set_var_labels (see this page for more detaila), which will affect all future plots and tables, or pass own labels via labelPredictors.

sjt.lm(fit1, fit2,
       labelPredictors = c("Carer's Age",
                           "Hours of Care",
                           "Carer's Sex",
                           "Educational Status"))
Model 1 Model 2
  B CI p B CI p
(Intercept) 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
Carer’s Age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
Hours of Care -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
Carer’s Sex -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
Educational Status -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

Compare models with different predictors

In some cases, for instance stepwise regressions, you have different predictors on the same response. The proper grouping of predictors, resp. rows, is done automatically.

First, let’s fit some example models.

# fit first model
fit1 <- lm(neg_c_7 ~ c160age + c172code + c161sex, data=efc)
# fit second model
fit2 <- lm(neg_c_7 ~ c160age + c172code + c161sex + c12hour, data=efc)
# fit second model
fit3 <- lm(neg_c_7 ~ c160age + c172code + e42dep + tot_sc_e, data=efc)

Note that printing tables with fitted models, which have different predictors do not automatically detect variable labels (maybe this will be implemented in a future package version).

sjt.lm(fit1, fit2, fit3, 
       separateConfColumn = FALSE,
       showAIC = TRUE,
       showFStat = TRUE)
Model 1 Model 2 Model 3
  B (CI) p B (CI) p B (CI) p
(Intercept) 7.82
(6.00 – 9.65)
< 0.001 8.46
(6.67 – 10.24)
< 0.001 6.23
(4.76 – 7.69)
< 0.001
c160age 0.04
(0.02 – 0.06)
< 0.001 0.01
(-0.01 – 0.03)
0.206 0.01
(-0.01 – 0.03)
0.271
c172code 0.39
(-0.03 – 0.81)
0.071 0.44
(0.03 – 0.86)
0.034 0.24
(-0.15 – 0.64)
0.230
c161sex 0.69
(0.07 – 1.31)
0.028 0.57
(-0.03 – 1.17)
0.061
c12hour 0.02
(0.01 – 0.02)
< 0.001
e42dep 1.50
(1.23 – 1.77)
< 0.001
tot_sc_e 0.21
(0.01 – 0.41)
0.038
Observations 832 832 833
R2 / adj. R2 0.025 / 0.022 0.079 / 0.075 0.153 / 0.148
F-statistics 7.107*** 17.730*** 37.250***
AIC 4611.921 4566.622 4502.333

Automatic grouping of categorical predictors

In case you have categorical variables with more than two factor levels, the sjt.lm function automatically groups the category levels to give a better overview of predictors in the table.

By default, automatic grouping is activated. To disable this feature, use group.pred = FALSE as parameter.

To demonstrate this feature, we first convert two predictors to factors (what they actually are, indeed). To do this, we use the to_fac function, which converts numerical variables into factors, however, does not remove the variable and value label attributes.

# make education categorical
efc$c172code <- to_fac(efc$c172code)
# make dependency categorical
efc$e42dep <- to_fac(efc$e42dep)
# fit first model again (with c172code as factor)
fit1 <- lm(barthtot ~ c160age + c12hour + c172code + c161sex + e42dep, data=efc)
# fit second model again (with c172code as factor)
fit2 <- lm(neg_c_7 ~ c160age + c12hour + c172code + c161sex + e42dep, data=efc)

Now we can print the table.

sjt.lm(fit1, fit2)
Model 1 Model 2
  B CI p B CI p
(Intercept) 97.17 88.37 – 105.97 < 0.001 7.76 5.97 – 9.55 < 0.001
carer’ age -0.06 -0.16 – 0.03 0.203 0.00 -0.02 – 0.02 0.683
average number of hours of care for the elder in a week -0.07 -0.10 – -0.04 < 0.001 0.01 0.00 – 0.01 0.015
carer’s level of education: recoding of variable c172edu1
intermediate level of education 1.50 -1.60 – 4.60 0.343 0.13 -0.50 – 0.76 0.689
high level of education 0.66 -3.20 – 4.52 0.738 0.72 -0.07 – 1.51 0.074
carer’s gender 0.09 -2.74 – 2.93 0.949 0.56 -0.02 – 1.13 0.058
how dependent is the elder? – subjective perception of carer
slightly dependent -7.85 -12.86 – -2.83 0.002 1.11 0.09 – 2.13 0.033
moderately dependent -19.49 -24.42 – -14.57 < 0.001 2.37 1.37 – 3.37 < 0.001
severely dependent -56.87 -62.12 – -51.63 < 0.001 3.92 2.86 – 4.99 < 0.001
Observations 821 832
R2 / adj. R2 0.653 / 0.650 0.160 / 0.152

Removing estimates from the output

With remove.estmates, specific estimates can be removed from the table output. This may make sense in case you have stepwise regression models and only want to compare the varying predictors but not the controls. remove.estmates either accepts the row indices of the rows of the table output that should be removed, or the coefficient’s names.

When using numeric indices, the estimates’ index number relates to the same order as coef(fit). Note that currently the intercept cannot be removed from the model output!

data(efc)
# attach variable labels to each variable of the data
# frame - useful for automatic label detection
efc <- set_var_labels(efc, get_var_labels(efc))
# make education categorical
efc$c172code <- to_fac(efc$c172code)
# make education categorical
efc$e42dep <- to_fac(efc$e42dep)
# make prettier variable labels
efc$c172code <- set_var_labels(efc$c172code, "Education")
efc$e42dep <- set_var_labels(efc$e42dep, "Dependency")
# fit first model
fit1 <- lm(neg_c_7 ~ c160age + c172code + c161sex, data=efc)
# fit second model
fit2 <- lm(neg_c_7 ~ c160age + c172code + c161sex + c12hour, data=efc)
# fit third model
fit3 <- lm(neg_c_7 ~ c160age + c172code + e42dep + tot_sc_e, data=efc)

Example 1: Complete table output

Here you have the complete table output. This helps you identify the row index numbers. Especially when you have multiple models with different predictors, the estimate’s position in the last model may differ from this estimate’s position in the table output.

sjt.lm(fit1, fit2, fit3)
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
c160age 0.04 0.02 – 0.06 < 0.001 0.01 -0.01 – 0.03 0.306 0.01 -0.01 – 0.03 0.384
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
c161sex 0.70 0.09 – 1.32 0.025 0.59 -0.01 – 1.19 0.053
c12hour 0.02 0.01 – 0.02 < 0.001
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
tot_sc_e 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 2: Remove first coefficient (after intercept)

sjt.lm(fit1, fit2, fit3,
       remove.estimates = 2)
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
c161sex 0.70 0.09 – 1.32 0.025 0.59 -0.01 – 1.19 0.053
c12hour 0.02 0.01 – 0.02 < 0.001
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
tot_sc_e 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 3: Remove age and sex

sjt.lm(fit1, fit2, fit3,
       remove.estimates = c("c160age", "c161sex"))
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
c12hour 0.02 0.01 – 0.02 < 0.001
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
tot_sc_e 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 4: Remove many esimates

sjt.lm(fit1, fit2, fit3,
       remove.estimates = c(2,5,6,10))
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 5: Custom predictor labels

In most cases you need to define your own labels when removing estimates, especially when you have grouped categorical predictors, because automatic label detection in quite tricky in such situations. If you provide own labels, please note that grouped predictors’ headings (the variable name of the grouped, categorical variable) are still automatically set by the sjt.lm function (variable labels are used, so use set_var_labels for those categorical predictors). All data rows in the table, i.e. for each coefficient appearing in the model, you need to specify a label string.

In the next example, we have seven table rows with data (excluding intercept): mid and hi education (categories of the variable Education), Hours of Care, slight, moderate and severe dependency (categories of the variable Dependency) and Service Usage. These ‘rows’ need to be labelled.

sjt.lm(fit1, fit2, fit3,
       labelPredictors = c("mid education", 
                           "hi education", 
                           "Hours of Care", 
                           "slight dependency", 
                           "moderate dependency", 
                           "severe dependency", 
                           "Service Usage"),
       remove.estimates = c("c160age", "c161sex"))
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
mid education 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
hi education 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
Hours of Care 0.02 0.01 – 0.02 < 0.001
Dependency
slight dependency 1.18 0.16 – 2.20 0.024
moderate dependency 2.53 1.53 – 3.52 < 0.001
severe dependency 4.32 3.31 – 5.33 < 0.001
Service Usage 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147
Beautiful tables for linear model summaries #rstats

sjPlot package and related online manuals updated #rstats # ggplot

My sjPlot package for data visualization has just been updated on CRAN. I’ve added some features to existing function, which I want to introduce here.

Plotting linear models

So far, plotting model assumptions of linear models or plotting slopes for each estimate of linear models were spread over several functions. Now, these plot types have been integrated into the sjp.lm function, where you can select the plot type with the type parameter. Furthermore, plotting standardized coefficients now also plot the related confidence intervals.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.lm

Plotting generalized linear models

Beside odds ratios, you now can also plot the predicted probabilities of the outcome for each predictor of generalized linear models. In case you have continuous variables, these kind of plots may be more intuitive than an odds ratio value.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.glm

Plotting (generalized) linear mixed effects models

The plotting function for creating plots of (generalized) linear mixed effects models (sjp.lmer and sjp.glmer) also got new plot types over the course of the last weeks.

For sjp.lmer, we have

  • re (default) for estimates of random effects
  • fe for estimates of fixed effects
  • fe.std for standardized estimates of fixed effects
  • fe.cor for correlation matrix of fixed effects
  • re.qq for a QQ-plot of random effects (random effects quantiles against standard normal quantiles)
  • fe.ri for fixed effects slopes depending on the random intercept.

and for sjp.glmer, we have

  • re (default) for odds ratios of random effects
  • fe for odds ratios of fixed effects
  • fe.cor for correlation matrix of fixed effects
  • re.qq for a QQ-plot of random effects (random effects quantiles against standard normal quantiles)
  • fe.pc or fe.prob to plot probability curves (predicted probabilities) of all fixed effects coefficients. Use facet.grid to decide whether to plot each coefficient as separate plot or as integrated faceted plot.
  • ri.pc or ri.prob to plot probability curves (predicted probabilities) of random intercept variances for all fixed effects coefficients. Use facet.grid to decide whether to plot each coefficient as separate plot or as integrated faceted plot.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.lmer and www.strengejacke.de/sjPlot/sjp.glmer

Plotting interaction terms of (generalized) linear (mixed effects) models

Another function, where new features were added, is sjp.int (formerly known as sjp.lm.int). This function is now kind of generic and can plot interactions of

  • linar models (lm)
  • generalized linar models (glm)
  • linar mixed effects models (lme4::lmer)
  • generalized linar mixed effects models (lme4::glmer)

For linear models (both normal and mixed effects), slopes of interaction terms are plotted. For generalized linear models, the predicted probabilities of the outcome towards the interaction terms is plotted.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.int

Plotting Likert scales

Finally, a comprehensive documentation for the sjp.likert function is finsihed, which can be found here:
www.strengejacke.de/sjPlot/sjp.likert

sjPlot package and related online manuals updated #rstats # ggplot

Systems Theory and the Sociology of Health and Illness #Luhmann

Today I got my copy from Systems Theory and the Sociology of Health and Illness – Observing Healthcare, a book edited by Morten Knudsen and Werner Vogd. The book contains very interesting chapters on this topic, and though it is not so cheap for private purchasing, you may consider ordering this book through your library.

Here is a short abstract of my contribution, Sustainability in Integrated Care Partnerships: a systems and network theoretical approach to analyse co-operation networks. The chapter highlights the importance and role of networks in the stabilization of meaningful arrangements between different contextures that refer to the basic conflict between medical and economic systems. Control mechanisms, which are applied to arrange these polycontextural conditions, have been analysed. Qualitative interviews were conducted with actors involved integrated care partnerships. Results show that loosely linked networks can take on important control functions in the sustainable balancing of financial, nursing and medical demands. Too tightly forged links prevent dynamic balancing and relating of different contextures. In such cases, networks can lose beneficial self-regulation capacities to eliminate drawbacks.

Systems Theory and the Sociology of Health and Illness #Luhmann

Visualizing (generalized) linear mixed effects models, part 2 #rstats #lme4

In the first part on visualizing (generalized) linear mixed effects models, I showed examples of the new functions in the sjPlot package to visualize fixed and random effects (estimates and odds ratios) of (g)lmer results. Meanwhile, I added further features to the functions, which I like to introduce here. This posting is based on the online manual of the sjPlot package.

In this posting, I’d like to give examples for diagnostic and probability plots of odds ratios. The latter examples, of course, only refer to the sjp.glmer function (generalized mixed models). To reproduce these examples, you need the version 1.59 (or higher) of the package, which can be found at GitHub. A submission to CRAN is planned for the next days…

Fitting example models

The following examples are based on two fitted mixed models:

# fit model
library(lme4)
# create binary response
sleepstudy$Reaction.dicho <- sju.dicho(sleepstudy$Reaction, 
                                       dichBy = "md")
# fit first model
fit <- glmer(Reaction.dicho ~ Days + (Days | Subject),
             sleepstudy,
             family = binomial("logit"))

data(efc)
# create binary response
efc$hi_qol <- sju.dicho(efc$quol_5)
# prepare group variable
efc$grp = as.factor(efc$e15relat)
levels(x = efc$grp) <- sji.getValueLabels(efc$e15relat)
# data frame for 2nd fitted model
mydf <- na.omit(data.frame(hi_qol = as.factor(efc$hi_qol),
                           sex = as.factor(efc$c161sex),
                           c12hour = as.numeric(efc$c12hour),
                           neg_c_7 = as.numeric(efc$neg_c_7),
                           grp = efc$grp))
# fit 2nd model
fit2 <- glmer(hi_qol ~ sex + c12hour + neg_c_7 + (1|grp),
              data = mydf,
              family = binomial("logit"))

Summary fit1

Formula: Reaction.dicho ~ Days + (Days | Subject)
   Data: sleepstudy

     AIC      BIC   logLik deviance df.resid 
   158.7    174.7    -74.4    148.7      175 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.2406 -0.2726 -0.0198  0.2766  2.9705 

Random effects:
 Groups  Name        Variance Std.Dev. Corr 
 Subject (Intercept) 8.0287   2.8335        
         Days        0.2397   0.4896   -0.19
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.8159     1.1728  -3.254 0.001139 ** 
Days          0.8908     0.2347   3.796 0.000147 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
     (Intr)
Days -0.694

Summary fit2

Formula: hi_qol ~ sex + c12hour + neg_c_7 + (1 | grp)
   Data: mydf

     AIC      BIC   logLik deviance df.resid 
  1065.3   1089.2   -527.6   1055.3      881 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.7460 -0.8139 -0.2688  0.7706  6.6464 

Random effects:
 Groups Name        Variance Std.Dev.
 grp    (Intercept) 0.08676  0.2945  
Number of obs: 886, groups:  grp, 8

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.179036   0.333940   9.520  < 2e-16 ***
sex2        -0.545282   0.178974  -3.047  0.00231 ** 
c12hour     -0.005148   0.001720  -2.992  0.00277 ** 
neg_c_7     -0.219586   0.024108  -9.109  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
        (Intr) sex2   c12hor
sex2    -0.410              
c12hour -0.057 -0.048       
neg_c_7 -0.765 -0.009 -0.116

Diagnostic plots

Two new functions are added to both sjp.lmer and sjp.glmer, hence they apply to linear and generalized linear mixed models, fitted with the lme4 package. The examples only refer to the sjp.glmer function.

Currently, there are two type options to plot diagnostic plots: type = "fe.cor" to plot a correlation matrix between fixed effects and type = "re.qq" to plot a qq-plot of random effects.

Correlation matrix of fixed effects

To plot a correlation matrix of the fixed effects, use type = "fe.cor".

# plot fixed effects correlation matrix
sjp.glmer(fit2, type = "fe.cor")

unnamed-chunk-11-1

qq-plot of random effects

Another diagnostic plot is the qq-plot for random effects. Use type = "re.qq" to plot random against standard quantiles. The dots should be plotted along the line.

# plot qq-plot of random effects
sjp.glmer(fit, type = "re.qq")

unnamed-chunk-13-1

Probability curves of odds ratios

These plotting functions have been implemented to easier interprete odds ratios, especially for continuous covariates, by plotting the probabilities of predictors.

Probabilities of fixed effects

With type = "fe.pc" (or type = "fe.prob"), probability plots for each covariate can be plotted. These probabilties are based on the fixed effects intercept. One plot per covariate is plotted.

The model fit2 has one binary and two continuous covariates:

# plot probability curve of fixed effects
sjp.glmer(fit2, type = "fe.pc")

unnamed-chunk-9-1

Probabilities of fixed effects depending on grouping level (random intercept)

With type = "ri.pc" (or type = "ri.prob"), probability plots for each covariate can be plotted, depending on the grouping level from the random intercept. Thus, for each covariate a plot for each grouping levels is plotted. Furthermore, with the show.se the standard error of probabilities can be shown. In this example, only the plot for one covariate is shown, not for all.

# plot probability curves for each covariate
# grouped by random intercepts
sjp.glmer(fit2,
          type = "ri.pc",
          show.se = TRUE)

unnamed-chunk-8-2

Instead of faceting plots, all grouping levels can be shown in one plot:

# plot probability curves for each covariate
# grouped by random intercepts
sjp.glmer(fit2,
          type = "ri.pc",
          facet.grid = FALSE)

unnamed-chunk-10-2

Outlook

These will be the new features for the next package update. For later updates, I’m also planning to plot interaction terms of (generalized) linear mixed models, similar to the existing function for visualizing interaction terms in linear models.

Visualizing (generalized) linear mixed effects models, part 2 #rstats #lme4

Patient centredness in integrated care (from systems theoretical perspective) #Luhmann #Systemstheory

My paper Patient centredness in integrated care: results of a qualitative study based on a systems theoretical framework has just been published in the International Journal of Integrated Care. It’s an open acces journal, so there’s no paywall to read it.
Let me provide you the abstract:

Introduction: Health care providers seek to improve patient-centred care. Due to fragmentation of services, this can only be achieved by establishing integrated care partnerships. The challenge is both to control costs while enhancing the quality of care and to coordinate this process in a setting with many organisations involved. The problem is to establish control mechanisms, which ensure sufficiently consideration of patient centredness.

Theory and methods: Seventeen qualitative interviews have been conducted in hospitals of metropolitan areas in northern Germany. The documentary method, embedded into a systems theoretical framework, was used to describe and analyse the data and to provide an insight into the specific perception of organisational behaviour in integrated care.

Results: The findings suggest that integrated care partnerships rely on networks based on professional autonomy in the context of reliability. The relationships of network partners are heavily based on informality. This correlates with a systems theoretical conception of organisations, which are assumed autonomous in their decision-making.

Conclusion and discussion: Networks based on formal contracts may restrict professional autonomy and competition. Contractual bindings that suppress the competitive environment have negative consequences for patient-centred care. Drawbacks remain due to missing self-regulation of the network. To conclude, less regimentation of integrated care partnerships is recommended.

The full text is also available as PDF file.

Patient centredness in integrated care (from systems theoretical perspective) #Luhmann #Systemstheory

Visualizing (generalized) linear mixed effects models with ggplot #rstats #lme4

In the past week, colleagues of mine and me started using the lme4-package to compute multi level models. This inspired me doing two new functions for visualizing random effects (as retrieved by ranef()) and fixed effects (as retrieved by fixef()) of (generalized) linear mixed effect models.

The upcoming version of my sjPlot package will contain two new functions to plot fitted lmer and glmer models from the lme4 package: sjp.lmer and sjp.glmer (not that surprising function names). Since I’m new to mixed effects models, I would appreciate any suggestions on how to improve the functions, which results are important to report (plot) and so on. Furthermore, I’m not sure whether my approach of computing confident intervals for random effects is the best?

I have used following code to compute confident intervals for the estimates returned by the lme4::ranef() function (bases on this stackoverflow answer):

coev <- as.matrix(lme4::vcov.merMod(fit))
tmp <- as.data.frame(cbind(OR = exp(mydf.ef[,i]),
                     lower.CI = exp(mydf.ef[,i] - (1.96 * sqrt(diag(coev))[i])),
                     upper.CI = exp(mydf.ef[,i] + (1.96 * sqrt(diag(coev))[i]))))

The update to version 1.6 of sjPlot is still in development (feature-freeze, mostly fixes now), however, you can download the latest snapshot from GitHub (see also this post for further information). Now to some examples. First, an example model is fitted and the random effects (default) for each predictor are plotted as “forest plot”:

# fit model
library(lme4)
fit <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
# simple plot
sjp.lmer(fit)

lmer1

Sorting a predictor (i.e. estimates of a facet) is done by specifying the predictor’s name as sort parameter.

sjp.lmer(fit, sort = "Days")

lmer2

Each facet plot can also be plotted as single plot, when facet.grid is set to FALSE. In this case, it is possible to sort the estimates for each plots. See following example from the sjp.glmer function:

library(lme4)
# create binary response
sleepstudy$Reaction.dicho <- sju.dicho(sleepstudy$Reaction, 
                                       dichBy = "md")
# fit model
fit <- glmer(Reaction.dicho ~ Days + (Days | Subject),
             sleepstudy,
             family = binomial("logit"))
sjp.setTheme(theme = "forest")
sjp.glmer(fit, 
          facet.grid = FALSE, 
          sort = "sort.all")

glmer1 glmer2

Plotting the fixed effects is not much spectacular, because we only have one estimate beside intercept here.

sjp.glmer(fit, 
          type = "fe", 
          sort = TRUE)

glmer3

To summarize, you can plot random and fixed effects in the way as shown above. Are there any other or better plot options for visualizing mixed effects models?

Any suggestions are welcome…

Disclaimer: all misspellings belong to Safari’s autocorrect feature!

Visualizing (generalized) linear mixed effects models with ggplot #rstats #lme4

sjPlot 1.6 – major revisions, anyone for beta testing? #rstats

In the last couple of weeks I have rewritten some core parts of my sjPlot-package and also revised the package- and online documentation.

Most notably are the changes that affect theming and appearance of plots and figures. There’s a new function called sjp.setTheme which now sets theme-options for all sjp-functions, which means

  1. you only need to specify theme / appearance option once and no longer need to repeat these parameter for each sjp-function call you make
  2. due to this change, all sjp-functions have much less parameters, making the functions and documentation clearer

Furthermore, due to some problems with connecting / updating to the RPubs server, I decided to upload my online documentation for the package to my own site. You will now find the latest, comprehensive documentation and examples for various functions of the sjPlot package at www.strengejacke.de/sjPlot/. For instance, take a look at customizing plot appearance and see how the new theming feature of the package allows both easier customization of plots as well as better integration of theming packages like ggthemr or ggthemes.

Updating the sjPlot package to CRAN is planned soon, however, I kindly ask you to test the current development snapshot, which is hosted on GitHub. You can easily install the package from there using the devtools-package (devtools::install_github("devel", "sjPlot")). The current snapshot is (very) stable and I appreciate any feedbacks or bug reports (if possible, use the issue tracker from GitHub).

The current change log with all new function, changes and bug fixes can also be found on GitHub.

sjPlot 1.6 – major revisions, anyone for beta testing? #rstats