Reading from and writing to SPSS, SAS and STATA with R #rstats #sjPlot

On CRAN now

My sjPlot-package was updated on CRAN (binaries will be available soon, I guess). This update contains, besides many small improvements and fixes, two major features:

  1. First, new features to print table summaries of linear models and generalized linear models (for sjt.glm, the same new features were added as to sjt.lm – however, the manual page is not finished yet). I have introduced these features in a former posting.
  2. Second, functions for reading data from and writing to other statistical packages like SPSS, SAS or STATA have been revamped or new features have been added. Furthermore, there are improved getters and setters to extract and set variable and value labels. A short introduction is available online.

The haven package

There are two reasons why this update focuses on reading and writing data as well as getting and setting value and variable labels. First, I wanted to rename all functions who formerly had the prefixes sji. or sju. in order to have more “intuitive” function names, so people better understand what these functions may do.

The second reason is the release of the haven package, which supports fast reading and writing from or to different file formats (like SPSS, SAS or STATA). I believe, this package will become frequently used when reading or writing data from/to other formats, so I wanted to ensure compatibility between sjPlot and haven imported data.

The haven package reads data to a data frame where all variables (vectors) are of class type labelled, which means these variables are atomic (i.e. they have numeric values, even if they are categorical or factors, see this introduction on RStudio) and each variable has – where applicable – a variable label and value labels attribute.
An example:

## Class 'labelled'  atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
##   ..- attr(*, "label")= chr "how dependent is the elder?"
##   ..- attr(*, "labels")= Named int [1:4] 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"

Until recently, the sjPlot package solely used the read.spss function from the foreign package to read data from SPSS. The foreign package uses following structure to import value and variable labels:

##  atomic [1:908] 3 3 3 4 4 4 4 4 4 4 ...
##  - attr(*, "value.labels")= Named chr [1:4] "1" "2" "3" "4"
##   ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
##  - attr(*, "variable.label")= chr "how dependent is the elder?"

Since version 1.7, sjPlot can also read data using the haven read-functions (simply use my_dataframe <- read_spss("path/to/spss-file.sav", option = "haven")).

These kind of attributes, whether from haven or foreign, provide huge advantages in case you want to plot or print (summaries of) variables and don’t want to manually set axis labels or titles, because you can extract these information from any variable’s attributes. This is one of the core functionality of all sjPlot plotting and table printing functions:

library(sjPlot)
# load sample data
data(efc2)
# set plot theme
sjp.setTheme(theme = "539")
# plot frequencies
sjp.frq(efc2$e42dep)

label_example

The new sjPlot update can now deal with both structures of either haven or foreign imported data. It doesn’t matter whether efc2$e42dep from the above example was read with foreign, or is a labelled class vector from haven.

Also, reading value and variable labels works for both vector types. get_var_labels() and get_val_labels() extract variable and value labels from both haven-data and foreign-data.

Writing data

The constructor of the labelled class only supports creating value labels, not variable labels. Thus, writing data back to SPSS or STATA do not write variable labels by default (at least for new created variables – variables that have been read with haven and already have the variable label attribute label will correctly save back variable labels).

So I wrote a wrapper class to write data, called write_spss and write_stata. These functions convert your data, independent whether it was imported with the foreign or haven package, or if you manually created new variables, into a format that will keep value and variable labels when writing data to SPSS or STATA.

When you create new variables, make sure you use set_val_labels and set_var_labels to add the necessary label attributes to new variables:

# create dummy variable
dummy <- sample(1:4, 40, replace=TRUE)
# manually attach value and variable labels
dummy <- set_val_labels(dummy, c("very low", "low", "mid", "hi"))
dummy <- set_var_labels(dummy, "This is a dummy")
# check structure of dummy
str(dummy)
##  atomic [1:40] 2 2 2 3 3 2 1 4 4 2 ...
##  - attr(*, "value.labels")= Named chr [1:4] "1" "2" "3" "4"
##   ..- attr(*, "names")= chr [1:4] "very low" "low" "mid" "hi"
##  - attr(*, "variable.label")= chr "This is a dummy"

Finally, I just like to mention convenient conversion functions, e.g. to convert atomic variables into factors without losing the label attributes. These are to_fac, to_label or to_value. Further notes on the read and write functions of the sjPlot package are in the online manual.

Reading from and writing to SPSS, SAS and STATA with R #rstats #sjPlot

CRAN download statistics of any packages #rstats

Hadley Wickham announced at Twitter that RStudio now provides CRAN package download logs. I was wondering about the download numbers of my package and wrote some code to extract that information from the logs…

The first code snippet is taken from the log website itself:

# Here's an easy way to get all the URLs in R
start <- as.Date('2013-11-28')
today <- as.Date('2015-03-04')

all_days <- seq(start, today, by = 'day')

year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')

Then I downloaded all files into a folder:

for (i in 1:length(urls)) {
  download.file(urls[i], sprintf("~/Desktop/rstats/temp%i.csv.gz", i))
}

Unzipping did not work with unzip, so I just “opened” all files with the OS X unarchiver, which was quite convenient.

Than I read all csv-files and extracted the information for my package, sjPlot, from each csv-file and merged everything into one data frame:

sjPlot.df <- data.frame()
library(dplyr)
pb <- txtProgressBar(min=0, max=length(urls), style=3)

for (i in 1:length(urls)) {
  df.csv <- read.csv(sprintf("~/Desktop/rstats/temp%i.csv", i))
  pack <- tolower(as.character(df.csv$package))
  my.package <- which(pack == "sjplot")
  if (length(my.package) > 0 ) {
    dummy.df <- df.csv %>% dplyr::slice(my.package) %>% dplyr::select(date, package, version, country)
    sjPlot.df <- dplyr::bind_rows(sjPlot.df, dummy.df)
  }
  setTxtProgressBar(pb, i)
}
close(pb)
sjPlot.df$date.short <- strftime(sjPlot.df$date, format="%Y-%m")

Finally, the download-stats as plot:

library(sjPlot)
library(ggplot2)

mydf <- sjPlot.df %>% dplyr::count(date.short)

sjp.setTheme(theme = "539", axis.angle.x = 90)
ggplot(mydf, aes(x = date.short, y = n)) +
  geom_bar(stat = "identity", width = .5, alpha = .5, fill = "#3399cc") +
  scale_y_continuous(expand = c(0, 0), breaks = seq(250, 1500, 250)) +
  labs(x = sprintf("Monthly CRAN-downloads of sjPlot package since first release until 4th March (total download: %i)", sum(mydf$n)), y = NULL)

sjPlot-downloads

By the way, there’s already a shiny app for this…

CRAN download statistics of any packages #rstats

Beautiful tables for linear model summaries #rstats

Beautiful HTML tables of linear models

In this blog post I’d like to show some (old and) new features of the sjt.lm function from my sjPlot-package. These functions are currently only implemented in the development snapshot on GitHub. A package update is planned to be submitted soon to CRAN.

There are two new major features I added to this function: Comparing models with different predictors (e.g. stepwise regression) and automatic grouping of categorical predictors. There are examples below that demonstrate these features.

The sjt.lm function prints results and summaries of linear models as HTML-table. These tables can be viewed in the RStudio Viewer pane, web browser or easily exported to office applications. See also my former posts on the table printing functions of my package here and here.

Please note: The following tables may look a bit cluttered – this is because I just pasted the HTML-code created by knitr into this blog post, so style sheets may interfere. The original online-manual for this function can be found here.

All following tables can be reproduced with the sjPlot package and the sample data set from this package.

Linear model summaries as HTML table

The sjt.lm function prints summaries of linear models (fitted with the lm function) as nicely formatted html-tables.

Before starting, sample data is loaded and sample models are fitted:

# sample data
data(efc)
# set variable labels
efc <- set_var_labels(efc, get_var_labels(efc))
# fit first model
fit1 <- lm(barthtot ~ c160age + c12hour + c161sex + c172code, data=efc)
# fit second model
fit2 <- lm(neg_c_7 ~ c160age + c12hour + c161sex + c172code, data=efc)
# Note that both models share the same predictors and only differ 
# in their dependent variable. See examples of stepwise models 
# later...

The simplest way of producing the table output is by passing the fitted models as parameter. By default, estimates (B), confidence intervals (CI) and p-values (p) are reported. The models are named Model 1 and Model 2.

sjt.lm(fit1, fit2)
Model 1 Model 2
  B CI p B CI p
(Intercept) 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
carer’ age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
average number of hours of care for the elder in a week -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
carer’s gender -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
carer’s level of education: recoding of variable c172edu1 -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

Custom labels

You can specify the ‘model’ label via labelDependentVariables parameter:

sjt.lm(fit1,
       fit2,
       labelDependentVariables = c("Barthel-Index",
                                   "Negative Impact"))
Barthel-Index Negative Impact
  B CI p B CI p
(Intercept) 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
carer’ age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
average number of hours of care for the elder in a week -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
carer’s gender -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
carer’s level of education: recoding of variable c172edu1 -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

More custom labels

Here is an example how to change the other labels. Note that showHeaderStrings makes the two labels on top and top left corner appear in the table.

sjt.lm(fit1,
       fit2,
       showHeaderStrings = TRUE,
       stringB = "Estimate",
       stringCI = "Conf. Int.",
       stringP = "p-value",
       stringDependentVariables = "Response",
       stringPredictors = "Coefficients",
       stringIntercept = "Konstante",
       labelDependentVariables = c("Barthel-Index",
                                   "Negative Impact"))
Coefficients Response
Barthel-Index Negative Impact
  Estimate Conf. Int. p-value Estimate Conf. Int. p-value
Konstante 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
carer’ age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
average number of hours of care for the elder in a week -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
carer’s gender -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
carer’s level of education: recoding of variable c172edu1 -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

Changing summary style and content

You can change the table style with specific parameters, e.g. to include CI into the same table cell as the estimates, print asterisks instead of numeric p-values etc.

sjt.lm(fit1, fit2,
       separateConfColumn = FALSE, # ci in same cell as estimates
       showStdBeta = TRUE,         # also show standardized beta values
       pvaluesAsNumbers = FALSE)   # "*" instead of numeric values
Model 1 Model 2
  B (CI) std. Beta (CI) B (CI) std. Beta (CI)
(Intercept) 90.06
(77.95 – 102.18) ***
8.46
(6.67 – 10.24) ***
carer’ age -0.22
(-0.36 – -0.08) **
-0.10
(-0.16 – -0.04)
0.01
(-0.01 – 0.03)
0.05
(-0.03 – 0.12)
average number of hours of care for the elder in a week -0.28
(-0.31 – -0.24) ***
-0.48
(-0.54 – -0.41)
0.02
(0.01 – 0.02) ***
0.25
(0.18 – 0.32)
carer’s gender -0.26
(-4.36 – 3.83)
-0.00
(-0.06 – 0.06)
0.57
(-0.03 – 1.17)
0.06
(-0.00 – 0.13)
carer’s level of education: recoding of variable c172edu1 -0.76
(-3.55 – 2.02)
-0.02
(-0.08 – 0.04)
0.44
(0.03 – 0.86) *
0.07
(0.01 – 0.14)
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075
Notes * p<0.05   ** p<0.01   *** p<0.001

Custom variable labels

In the above example, the original variable labels are long and not much pretty. You can change variable labels either with set_var_labels (see this page for more detaila), which will affect all future plots and tables, or pass own labels via labelPredictors.

sjt.lm(fit1, fit2,
       labelPredictors = c("Carer's Age",
                           "Hours of Care",
                           "Carer's Sex",
                           "Educational Status"))
Model 1 Model 2
  B CI p B CI p
(Intercept) 90.06 77.95 – 102.18 < 0.001 8.46 6.67 – 10.24 < 0.001
Carer’s Age -0.22 -0.36 – -0.08 0.002 0.01 -0.01 – 0.03 0.206
Hours of Care -0.28 -0.31 – -0.24 < 0.001 0.02 0.01 – 0.02 < 0.001
Carer’s Sex -0.26 -4.36 – 3.83 0.900 0.57 -0.03 – 1.17 0.061
Educational Status -0.76 -3.55 – 2.02 0.592 0.44 0.03 – 0.86 0.034
Observations 821 832
R2 / adj. R2 0.270 / 0.266 0.079 / 0.075

Compare models with different predictors

In some cases, for instance stepwise regressions, you have different predictors on the same response. The proper grouping of predictors, resp. rows, is done automatically.

First, let’s fit some example models.

# fit first model
fit1 <- lm(neg_c_7 ~ c160age + c172code + c161sex, data=efc)
# fit second model
fit2 <- lm(neg_c_7 ~ c160age + c172code + c161sex + c12hour, data=efc)
# fit second model
fit3 <- lm(neg_c_7 ~ c160age + c172code + e42dep + tot_sc_e, data=efc)

Note that printing tables with fitted models, which have different predictors do not automatically detect variable labels (maybe this will be implemented in a future package version).

sjt.lm(fit1, fit2, fit3, 
       separateConfColumn = FALSE,
       showAIC = TRUE,
       showFStat = TRUE)
Model 1 Model 2 Model 3
  B (CI) p B (CI) p B (CI) p
(Intercept) 7.82
(6.00 – 9.65)
< 0.001 8.46
(6.67 – 10.24)
< 0.001 6.23
(4.76 – 7.69)
< 0.001
c160age 0.04
(0.02 – 0.06)
< 0.001 0.01
(-0.01 – 0.03)
0.206 0.01
(-0.01 – 0.03)
0.271
c172code 0.39
(-0.03 – 0.81)
0.071 0.44
(0.03 – 0.86)
0.034 0.24
(-0.15 – 0.64)
0.230
c161sex 0.69
(0.07 – 1.31)
0.028 0.57
(-0.03 – 1.17)
0.061
c12hour 0.02
(0.01 – 0.02)
< 0.001
e42dep 1.50
(1.23 – 1.77)
< 0.001
tot_sc_e 0.21
(0.01 – 0.41)
0.038
Observations 832 832 833
R2 / adj. R2 0.025 / 0.022 0.079 / 0.075 0.153 / 0.148
F-statistics 7.107*** 17.730*** 37.250***
AIC 4611.921 4566.622 4502.333

Automatic grouping of categorical predictors

In case you have categorical variables with more than two factor levels, the sjt.lm function automatically groups the category levels to give a better overview of predictors in the table.

By default, automatic grouping is activated. To disable this feature, use group.pred = FALSE as parameter.

To demonstrate this feature, we first convert two predictors to factors (what they actually are, indeed). To do this, we use the to_fac function, which converts numerical variables into factors, however, does not remove the variable and value label attributes.

# make education categorical
efc$c172code <- to_fac(efc$c172code)
# make dependency categorical
efc$e42dep <- to_fac(efc$e42dep)
# fit first model again (with c172code as factor)
fit1 <- lm(barthtot ~ c160age + c12hour + c172code + c161sex + e42dep, data=efc)
# fit second model again (with c172code as factor)
fit2 <- lm(neg_c_7 ~ c160age + c12hour + c172code + c161sex + e42dep, data=efc)

Now we can print the table.

sjt.lm(fit1, fit2)
Model 1 Model 2
  B CI p B CI p
(Intercept) 97.17 88.37 – 105.97 < 0.001 7.76 5.97 – 9.55 < 0.001
carer’ age -0.06 -0.16 – 0.03 0.203 0.00 -0.02 – 0.02 0.683
average number of hours of care for the elder in a week -0.07 -0.10 – -0.04 < 0.001 0.01 0.00 – 0.01 0.015
carer’s level of education: recoding of variable c172edu1
intermediate level of education 1.50 -1.60 – 4.60 0.343 0.13 -0.50 – 0.76 0.689
high level of education 0.66 -3.20 – 4.52 0.738 0.72 -0.07 – 1.51 0.074
carer’s gender 0.09 -2.74 – 2.93 0.949 0.56 -0.02 – 1.13 0.058
how dependent is the elder? – subjective perception of carer
slightly dependent -7.85 -12.86 – -2.83 0.002 1.11 0.09 – 2.13 0.033
moderately dependent -19.49 -24.42 – -14.57 < 0.001 2.37 1.37 – 3.37 < 0.001
severely dependent -56.87 -62.12 – -51.63 < 0.001 3.92 2.86 – 4.99 < 0.001
Observations 821 832
R2 / adj. R2 0.653 / 0.650 0.160 / 0.152

Removing estimates from the output

With remove.estmates, specific estimates can be removed from the table output. This may make sense in case you have stepwise regression models and only want to compare the varying predictors but not the controls. remove.estmates either accepts the row indices of the rows of the table output that should be removed, or the coefficient’s names.

When using numeric indices, the estimates’ index number relates to the same order as coef(fit). Note that currently the intercept cannot be removed from the model output!

data(efc)
# attach variable labels to each variable of the data
# frame - useful for automatic label detection
efc <- set_var_labels(efc, get_var_labels(efc))
# make education categorical
efc$c172code <- to_fac(efc$c172code)
# make education categorical
efc$e42dep <- to_fac(efc$e42dep)
# make prettier variable labels
efc$c172code <- set_var_labels(efc$c172code, "Education")
efc$e42dep <- set_var_labels(efc$e42dep, "Dependency")
# fit first model
fit1 <- lm(neg_c_7 ~ c160age + c172code + c161sex, data=efc)
# fit second model
fit2 <- lm(neg_c_7 ~ c160age + c172code + c161sex + c12hour, data=efc)
# fit third model
fit3 <- lm(neg_c_7 ~ c160age + c172code + e42dep + tot_sc_e, data=efc)

Example 1: Complete table output

Here you have the complete table output. This helps you identify the row index numbers. Especially when you have multiple models with different predictors, the estimate’s position in the last model may differ from this estimate’s position in the table output.

sjt.lm(fit1, fit2, fit3)
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
c160age 0.04 0.02 – 0.06 < 0.001 0.01 -0.01 – 0.03 0.306 0.01 -0.01 – 0.03 0.384
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
c161sex 0.70 0.09 – 1.32 0.025 0.59 -0.01 – 1.19 0.053
c12hour 0.02 0.01 – 0.02 < 0.001
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
tot_sc_e 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 2: Remove first coefficient (after intercept)

sjt.lm(fit1, fit2, fit3,
       remove.estimates = 2)
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
c161sex 0.70 0.09 – 1.32 0.025 0.59 -0.01 – 1.19 0.053
c12hour 0.02 0.01 – 0.02 < 0.001
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
tot_sc_e 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 3: Remove age and sex

sjt.lm(fit1, fit2, fit3,
       remove.estimates = c("c160age", "c161sex"))
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
c12hour 0.02 0.01 – 0.02 < 0.001
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
tot_sc_e 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 4: Remove many esimates

sjt.lm(fit1, fit2, fit3,
       remove.estimates = c(2,5,6,10))
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
c172code2 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
c172code3 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
Dependency
e42dep2 1.18 0.16 – 2.20 0.024
e42dep3 2.53 1.53 – 3.52 < 0.001
e42dep4 4.32 3.31 – 5.33 < 0.001
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147

Example 5: Custom predictor labels

In most cases you need to define your own labels when removing estimates, especially when you have grouped categorical predictors, because automatic label detection in quite tricky in such situations. If you provide own labels, please note that grouped predictors’ headings (the variable name of the grouped, categorical variable) are still automatically set by the sjt.lm function (variable labels are used, so use set_var_labels for those categorical predictors). All data rows in the table, i.e. for each coefficient appearing in the model, you need to specify a label string.

In the next example, we have seven table rows with data (excluding intercept): mid and hi education (categories of the variable Education), Hours of Care, slight, moderate and severe dependency (categories of the variable Dependency) and Service Usage. These ‘rows’ need to be labelled.

sjt.lm(fit1, fit2, fit3,
       labelPredictors = c("mid education", 
                           "hi education", 
                           "Hours of Care", 
                           "slight dependency", 
                           "moderate dependency", 
                           "severe dependency", 
                           "Service Usage"),
       remove.estimates = c("c160age", "c161sex"))
Model 1 Model 2 Model 3
  B CI p B CI p B CI p
(Intercept) 8.40 6.72 – 10.08 < 0.001 9.18 7.53 – 10.83 < 0.001 8.48 6.99 – 9.97 < 0.001
Education
mid education 0.16 -0.52 – 0.83 0.652 0.12 -0.54 – 0.78 0.728 0.08 -0.56 – 0.72 0.806
hi education 0.79 -0.05 – 1.64 0.066 0.91 0.09 – 1.74 0.030 0.52 -0.28 – 1.32 0.203
Hours of Care 0.02 0.01 – 0.02 < 0.001
Dependency
slight dependency 1.18 0.16 – 2.20 0.024
moderate dependency 2.53 1.53 – 3.52 < 0.001
severe dependency 4.32 3.31 – 5.33 < 0.001
Service Usage 0.21 0.01 – 0.41 0.042
Observations 832 832 833
R2 / adj. R2 0.026 / 0.021 0.081 / 0.075 0.154 / 0.147
Beautiful tables for linear model summaries #rstats

sjPlot package and related online manuals updated #rstats # ggplot

My sjPlot package for data visualization has just been updated on CRAN. I’ve added some features to existing function, which I want to introduce here.

Plotting linear models

So far, plotting model assumptions of linear models or plotting slopes for each estimate of linear models were spread over several functions. Now, these plot types have been integrated into the sjp.lm function, where you can select the plot type with the type parameter. Furthermore, plotting standardized coefficients now also plot the related confidence intervals.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.lm

Plotting generalized linear models

Beside odds ratios, you now can also plot the predicted probabilities of the outcome for each predictor of generalized linear models. In case you have continuous variables, these kind of plots may be more intuitive than an odds ratio value.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.glm

Plotting (generalized) linear mixed effects models

The plotting function for creating plots of (generalized) linear mixed effects models (sjp.lmer and sjp.glmer) also got new plot types over the course of the last weeks.

For sjp.lmer, we have

  • re (default) for estimates of random effects
  • fe for estimates of fixed effects
  • fe.std for standardized estimates of fixed effects
  • fe.cor for correlation matrix of fixed effects
  • re.qq for a QQ-plot of random effects (random effects quantiles against standard normal quantiles)
  • fe.ri for fixed effects slopes depending on the random intercept.

and for sjp.glmer, we have

  • re (default) for odds ratios of random effects
  • fe for odds ratios of fixed effects
  • fe.cor for correlation matrix of fixed effects
  • re.qq for a QQ-plot of random effects (random effects quantiles against standard normal quantiles)
  • fe.pc or fe.prob to plot probability curves (predicted probabilities) of all fixed effects coefficients. Use facet.grid to decide whether to plot each coefficient as separate plot or as integrated faceted plot.
  • ri.pc or ri.prob to plot probability curves (predicted probabilities) of random intercept variances for all fixed effects coefficients. Use facet.grid to decide whether to plot each coefficient as separate plot or as integrated faceted plot.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.lmer and www.strengejacke.de/sjPlot/sjp.glmer

Plotting interaction terms of (generalized) linear (mixed effects) models

Another function, where new features were added, is sjp.int (formerly known as sjp.lm.int). This function is now kind of generic and can plot interactions of

  • linar models (lm)
  • generalized linar models (glm)
  • linar mixed effects models (lme4::lmer)
  • generalized linar mixed effects models (lme4::glmer)

For linear models (both normal and mixed effects), slopes of interaction terms are plotted. For generalized linear models, the predicted probabilities of the outcome towards the interaction terms is plotted.

Detailed examples can be found here:
www.strengejacke.de/sjPlot/sjp.int

Plotting Likert scales

Finally, a comprehensive documentation for the sjp.likert function is finsihed, which can be found here:
www.strengejacke.de/sjPlot/sjp.likert

sjPlot package and related online manuals updated #rstats # ggplot

sjPlot 1.6 – major revisions, anyone for beta testing? #rstats

In the last couple of weeks I have rewritten some core parts of my sjPlot-package and also revised the package- and online documentation.

Most notably are the changes that affect theming and appearance of plots and figures. There’s a new function called sjp.setTheme which now sets theme-options for all sjp-functions, which means

  1. you only need to specify theme / appearance option once and no longer need to repeat these parameter for each sjp-function call you make
  2. due to this change, all sjp-functions have much less parameters, making the functions and documentation clearer

Furthermore, due to some problems with connecting / updating to the RPubs server, I decided to upload my online documentation for the package to my own site. You will now find the latest, comprehensive documentation and examples for various functions of the sjPlot package at www.strengejacke.de/sjPlot/. For instance, take a look at customizing plot appearance and see how the new theming feature of the package allows both easier customization of plots as well as better integration of theming packages like ggthemr or ggthemes.

Updating the sjPlot package to CRAN is planned soon, however, I kindly ask you to test the current development snapshot, which is hosted on GitHub. You can easily install the package from there using the devtools-package (devtools::install_github("devel", "sjPlot")). The current snapshot is (very) stable and I appreciate any feedbacks or bug reports (if possible, use the issue tracker from GitHub).

The current change log with all new function, changes and bug fixes can also be found on GitHub.

sjPlot 1.6 – major revisions, anyone for beta testing? #rstats

Visualize pre-post comparison of intervention #rstats

My sjPlot-package was just updated on CRAN, introducing a new function called sjp.emm.int to plot estimated marginal means (least-squares means) of linear models with interaction terms. Or: plotting adjusted means of an ANCOVA.

The idea to this function came up when we wanted to analyze the effect of an intervention (an educational programme on knowledge about mental disorders and associated stigma) between two groups: a “treatmeant” group (city) where a campaign on mental disorders was conducted and another city without this campaign. People from both cities were asked about their attitudes and knowledge about specific mental disorders at t0 before the campaign started in the one city. Some month later (t1), again people from both cities were asked the same questions. The intention was to see a) whether there were differences in knowledge and pro-social attidutes of people towards mental disorders and b) if the compaign successfully reduces stigma and increases knowledge.

To analyse these questions, we used an ANCOVA with knowledge and stigma score as dependent variables, “city” and “time” (t0 versus t1) as predictors and adjusted for covariates like age, sex, education etc. The estimated marginal means (or least-squares means) show you the differences of the dependent variable.

Here’s an example plot, quickly done with the sjp.emm.int function:
sjpemmint

Since the data is not publicly available, I’ve set an an documentation with reproducable examples (though those example do not fit very well…).

The latest development snapshot of my package is available on GitHub.

BTW: You may have noticed that this function is quite similar to the sjp.lm.int function for visually interpreting interaction terms in linear models…

Visualize pre-post comparison of intervention #rstats

sjPlot: New options for creating beautiful tables and documentation #rstats

A new update of my sjPlot package was just released on CRAN. This release focused on improving existing functions and bug fixes again. Especially the table output functions (see my previous blog posts on table output functions here and here) improved a lot. Tables now have more and better possibilities for style customization and knitr integration. A basic introduction into the new features is given in this document.

To make it easier to understand all features, I started to setup comprehensive documentations for all sjPlot functions on strengejacke.de.

sjPlot: New options for creating beautiful tables and documentation #rstats

sjPlot 1.3 available #rstats #sjPlot

I just submitted my package update (version 1.3) to CRAN. The download is already available (currently source, binaries follow). While the last two updates included new functions for table outputs (see here and here for details on these functions), the current update mostly provides small helper functions. The focus of this update was to improve existing functions and make their handling easier and more comfortable.

Automatic label detection

One major feature is that many functions now automatically detect variables and value labels, if possible. For instance, if you have imported an SPSS dataset (e.g. with the function sji.SPSS), value labels are automatically attached to all variables of the data frame. With the autoAttachVarLabels parameter set to TRUE, even variable labels will be attached to the data frame after importing the SPSS data. These labels are automatically detected by most functions of the package now. But this does not only apply to importet SPSS-data. If you have factors with specified factor levels, these will also automatically be used as value labels. Furthermore, you can manually attach value and variable labels using the new function sji.setVariableLabels and sji.setValueLabels.

But what are the exactly the benefits of this new feature? Let me give an example. To plot a proportional table with axis and legend labels, prior to sjPlot 1.3 you needed following code:

data(efc)
efc.val <- sji.getValueLabels(efc)
efc.var <- sji.getVariableLabels(efc)
sjp.xtab(efc$e16sex,
         efc$e42dep,
         axisLabels.x=efc.val[['e16sex']],
         legendTitle=efc.var['e42dep'],
         legendLabels=efc.val[['e42dep']])

Since version 1.3, you only need to write:

data(efc)
sjp.xtab(efc$e16sex, efc$e42dep)

Reliability check for index scores

One new table output function included in this update is sjt.itemanalysis, which helps performing an item analysis on a scale or data frame if you want to develop index scores.

Let’s say you have several items and you want to compute a principal component analysis in order to identify different components that can be composed to an index score. In such cases, you might want to perform reliability and item discrimination tests. This is shown in the following example, which performs a PCA on the COPE-Index-scale, followed by a reliability and item analysis of each extracted “score”:

data(efc)
# recveive first item of COPE-index scale
start <- which(colnames(efc)=="c82cop1")
# recveive last item of COPE-index scale
end <- which(colnames(efc)=="c90cop9")
# create data frame of cope-index-items
df <- as.data.frame(efc[,c(start:end)])
colnames(df) <- sji.getVariableLabels(efc)[c(start:end)]
# compute PCA on cope index and return
# "group classifications" of factors
factor.groups <- sjt.pca(df, no.output=TRUE)$factor.index
# perform item analysis
sjt.itemanalysis(df, factor.groups)

The result is following table, where two components have been extracted via the PCA, and the variables belonging each component are treated as one “index score” (note that you don’t need to define groups, you can also treat a data frame as one single “index”):
relia

The output of the computed PCA was suppressed by no.output=TRUE. To better understand the above figure, take a look at the PCA results, where two components have been extracted:
pca_item_reli

Beside that, many functions – especially the table output functions – got new parameters to change the appearance of the output (amount of digits, including NA’s, additional information in tables etc.). Refer to the package news to get a complete overview of what was changed since the last version.

The latest developer build can be found on github.

sjPlot 1.3 available #rstats #sjPlot

sjPlot – data visualization for statistics (in social science) #rstats

I’d like to announce the release of version 0.7 of my R package for data visualization and give a small overview of this package (download and installation instructions can be found on the package page, detailed examples on RPubs).

What does this package do?
In short, the functions in this package mostly do two things:

  1. compute basic or advanced statistical analyses
  2. either plot the results as ggplot-diagram or print them as html-table

However, meanwhile the amount of functions has increased, hence you’ll also find some utility functions beside the plotting functions.

How does this package help me?
Basically, this package either helps those users…

  • who have difficulties using and/or understanding all possibilities that ggplot offers to create plots, simply by providing intuitive function parameters, which allow for manipulating the appearance of plots; or
  • who don’t want to set up complex ggplot-object each time from the scratch; or
  • like quick inspections of (basic) statistics via (html-)tables that are shown in the GUI viewer pane or default browser; or
  • want easily create beautiful table outputs that can be imported in office applications.

Furthermore, for advanced users, each functions returns either the prepared ggplot-object (in case of sjp-plotting functions) or the HTML-tables (in case of sjt-table-output functions), which than can be manipulated even further (for instance, for ggplot-objects, you can specify certain parameters that cannot be modified via the sjPlot package or html-tables could be integrated into knitr-documents).

What are all these functions about?
There’s a certain naming convention for the functions:

  • sjc – collection of functions useful for carrying out cluster analyses
  • sji – collection of functions for data import and manipulation
  • sjp – collection plotting functions, the “core” of this package
  • sjt – collection of function that create (HTML) table outputs (instead of ggplot-graphics
  • sju – collection of statistical utility functions

Use cases?

  • You can plot results of Anova, correlations, histograms, box plots, bar plots, (generalized) linear models, likert scales, PCA, proportional tables as bar chart etc.
  • You can create plots to analyse model assumptions (lm, glm), predictor interactions, multiple contigency tables etc.
  • You can create table outputs instead of graphs for most plotting functions
  • With the import and utility functions, you can, for instance, extract beta coefficients of linear models, convert numeric scales into grouped factors, perform statistical tests, import SPSS data sets (and retrieve variable and value labels from the importet data), convert factors to numeric variables (and vice versa)…

Final remarks
At the bottom of my package page you’ll find some examples of selected functions that have been published on this blog before I created the package. Furthermore, the package includes a sample dataset from one of my research projects. Once the package is installed, you can test each function by running the examples. All news and recent changes can be found in the NEWS section of the package help (type ?sjPlot to access the help file inside R).

I tried to write a very comprehensive documentation for each function and their parameters, hopefully this will help you using my package…

Any comments, suggestions etc. are very welcome!

sjPlot – data visualization for statistics (in social science) #rstats

sjPlotting functions now as package available #rstats

This weekend I had some time to deal with package building in R. After some struggling, I now managed to setup RStudio, Roxygen and MikTex properly so I can compile my collection of R-scripts into a package that even succeeds the package check.

Downloads (package and manual) as well as package description are available at the package information page!

Since the packages successfully passed the package check and a manual could also be created, I’ll probably submit my package to the CRAN. Currently, I’m only able to compile the source and the Windows binaries of the package, because at home I use RStudio on my Mac with OS X 10.9 Mavericks. It seems that there’s an issue with the GNU Tar on Mavericks, which is needed to compile the OS X binaries… I’m not sure whether it’s enough to just submit the source the the CRAN.

Anyway, please check out my package and let me know if you encounter any problems or if you have suggestions on improving the documentation etc.

Open questions

  • How do I write an “ü” in the R-documentation (needed for my family name in author information)? The documentation is inside the R-files, the RD-files are created using Roxygen.
  • How do I include datasets inside an R-package? I would like to include an SPSS-dataset (.sav-File), so I can make the examples of my sji.XYZ functions running… (currently they’re outcommented so the package will compile and pass its check properly)
  • How to include a change log inside R-packages?
sjPlotting functions now as package available #rstats