My set of packages for (daily) data analysis #rstats

I started writing my first package as collection of various functions that I needed for (almost) daily work. Meanwhile, packages were growing and bit by bit I sourced out functions to put them into new packages. Although this means more work for CRAN members when they have more packages to manage on their network, from a user-perspective it is much better if packages have a clear focus and a well defined set of functions. That’s why I now released a new package on CRAN, sjlabelled, which contains all functions that deal with labelled data. These functions use to live in the sjmisc-package, where they now are deprecated and will be removed in a future update.

My aim is not only to provide packages with a clear focus, but also with a consistent design and philosophy, making it easier and more intuitive to use (see also here) – I prefer to follow the so called „tidyverse“-approach here. It is still work in progress, but so far I think I’m on a good way…

So, what are the packages I use for (almost) daily work?

  • sjlabelled – reading, writing and working with labelled data (especially since I collaborate a lot with people who use Stata or SPSS)
  • sjmisc – data and variable transformation utilities (the complement to dplyr and tidyr, when it comes down from data frames to variables within the data wrangling process)
  • sjstats – out-of-the-box statistics that is not already provided by base R or packages
  • sjPlot – to quickly generate tables and plot
  • ggeffects – to visualize regression models

The next step is creating cheat sheets for my packages. I think if you can map the scope and idea of your package (functions) on a cheat sheet, its focus is probably well defined.

Btw, if you also use some of the above packages more or less regularly, you can install the „strengejacke“-package to load them all in one step. This package is not on CRAN, because its only purpose is to load other packages.

Disclaimer: Of course I use other packages everyday as well – this posting is just focussing on my packages that I created because I frequently needed these kind of functions.

Advertisements

ggeffects: Create Tidy Data Frames of Marginal Effects for ‚ggplot‘ from Model Outputs #rstats

Aim of the ggeffects-package

The aim of the ggeffects-package is similar to the broom-package: transforming “untidy” input into a tidy data frame, especially for further use with ggplot. However, ggeffects does not return model-summaries; rather, this package computes marginal effects at the mean or average marginal effects from statistical models and returns the result as tidy data frame (as tibbles, to be more precisely).

Since the focus lies on plotting the data (the marginal effects), at least one model term needs to be specified for which the effects are computed. It is also possible to compute marginal effects for model terms, grouped by the levels of another model’s predictor. The package also allows plotting marginal effects for two- or three-way-interactions, or for specific values of a model term only. Examples are shown below.

Weiterlesen „ggeffects: Create Tidy Data Frames of Marginal Effects for ‚ggplot‘ from Model Outputs #rstats“

Negative Binomial Regression for Complex Samples (Surveys) #rstats

The survey-package from Thomas Lumley is a great toolkit when analyzing complex samples. It provides svyglm(), to fit generalised linear models to data from a complex survey design. svyglm() covers all families that are also provided by R’s glm() – however, the survey-package has no function to fit negative binomial models, which might be useful for overdispersed count models. Yet, the package provides a generic svymle() to fit user-specified likelihood estimations. In his book, Appendix E, Thomas Lumley describes how to write your own likelihood-function, passed to svymle(), to fit negative binomial models for complex samples. So I wrote a small „wrapper“ and implemented a function svyglm.nb() in my sjstats-package.

# ------------------------------------------
# This example reproduces the results from
# Lumley 2010, figure E.7 (Appendix E, p256)
# ------------------------------------------
library(sjstats)
library(survey)
data(nhanes_sample)

# create survey design
des <- svydesign(
  id = ~ SDMVPSU,
  strat = ~ SDMVSTRA,
  weights = ~ WTINT2YR,
  nest = TRUE,
  data = nhanes_sample
)

# fit negative binomial regression
fit <- svyglm.nb(total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)), des)

# print coefficients and standard errors
round(cbind(coef(fit), survey::SE(fit)), 2)

#>                                          [,1] [,2]
#> theta.(Intercept)                        0.81 0.05
#> eta.(Intercept)                          2.29 0.16
#> eta.factor(RIAGENDR)2                   -0.80 0.18
#> eta.log(age)                             1.07 0.23
#> eta.factor(RIDRETH1)2                    0.08 0.15
#> eta.factor(RIDRETH1)3                    0.09 0.18
#> eta.factor(RIDRETH1)4                    0.82 0.30
#> eta.factor(RIDRETH1)5                    0.06 0.38
#> eta.factor(RIAGENDR)2:log(age)          -1.22 0.27
#> eta.factor(RIAGENDR)2:factor(RIDRETH1)2 -0.18 0.26
#> eta.factor(RIAGENDR)2:factor(RIDRETH1)3  0.60 0.19
#> eta.factor(RIAGENDR)2:factor(RIDRETH1)4  0.06 0.37
#> eta.factor(RIAGENDR)2:factor(RIDRETH1)5  0.38 0.44

The functions returns an object of class svymle, so all methods provided by the survey-package for this class work – it’s just that there are only a few, and common methods like predict() are currently not implemented. Maybe, hopefully, future updates of the survey-package will include such features.

Descriptive summary: Proportions of values in a vector #rstats

When describing a sample, researchers in my field often show proportions of specific characteristics as description. For instance, proportion of female persons, proportion of persons with higher or lower income etc. Since it happens often that I like to know these characteristics when exploring data, I decided to write a function, prop(), which is part of my sjstats-package – a package dedicated to summary-functions, mostly for fit- or association-measures of regression models or descriptive statistics.

Weiterlesen „Descriptive summary: Proportions of values in a vector #rstats“

Data wrangling within the #tidyverse – the design philosophy behind the sjmisc-package #rstats

I’m pleased to announce sjmisc 2.3.0, which was just updated on CRAN. The update might break existing code – however, functions were largely revised to work seamlessly within the tidyverse. In the long run, consistent design makes working with sjmisc more intuitive.

Basically, sjmisc covers two domains of functionality:

  • Reading and writing data between R and other statistical software packages like SPSS, SAS or Stata and working with labelled data; this includes easy ways to get and set label attributes, to convert labelled vectors into factors (and vice versa), or to deal with multiple declared missing values etc.
  • Data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. The data transformation functions also support labelled data.

This posting briefly describes some of the changes to the function design that do data transformation tasks.

Weiterlesen „Data wrangling within the #tidyverse – the design philosophy behind the sjmisc-package #rstats“

sjPlot-update: b&w-Figures for Print Journals and Package Vignettes #rstats #dataviz

My sjPlot-package was just updated on CRAN with some – as I think – useful new features.

First, I have added some vignettes to the package (based on the existing online-documentation) that cover some core features and principles of the sjPlot-package, so you have direct access to these manuals within R. The vignettes are also online on CRAN.

Weiterlesen „sjPlot-update: b&w-Figures for Print Journals and Package Vignettes #rstats #dataviz“

Exploring the European Social Survey (ESS) – pipe-friendly workflow with sjmisc, part 2 #rstats #tidyverse

This is another post of my series about how my packages integrate into a pipe-friendly workflow. The post focusses on my sjmisc-package, which was just updated on CRAN, and highlights some of the new features. Examples are based on data from the European Social Survey, which are freely available.

Please note: The statistical analyses at the end of this post mainly serve the purpose of demonstrating some features of the sjmisc-package that target „real life“ problems! For clarity reasons, I ran a quick-and-dirty model, which is not of high statistical quality or standard!

Weiterlesen „Exploring the European Social Survey (ESS) – pipe-friendly workflow with sjmisc, part 2 #rstats #tidyverse“

Pipe-friendly workflow with sjPlot, sjmisc and sjstats, part 1 #rstats #tidyverse

Recent development in R packages are increasingly focussing on the philosophy of tidy data and a common package design and api. Tidy data is an important part of data exploration and analysis, as shown in the following figure:

(Source: http://r4ds.had.co.nz/explore-intro.html)
(Source: http://r4ds.had.co.nz/explore-intro.html)

Tidying data not only includes data cleaning, but also data transformation, both being necessary to perform the core steps of data analysis and visualization. This is a complex process, which involves many steps. You need many packages and functions to perfom those tasks. This is where a common package design and api comes into play: „A powerful strategy for solving complex problems is to combine many simple pieces“, says the tidyverse manifesto. For a coding workflow, this means:

  • compose single functions with the pipe
  • design your API so that it is easy to use by humans

The latter bullet point is helpful to achieve the first bullet point.

Weiterlesen „Pipe-friendly workflow with sjPlot, sjmisc and sjstats, part 1 #rstats #tidyverse“

Tagged NA values and labelled data #rstats

sjmisc-package: Working with labelled data

A major update of my sjmisc-package was just released an CRAN. A major change (see changelog for all changes )is the support of the latest release from the haven-package, a package to import and export SPSS, SAS or Stata files.

The sjmisc-package mainly addresses three domains:

  • reading and writing data between other statistical packages and R
  • functions to make working with labelled data easier
  • frequently applied recoding and variable transformation tasks, also with support for labelled data

In this post, I want to introduce the topic of labelled data and give some examples of what the sjmisc-package can do, with a special focus on tagged NA values.

Weiterlesen „Tagged NA values and labelled data #rstats“