This is another post of my series about how my packages integrate into a pipe-friendly workflow. The post focusses on my sjmisc-package, which was just updated on CRAN, and highlights some of the new features. Examples are based on data from the European Social Survey, which are freely available.

Please note: The statistical analyses at the end of this post mainly serve the purpose of demonstrating some features of the sjmisc-package that target „real life“ problems! For clarity reasons, I ran a quick-and-dirty model, which is not of high statistical quality or standard!

Steps of the data analysis process

Let’s recall the last post, where I showed a figure from Hadley Wickham’s R for Data Science book:

(Source: http://r4ds.had.co.nz/explore-intro.html)

Tidying up, transforming and exploring data is an important part of data analysis, and you can manage many common tasks in this process with the tidyverse- or related packages. The sjmisc-package fits into this workflow, especially when you work with labelled data (like many public use data sets, and the European Social Survey as well), because it offers functions for data transformation and labelled data utility functions. This is what I’m going to show next, based on some „real life“ initial steps when beginning with data exploration.

The European Social Survey

As mentioned above, data from the ESS is freely available. Following examples are based on the SPSS-dataset from the current 7th round. Importing foreign data is easy in R – I think, there’s no other bigger statistical software package, which let’s you import so many data formats.

First, load the required packages and the SPSS-data set. Note that I use the read_spss-function from my sjmisc-package, which wraps the read-function from the haven-package (but it does not return labelled class vectors, for some reasons…). Since the SPSS-data is labelled, the read-function returns a labelled data frame.

library(sjmisc)
library(tidyverse)
ess <- read_spss("ESS7e02.sav")

frq() – printing frequency tables

The first thing that may be of interest might be the response rates in each country. You can plot frequencies for labelled data with the frq()-function. This function requires either a vector or data frame as input and prints the variable label as first line, followed by a frequency-table with values, labels, counts and percentages of the vector.

frq(ess$cntry)

# Country

 val          label  frq raw.prc valid.prc cum.prc
   1        Austria 1795    4.47      4.47    4.47
   2        Belgium 1769    4.40      4.40    8.87
   3    Switzerland 1532    3.81      3.81   12.68
   4 Czech Republic 2148    5.35      5.35   18.03
   5        Germany 3045    7.58      7.58   25.60

... (omitted rows)

find_var() – finding variables in a data frame

Ok, next, let’s look at the distribution of gender across countries. To compute cross tables, you can use the flat_table() function. It requires the data as first argument, followed by any number of variable names. But first, we need to know the name of the gender-variable. It is obviously not „gender“. This is where find_var() comes into play. It searches for variables in a data frame by 1) variable names, 2) variable labels, 3) value labels 4) or any combination of these. By default, it looks for variable name and labels. The function also supports regex-patterns. By default, find_var() returns the column-indices, but you can also print a small „summary“ with the as.varlab-argument.

# find all variables with "gender" in name or label
find_var(ess, "gender", as.varlab = T)

# A tibble: 16 × 3
   col.nr var.name                                    var.label
    <int>    <chr>                                        <chr>
1     150  dscrgnd Discrimination of respondent's group: gender
2     213  icgndra       Interviewer code, gender of respondent
3     297     gndr                                       Gender
4     298    gndr2         Gender of second person in household
5     299    gndr3          Gender of third person in household
6     300    gndr4         Gender of fourth person in household
7     301    gndr5          Gender of fifth person in household

... (omitted rows)

Ok, variable in column 297, named „gndr“, is what we are looking for.

flat_table() – printing cross tables

Now we can look at the distribution of gender across countries:

flat_table(ess, cntry, gndr)
# flat_table also works in a pipe-workflow:
ess %>% select(cntry, gndr) %>% flat_table()

               gndr Male Female No answer
cntry                                    
Austria              853    942         0
Belgium              896    873         0
Switzerland          766    766         0
Czech Republic       998   1128         0
Germany             1545   1500         0

... (omitted rows)

get_labels() – getting value labels

Next thing we want to know is where to find the educational level of participants. We know that the ESS uses the ISCED-classification. So let’s find „isced“.

find_var(ess, "isced", as.varlab = T)

# A tibble: 4 × 3
  col.nr var.name                                        var.label
   <int>    <chr>                                            <chr>
1    355   eisced           Highest level of education, ES - ISCED
2    447  eiscedp Partner's highest level of education, ES - ISCED
3    496  eiscedf  Father's highest level of education, ES - ISCED
4    526  eiscedm  Mother's highest level of education, ES - ISCED

ISCED has several categories, but we just want three groups of educational level: low, intermediate and high. So we first need some more information about eisced and its coding. You can get value labels with the get_labels()-function. The simple function call returns the value labels of a variable as character vector, but in this case we also use include.values = "p" to see the related values of the label, and drop.unused = T to drop all labels of those values that don’t occur in the vector.

get_labels(ess$eisced, 
           include.values = "p", 
           drop.unused = T)

[1] "[1] ES-ISCED I , less than lower secondary"             
[2] "[2] ES-ISCED II, lower secondary"                       
[3] "[3] ES-ISCED IIIb, lower tier upper secondary"          
[4] "[4] ES-ISCED IIIa, upper tier upper secondary"          
[5] "[5] ES-ISCED IV, advanced vocational, sub-degree"       
[6] "[6] ES-ISCED V1, lower tertiary education, BA level"    
[7] "[7] ES-ISCED V2, higher tertiary education, >= MA level"
[8] "[55] Other"

rec() – recoding variables

We just want three categories. The ISCED-classification can be recoded into lower, intermediate and higher educational levels by combining categories 1 and 2, 3 to 5 and 6 and 7. All other values (in the above example, we have a „55“ for „Other“) should be set to missing. We can easily recode and label vectors with the rec()-function. This function does not only recode vectors, it also allows direct labelling of categories inside the recode-syntax (this is optional, you can also use the val.labels-argument). We now recode eisced into a new variable education.

# ok, let's recode
ess$education <- rec(
  ess$eisced,
  recodes = c("1:2=1 [low]; 3:5=2 [intermediate]; 6:7=3 [high]; else=NA"),
  var.label = "Educational level (eisced)"
)
# print frequencies
frq(ess$education)

# Educational level (eisced)

 val        label   frq raw.prc valid.prc cum.prc
   1          low 10845   26.99     27.17   27.17
   2 intermediate 19978   49.72     50.05   77.21
   3         high  9096   22.64     22.79  100.00
   4           NA   266    0.66        NA      NA

You can see the the vector education has a variable label (Educational level (eisced)), which was set inside the rec()-function, as well as three values with labels („low“, „intermediate“ and „high“). All values 1 to 2 of eisced were recoded into 1 in education, values 3 to 5 into 2 and so on. All remaining values are set to missing (else=NA – for details on the recode-syntax, see ?rec).

Grouped data frames and sjmisc-package

Now, how is education distributed by gender? We can group the ESS-data and print frequencies using the frq()-function for this as well, as this function also accepts grouped data frames. Frequencies for grouped data frames first print the group-details (variable name and category), followed by the frequency table. Thanks to labelled data, the output is easy to understand.

ess %>% 
  select(education, gndr) %>% 
  group_by(gndr) %>% 
  frq()

Grouped by:
Gender: Male
# Educational level (eisced)

 val        label  frq raw.prc valid.prc cum.prc
   1          low 5023   26.62     26.79   26.79
   2 intermediate 9687   51.33     51.66   78.44
   3         high 4042   21.42     21.56  100.00
   4           NA  119    0.63        NA      NA


Grouped by:
Gender: Female
# Educational level (eisced)

 val        label   frq raw.prc valid.prc cum.prc
   1          low  5813   27.30     27.49   27.49
   2 intermediate 10278   48.27     48.61   76.10
   3         high  5054   23.74     23.90  100.00
   4           NA   147    0.69        NA      NA

And what about education and gender across countries? But this time we would like to see the row percentages…

ess %>% 
  select(gndr, cntry, education) %>% 
  group_by(gndr) %>% 
  flat_table(margin = "row")

Grouped by:
Gender: Male
               education   low intermediate  high
cntry                                            
Austria                  17.86        69.21 12.93
Belgium                  33.93        44.38 21.69
Switzerland              18.19        61.39 20.42
Czech Republic           12.32        74.24 13.43
... (omitted rows)

Grouped by:
Gender: Female
               education   low intermediate  high
cntry                                            
Austria                  25.45        61.24 13.31
Belgium                  27.68        42.91 29.41
Switzerland              27.58        57.91 14.51
Czech Republic           12.41        75.36 12.23
... (omitted rows)

Nested data frames and models

Next, we want to apply some basic analyses, to see the relation for self-rated health and education in different countries, adjusted for gender and age of respondents. First, we need to find the health-variable. We see that this variable starts with „very good health“ for value 1, and „very bad health“ for value 5. We can reverse this scale, to have better health with higher values, and recode this into a new variable srh (self-rated health):

# find all health-variables
find_var(ess, "health", as.varlab = T)
# ok, found it. it's named "health". now get coding 
# of this variable - we see that better health is
# coded with lower values
get_labels(ess$health)
# reverse the scale. there is a "shortcut" pattern
# for doing this: "rev"
ess$srh <- rec(ess$health, "rev")

So, now we have our dependent variable srh and our independent variables education, gndr and agea.

Let’s get a quick impression about the relations between health and education across countries, by fitting linear models for each country. We can do this using nested data frames (see also this post on bootstrapping with list-variables). The nest()-function from the tidyr package can create subsets of a data frame, based on grouping criteria, and create a new list-variable, where each element itself is a data frame (so it’s nested, because we have data frames inside a data frame).

In the following example, we group the ESS-data by country, and „nest“ the groups. Now we get a data frame with two columns: First, the grouping variable (country) and second, the datasets (subsets) for each country as data frame, store in the list-variable „data“. The data frames in the subsets (in „data“) all contain the selected variables gndr, education, srh and agea.

# convert variable to labelled factor,
# because we then have the labels as
# factor levels in the output
ess$country <- to_label(ess$cntry, drop.levels = T)
ess %>%
  select(country, gndr, education, srh, agea) %>%
  group_by(country) %>%
  nest()

# A tibble: 21 × 2
          country                 data
           <fctr>               <list>
1         Austria <tibble [1,795 × 4]>
2         Belgium <tibble [1,769 × 4]>
3     Switzerland <tibble [1,532 × 4]>
4  Czech Republic <tibble [2,148 × 4]>

... (omitted rows)

spread_coef() – get back coefficients of nested models

Using the map()-function from the purrr-package, we can iterate this list and apply any function on each data frame in the list-variable „data“. We want to apply the lm()-function to the list-variable, to run linear models for all country datasets. The results of these linear regressions are stored in another list-variable, „models“ (created with the mutate()-function). To quickly access and look at the coefficients, we can use the spread_coef()-function:

ess$country <- to_label(ess$cntry, drop.levels = T)
ess %>%
  select(country, gndr, education, srh, agea) %>%
  group_by(country) %>%
  nest() %>%
  mutate(models = purrr::map(data, ~ 
    lm(srh ~ education + gndr + agea, 
       data = .))) %>%
  spread_coef(models)

# A tibble: 21 × 7
          country                 data   models `(Intercept)`         agea education         gndr
           <fctr>               <list>   <list>         <dbl>        <dbl>     <dbl>        <dbl>
1         Austria <tibble [1,795 × 4]> <S3: lm>      4.682988 -0.018950774 0.1577497 -0.021005919
2         Belgium <tibble [1,769 × 4]> <S3: lm>      4.325110 -0.012303883 0.1697290 -0.086008553
3     Switzerland <tibble [1,532 × 4]> <S3: lm>      4.435222 -0.011108490 0.1437029 -0.042177216
4  Czech Republic <tibble [2,148 × 4]> <S3: lm>      5.094801 -0.029581163 0.1405284 -0.049518435
5         Germany <tibble [3,045 × 4]> <S3: lm>      4.046202 -0.014770345 0.2021526 -0.051458728
6         Denmark <tibble [1,502 × 4]> <S3: lm>      4.240526 -0.008165113 0.1880305 -0.115486662
7         Estonia <tibble [2,051 × 4]> <S3: lm>      4.098305 -0.023430708 0.2406452  0.003303693
8           Spain <tibble [1,925 × 4]> <S3: lm>      4.613964 -0.018838560 0.1452714 -0.169171898
9         Finland <tibble [2,087 × 4]> <S3: lm>      4.306064 -0.015847129 0.1823691 -0.030827809
10         France <tibble [1,917 × 4]> <S3: lm>      4.143245 -0.013747901 0.2183592 -0.119741127
# ... with 11 more rows

Ok, we see that higher education is positively associated with higher self-rated health, while higher age is negatively correlated to higher self-rated health. What about gender? Forgot the coding of female and male…

get_labels(ess$gndr, include.values = T)

     1           2           9 
"Male"    "Female" "No answer"

Male persons seem to rate their health better than female respondents.

Final words

In this post, I wanted to show a „real life“ approach on exploring unknown datasets. In this example, a labelled dataset was used and shown, how you can work with labelled data in R. The on my sjmisc-package can be useful when performing data analysis, especially in the step of data transformation and exploration. A main focus of this package is to make life with labelled data and data transformation easier, and to integrate into modern workflows (like working with the pipe-operator).

This post demonstrated only a few of the functions the sjmisc-package offers, but I hope you got an impression of its application and functionality.

11 Kommentare zu „Exploring the European Social Survey (ESS) – pipe-friendly workflow with sjmisc, part 2 #rstats #tidyverse“

ajdamico sagt:

22. Dezember 2016 um 20:51

sorry i think this is wrong? the european social survey is a complex sample that should not be analyzed like a simple random sample. you can use hadley’s functions with the srvyr package, but you are leading your readers to calculate the wrong results off of the microdata. if you want to match numbers published by the creators of ESS, start here https://github.com/ajdamico/asdfree/tree/master/European%20Social%20Survey sorry 😦
1. Daniel sagt:
  
  22. Dezember 2016 um 21:21
  
  I know there are variables for design-weights and post-stratification weights, both for descriptive and multivariate statistics, which I did not mention here. The purpose was more to show how to start data exploration, not how to conduct final analyses including weighting or more sophisticated methods. So, please read this posting cum grano salis.
Joe sagt:

23. Dezember 2016 um 12:11

It looks like a nice package, and I’ll certainly take a closer look at it. I was wondering whether the package contains any functions for publication-ready tables (someting better than pander() or kable(), which are pretty limited (no spanned columns, for instance). Your output looks nice, but how you move from console output to presentation-ready tables?
1. Daniel sagt:
  
  23. Dezember 2016 um 12:45
  
  The package for figures and tables is sjPlot, which will also be updated soon. You can find some of the function more detailled online here. And some sjPlot-related postings here as well.
  1. Joe sagt:
    
    23. Dezember 2016 um 13:29
    
    Oh, right, I thought the blog sounded familiar. I’ve used that package (and really liked it) but it’s basically HTML output only, isn’t it? (that’s not a criticism, because I know how difficult it is for all formats). If so, what do you for other formats? Do you copy the table into, say, Word. Can you get a nice PDF?
    
    Happy holidays, by the way.
  2. Daniel sagt:
    
    23. Dezember 2016 um 18:20
    
    HTML-tables are, in my experience, the best way to get them into word. Indeed, just copy & paste, or save it as file and load the file into word. If I need a PDF, I create that PDF from Word/Libre Office. From R Markdown or knitr, PDF’s with included HTML-tables are difficult. You should then better use packages like stargazer or xtable.
Joel Gombin sagt:

23. Dezember 2016 um 14:22

Also, the linear model is wrong, right? I think it uses categorical variables (including the dependant variable) as if they were continuous ones, is that right?
Just like Anthony, sorry to sound so negative, as I find your package ad your work great besides that!
1. Daniel sagt:
  
  23. Dezember 2016 um 18:15
  
  Yes, education and gender should usually be categorical, a 5-point-scale is sometimes used as continuous, sometimes not (in various social science research). The model is quick and dirty, anyway. Just to demonstrate the spread_coef()-function (however, even if education is categorical, the message is the same: you find better self-rated health in higher educated groups).
2. Daniel sagt:
  
  24. Dezember 2016 um 08:40
  
  Btw, you don’t sound negative, it’s justified comments… perhaps I should have thought about better statistical application or being more clear about the example character of some code parts above.
Antonio Rodriguez sagt:

27. April 2018 um 20:11

Daniel, this is quite interesting indeed. But I find that if we deal with all waves of the ESS, we need to use other R pacakges and functions. The only way is to use fread from data.table pacakge, and save it as binary file later. We do not keep labels for the variables with this option.
1. Daniel sagt:
  
  27. April 2018 um 19:32
  
  Antonio, why do you need fread from data.table? We have just submitted a paper using all 8 waves and read the SPSS data with haven resp. sjlabelled.

Kommentare sind geschlossen.