Easily plotting grouped bars with ggplot #rstats

Summary
This tutorial shows how to create diagrams with grouped bar charts or dot plots with ggplot. The groups can also be displayed as facet grids.

Importing the data from SPSS
All following examples are based on an imported SPSS data set. Refer to this posting for more details on how to do that and to my script page to download the scripts. This is important to know because the way the variable and value labels are accessed may depend on whether you use an imported SPSS dataset or not (i.e. you may have to change parameters to get the sample running).

You can, for instance, import your SPSS data like this, if you are using my script:
source(“sjImportSPSS.R”)

efc <- importSPSS("GER_Services_FU_PV_dt.sav")
efc_vars <- getVariableLabels(efc)
efc_labels <- getValueLabels(efc)

The R script
You can download the script from my script page. I will not describe the code in detail because the source code is (hopefully) well commented. Basically, the script just transforms the data from two variables (one count variable with categories and one grouping variables) to fit into the ggplot-requirements for plotting bar charts. You can use a lot of parameters to change the style of the output, e.g. you can plot bars or dots, dodged or stacked bars, change colors etc. and you don’t need to know how this works in ggplot. You simply pass your “preferred settings” as parameters.

You can include the script via this single line:

source("sjPlotGroupFrequencies.R")

Examples
The minimal requirements for this function to work are two variables: One of which frequencies should be plotted and one variable that indicates the groups. Once you’ve (imported) a data set and included the script as shown above, you can plot a bar graph with default settings like this:

sjp.grpfrq(efc$e42dep,
           efc$n1pv_ovs)

The varCount parameter requires the variable containing the frequencies that should be plotted. varGroup indicates the groups. The result is a kind of “graphical” cross tabulation of varCount and varGroup:

Default grouped bar charts from my function, using ggplot in R

Default grouped bar charts from my function, using ggplot in R

This graph provides no information on category or variable labels. However, you can pass these information as parameter as well:

sjp.grpfrq(efc$e42dep, 
           efc$n1pv_ovs, 
           title = efc_vars['e42dep'], 
           axisLabels.x = efc_labels[['e42dep']],
           legendLabels = efc_labels[['n1pv_ovs']],
           legendTitle = efc_vars[['n1pv_ovs']],
           barColor = "brewer",
           colorPalette = "PuBu",
           showPercentageValues = FALSE)
Grouped bars with different color palette, category and legend labels and diagram title.

Grouped bars with different color palette, category and legend labels and diagram title.

As you can see above, category labels can be passed as parameter, typically a vector of char values. If you have used my script to import data from SPSS, you simply access the category labels with efc_labels[['--variablename--']]. The variable names are saved in efc_vars. Furthermore, you can see that the bar colors have changed. You can either use your own color values, gs for a greyscale, or use brewer in combination with colorPalette to use any of the pre-defined color brewer palettes supported by ggplot. You find further information in the documentation of the R script.

You can also plot dots instead of bars:

sjp.grpfrq(fc$e42dep, 
           efc$n1pv_ovs, 
           title = efc_vars['e42dep'], 
           axisLabels.x = efc_labels[['e42dep']],
           legendLabels = efc_labels[['n1pv_ovs']],
           legendTitle = efc_vars[['n1pv_ovs']],
           barColor = "brewer",
           type = "dots",
           showPercentageValues = FALSE)
Dot plot with different color palette, no percentage values shown.

Dot plot with different color palette, no percentage values shown.

To avoid overlapping, you can see the dodged position of the dots in the graph above. Since this may make it harder to define which dot belongs to which group, shaded rectangles surrounding each group are plotted by default. You can, of course, switch them off as well.

If you prefer stacked bars, you can do this with the barPosition parameter:

sjp.grpfrq(efc$e42dep, 
           efc$n1pv_ovs, 
           title = efc_vars['e42dep'], 
           axisLabels.x = efc_labels[['e42dep']],
           legendLabels = efc_labels[['n1pv_ovs']],
           legendTitle = efc_vars[['n1pv_ovs']],
           barPosition = "stack",
           showPercentageValues = FALSE,
           upperYlim = 350)
Stacked group bar chart, with default color palette and fixed upper y-axis limit.

Stacked group bar chart, with default color palette and fixed upper y-axis limit.

If you like, you can also display each category in a single diagram using facet grids. Please notice that facet grids calculate category percentages, i.e. each category sums up to 100%, while the other plots use group percentages, i.e. each groups sums up to 100%!

sjp.grpfrq(efc$e42dep, 
           efc$n1pv_ovs, 
           title = efc_vars['e42dep'], 
           legendLabels = efc_labels[['n1pv_ovs']],
           legendTitle = efc_vars[['n1pv_ovs']],
           useFacetGrid=TRUE)
Facet grid with one diagram per group, each diagram representing one category.

Facet grid with one diagram per group, each diagram representing one category.

This example should demonstrate how dots in a facet grid may look like.

sjp.grpfrq(efc$e42dep, 
           convertToLabel(efc$n1pv_ovs), 
           title = efc_vars['e42dep'], 
           axisLabels.x = efc_labels[['e42dep']],
           useFacetGrid=TRUE,
           type="dots",
           omitNA=FALSE,
           hideLegend=TRUE,
           axisLabelAngle.x=45,
           axisLabelSize=0.9)
Facet grid with dot plots, using grid title instead of legend.

Facet grid with dot plots, using grid title instead of legend.

As you can see, I have replaced the factor values of the grouping varible (which were “0”, “1”, “2” etc.) with the related category or factor labels. This can easily be performed with the convertToLabel function, which is also included in the sjImportSPSS script. By this, the group variable now contains “CL 1″ or “CL 2″ as values instead of “1” or “2”. This makes it possible to plot the group titles into each facet, which makes the legend needles. In this case, I would recommend removing the legend, because the facets are ordered in alphabetical way, while the legend is not. This may lead to differences between legend colors and facet colors.

And finally, three examples for plotting distributions of metric scales, e.g. age. First, a grouped histogram chart, with mean intercept lines for each group.

sjp.grpfrq(efc$e17age, 
           efc$e16sex,
           legendLabels = efc_labels[['e16sex']],
           type="hist",
           showValueLabels=FALSE,
           axisTitle.x ="Age of Elderly Dependent",
           showMeanIntercept=TRUE)
Grouped histogram with mean intercept line for each group

Grouped histogram with mean intercept line for each group

If you wish, you can also display box plots for each groups. The groups’ mean value is displayed as white point inside the box.

sjp.grpfrq(efc$e17age, 
           efc$e16sex,
           legendLabels = efc_labels[['e16sex']],
           type="box",
           barColor="brewer",
           colorPalette="Pastel1",
           axisTitle.y="Age of Elderly Dependent")
Box plot with white mean-point

Box plot with white mean-point

Or, if you don’t want to lose information on the distribution, you may want to use violin plots. These are “mirrored” denstity curves for each group. Additionally, a small box plot is plotted inside the violins.

sjp.grpfrq(efc$e17age, 
           efc$e16sex,
           legendLabels = efc_labels[['e16sex']],
           type="v",
           barColor="brewer",
           colorPalette="Pastel1",
           axisTitle.y="Age of Elderly Dependent")
Violin plot with small box plot inside. Mean value is indicates by the black point.

Violin plot with small box plot inside. Mean value is indicates by the black point.

You can, of course, change many more parameters. A complete overview with short information on each parameter is in the documentation inside the R script.

Have fun!

About these ads

9 Gedanken zu “Easily plotting grouped bars with ggplot #rstats

  1. An observation: this kind of thing discourages me from using ggplot, because it seems so complicated. In base graphics, on the other hand, this gets almost the same result, using a data frame “cars” as below:

    > str(cars)
    ‘data.frame': 38 obs. of 6 variables:
    $ Car : Factor w/ 38 levels “AMC Concord D/L”,..: 7 18 28 20 30 37 31 26 6 3 …
    $ MPG : num 28.4 30.9 20.8 37.3 16.2 31.9 34.2 34.1 16.9 20.3 …
    $ Weight : num 2.67 2.23 3.07 2.13 3.41 …
    $ Cylinders : int 4 4 6 4 6 4 4 4 8 5 …
    $ Horsepower: int 90 75 85 69 133 71 70 65 155 103 …
    $ Country : Factor w/ 6 levels “France”,”Germany”,..: 6 6 6 3 1 2 6 4 6 2 …
    > tbl=with(cars,
    + table(Cylinders,Country),
    + )
    > tbl
    Country
    Cylinders France Germany Italy Japan Sweden U.S.
    4 0 4 1 6 1 7
    5 0 1 0 0 0 0
    6 1 0 0 1 1 7
    8 0 0 0 0 0 8
    > barplot(tbl)
    > barplot(tbl,beside=T)

    The first barplot is stacked bars, the second one is bars beside each other.

    • Well, it’s not that complicated, actually. ggplot offers quick plots “on the fly” as well which give the same result, so there’s no difference between “standard” plotting functions and ggplot.

      ggplot, however, becomes strong when it goes into detail in graphs’ appearance. That’s where ggplot starts looking complicated. But: If you want more graph manipulation with the standard functions, it looks complicated as well: Let’s say you want to add lines or values to the bar plot, you have to use further commands to get the wanted result.

      An example for plotting VIF-values (Variance Inflation Factors) as bar charts to check multi-colinearity of a linear regression’s predictors. This bar chart should include the value labels and horizontal lines that indicate the acceptable limits for the VIF-values. Now the standard bar plot becomes looking more complicated, too.

          barX <- barplot(val, # VIF values
                          main="Variance Inflation Factors",
                          las=3, # rotate x-axis labels
                          cex.axis=.7, # decrease axis char size
                          cex.names=.7, # decrease label char size
                          ylim=c(0,upperLimit),
                          las=2,
                          col=c(rgb(128,172,200,max = 255)))
          # set horizontal line to indicate strong limit of VIF
          abline(h=5.0, lty=2, col="darkgreen")
          # set 2nd horizontal line to indicate weak, but still tolerable limit of VIF
          abline(h=10.0, lty=2, col="darkred")
          # print percentage values
          # The xpd=TRUE means to not plot the text even if it is outside 
          # of the plot area and par("cxy") gives the size of a typical 
          # character in the current user coordinate system. 
          text(cex=.7, x=barX, y=val+par("cxy")[2]/2, round(val,2), col="black", xpd=TRUE)

      Possible reasons why my source codes might look complicated:
      1.) I’m new to R and my sources and probably the ggplot-term may not written in the most optimized way
      2.) I have to deal with all arguments passed as parameters to create the ggplot-object. If you would just use “fix” values, you can shorten the ggplot-term a lot.

      Still, it seems to me that ggplot is, at least at the beginning, a bit more complicated. But once you know how it works, it gives you many options to manipulate your graphical output – which is one of the main reasons I started learning R and ggplot as well. The results you get with ggplot go beyond from what is possible with the basic plotting functions, that’s at least my impression.

      • That’s a good answer. The examples that I have seen with ggplot have used (many) options, but there is no need to use them at first. I looked at your example with VIFs, and that looked less complicated to me, but I realized that this was because I know about the options to “plot”. In terms of complexity of code, it is about the same as your ggplot (which accomplishes more).

        I have seen a lot of people using ggplot, so that it does seem worth learning. As with anything new, there is a “learning curve” that has to be climbed.

        Ich wollte diese Antwort auf Deutsch schreiben, aber ich habe es realiziert, dass meine Deutsch zu schwach ist!

  2. [...] remark In between I have also updated my other scripts. For instance, the sjPlotGroupFrequencies.R function can now also plot box plots or violin plots (see examples at the end of that posting). So [...]

  3. Hi,
    could you maybe post your dataset, or provide an example with a standard dataset from R.

    I’m trying to reproduce the first plot but I’m having a hard time because I don’t have your data and I don’t know exactly how the input data should be formatted.

    Thx

    • Well, we probably see here my “social science” impact on how to focus on visualizing data. I mainly deal with categorial or metric variables, Likert scales etc. so my functions are especially for data sets or variables that have “quasi metric” or Likert scales. If you look at the webpage of the research project which data I’m referring in my examples: http://www.uke.de/extern/eurofamcare/deli.php
      you find the questionnaire and get an impression of how variables can / should look like: http://www.uke.de/extern/eurofamcare/documents/deliverables/cat_uk.pdf

      Unfortunately, our data set is not yet public (though we are planning to do this, or actually, wanted it to have already public by now). But I have found a public test data set, see my comment here:

      http://strengejacke.wordpress.com/2013/02/25/simplify-frequency-plots-with-ggplot-in-r-rstats/#comment-375

      If you download that data set, you can do the following:

      disease <- importSPSS("/Users/danielludecke/Desktop/disease.sav")
      source("lib/sjPlotGroupFrequencies.R")
      sjgfrq(disease$sector, disease$class)

      or you use

      disease <- importSPSS("/Users/danielludecke/Desktop/disease.sav")
      dislab <- getValueLabels(disease)
      disvars <- getVariableLabels(disease)
      source("lib/sjPlotGroupFrequencies.R")
      sjgfrq(disease$sector, disease$class, title=disvars['sector'], categoryLabels=dislab[['sector']], legendLabels=dislab[['class']])

      (the sector-variable has no label and variable values in the SPSS-data set, so the title- and categoryLabel parameter have no influence here).

      Hope that helps!

      Best wishes
      Daniel

  4. [...] Plotting grouped frequencies: sjPlotGroupFrequencies.R The grouped frequencies script has also been described in a separate posting. [...]

  5. […] Basic Introduction to ggplot2 – How to customize ggplot2 graphics – Easily plotting grouped bars with ggplot – Plotting lm and glm models with ggplot – Kalkalash! Pinpointing the Moments “The Simpsons” […]

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ photo

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s