Data Visualization in R: a ggplot2 primer

Visualization is one of the most efficient techniques to present results. In the realm of statistical analysis, R is a popular programming language used to perform initial exploratory analysis and statistical modelling. Ggplot2 is the most popular package from R’s tidyverse which is primarily used to build supporting visualizations. This blog post provides a quick guide on how to start with the package.

Ggplot2 stands for “grammar of graphics”. Think of it as semantics for a visualization language; the package has been built to follow similar technique as constructing a meaningful sentence in a language. As in a sentence, you combine different element types together to create any kind of graphic display. Let’s look at most prominent building blocks for a plot in ggplot2:

1. Data

The goal of any visualization is to present data in an effective manner. There is a feedback loop- to have an effective visualization, it’s imminent to have data in a formatted and concise manner. The dataset needs to be in a ‘tidy’ state (avid R users will get the reference). What do I mean when I say tidy data? Each row should contain all of the information you would want to be plotting for a particular point. Design decisions directly drive formatting of data – should the dataset be wide or long? Is scaling required? Would faceting or grouping certain columns make a better input for the visualization?

Building a data manipulation pipeline is highly recommended, and I can dedicate a whole other blog post for it- stay tuned! For the current exercise I will provide R commands used to manipulate the data for the plot being discussed.

2. Aesthetic mapping

In ggplot2 universe, aesthetic means “something you can see”. This building block refers to the primary elements for making a visualization- what are the x and y axis? What colors should they be? What is the shape of the data point? In summary, here are some aesthetic elements of a ggplot2 command:

·       Position (i.e., on the x and y axes)

·       Color (“outside” color)

·       Fill (“inside” color)

·       Shape (of points)

·       Line type

·       Size

 

3. Geometric objects

After you have finalized the data and decided on the aesthetic elements, its time to put them together using a geometric block. These are the actual marks that define what kind of plot will be produced. If you have some background in visualization, you will recognize ggplot2 geometric elements very easily:

·       Points (geom_point, for scatter plots, dot plots, etc.),

·       Lines (geom_line, for time series, trend lines, etc.),

·       Boxplot (geom_boxplot)

·       And so much more…

Now you have the basic understanding of how gglpot2 works, it’s time to start making plots!

We will be using the popular nycflights13 dataset which contains on time airline data for all flights departing NYC in 2013. Also included is useful metadata on airlines, airports, weather and planes.

1.png

Scatterplot:

We will use scatterplot to identify any relationship between arrival and departure delays of United Airlines flights. Here is the initial data preparation required to do that:

2.png
3.png

And here is the scatterplot command:

4.png

Let us understand what is happening in the above command. Within the ggplot() function call, we specify two of the components of the grammar:

  • The data frame to be all_alaska_flights by setting data = all_alaska_flights

  • Aesthetic mapping by setting aes(x = dep_delay, y = arr_delay)

We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object. In this case the geometric object are points, set by specifying geom_point().

Some notes on layers:

·       The + sign comes at the end of lines, not at the beginning. You'll get an error in R if you put it at the beginning.

·       When adding layers to a plot, you are encouraged to hit Return on your keyboard after entering the + so that the code for each layer is on a new line. As we add more and more layers to plots, you'll see this will greatly improve the legibility of your code.

 And here is the output of the command:

5.png

It is very evident that there is some over-plotting happening in our plot. The first way of relieving over-plotting is by changing the alpha argument in geom_point() which controls the transparency of the points. By default, this value is set to 1. We can change this to any value between 0 and 1 where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque.

6.png
7.png

Another popular method is to add a color component. In the given dataset, we can use the origin airports as a differentiating dimension for color to add to the scatterplot.

8.png

Line Chart:

Let’s plot a line chart to understand variations in early January (first 15 days) weather at the Newark airport.

As usual, we need to do some data prep for the graphic:

10.png
11.png

Here is how the ggplot command would look like:

12.png
13.png

Box Plot:

We will use ggplot2 to create a box plot of temperature variations over the year.

14.png

Do you see something amiss with the plot above? We are plotting a numerical, and not categorical variable, on the x-axis. This gives us the overall boxplot without any other groupings. We can get around this by introducing ‘factor’ function for our x variable:

16.png

Bar Chart:

Next, we will plot a bar chart to compare number of flights from each airline at the New York airports.

18.png
19.png

We can modify the aesthetic elements of the bar chart to plot different variations of the same dataset. We can use bar charts to compare two categorical variables, this variation is called a stacked bar chart.

20.png

Here is the new ggplot2 bar chart command:

21.png
22.png

Another variation on the stacked bar chart is the side-by-side bar chart also called a dodged bar chart.

23.png
24.png

We can also create a faceted bar chart, which creates separate bar plot for a categorical variable and creates a view to assist in comparing values for a common variable.

25.png
26.png

Heat map:

We will build a heatmap for plotting departure delays for all days of the week at the JFK airport. We will use the color aspect of the heat map to compare delays for all hours of each day of the week. It sounds intense, but we can achieve it easily through a heat map.

Let’s do the data preparation first:

27.png

Here is the command:

Again, as in other plots, there are ample ways you can manipulate the box plot above. Hopefully this blog post has enabled you with the skills to start exploring and using ggplot2 in your work.

Do you have any questions on the commands above? Or you want to connect and know more about how to use ggplot2 for your work? Please comment on this post and I will reach out to you.

All content on this post has been sourced from the eastside chapter of the Seattle useR group, of which ProCogia is a co-organizer. Here the group’s github page: https://github.com/SeattleRUser/2018-08-23_Data_Visualization_in_R.