5 Descriptive Statistics and Visualisations

5.1 Introduction to Descriptive Statistics

Descriptive statistics provide a summary of the central tendency, dispersion, and shape of a dataset’s distribution. In SPSS, you may have used functions like FREQUENCIES or DESCRIPTIVES. In R, these tasks are just as straightforward, with added flexibility and power.

In this chapter we will utilise a fictional dataset called police_activity_data. You can download a copy by clicking on the link.

#Load the dataset from csv file
police_activity_data <- read.csv('data/police_activity_data.csv')

#Explore the first 5 entries
head(police_activity_data, 5)
##   IncidentID       Date  Time       IncidentType ResponseTime OfficersInvolved   Outcome Borough IncidentSeverity
## 1     INC001 2024-08-31 15:32           Burglary           13                1 No Action   North                1
## 2     INC002 2024-08-15 19:15 Public Disturbance           12                4    Arrest    East                2
## 3     INC003 2024-08-19 07:27 Public Disturbance            7                2    Arrest    West                5
## 4     INC004 2024-08-14 02:48       Traffic Stop           14                2   Warning   South                2
## 5     INC005 2024-08-03 03:11       Traffic Stop           13                4   Warning    East                5

5.1.1 Understanding Descriptive Statistics

  • Measures of Central Tendency: These describe the center of the data (mean, median, mode).
  • Measures of Dispersion: These describe the spread of the data (range, variance, standard deviation, interquartile range).
  • Shape of Distribution: Skewness and kurtosis help describe the shape of the data distribution.

5.1.2 Basic Descriptive Statistics in R

R provides multiple ways to compute descriptive statistics. Here are some basic functions.

The summary() function provides a summary of each variable in a dataset.

# Basic summary of all variables in a data frame
summary(police_activity_data)
##   IncidentID            Date               Time           IncidentType        ResponseTime   OfficersInvolved   Outcome            Borough          IncidentSeverity
##  Length:200         Length:200         Length:200         Length:200         Min.   : 5.00   Min.   :1.00     Length:200         Length:200         Min.   :1.00    
##  Class :character   Class :character   Class :character   Class :character   1st Qu.: 9.00   1st Qu.:2.00     Class :character   Class :character   1st Qu.:2.00    
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   Median :13.00   Median :3.00     Mode  :character   Mode  :character   Median :3.00    
##                                                                              Mean   :12.42   Mean   :3.03                                           Mean   :2.98    
##                                                                              3rd Qu.:16.00   3rd Qu.:4.00                                           3rd Qu.:4.00    
##                                                                              Max.   :20.00   Max.   :5.00                                           Max.   :5.00

But you can also use specific functions such as mean(), median() and sd() to calculate specific statistics. Note the usage of the na.rm = TRUE argument which tells R to ignore NA (missing) values.

# Mean of a numeric variable
mean(police_activity_data$ResponseTime, na.rm = TRUE)
## [1] 12.42
# Median of a numeric variable
median(police_activity_data$ResponseTime, na.rm = TRUE)
## [1] 13
# Standard deviation of a numeric variable
sd(police_activity_data$ResponseTime, na.rm = TRUE)
## [1] 4.290178

Exercise!

Download the police_activity_data.csv file and load it into a R data frame. Produce a basic summary of the Response Time variable. What does the difference between the Mean and Median measure tell you about the skewness of the data?

5.2 Creating Visualisations with ggplot2

Visualisations are key to understanding and presenting data. The ggplot2 package is a powerful tool for creating a wide variety of plots, from simple bar charts to complex multi-layered visualisations.

5.2.1 Introduction to ggplot2

ggplot2 is part of the tidyverse collection of packages and is based on the grammar of graphics. The basic structure of a ggplot2 plot involves:

  • Data: The dataset being used.
  • Aesthetics (aes): Mapping variables to visual properties like x, y, color, size.
  • Geometries (geom): The type of plot (e.g., geom_bar for bar charts, geom_point for scatter plots).

Install ggplot2 if you haven’t already:

install.packages("ggplot2")

5.2.2 Creating Basic Plots

Here’s how you can create some basic plots in R using ggplot2:

5.2.2.1 Bar Charts

Bar charts are used to display the frequency of categorical data.

  • aes(x = categorical_variable): Maps the categorical variable to the x-axis.
  • geom_bar(): Creates the bar chart.
#Load the ggplot2 library
library(ggplot2)

# Bar chart for a categorical variable
ggplot(your_data, aes(x = categorical_variable)) +
  geom_bar() +
  labs(title = "Bar Chart of Categorical Variable", x = "Category", y = "Count")

Exercise!

Create a Bar Chart of the Borough variable in the police_activity_data dataset. Which borough has the greatest number of crimes?

5.2.2.2 Histograms

Histograms show the distribution of a continuous variable.

  • geom_histogram(binwidth = 10): Creates the histogram with specified bin width.
  • fill and color: Customize the appearance.
ggplot(your_data, aes(x = continuous_variable)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(title = "Histogram of Continuous Variable", x = "Value", y = "Frequency")

Exercise!

Create a Histogram of the ResponseTime variable in the police_activity_data dataset setting the bin size to 4. How is the response time distributed?

5.2.2.3 Boxplots

Boxplots display the distribution of a variable and its potential outliers.

  • aes(x = factor_variable, y = continuous_variable): Maps the factor variable to the x-axis and the continuous variable to the y-axis.
  • geom_boxplot(): Creates the boxplot.
ggplot(your_data, aes(x = factor_variable, y = continuous_variable)) +
  geom_boxplot() +
  labs(title = "Boxplot of Continuous Variable by Factor", x = "Factor", y = "Value")

Exercise!

Create a Boxplot of the ResponseTime variable for each of the Borough in the police_activity_data dataset. Which Borough has the lowest Median response time? Which Borough has the smallest range of response times?

5.2.3 Customising Your Plots

One of the strengths of ggplot2 is its flexibility in customising plots.You can add additional commands and features using the + notation.

Adding Titles, Labels, and Themes

  • labs(): Adds titles and axis labels.
  • theme_minimal(): Applies a clean, minimalistic theme to the plot.
ggplot(your_data, aes(x = categorical_variable)) +
  geom_bar(fill = "lightblue", color = "black") +
  labs(title = "Bar Chart of Categorical Variable", x = "Category", y = "Count") +
  theme_minimal()

Using Colors to Enhance Visualisations

You can differentiate categories or highlight data points using color.

  • scale_fill_brewer(palette = "Pastel1"): Applies a color palette to the fill of the boxplots.
ggplot(your_data, aes(x = factor_variable, y = continuous_variable, fill = factor_variable)) +
  geom_boxplot() +
  labs(title = "Boxplot of Continuous Variable by Factor", x = "Factor", y = "Value") +
  scale_fill_brewer(palette = "Pastel1")

Exercise!

Using your Boxplot Diagram of the ResponseTime variable for each of the Borough in the police_activity_data dataset. Add some colour!

5.3 Descriptive Statistics with dplyr

dplyr is a powerful tool for data manipulation and is also useful for summarising data. It works well alongside ggplot2 for data exploration and visualization.

5.3.1 Using dplyr to Summarise Data

You can summarise data by calculating various descriptive statistics for different groups.

  • group_by(factor_variable): Groups the data by the specified factor variable.
  • summarize(): Calculates the mean and standard deviation for each group.
# Load the dplyr library
library(dplyr)

# Summarise data: mean and standard deviation by group
summary_data <- your_data %>%
  group_by(factor_variable) %>%
  summarize(
    mean_value = mean(continuous_variable, na.rm = TRUE),
    sd_value = sd(continuous_variable, na.rm = TRUE)
  )

Exercise!

Using the police_activity_data dataset, calculate the mean and standard deviation of the ResponseTime based on the IncidentType. Which Incident Type had the greatest mean response time?

5.3.2 Combining dplyr with ggplot2

You can easily combine the power of dplyr and ggplot2 to create insightful visualisations.

# Example: Create a summary and plot it
summary_data <- your_data %>%
  group_by(factor_variable) %>%
  summarize(sd_value = sd(continuous_variable, na.rm = TRUE))

ggplot(summary_data, aes(x = factor_variable, y = sd_value)) +
  geom_bar(stat = "identity") +
  labs(title = "Standard Deviation of Continuous Variable by Factor",
       x = "Factor",
       y = "Standard Deviation")

Exercise!

Building on the previous exercise, create a plot of the mean ResponseTime based on the IncidentType. The factors along the x-axis should be sorted alphabetically. Can you try sorting these in ascending order of their mean?

5.4 Advanced Visualisation Techniques

Once you’re comfortable with basic plots, ggplot2 offers many advanced features for more complex visualisations.

5.4.1 Faceting

Faceting allows you to create multiple plots based on the values of one or more variables.

  • facet_wrap(~factor_variable): Creates separate plots for each level of factor_variable.
ggplot(your_data, aes(x = continuous_variable)) +
  geom_histogram(binwidth = 10) +
  facet_wrap(~factor_variable) +
  labs(title = "Histogram Faceted by Factor")

Exercise!

Use the facet wrap functionality to create a series of histograms representing the IncidentSeverity across the four different boroughs using a binwidth of 3. How does the response time vary across the four different boroughs?

5.4.2 Combining Multiple Geoms

You can layer multiple geometries to create complex plots.

  • geom_point(): Adds a scatter plot.
  • geom_smooth(method = "lm"): Adds a linear regression line without the confidence interval.
ggplot(your_data, aes(x = continuous_variable, y = another_continuous_variable)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatter Plot with Regression Line", x = "X Variable", y = "Y Variable")

Exercise!

Using dplyr produce a count of the number of crimes that occurred on each day. Use this information to create a scatterplot with a regression line. What do you notice about the trend?

  • Hint I: Use the dplyr pipeline to group the data in conjunction with the summarise(count = n()) function.
  • Hint II: If you can’t see a trendline you may need to review your Date variable data type.

5.4.3 Saving Your Plots

Once you’ve created a plot, you might want to save it for later use.

  • ggsave(): Saves the last plot with specified dimensions.
# Save the plot to a file
ggsave("my_plot.png", width = 8, height = 6)

5.5 Conclusion

In this chapter, we explored the fundamentals of descriptive statistics and data visualisation in R, tools essential for any data analysis, including crime analysis. Starting with basic summary statistics, such as measures of central tendency and dispersion, we demonstrated how to gain quick insights into your data. We then explored the power of visualisations, learning how to create bar plots, histograms, boxplots, and scatterplots using the ggplot2 package, one of R’s most versatile and widely-used visualization libraries.

These techniques allow you to uncover patterns, trends, and potential outliers in your data, transforming raw numbers into visual stories that can be more easily interpreted and communicated. As you continue to work with crime data or any other datasets, these descriptive statistics and visualisation skills will form the foundation for more advanced analyses. Whether summarising crime rates across boroughs or visualising the distribution of incident response times, these tools will help you to turn data into actionable insights.

In the next chapters, we will build upon these concepts, diving into inferential statistics and regression analysis, where you’ll learn to make predictions and draw conclusions beyond mere descriptions. The knowledge you’ve gained here will be crucial as we move forward, so make sure to revisit these techniques regularly as you become more familiar with R.