3 Data Management in R

3.1 Data Import and Export

Effective data management is crucial for any analysis. In this section, we’ll explore how to import data from various formats, such as SPSS, Excel, and CSV files, and how to export your cleaned and prepared data to different formats. We will also compare the R methods to their SPSS equivalents.

3.1.1 Importing Data

R offers several packages to import data from different sources:

SPSS Files: Use the haven package to import SPSS data files (.sav).

# Load the haven package
library(haven)
  
# Import SPSS data file
data <- read_sav("path_to_your_file.sav")

SPSS Equivalent: In SPSS, you would use the File > Open > Data menu or the GET FILE command.

Excel Files: Use the readxl package to import Excel files (.xlsx).

# Load the readxl package
library(readxl)

# Import Excel file
data <- read_excel("path_to_your_file.xlsx")

SPSS Equivalent: You would typically use File > Open > Data with Excel files in SPSS.

CSV Files: Use base R’s read.csv() function or the readr package for more advanced options.

# Base R method to import CSV file
data <- read.csv("path_to_your_file.csv")

SPSS Equivalent: In SPSS, importing CSV files can be done via File > Open > Data or with the GET DATA command.

3.1.2 Exporting Data

After cleaning and analysing your data, you may need to export it to various formats:

To SPSS Format:

# Export data to SPSS format
write_sav(data, "path_to_your_file.sav")

SPSS Equivalent: Use File > Save As in SPSS to export data.

To Excel Format:

# Load the openxlsx package
library(openxlsx)

#Export data to Excel format using openxlsx package
write.xlsx(data, "path_to_your_file.xlsx")

SPSS Equivalent: Use File > Save As in SPSS with Excel files.

To CSV Format:

# Export data to CSV format
write.csv(data, "path_to_your_file.csv", row.names = FALSE)

SPSS Equivalent: In SPSS, you can use File > Save As to save data as a CSV file.

Exercise!

Go to https://data.police.uk/data/ and download a CSV file of Metropolitan Police Service data for January 2024. Load the CSV file into a R dataframe called data_mps. How many observations and variables are there?

  • Hint: You can find this in the environment pane.

3.2 Data Cleaning and Preparation

Data cleaning is an essential step in preparing your data for analysis. This section covers handling missing data, recoding variables, filtering, subsetting, and performing data transformations. We’ll introduce the dplyr package, which is one of the most popular R packages for data manipulation.

3.2.1 Handling Missing Data

Identify Missing Data:

We can find missing data by looking for not available (NA) information. The sum command is used to give us an indication of how many missing pieces of data there are.

# Check for missing data
sum(is.na(data_mps))

# Check for missing data in a specific column
sum(is.na(data_mps$Month))

SPSS Equivalent: Use the MISSING VALUES command or the Data > Define Missing Values menu.

Handling Missing Data:

There are numerous approaches for handling missing data. Here we will cover the simplest methods.

# Remove any rows with missing data
data_mps_clean <- na.omit(data_mps)

# Remove any columns with missing data
data_mps_clean <- data_mps[, colSums(is.na(data_mps)) == 0]

# Replace missing values with a specific value, such as 0.
data_mps[is.na(data_mps)] <- 0

#Replace missing values with a column value (e.g. Mode, Mean, ...)
data_mps$Latitude[is.na(data_mps$Latitude)] <- mean(data_mps$Latitude, na.rm = TRUE)

SPSS Equivalent: Use the RECODE command or the Transform > Replace Missing Values function.

Exercise!

Some of the commands above are composite commands encompassing multiple functions. Can you break them down to understand what each part of the function is doing?

Exercise!

In the examples above, in the second example we replaced all missing values with 0. We also replaced any missing latitudes with the average latitude across the dataset in the fourth example. Were these the best imputation choices? What method would you have chosen for dealing with outliers?

3.2.2 Filtering and Subsetting Data

Filtering and subsetting data allow you to focus on specific observations of interest.

Filtering Data in R:

# Filter data where a variable equals a specific value
data_filtered <- filter(data, variable == "value")

SPSS Equivalent: Use the SELECT IF command…

SELECT IF (variable = value).

Subsetting Data in R:

# Subset columns in R
data_subset <- select(data, column1, column2)

SPSS Equivalent: Use the MATCH FILES / KEEP command or Data > Select Cases.

Exercise!

The Context column is full of NA values. Can you subset the dataset to remove this column? Can you then filter the dataset to create a new dataframe called data_mps_drugs containing only crimes with the Crime Type recorded as ‘Drugs’.

  • Hint: You do not need to list all of the columns you want to keep. Can you find a way to mention only the colum you want to drop?

3.2.3 Data Transformations

Data transformations involve creating new variables or modifying existing ones. This can include mathematical operations, merging datasets, and more.

Creating New Variables:

# Create a new variable
data <- data %>%
  mutate(new_variable = old_variable1 + old_variable2)

SPSS Equivalent: Use the COMPUTE command:

COMPUTE new_variable = old_variable1 + old_variable2.

Merging Datasets:

# Merge two datasets by a common key
merged_data <- merge(data1, data2, by = "common_column")

SPSS Equivalent: Use the MATCH FILES command.

Exercise!

How would you merge two datasets if they didn’t have a common key?

  • Hint: Google is your friend!

3.2.4 The dplyr Pipeline

You may have noticed in the previous section that we snuck in a funny looking bit of code – the pipeline operator, %>% – when creating a new variable. The dplyr package uses the pipeline operator %>% to streamline the process of applying a sequence of operations to data frames. The %>% operator allows you to pass the result of one function directly into the next function, creating a clear and readable chain of commands.

Pipe Data: The object on the left side of %>% is passed as the first argument to the function on the right side.

data %>% function()

Chaining Multiple Operations: You can chain multiple functions together, where the output of each function becomes the input for the next. This allows you to perform multiple operations on your data frame without needing to create intermediate variables.

data %>%
  function1() %>%
  function2() %>%
  function3()

Benefits

  • Readability: Makes code more readable by laying out operations in a clear, top-down sequence.
  • Efficiency: Reduces the need for temporary variables or nested function calls.

Exercise!

Using the original data_mps dataframe can you recreate the data_mps_drugs dataframe using the dplyr pipeline operator with the relevant select and filter commands?

3.3 Working with Categorical Data

Categorical data, or factors in R, are a common type of data in survey analysis. Understanding how to manipulate and analyse this type of data is essential.

3.3.1 Creating and Manipulating Factors

Factors in R are used to represent categorical data and can be ordered or unordered. Here’s how to create and manipulate them:

Creating a Factor:

When creating a categorical factor variable you can manually specify the levels to provide an ordering or you can leave it blank to produce an unordered categorical variable.

# Create a factor variable
data$factor_variable <- factor(data$categorical_variable,
                               levels = c("Category 1", "Category 2", "Category 3"))

SPSS Equivalent: This is similar to defining value labels and assigning them in SPSS.

Reordering Factor Levels:

You can reorder the factor levels by respecifying the ordering.

# Reorder levels of a factor
data$factor_variable <- factor(data$factor_variable,
                               levels = c("Category 3", "Category 1", "Category 2"))

SPSS Equivalent: Use the VARIABLE LEVELS command.

3.3.2 Recoding Variables

Recoding variables is a common task, such as transforming a continuous variable into a categorical one.

Recoding Variables in R:

#Load the dplyr library
library(dplyr)

# Recoding a variable using dplyr
data <- data %>%
  mutate(new_variable = recode(old_variable,
                               "1" = "Category 1",
                               "2" = "Category 2"))

SPSS Equivalent: Use the RECODE command…

RECODE old_variable (1='Category 1') (2='Category 2') INTO new_variable.

Exercise!

Using the unfiltered data_mps dataset. Convert the Crime Type columns into a factor.

3.3.3 Frequency Tables and Cross-Tabulations

Understanding the distribution of categorical data is often done through frequency tables and cross-tabulations.

Frequency Tables:

# Create a frequency table
table(data$factor_variable)

SPSS Equivalent: Use the FREQUENCIES command:

FREQUENCIES VARIABLES=factor_variable.

Cross-Tabulations:

# Create a cross-tabulation
table(data$factor_variable1, data$factor_variable2)

SPSS Equivalent: Use the CROSSTABS command:

CROSSTABS /TABLES=factor_variable1 BY factor_variable2.

Exercise!

Convert the Last Outcome Category to a categorical variable and view a cross-tabulation of the outcomes with the crime types. Which Crime Type has the highest number of Complete Investigations Where No Suspect Was Identified?

3.4 Conclusion

In this chapter, we’ve covered essential data management tasks in R, including importing and exporting data, cleaning and preparing data, and working with categorical data. By comparing these processes with their SPSS equivalents, you can see how R can effectively replace or complement SPSS in your analytical workflow. In the next chapter, we’ll delve into descriptive statistics and data visualizations in R, continuing to build your skills in this versatile programming language.