Data Wrangling in R: A Practical Guide

Docs

A Practical overview of essential data wrangling practices

Data wrangling is the unsung hero of every data project. It’s where the magic of making sense of messy data happens—and in R, you’ve got a whole toolbox to do it right. Let's go over the most useful wrangling techniques.

Importing Data

Before you wrangle, you’ve got to get the data imported. Garbage in, garbage out. Clean imports set the stage for everything else.

read_csv(), read_excel(), read_json() – These functions make it super easy to bring in structured data from files. They're smart about guessing column types, which saves you time.
DBI::dbReadTable() – If you're pulling from a SQL database, this one's your go-to.

Cleaning Column Names

Clean names mean cleaner code and fewer typos later on.

janitor::clean_names() – Converts weird column names into something usable, like Total Revenue → total_revenue.
rename() – For fixing just a few column names.

Filtering and Subsetting Rows

Focus on the rows that matter most. You don’t always need everything.

filter(df, condition) – Grab only what you need.
slice(df, 1:10) – Useful for previewing or working with sample data.

Selecting and Reordering Columns

Keeps your workspace manageable and your logic focused.

select() – Choose only the columns you care about.
relocate() – Move important columns to the front.

Creating New Columns with `mutate()`

New insights often come from combining existing data in new ways.

mutate() – Add new calculated columns.
case_when() – Like if-else, but cooler and vectorized.

Type Conversion

Math on strings? No thanks. Get your types right.

as.numeric(), as.factor(), etc. – Convert your data to what it’s supposed to be.
parse_number() – Extract numbers from messy strings.

String Manipulation with `stringr`

Text data is often messy and inconsistent. These tools bring order.

str_to_lower(), str_trim() – Standardize text.
str_replace(), str_detect() – Clean and filter with regex.

Date and Time Handling with `lubridate`

Time-based analysis depends on correct, consistent date formats.

ymd(), dmy() – Convert strings to proper date objects.
year(), month() – Pull out components.

Reshaping with `pivot_longer()` and `pivot_wider()`

Some analyses or visuals require a different layout. Reshape as needed.

pivot_longer() – Turn wide data into long format.
pivot_wider() – Go the other way.

Grouping and Summarizing

Most insights are found in the group-level patterns.

group_by() + summarise() – Aggregate your data by categories.
count() – Quick value counts.

Sorting and Ranking

Highlight top performers, trends, or outliers.

arrange() – Sort your data.
rank(), dense_rank() – Assign positions or ranks.

Handling Missing Data

Missing data can break your analysis. Tidy it up before moving on.

is.na(), drop_na() – Detect or remove missing values.
replace_na() – Fill in gaps.
coalesce() – Smart defaults for missing values.

Joining Data

Real-world data is scattered across sources. Joins bring the pieces together.

left_join(), inner_join(), etc. – Combine tables by keys.

Row-wise Operations

Not everything is column-wise. Sometimes, each row is its own little puzzle.

rowwise() + mutate() – Calculations across columns in a single row.

Combining Data

You may get your data in parts. Combine them to see the whole picture.

bind_rows(), bind_cols() – Stack or merge tables.

Dealing with Duplicates and Outliers

Duplicates and extremes can skew your results. Keep them in check.

duplicated(), distinct() – Spot and remove repeats.
Use filter() with quantile() to manage outliers.

Recoding Values

Sometimes categories need to be merged, renamed, or reclassified.

recode(), case_when() – Replace values based on logic.
fct_recode() – For factor variables.

Nesting & Unnesting

Allows more complex workflows without breaking things into separate tables.

nest(), unnest() – Work with grouped data as mini-dataframes.
map() – Apply functions to nested data.

Functional Programming for Wrangling

map(), walk(), across() – Repeat yourself the smart way.

Repetition is inevitable. These help you automate cleanly.

Inspecting and Validating Data

str(), summary(), glimpse() – Know what you're working with.
skimr::skim() – For beautiful summaries.
assertthat or checkmate – Add sanity checks.

A quick check now saves hours of debugging later.

Data Types: pros and cons Common dbt commands

Data Wrangling in R: A Practical Guide

Importing Data

Cleaning Column Names

Filtering and Subsetting Rows

Selecting and Reordering Columns

Creating New Columns with mutate()

Type Conversion

String Manipulation with stringr

Date and Time Handling with lubridate

Reshaping with pivot_longer() and pivot_wider()

Grouping and Summarizing

Sorting and Ranking

Handling Missing Data

Joining Data

Row-wise Operations

Combining Data

Dealing with Duplicates and Outliers

Recoding Values

Nesting & Unnesting

Functional Programming for Wrangling

Inspecting and Validating Data

Creating New Columns with `mutate()`

String Manipulation with `stringr`

Date and Time Handling with `lubridate`

Reshaping with `pivot_longer()` and `pivot_wider()`