Loading...

Project: Automating IRS tax Compliance with R & Tidyverse

Introduction

I have a modest collection of merchandise (vinyl, LEGO, toys) that I sell as a hobbyist throughout the year, primarily to cull my inventory while iteratively curating the collection. Profit from the sale of collectibles is generally taxable by the IRS as capital gains. Form 8949 (Sales and Other Dispositions of Capital Assets) is the specific IRS tax form used to report the details of your capital asset transactions. Sales data for the year can be downloaded from various online marketplaces and compiled to construct the asset transaction datasheet, but additional formatting must be done to make this data Form 8949 compliant. Managing high-volume collectible sales requires more than just a spreadsheet – it requires a reproducible workflow. This page summarizes how I leveraged the R Tidyverse to convert raw PayPal and other marketplace exports into a ‘Tidy’ format, ultimately generating professional, IRS-ready Form 8949 statements with the push of a button. View the Source Code on GitHub.


Prerequisites

To run this project you need R and RStudio installed. You will also need a LaTeX distribution (like TinyTeX) to compile the PDF.

Required R Packages

# Prerequisites

install.packages(c("tidyverse", "knitr", "kableExtra", "rmarkdown"))


LaTeX Setup

LaTeX is a high-quality typesetting system that is the standard for the communication and publication of scientific documents. If you don’t have LaTeX installed already, run this in your R console:

install.packages("tinytex")
tinytex::install_tinytex()

Data Preparation

After the sales data for the entire year is exported from the online marketplace, all source data need to be aggregated into a CSV file that matches the IRS Form 8949 column requirements.


CSV File Header Data

The input file must contain the following columns:

  • Description (e.g., "10265 Ford Mustang")
  • Date_Acquired (MM/DD/YY)
  • Date_Sold (MM/DD/YY)
  • Proceeds (Sale price)
  • Cost_Basis (Original purchase price)
  • Gain_Loss (Difference between sale price and original purchase price)

An excerpt from this populated input file 2025_sales_8949.csv is displayed below.

DescriptionDate_AcquiredDate_SoldProceedsCost_BasisGain_LossMarketplace
Richard Avedon: Photographs 1946-20041/23/202211/22/2025$49.74 $75.00 -$25.26ebay
10265 Ford Mustang3/3/201910/21/2025$92.01 $160.41 -$68.40bricklink

Implementation

Our goal for this project is to reproducible data pipeline that can

  • Read CSV data containing collectible sales data
  • Sanitize data for public release
  • Apply logic to categorize transactions into short-term and long-term transactions
  • Apply professional formatting to generate a PDF that meets the criteria commensurate with IRS Form 8949.


Data Sanitization

When sharing work for public consumption, we must ensure that personal information – transaction IDs, buyer names, buying patterns, or exact purchase dates – is scrubbed while keeping the data "real" enough to be useful for analysis and results. The following sanitization strategies will be applied for this project:

  • Hashing Identifiers: We "hash" the item name for each transaction to allow the reader to see the transactions are unique without revealing the actual marketplace ID.
  • Date Generalization: Exact dates can sometimes be used to link data back to public. marketplace listings. Moving all dates to the first of the month maintains the long-term vs. short-term categorization logic while protecting privacy.
  • Obfuscate Price Data: Add random uniform ‘noise’ to price data to hide exact earnings.
  • Mask Personal Data: Remove any columns with names or addresses.


R Script for Data Sanitization

The script data_scrubber.R creates a public releasable version of the input file 2025_sales_8949.csv containing the sales data. It uses the digest package to mask sensitive strings and adds "noise" to the price data. It also generalizes the dates to mask business activity while maintaining the long-term vs. short-term logic. Lastly, the Description column is dropped before outputting the sanitized CSV for further use in the data pipeline.

library(tidyverse)  # Load collection of R tools designed specifically for
                    # data science
library(digest) # Load tool used to create cryptographic hash digests of R 
                # objects.

# Load your real data
real_data <- read_csv("C:/School/R_wd/tax_2025/data/Private/2025_sales_8949.csv")

# Sanitize the data
public_data <- real_data %>%
  mutate(across(c(Proceeds, Cost_Basis, Gain_Loss), 
                ~ as.numeric(gsub("[\\$,]", "", .)))) |>
  mutate(
    # 1. Mask Order IDs using a hash (reproducible but anonymous)
    Order_ID = sapply(Description, function(x) digest(x, algo="crc32")),
    
    # 2. Generalize Dates (Keep month/year, but set day to 1st)
    Date_Acquired = as.character(floor_date(mdy(Date_Acquired), "month")),
    Date_Sold = as.character(floor_date(mdy(Date_Sold), "month")),
    
    # 3. Optional: Add 'Noise' to prices (e.g., +/- 5%) to hide exact earnings
    Proceeds = Proceeds * runif(n(), 0.75, 1.25),
    Cost_Basis = Cost_Basis * runif(n(), 0.75, 1.25),
    Gain_Loss = Proceeds - Cost_Basis
  ) %>%
  
  # 4. Remove any columns with Names or Addresses (not used in this case)
  #select(-matches("Name|Address|Email|Phone|Description"))
  
  # 4a. Return a set of columns as a new table
  select(Order_ID, Date_Acquired, Date_Sold, Proceeds, Cost_Basis, Code, 
         Adjustment, Gain_Loss, Marketplace)

# 5. Re-format back to currency
public_data <- public_data |> 
  mutate(across(c(Proceeds, Cost_Basis, Gain_Loss), 
                ~ scales::dollar(.)))

# 6. Write data frame to file
write_csv(public_data, "data/public_demo_sales.csv")

Depicted below is an excerpt from the sanitized dataset, public_demo_sales.csv. Compare this with the same observations from the original dataset, 2025_sales_8949.csv, above.

Order_IDDate_AcquiredDate_SoldProceedsCost_BasisGain_LossMarketplace
fbaee11b1/1/202211/1/2025$47.84 $65.81 -$17.97ebay
343f08273/1/201910/1/2025$111.86 $190.87 -$79.01bricklink


Data Configuration and Generation

Data configuration and generation represents the primary engine of this data pipeline. The primary features of this model workflow include:

  • Categorization and Summary: Logic that sorts transactions into short-term and long-term tables based on holding periods. Automatically calculate Grand Totals for Proceeds, Cost Basis, and Gain/Loss.
  • String Processing: Use regular expressions (regex) to remove unwanted characters so that the strings can be temporarily converted to other data types.
  • Professional Formatting: Generate tables for each transaction type with longtable support, ensuring rows span multiple pages without breaking layout constraints.


R Script for Loading Sanitized Data and Generating Form 8949

The script generate_8949.R loads the sanitized dataset, public_demo_sales.csv, and renders the output file, which is specified to be in PDF format. It will clean the currency data and "knit" R Markdown template file. The "knitting" is accomplished by the render() command which reads the R Markdown template template_8949.Rmd, executes the code chunks in a fresh R session, captures the code output and dynamically generates a report containing the Markdown content interweaved with the R output. Once the knitting is done, it hands the Markdown file off to Pandoc, which is a tool that generates Markdown results into standard output formats (HTML, Word, PDF).

# This script prepares your data and calls a markdown template to generate the PDF.

library(tidyverse)  # Load collection of R tools designed specifically for
                    # data science
library(rmarkdown)  # Load tool for adding r code chunks to standard Markdown
library(lubridate)  # Load tool for handling dates and times

# 1. Load your Tidyverse-compliant CSV
# Ensure columns: Description, Date_Acquired, Date_Sold, Proceeds, Cost_Basis
data <- read_csv("C:/School/R_wd/tax_2025/data/public_demo_sales.csv")

# 2. Calculate Gain/Loss and format for the IRS
data_clean <- data %>%
  mutate(
    # IRS requires dates in MM/DD/YYYY for the form
    Date_Acquired = format(Date_Acquired, "%m/%d/%Y"),
    Date_Sold = format(Date_Sold, "%m/%d/%Y")
  )

# 3. Generate the PDF using an R Markdown template
render("scripts/template_8949.Rmd", 
       output_file = paste0("../output/Form_8949_", Sys.Date(), ".pdf"),
       params = list(df = data_clean))

The R Markdown (.Rmd) template, template_8949.Rmd, contains metadata, markdown text, and the R code chunks needed to generate a Form 8949 equivalent. The .Rmd acts as the "printing press" containing the formatting logic, order control, narrative (text), and results (plots / tables), appear in a single, high quality document the IRS expects for Form 8949.

---
title: "Form 8949 - Sales and Other Dispositions of Capital Assets"
author: "JKLM Data"
date: "`r Sys.Date()`"
output: 
  pdf_document:
    latex_engine: pdflatex
params:
  df: NA
header-includes:
  - \usepackage{booktabs}
  - \usepackage{longtable}
  - \usepackage{array}
  - \usepackage{pdflscape} # Added for landscape
  - \usepackage{colortbl}
---

```{r setup, include=FALSE}
library(tidyverse)
library(kableExtra)
library(lubridate)

# Ensure dates are in the correct format
df_processed <- params$df %>%
  mutate(
    Acq_Date = mdy(Date_Acquired),
    Sold_Date = mdy(Date_Sold),
    Holding_Period = as.numeric(Sold_Date - Acq_Date)
  )

# Split the data
short_term <- df_processed %>% filter(Holding_Period <= 365)
long_term <- df_processed %>% filter(Holding_Period > 365)
```

\begin{landscape}

Part I: Short-Term Capital Gains and Losses
Assets held one year or less.

```{r short_term, echo=FALSE, message=FALSE, warning=FALSE}
# Create total row
# 1. Clean the data: Remove '$' and ',' then convert to numeric
short_term_numeric <- short_term %>%
  mutate(across(c(Proceeds, Cost_Basis, Gain_Loss), 
                ~ as.numeric(gsub("[\\$,]", "", .))))

# 2. Create the total row
short_total <- short_term_numeric %>%
  summarise(
    Order_ID = "GRAND TOTAL", 
    Date_Acquired = "", 
    Date_Sold = "",
    Proceeds = sum(Proceeds, na.rm = TRUE), 
    Cost_Basis = sum(Cost_Basis, na.rm = TRUE), 
    Gain_Loss = sum(Gain_Loss, na.rm = TRUE),
    Marketplace = ""
  )

# 3. Combine and format back to currency for the table
short_final <- bind_rows(short_term_numeric, short_total) %>%
  mutate(across(c(Proceeds, Cost_Basis, Gain_Loss), 
                ~ scales::dollar(.)))

# 4. Generate the table
knitr::kable(short_final |>
               select(Order_ID, Date_Acquired, Date_Sold, Proceeds, Cost_Basis, Gain_Loss, Marketplace),
             format = "latex", 
             booktabs = TRUE, 
             longtable = TRUE) %>%
  kable_styling(latex_options = c("striped", "repeat_header"), font_size = 10) %>% # scale_down added
  column_spec(1, width = "9cm") %>% # Forces text wrapping in the Description column
  row_spec(nrow(short_final), bold = TRUE, background = "yellow")
```

Part II: Long-Term Capital Gains and Losses
Assets held more than one year

```{r long_term, echo=FALSE, message=FALSE, warning=FALSE}
# 1. Clean data: Remove symbols and cast to numeric
long_term_numeric <- long_term %>%
  mutate(across(c(Proceeds, Cost_Basis, Gain_Loss), 
                ~ as.numeric(gsub("[\\$,]", "", .))))

# 2. Create the total row
long_total <- long_term_numeric %>%
  summarise(
    Order_ID = "GRAND TOTAL", 
    Date_Acquired = "", 
    Date_Sold = "",
    Proceeds = sum(Proceeds, na.rm = TRUE), 
    Cost_Basis = sum(Cost_Basis, na.rm = TRUE), 
    Gain_Loss = sum(Gain_Loss, na.rm = TRUE),
    Marketplace = ""
  )

# 3. Combine and format for display
long_final <- bind_rows(long_term_numeric, long_total) %>%
  mutate(across(c(Proceeds, Cost_Basis, Gain_Loss), 
                ~ scales::dollar(.)))

# 4. Generate the landscape, multi-page table
knitr::kable(long_final |>
               select(Order_ID, Date_Acquired, Date_Sold, Proceeds, Cost_Basis, Gain_Loss, Marketplace), 
             format = "latex", 
             booktabs = TRUE, 
             longtable = TRUE) %>%
  kable_styling(latex_options = c("striped", "repeat_header"), font_size = 10) %>%
  column_spec(1, width = "9cm") %>% # Adjust width as needed for landscape
  row_spec(nrow(long_final), bold = TRUE, background = "yellow")
```

\end{landscape}


Final Generated Tax Statement

The final output of the data pipeline Form_8949_YYYY-MM-DD.pdf is depicted below. Unlike updating a spreadsheet, this workflow is non-destructive, meaning the original data remains entirely untouched and unaltered throughout the entire process. You can re-run the script anytime you find a new receipt or need to update a cost basis, and the entire PDF will regenerate instantly with correct totals. The current date is appended to the output filename each time you run the script, bolstering configuration management of different versions.



Disclaimer

I am a Data Analyst, not a CPA. This software is provided for informational and record-keeping purposes only. It does not constitute professional tax advice. Always verify your final numbers with a qualified tax professional before filing with the IRS.


Summary

The Form 8949 Generator provides a reproducible data pipeline for independent consultants and hobbyist sellers to generate IRS-compliant Form 8949 statements. By integrating disparate data sources into complex IRS tax requirements this project delivered an automated, non-destructive pipeline that replaces hours of manual entry with a single, reproducible command. The project also demonstrates how rigorous Tidyverse workflows and LaTeX precision can streamline compliance and provide audit-proof financial transparency for independent consultants and small businesses alike.


Maintained by JKLM Data Analytics