Amazon Raw Data Audit: #Project-FSA

Task: Amazon Ingestion & Sanitization

Author

Larry Mannings

Published

March 25, 2026

Purpose

Construct a notebook for exploring the data in the raw Amazon data to help us focus on the key variables and “grimey” data (HTML, encoding issues, and outliers) before proceeding to Step 5: Parse and Filter.

Code

library(tidyverse)    # Core data manipulation (dplyr, stringr, lubridate, etc.)
library(readr)        # reads table data into R
library(skimr)        # Provides descriptive statistics and data quality summaries supporting tidyverse
library(stringi)      # R package for efficient string processing
library(knitr)        # Power R Markdown by handling the execution of embedded R code
library(kableExtra)   # Extends the basic functionality of tables produced by `knitr` package
library(lubridate)    # Library that contains functions for working with date-time data
library(scales)       # Provides functions for human readable labels for axes and legends
set.seed(1)

Data Exploration

Profile the dataset

First we load the data and look at the completeness of the dataset using skimr()

Code

amazon_combined <- read.csv("../data/amazon_combined.csv")
skim(amazon_combined)

Data summary
Name	amazon_combined
Number of rows	9407
Number of columns	29
_______________________
Column type frequency:
character	26
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
drive_id	1	33	33	2
asin	1	10	10	5278
billing_address	1	13	79	20
carrier_name_._tracking_number	1	5	424	3305
currency	1	3	3	2
gift_message	1	13	189	57
gift_recipient_contact	1	13	25	6
gift_sender_name	1	5	31	18
item_serial_number	1	13	121	82
order_date	1	20	24	2596
order_id	1	19	19	3100
order_status	1	6	10	3
payment_method_type	1	11	43	29
product_condition	1	3	13	4
product_name	1	4	458	5897
purchase_order_number	1	14	14	1
ship_date	1	13	70	3816
shipment_item_subtotal	1	1	13	1930
shipment_item_subtotal_tax	1	1	13	608
shipment_status	1	7	31	4
shipping_address	1	13	96	39
shipping_option	1	3	23	45
total_amount	1	1	8	2719
total_discounts	1	1	8	267
unit_price	1	1	8	2093
website	1	7	10	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
original_quantity	1	1.02	0.37	1	1	1.00	10.00	▇▁▁▁▁
shipping_charge	1	0.23	1.92	0	0	0.00	120.93	▇▁▁▁▁
unit_price_tax	1	0.82	3.80	0	0	0.77	223.79	▇▁▁▁▁

Key Variables

A quick review of the skim statistics indicate that there are 7 of 29 variables that are useful to this project, namely:

drive_id Unique identifier for the csv file ingested from Google Drive

currency Unit of exchage for purchasing goods. Primarily USD but may be CAD on occasion

order_id An Amazon-ascribed unique 17-digit number that identifies a specific purchase

order_date The timestamp when a purchase is officially placed and confirmed

order_status Status of purchase. “Closed” is the status for confirmed purchases

product_name A descriptive, often long-form title designed to identify products

total_amount The cumulative cost of the item including price and taxes

`drive_id`

The drive_id variable contains unique values corresponding to the files of individual household members before the data frames were merged as one. These values will be used later to construct distinct user names associated with the observations, namely, customer_1 and customer_2.

Code

unique(amazon_combined$drive_id)

[1] "10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF" "1Vu8KtCY_nIuCJtYe3oH1QAn-49ys6vzA"

`currency`

The currency variable is likely populated with “USD” for this case example but it is prudent to capture other forms of currency for other customer use cases so that currency conversion can be conducted based on conditional case.

Code

unique(amazon_combined$currency)

[1] "USD" "CAD"

`order_id`

The order_id variable must be retained with the scrubbed dataset as it maintains traceability to the original purchase, ensuring that the customer can retrieve supporting documentation such as paid invoices. It should be noted that order_id has no context outside of the customer’s Amazon account so it can be distributed in a public-facing data enviroment without masking.

`order_date`

The order_date timestamp contains information that could be mined for consumer behavior and thus must be scrubbed for generalization as the month and year of purchase is sufficient for a public facing dataset. If desired, the end user can retrieve the actual timestamp for an item purchase referencing the order_id. Below is a randomly selected order_date observation.

Code

sample(amazon_combined$`order_date`, 1)

[1] "2020-07-06T03:11:23Z"

`order_status`

order_status does not contain any privacy data, but must be retained in the dataset for the purpose of validating a purchase. For the purpose of this project only observations with order status “Closed” will remain after a filter as this status indicates the the payment was finished.

Code

unique(amazon_combined$`order_status`)

[1] "Closed"     "Authorized" "Cancelled"

`product_name`

Variable ‘product_name’ contains information that could be mined for consumer behavior and thus must be scrubbed for generalization to avoid identifying specific purchases with a user. Below is a randomly selected product_name observation.

Code

sample(amazon_combined$`product_name`, 1)

[1] "JoyCat 24 Pack Tic Tac Toe Board Game,Christmas Party Favors for Kids,Classroom Prizes for Goodie Bag Stuffers,Gifts for Students with Envelope,Easy to Store(Farm)"

`total_amount`

Lastly, the variable total_amount contains price data that can be mined for consumer behavior and thus must be generalized to avoid the data being used for unintended purposes. Below is a randomly selected total_amount observation.

Code

sample(amazon_combined$`total_amount`, 1)

[1] "5.27"

Key Variable Summary

In summary, below is a table of the variables essential to the FSA Eligible reimbursement criteria summary and their data senstitivity impact:

Code

FSA_summary <- data.frame(drive_id = "Customer Name Keyword", currency = "No Impact", order_id = "No Impact", 
                          order_date = "Month/Date Generalize", order_status = "No Impact", 
                          product_name = "Category Generalize", total_amount = "Mask with Randomness")
kable(FSA_summary) |> kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE)

drive_id	currency	order_id	order_date	order_status	product_name	total_amount
Customer Name Keyword	No Impact	No Impact	Month/Date Generalize	No Impact	Category Generalize	Mask with Randomness

Data Sanitization Techniqes

It is worthwhile to explore how the data sanitization techniques will look before developing them in the production script. Let’s reduce the dataset to include only the relevant variables first. Note: Variables order_id and order_status have no sanitization impact and thus are not displayed in this section to save room.

Code

FSA_reduced <- amazon_combined |> select(drive_id, order_date, product_name, currency, total_amount) |>
    mutate(product_name = str_trunc(product_name, width = 20))
head(FSA_reduced)

                           drive_id           order_date         product_name
1 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-06-08T03:54:26Z SHOKZ OpenRun Pro...
2 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-06-08T03:54:26Z JBL Endurance Pea...
3 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-06-08T03:54:26Z Triple Strength O...
4 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-01-14T00:29:01Z Soft Scrub All Pu...
5 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2025-04-06T16:21:29Z Honeydew Melon, 1...
6 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2025-04-06T16:21:29Z SEAFOOD Flounder ...
  currency total_amount
1      USD       192.55
2      USD       106.95
3      USD        32.05
4      USD         5.11
5      USD         3.99
6      USD        19.25

`drive_id`

Recall that, for drive_id, we want to convert the Google Drive file id corresponding to Customer 1 or Customer 2 to the descriptor cust_1 or cust_2. See the code below followed by the summary of the beginning and the end of the dataset. The variable product_name is withheld for the moment until privacy scrubbing can be applied.

Code

# Create function that converts google drive name to a corresponding customer id
change2custid <- function(drive_names, drive_vals){
    for(i in 1:length(drive_names)){
        for(j in 1:length(drive_vals)){
        if(drive_vals[j] == drive_names[i]) {
            drive_names[i] <- paste0("cust_",j)   # convert drive id name to customer id name and return
        }
    }
        
    }
    return(drive_names)
    
}
FSA_reduced <- amazon_combined |> select(drive_id, order_date, currency, total_amount)
df <- FSA_reduced |> rename(cust_id = drive_id) |>   # change column name to reflect customer ids
    mutate(cust_id = change2custid(cust_id, unique(cust_id)))
head(df, 4)

  cust_id           order_date currency total_amount
1  cust_1 2024-06-08T03:54:26Z      USD       192.55
2  cust_1 2024-06-08T03:54:26Z      USD       106.95
3  cust_1 2024-06-08T03:54:26Z      USD        32.05
4  cust_1 2024-01-14T00:29:01Z      USD         5.11

Code

tail(df,4)

     cust_id           order_date currency total_amount
9404  cust_2 2024-12-20T22:34:24Z      USD        134.8
9405  cust_2 2024-06-12T00:31:54Z      USD       145.51
9406  cust_2 2025-12-05T06:21:12Z      USD         48.1
9407  cust_2 2023-01-19T00:42:12Z      USD         8.55

`order_date`

The variable order_date should be generalized to display month and year only in the sanitized dataset. If you recall from the order_date Key Variable Summary , the ‘order-date’ format is “YYYY-MM-DDTHH:MM:SSz”, which is a string type that can be easily handled by the lubridate library to generalize the ymd_hms format to a MM-YYYY format. This is accomplished as follows:

Code

df <- df |> mutate(order_date = ymd_hms(order_date)) |>
    mutate(order_date = paste0(month(order_date), "-", year(order_date)))
head(df)

  cust_id order_date currency total_amount
1  cust_1     6-2024      USD       192.55
2  cust_1     6-2024      USD       106.95
3  cust_1     6-2024      USD        32.05
4  cust_1     1-2024      USD         5.11
5  cust_1     4-2025      USD         3.99
6  cust_1     4-2025      USD        19.25

`product_name`

Filter for FSA Eligibity

The objective here is to construct a Regular Expression (regex) pattern and using this pattern in conjunction with stringi library tools to detect pattern matches within the product_name column. Thus reducing the dataset to FSA Eligible items should simply be a matter of filtering based on a string search.

Code

# Add `product_name` and `order_id` back to the data frame.
df <- df |> add_column(product_name = amazon_combined$product_name) |>
    add_column(order_id = amazon_combined$order_id) |>
    select(cust_id, order_id, order_date, product_name, currency, total_amount)

fsa_keywords <- regex("face\\smask|FSA\\sHSA|N95\\smask|thermometer|first\\said|bandage|sunscreen|light\\stherapy|medical|
              brace|sanitizer|tylenol|\\sadvil\\s|covid|therapy", ignore_case = TRUE) 
fsa_candidates <- df |> filter(str_detect(product_name, fsa_keywords))  # filter dataset for FSA Eligible matches
message("Number of FSA Eligible items found: ", nrow(fsa_candidates))

Number of FSA Eligible items found: 57

Generalize FSA Products by Category

In the Filter for FSA Eligibility section above, we included about 15 keywords for pattern detection, but we actually want to group them into about 4 or 5 clean, broad buckets (e.g., putting all masks and sanitizers into a “Personnel Protective Equipment” category). We will use case_when() logic to help us in this regard. The fsa_category column generated by this step now permits us to represent the products in a pulblic facing dataset. The first 10 observations of this FSA-eligible filter result are tabulated.

Code

fsa_candidates <- fsa_candidates |>
  mutate(
    fsa_category = case_when(
      # Bucket 1: PPE & Prevention
      str_detect(product_name, regex("face\\smask|N95\\smask|sanitizer|covid", ignore_case = TRUE)) ~ "PPE & Prevention",
      
      # Bucket 2: Medical Devices
      str_detect(product_name, regex("thermometer|brace|light\\stherapy|therapy|medical", ignore_case = TRUE)) ~ "Medical Devices",
      
      # Bucket 3: First Aid & OTC Meds
      str_detect(product_name, regex("first\\said|bandage|tylenol|advil", ignore_case = TRUE)) ~ "First Aid & OTC",
      
      # Bucket 4: General FSA/HSA
      str_detect(product_name, regex("FSA\\sHSA|sunscreen", ignore_case = TRUE)) ~ "General FSA/HSA",
      
      # Default: If it doesn't match above, label it NA
      TRUE ~ NA_character_
    )
  ) |> select(cust_id, order_id, order_date, product_name, fsa_category, currency, total_amount)

# Prepare for tabulation
df <- fsa_candidates |> select(-order_id, -product_name)

kable(df[1:10, ]) |> kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE) #display fsa-eligible observations

cust_id	order_date	fsa_category	currency	total_amount
cust_1	6-2024	General FSA/HSA	USD	10.69
cust_1	2-2018	Medical Devices	USD	17.95
cust_1	10-2018	First Aid & OTC	USD	9.38
cust_1	10-2018	PPE & Prevention	USD	23.78
cust_1	8-2024	General FSA/HSA	USD	17.08
cust_1	4-2024	Medical Devices	USD	8.49
cust_1	12-2024	Medical Devices	USD	0
cust_1	7-2022	Medical Devices	USD	11.29
cust_1	8-2015	Medical Devices	USD	11.88
cust_1	8-2025	Medical Devices	USD	10.69

`total_amount`

Add Noise to mask exact price data

For the final variable in our dataset, we generate random uniform random noise (+/25%) to the total_amount variable. This should mitigate the risk of a bad faith actor attempting to indirectly identify item purchased based on item price.

Code

# modify 'total_amount` from it's original value by a randomly selected +/- 25% margin
fsa_candidates <- fsa_candidates |> mutate(total_amount = as.numeric(total_amount) * runif(n(), 0.75, 1.25)) |>
    mutate_at(vars(total_amount), funs(dollar(.)))    # add currency symbol

Concluding Remarks

The fully scrubbed dataset is now ready for final formatting in preparation for delivery to the customer. The completed table is depicted below. Note that the public facing dataset has the majority of the product_name data masked.

Code

 df2 <- fsa_candidates |>   arrange(fsa_category) # Alphabetically sort the dataset by `fsa_category`

# Prepare to mask the product_name data for public facing dataset
df3 <- df2 |> mutate(product_name = str_replace(product_name, "^(.{4}).*(.{4})$", "\\1**********\\2"))

#display fsa-eligible observations
kable(df3) |> kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE) |>
# pack_rows looks at how many times each category appears and draws the headers
  pack_rows(index = table(df2$fsa_category))

cust_id	order_id	order_date	product_name	fsa_category	currency	total_amount
First Aid & OTC
cust_1	113-8609955-6621848	10-2018	Infa**********ry )	First Aid & OTC	USD	$11.25
cust_1	113-3239294-7324261	1-2020	Chil**********. oz	First Aid & OTC	USD	$23.88
cust_1	113-7744157-8873047	7-2020	Swis**********ival	First Aid & OTC	USD	$36.46
cust_1	112-9687788-0137844	1-2019	Infa**********l oz	First Aid & OTC	USD	$10.76
cust_2	112-8086631-9349047	11-2019	Hego**********e-1)	First Aid & OTC	USD	$62.66
General FSA/HSA
cust_1	112-6986889-9915463	6-2024	Baby**********3 oz	General FSA/HSA	USD	$12.87
cust_1	111-9510385-2246650	8-2024	Comf**********acks	General FSA/HSA	USD	$18.45
cust_1	114-8332615-1848259	5-2018	Baby**********f 4)	General FSA/HSA	USD	$19.45
cust_1	113-7744157-8873047	7-2020	Thin**********nce)	General FSA/HSA	USD	$16.48
cust_1	113-8212141-8198637	5-2019	Baby**********f 2)	General FSA/HSA	USD	$17.71
cust_1	112-6169043-3177055	9-2024	LAO ********** Net	General FSA/HSA	USD	$12.79
cust_1	111-7957159-2074606	12-2021	Pamp**********lbs)	General FSA/HSA	USD	$22.31
cust_1	111-1800947-2808231	7-2025	Sun **********6 oz	General FSA/HSA	USD	$13.37
cust_1	111-9334120-6185810	11-2021	Pamp**********lbs)	General FSA/HSA	USD	$28.69
cust_1	112-3277915-3871452	9-2021	Pamp**********lbs)	General FSA/HSA	USD	$28.41
cust_1	111-4683115-0581816	1-2022	Pamp**********lbs)	General FSA/HSA	USD	$27.65
cust_2	111-4093589-5019409	4-2019	Baby**********Pack	General FSA/HSA	USD	$22.24
cust_2	112-0058571-5953862	2-2026	Kieh**********l oz	General FSA/HSA	USD	$48.37
cust_2	113-1926323-0442610	6-2024	Kieh**********l oz	General FSA/HSA	USD	$44.44
cust_2	114-2342275-6469811	12-2024	Kieh**********l oz	General FSA/HSA	USD	$54.32
cust_2	113-3928775-7080222	5-2025	Kieh**********l oz	General FSA/HSA	USD	$39.90
cust_2	112-0653424-6762657	8-2025	Kieh**********l oz	General FSA/HSA	USD	$35.92
cust_2	112-7518733-2752222	11-2025	Kieh**********l oz	General FSA/HSA	USD	$41.32
Medical Devices
cust_1	114-9033572-9880253	2-2018	Newk**********eter	Medical Devices	USD	$15.27
cust_1	113-7458070-8221843	4-2024	JA-R**********6-2p	Medical Devices	USD	$9.04
cust_1	112-7988003-9562665	12-2024	Skil**********ults	Medical Devices	USD	$0.00
cust_1	113-6826678-0824251	7-2022	TEA **********2 OZ	Medical Devices	USD	$9.63
cust_1	002-0908886-0308264	8-2015	Nube**********able	Medical Devices	USD	$9.96
cust_1	111-6938660-0891453	8-2025	Sess**********book	Medical Devices	USD	$11.69
cust_1	114-2627810-2825050	8-2018	Avee**********. oz	Medical Devices	USD	$18.12
cust_1	112-1130020-3678605	1-2020	Frac**********6 OZ	Medical Devices	USD	$16.55
cust_1	110-4770389-1029840	10-2015	Hair**********Wrap	Medical Devices	USD	$19.57
cust_1	105-0151318-9022604	12-2015	Esse**********rapy	Medical Devices	USD	$77.44
cust_1	111-7068540-9605846	12-2025	Craz********** USA	Medical Devices	USD	$11.27
cust_1	105-8202098-5233056	10-2016	Arom**********ack)	Medical Devices	USD	$10.37
cust_1	111-7068540-9605846	12-2025	Craz********** USA	Medical Devices	USD	$12.11
cust_1	114-3551914-3148242	8-2018	Aqua**********unce	Medical Devices	USD	$13.55
cust_1	111-3085155-2599469	9-2020	Maje**********ary)	Medical Devices	USD	$15.89
cust_1	113-0010372-5409850	8-2019	Medi**********hite	Medical Devices	USD	$308.88
cust_1	107-8733716-1693847	2-2017	Empt**********ings	Medical Devices	USD	$8.83
cust_2	112-9689746-5453061	4-2023	Ruth********** Set	Medical Devices	USD	$274.43
cust_2	114-0110855-3553024	12-2022	Kipi********** Gym	Medical Devices	USD	$17.97
cust_2	114-9249680-8065813	10-2023	Aoli**********ancy	Medical Devices	USD	$23.94
cust_2	113-9123562-4762628	5-2025	Huma**********lers	Medical Devices	USD	$25.66
cust_2	113-2625042-7384207	8-2025	LESC**********lmet	Medical Devices	USD	$321.56
cust_2	104-7416017-0699423	7-2016	EvoN**********ack)	Medical Devices	USD	$84.29
PPE & Prevention
cust_1	113-8609955-6621848	10-2018	PURE**********ttle	PPE & Prevention	USD	$29.07
cust_1	114-5677546-6950667	6-2018	1 X **********ttle	PPE & Prevention	USD	$6.65
cust_1	113-3286887-8268203	7-2020	Hand**********0 ml	PPE & Prevention	USD	$24.47
cust_1	114-2077704-9365062	8-2021	Pers**********f 3)	PPE & Prevention	USD	$12.99
cust_1	112-9791026-4815436	9-2018	1 X **********ttle	PPE & Prevention	USD	$5.52
cust_1	113-4671227-4000201	8-2019	PURE**********4-EC	PPE & Prevention	USD	$23.89
cust_1	114-6107407-2375452	9-2022	Pure**********DECO	PPE & Prevention	USD	$18.81
cust_1	111-4733609-4858638	7-2020	Wedd**********Grey	PPE & Prevention	USD	$13.91
cust_1	111-5624520-3829059	7-2020	Supp**********Pack	PPE & Prevention	USD	$21.51
cust_2	113-3309505-4635404	3-2021	COVI**********ated	PPE & Prevention	USD	$13.44
cust_2	114-7004198-5218612	12-2023	Purj**********rey)	PPE & Prevention	USD	$13.55