Amazon Raw Data Audit: #Project-FSA

Task: Amazon Ingestion & Sanitization

Author

Larry Mannings

Published

March 25, 2026

Purpose

Construct a notebook for exploring the data in the raw Amazon data to help us focus on the key variables and “grimey” data (HTML, encoding issues, and outliers) before proceeding to Step 5: Parse and Filter.

Code
library(tidyverse)    # Core data manipulation (dplyr, stringr, lubridate, etc.)
library(readr)        # reads table data into R
library(skimr)        # Provides descriptive statistics and data quality summaries supporting tidyverse
library(stringi)      # R package for efficient string processing
library(knitr)        # Power R Markdown by handling the execution of embedded R code
library(kableExtra)   # Extends the basic functionality of tables produced by `knitr` package
library(lubridate)    # Library that contains functions for working with date-time data
library(scales)       # Provides functions for human readable labels for axes and legends
set.seed(1)

Data Exploration

Profile the dataset

First we load the data and look at the completeness of the dataset using skimr()

Code
amazon_combined <- read.csv("../data/amazon_combined.csv")
skim(amazon_combined)
Data summary
Name amazon_combined
Number of rows 9407
Number of columns 29
_______________________
Column type frequency:
character 26
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
drive_id 0 1 33 33 0 2 0
asin 0 1 10 10 0 5278 0
billing_address 0 1 13 79 0 20 0
carrier_name_._tracking_number 0 1 5 424 0 3305 0
currency 0 1 3 3 0 2 0
gift_message 0 1 13 189 0 57 0
gift_recipient_contact 0 1 13 25 0 6 0
gift_sender_name 0 1 5 31 0 18 0
item_serial_number 0 1 13 121 0 82 0
order_date 0 1 20 24 0 2596 0
order_id 0 1 19 19 0 3100 0
order_status 0 1 6 10 0 3 0
payment_method_type 0 1 11 43 0 29 0
product_condition 0 1 3 13 0 4 0
product_name 0 1 4 458 0 5897 0
purchase_order_number 0 1 14 14 0 1 0
ship_date 0 1 13 70 0 3816 0
shipment_item_subtotal 0 1 1 13 0 1930 0
shipment_item_subtotal_tax 0 1 1 13 0 608 0
shipment_status 0 1 7 31 0 4 0
shipping_address 0 1 13 96 0 39 0
shipping_option 0 1 3 23 0 45 0
total_amount 0 1 1 8 0 2719 0
total_discounts 0 1 1 8 0 267 0
unit_price 0 1 1 8 0 2093 0
website 0 1 7 10 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
original_quantity 0 1 1.02 0.37 0 1 1 1.00 10.00 ▇▁▁▁▁
shipping_charge 0 1 0.23 1.92 0 0 0 0.00 120.93 ▇▁▁▁▁
unit_price_tax 0 1 0.82 3.80 0 0 0 0.77 223.79 ▇▁▁▁▁

Key Variables

A quick review of the skim statistics indicate that there are 7 of 29 variables that are useful to this project, namely:

drive_id Unique identifier for the csv file ingested from Google Drive

currency Unit of exchage for purchasing goods. Primarily USD but may be CAD on occasion

order_id An Amazon-ascribed unique 17-digit number that identifies a specific purchase

order_date The timestamp when a purchase is officially placed and confirmed

order_status Status of purchase. “Closed” is the status for confirmed purchases

product_name A descriptive, often long-form title designed to identify products

total_amount The cumulative cost of the item including price and taxes

drive_id

The drive_id variable contains unique values corresponding to the files of individual household members before the data frames were merged as one. These values will be used later to construct distinct user names associated with the observations, namely, customer_1 and customer_2.

Code
unique(amazon_combined$drive_id)
[1] "10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF" "1Vu8KtCY_nIuCJtYe3oH1QAn-49ys6vzA"

currency

The currency variable is likely populated with “USD” for this case example but it is prudent to capture other forms of currency for other customer use cases so that currency conversion can be conducted based on conditional case.

Code
unique(amazon_combined$currency)
[1] "USD" "CAD"

order_id

The order_id variable must be retained with the scrubbed dataset as it maintains traceability to the original purchase, ensuring that the customer can retrieve supporting documentation such as paid invoices. It should be noted that order_id has no context outside of the customer’s Amazon account so it can be distributed in a public-facing data enviroment without masking.

order_date

The order_date timestamp contains information that could be mined for consumer behavior and thus must be scrubbed for generalization as the month and year of purchase is sufficient for a public facing dataset. If desired, the end user can retrieve the actual timestamp for an item purchase referencing the order_id. Below is a randomly selected order_date observation.

Code
sample(amazon_combined$`order_date`, 1)
[1] "2020-07-06T03:11:23Z"

order_status

order_status does not contain any privacy data, but must be retained in the dataset for the purpose of validating a purchase. For the purpose of this project only observations with order status “Closed” will remain after a filter as this status indicates the the payment was finished.

Code
unique(amazon_combined$`order_status`)
[1] "Closed"     "Authorized" "Cancelled" 

product_name

Variable ‘product_name’ contains information that could be mined for consumer behavior and thus must be scrubbed for generalization to avoid identifying specific purchases with a user. Below is a randomly selected product_name observation.

Code
sample(amazon_combined$`product_name`, 1)
[1] "JoyCat 24 Pack Tic Tac Toe Board Game,Christmas Party Favors for Kids,Classroom Prizes for Goodie Bag Stuffers,Gifts for Students with Envelope,Easy to Store(Farm)"

total_amount

Lastly, the variable total_amount contains price data that can be mined for consumer behavior and thus must be generalized to avoid the data being used for unintended purposes. Below is a randomly selected total_amount observation.

Code
sample(amazon_combined$`total_amount`, 1)
[1] "5.27"

Key Variable Summary

In summary, below is a table of the variables essential to the FSA Eligible reimbursement criteria summary and their data senstitivity impact:

Code
FSA_summary <- data.frame(drive_id = "Customer Name Keyword", currency = "No Impact", order_id = "No Impact", 
                          order_date = "Month/Date Generalize", order_status = "No Impact", 
                          product_name = "Category Generalize", total_amount = "Mask with Randomness")
kable(FSA_summary) |> kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE)
drive_id currency order_id order_date order_status product_name total_amount
Customer Name Keyword No Impact No Impact Month/Date Generalize No Impact Category Generalize Mask with Randomness

Data Sanitization Techniqes

It is worthwhile to explore how the data sanitization techniques will look before developing them in the production script. Let’s reduce the dataset to include only the relevant variables first. Note: Variables order_id and order_status have no sanitization impact and thus are not displayed in this section to save room.

Code
FSA_reduced <- amazon_combined |> select(drive_id, order_date, product_name, currency, total_amount) |>
    mutate(product_name = str_trunc(product_name, width = 20))
head(FSA_reduced)
                           drive_id           order_date         product_name
1 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-06-08T03:54:26Z SHOKZ OpenRun Pro...
2 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-06-08T03:54:26Z JBL Endurance Pea...
3 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-06-08T03:54:26Z Triple Strength O...
4 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2024-01-14T00:29:01Z Soft Scrub All Pu...
5 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2025-04-06T16:21:29Z Honeydew Melon, 1...
6 10AxRBXfFP4Yel1gJ_vs85rvWLYbWIEKF 2025-04-06T16:21:29Z SEAFOOD Flounder ...
  currency total_amount
1      USD       192.55
2      USD       106.95
3      USD        32.05
4      USD         5.11
5      USD         3.99
6      USD        19.25

drive_id

Recall that, for drive_id, we want to convert the Google Drive file id corresponding to Customer 1 or Customer 2 to the descriptor cust_1 or cust_2. See the code below followed by the summary of the beginning and the end of the dataset. The variable product_name is withheld for the moment until privacy scrubbing can be applied.

Code
# Create function that converts google drive name to a corresponding customer id
change2custid <- function(drive_names, drive_vals){
    for(i in 1:length(drive_names)){
        for(j in 1:length(drive_vals)){
        if(drive_vals[j] == drive_names[i]) {
            drive_names[i] <- paste0("cust_",j)   # convert drive id name to customer id name and return
        }
    }
        
    }
    return(drive_names)
    
}
FSA_reduced <- amazon_combined |> select(drive_id, order_date, currency, total_amount)
df <- FSA_reduced |> rename(cust_id = drive_id) |>   # change column name to reflect customer ids
    mutate(cust_id = change2custid(cust_id, unique(cust_id)))
head(df, 4)
  cust_id           order_date currency total_amount
1  cust_1 2024-06-08T03:54:26Z      USD       192.55
2  cust_1 2024-06-08T03:54:26Z      USD       106.95
3  cust_1 2024-06-08T03:54:26Z      USD        32.05
4  cust_1 2024-01-14T00:29:01Z      USD         5.11
Code
tail(df,4)
     cust_id           order_date currency total_amount
9404  cust_2 2024-12-20T22:34:24Z      USD        134.8
9405  cust_2 2024-06-12T00:31:54Z      USD       145.51
9406  cust_2 2025-12-05T06:21:12Z      USD         48.1
9407  cust_2 2023-01-19T00:42:12Z      USD         8.55

order_date

The variable order_date should be generalized to display month and year only in the sanitized dataset. If you recall from the order_date Key Variable Summary , the ‘order-date’ format is “YYYY-MM-DDTHH:MM:SSz”, which is a string type that can be easily handled by the lubridate library to generalize the ymd_hms format to a MM-YYYY format. This is accomplished as follows:

Code
df <- df |> mutate(order_date = ymd_hms(order_date)) |>
    mutate(order_date = paste0(month(order_date), "-", year(order_date)))
head(df)
  cust_id order_date currency total_amount
1  cust_1     6-2024      USD       192.55
2  cust_1     6-2024      USD       106.95
3  cust_1     6-2024      USD        32.05
4  cust_1     1-2024      USD         5.11
5  cust_1     4-2025      USD         3.99
6  cust_1     4-2025      USD        19.25

product_name

Filter for FSA Eligibity

The objective here is to construct a Regular Expression (regex) pattern and using this pattern in conjunction with stringi library tools to detect pattern matches within the product_name column. Thus reducing the dataset to FSA Eligible items should simply be a matter of filtering based on a string search.

Code
# Add `product_name` and `order_id` back to the data frame.
df <- df |> add_column(product_name = amazon_combined$product_name) |>
    add_column(order_id = amazon_combined$order_id) |>
    select(cust_id, order_id, order_date, product_name, currency, total_amount)

fsa_keywords <- regex("face\\smask|FSA\\sHSA|N95\\smask|thermometer|first\\said|bandage|sunscreen|light\\stherapy|medical|
              brace|sanitizer|tylenol|\\sadvil\\s|covid|therapy", ignore_case = TRUE) 
fsa_candidates <- df |> filter(str_detect(product_name, fsa_keywords))  # filter dataset for FSA Eligible matches
message("Number of FSA Eligible items found: ", nrow(fsa_candidates))
Number of FSA Eligible items found: 57

Generalize FSA Products by Category

In the Filter for FSA Eligibility section above, we included about 15 keywords for pattern detection, but we actually want to group them into about 4 or 5 clean, broad buckets (e.g., putting all masks and sanitizers into a “Personnel Protective Equipment” category). We will use case_when() logic to help us in this regard. The fsa_category column generated by this step now permits us to represent the products in a pulblic facing dataset. The first 10 observations of this FSA-eligible filter result are tabulated.

Code
fsa_candidates <- fsa_candidates |>
  mutate(
    fsa_category = case_when(
      # Bucket 1: PPE & Prevention
      str_detect(product_name, regex("face\\smask|N95\\smask|sanitizer|covid", ignore_case = TRUE)) ~ "PPE & Prevention",
      
      # Bucket 2: Medical Devices
      str_detect(product_name, regex("thermometer|brace|light\\stherapy|therapy|medical", ignore_case = TRUE)) ~ "Medical Devices",
      
      # Bucket 3: First Aid & OTC Meds
      str_detect(product_name, regex("first\\said|bandage|tylenol|advil", ignore_case = TRUE)) ~ "First Aid & OTC",
      
      # Bucket 4: General FSA/HSA
      str_detect(product_name, regex("FSA\\sHSA|sunscreen", ignore_case = TRUE)) ~ "General FSA/HSA",
      
      # Default: If it doesn't match above, label it NA
      TRUE ~ NA_character_
    )
  ) |> select(cust_id, order_id, order_date, product_name, fsa_category, currency, total_amount)

# Prepare for tabulation
df <- fsa_candidates |> select(-order_id, -product_name)

kable(df[1:10, ]) |> kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE) #display fsa-eligible observations
cust_id order_date fsa_category currency total_amount
cust_1 6-2024 General FSA/HSA USD 10.69
cust_1 2-2018 Medical Devices USD 17.95
cust_1 10-2018 First Aid & OTC USD 9.38
cust_1 10-2018 PPE & Prevention USD 23.78
cust_1 8-2024 General FSA/HSA USD 17.08
cust_1 4-2024 Medical Devices USD 8.49
cust_1 12-2024 Medical Devices USD 0
cust_1 7-2022 Medical Devices USD 11.29
cust_1 8-2015 Medical Devices USD 11.88
cust_1 8-2025 Medical Devices USD 10.69

total_amount

Add Noise to mask exact price data

For the final variable in our dataset, we generate random uniform random noise (+/25%) to the total_amount variable. This should mitigate the risk of a bad faith actor attempting to indirectly identify item purchased based on item price.

Code
# modify 'total_amount` from it's original value by a randomly selected +/- 25% margin
fsa_candidates <- fsa_candidates |> mutate(total_amount = as.numeric(total_amount) * runif(n(), 0.75, 1.25)) |>
    mutate_at(vars(total_amount), funs(dollar(.)))    # add currency symbol

Concluding Remarks

The fully scrubbed dataset is now ready for final formatting in preparation for delivery to the customer. The completed table is depicted below. Note that the public facing dataset has the majority of the product_name data masked.

Code
 df2 <- fsa_candidates |>   arrange(fsa_category) # Alphabetically sort the dataset by `fsa_category`

# Prepare to mask the product_name data for public facing dataset
df3 <- df2 |> mutate(product_name = str_replace(product_name, "^(.{4}).*(.{4})$", "\\1**********\\2"))

#display fsa-eligible observations
kable(df3) |> kable_styling(bootstrap_options = c("striped", "bordered"), full_width = FALSE) |>
# pack_rows looks at how many times each category appears and draws the headers
  pack_rows(index = table(df2$fsa_category))
cust_id order_id order_date product_name fsa_category currency total_amount
First Aid & OTC
cust_1 113-8609955-6621848 10-2018 Infa**********ry ) First Aid & OTC USD $11.25
cust_1 113-3239294-7324261 1-2020 Chil**********. oz First Aid & OTC USD $23.88
cust_1 113-7744157-8873047 7-2020 Swis**********ival First Aid & OTC USD $36.46
cust_1 112-9687788-0137844 1-2019 Infa**********l oz First Aid & OTC USD $10.76
cust_2 112-8086631-9349047 11-2019 Hego**********e-1) First Aid & OTC USD $62.66
General FSA/HSA
cust_1 112-6986889-9915463 6-2024 Baby**********3 oz General FSA/HSA USD $12.87
cust_1 111-9510385-2246650 8-2024 Comf**********acks General FSA/HSA USD $18.45
cust_1 114-8332615-1848259 5-2018 Baby**********f 4) General FSA/HSA USD $19.45
cust_1 113-7744157-8873047 7-2020 Thin**********nce) General FSA/HSA USD $16.48
cust_1 113-8212141-8198637 5-2019 Baby**********f 2) General FSA/HSA USD $17.71
cust_1 112-6169043-3177055 9-2024 LAO ********** Net General FSA/HSA USD $12.79
cust_1 111-7957159-2074606 12-2021 Pamp**********lbs) General FSA/HSA USD $22.31
cust_1 111-1800947-2808231 7-2025 Sun **********6 oz General FSA/HSA USD $13.37
cust_1 111-9334120-6185810 11-2021 Pamp**********lbs) General FSA/HSA USD $28.69
cust_1 112-3277915-3871452 9-2021 Pamp**********lbs) General FSA/HSA USD $28.41
cust_1 111-4683115-0581816 1-2022 Pamp**********lbs) General FSA/HSA USD $27.65
cust_2 111-4093589-5019409 4-2019 Baby**********Pack General FSA/HSA USD $22.24
cust_2 112-0058571-5953862 2-2026 Kieh**********l oz General FSA/HSA USD $48.37
cust_2 113-1926323-0442610 6-2024 Kieh**********l oz General FSA/HSA USD $44.44
cust_2 114-2342275-6469811 12-2024 Kieh**********l oz General FSA/HSA USD $54.32
cust_2 113-3928775-7080222 5-2025 Kieh**********l oz General FSA/HSA USD $39.90
cust_2 112-0653424-6762657 8-2025 Kieh**********l oz General FSA/HSA USD $35.92
cust_2 112-7518733-2752222 11-2025 Kieh**********l oz General FSA/HSA USD $41.32
Medical Devices
cust_1 114-9033572-9880253 2-2018 Newk**********eter Medical Devices USD $15.27
cust_1 113-7458070-8221843 4-2024 JA-R**********6-2p Medical Devices USD $9.04
cust_1 112-7988003-9562665 12-2024 Skil**********ults Medical Devices USD $0.00
cust_1 113-6826678-0824251 7-2022 TEA **********2 OZ Medical Devices USD $9.63
cust_1 002-0908886-0308264 8-2015 Nube**********able Medical Devices USD $9.96
cust_1 111-6938660-0891453 8-2025 Sess**********book Medical Devices USD $11.69
cust_1 114-2627810-2825050 8-2018 Avee**********. oz Medical Devices USD $18.12
cust_1 112-1130020-3678605 1-2020 Frac**********6 OZ Medical Devices USD $16.55
cust_1 110-4770389-1029840 10-2015 Hair**********Wrap Medical Devices USD $19.57
cust_1 105-0151318-9022604 12-2015 Esse**********rapy Medical Devices USD $77.44
cust_1 111-7068540-9605846 12-2025 Craz********** USA Medical Devices USD $11.27
cust_1 105-8202098-5233056 10-2016 Arom**********ack) Medical Devices USD $10.37
cust_1 111-7068540-9605846 12-2025 Craz********** USA Medical Devices USD $12.11
cust_1 114-3551914-3148242 8-2018 Aqua**********unce Medical Devices USD $13.55
cust_1 111-3085155-2599469 9-2020 Maje**********ary) Medical Devices USD $15.89
cust_1 113-0010372-5409850 8-2019 Medi**********hite Medical Devices USD $308.88
cust_1 107-8733716-1693847 2-2017 Empt**********ings Medical Devices USD $8.83
cust_2 112-9689746-5453061 4-2023 Ruth********** Set Medical Devices USD $274.43
cust_2 114-0110855-3553024 12-2022 Kipi********** Gym Medical Devices USD $17.97
cust_2 114-9249680-8065813 10-2023 Aoli**********ancy Medical Devices USD $23.94
cust_2 113-9123562-4762628 5-2025 Huma**********lers Medical Devices USD $25.66
cust_2 113-2625042-7384207 8-2025 LESC**********lmet Medical Devices USD $321.56
cust_2 104-7416017-0699423 7-2016 EvoN**********ack) Medical Devices USD $84.29
PPE & Prevention
cust_1 113-8609955-6621848 10-2018 PURE**********ttle PPE & Prevention USD $29.07
cust_1 114-5677546-6950667 6-2018 1 X **********ttle PPE & Prevention USD $6.65
cust_1 113-3286887-8268203 7-2020 Hand**********0 ml PPE & Prevention USD $24.47
cust_1 114-2077704-9365062 8-2021 Pers**********f 3) PPE & Prevention USD $12.99
cust_1 112-9791026-4815436 9-2018 1 X **********ttle PPE & Prevention USD $5.52
cust_1 113-4671227-4000201 8-2019 PURE**********4-EC PPE & Prevention USD $23.89
cust_1 114-6107407-2375452 9-2022 Pure**********DECO PPE & Prevention USD $18.81
cust_1 111-4733609-4858638 7-2020 Wedd**********Grey PPE & Prevention USD $13.91
cust_1 111-5624520-3829059 7-2020 Supp**********Pack PPE & Prevention USD $21.51
cust_2 113-3309505-4635404 3-2021 COVI**********ated PPE & Prevention USD $13.44
cust_2 114-7004198-5218612 12-2023 Purj**********rey) PPE & Prevention USD $13.55