Sample messy data

Point-of-sale transactions with realistic mess layered in: missing fields, duplicate rows, typos and inconsistent dates — a believable test for cleaning and validation pipelines.

Retail POSSeeded - reproducibleCSV / Excel / JSON / SQL100% in-browser

Generate & download

Save / load scenario (stored only in this browser)

About this dataset

This is a free, reproducible Retail POS dataset you can generate and download right here as CSV, Excel, JSON or SQL. It is built for data cleaning, pipeline validation and anomaly vs. error triage — and because every field is correlated rather than random, the numbers actually hold together when you analyze them.

The catalog is organized into real affinity groups (e.g. chips + salsa + soda) that co-occur within baskets, so an association-rule miner actually surfaces lift — exactly what a market-basket exercise needs.

Columns in this dataset

Schema for the Retail POS export (the anomaly column appears only when labels are switched on):

ColumnTypeDescription
transaction_idintegerThe basket; item rows share it.
datetimedatetimeTimestamp with realistic hour-of-day weighting.
store_idtextWhich store rang the sale.
product / departmenttextItem and its aisle/department.
quantity / unit_pricenumberUnits and shelf price.
line_totalnumberquantity x unit_price.
paymenttextCard / Cash / Mobile.
anomaly0/1Present with labels on; flags suspicious transactions.

Load it with pandas

import pandas as pd
df = pd.read_csv("retail_pos.csv")
df.head()

Good for

Data cleaningPipeline validationAnomaly vs. error triageETL robustness tests

Related sample datasets

FAQ

How big is this dataset?

Around 9,000 rows by default. Change the row count in the generator above and re-export — anything up to ~200k works in the browser.

What formats can I download?

CSV, Excel (.xlsx), JSON, and SQL (a CREATE TABLE plus INSERT statements). Pick whatever fits your workflow.

Will I get the same file every time?

Yes. This page uses the fixed seed messy-demo, so the download is byte-identical on every machine. Clear the seed in the generator for fresh random data.

Can I get separate tables, messy data, or other formats?

Yes. Use Tables → Excel/SQL for a normalized multi-table export, switch on Messy / dirty data in Advanced options for nulls, typos and inconsistent dates, and choose CSV, Excel, JSON or SQL on any download.

Is the data real?

No — it is 100% synthetic, generated in your browser, with no real people or companies. Free to use commercially.