Sample data for data cleaning practice

Realistic sales data with intentional quality problems — blanks, duplicate rows, typos and inconsistent dates — so you can practice the full cleaning workflow on data that still makes sense underneath.

B2B distributionSeeded - reproducibleCSV / Excel / JSON / SQL100% in-browser

Generate & download

Save / load scenario (stored only in this browser)

About this dataset

This is a free, reproducible B2B distribution dataset you can generate and download right here as CSV, Excel, JSON or SQL. It is built for data cleaning workflows, deduplication and type coercion — and because every field is correlated rather than random, the numbers actually hold together when you analyze them.

Each customer is assigned a segment per product category that drives how often and how much they buy, with relationship momentum, occasional large-buy spikes, and category-specific markups — so the file behaves like a real distributor sales export, not random noise.

Columns in this dataset

Schema for the B2B distribution export (the anomaly column appears only when labels are switched on):

ColumnTypeDescription
order_datedateBusiness day the order was placed.
invoice_nointegerInvoice id; multiple lines share one invoice.
customer_id / customerint / textThe buying business.
product_id / productint / textThe SKU ordered.
categorytextOne of five distributor categories.
segmenttextCustomer role for that category (segA largest to segD smallest).
quantityintegerUnits ordered; scales with segment and volume.
unit_cost / unit_pricenumberYour cost and the price charged.
revenue / cost / marginnumberLine economics.
ship_datedateFulfilment date (usually next business day).
anomaly0/1Present with labels on; flags inflated-price large orders.

Load it with pandas

import pandas as pd
df = pd.read_csv("b2b_invoices.csv")
df.head()

Good for

Data cleaning workflowsDeduplicationType coercionTeaching data quality

Related sample datasets

FAQ

How big is this dataset?

Around 6,000 rows by default. Change the row count in the generator above and re-export — anything up to ~200k works in the browser.

What formats can I download?

CSV, Excel (.xlsx), JSON, and SQL (a CREATE TABLE plus INSERT statements). Pick whatever fits your workflow.

Will I get the same file every time?

Yes. This page uses the fixed seed cleanprac-demo, so the download is byte-identical on every machine. Clear the seed in the generator for fresh random data.

Can I get separate tables, messy data, or other formats?

Yes. Use Tables → Excel/SQL for a normalized multi-table export, switch on Messy / dirty data in Advanced options for nulls, typos and inconsistent dates, and choose CSV, Excel, JSON or SQL on any download.

Is the data real?

No — it is 100% synthetic, generated in your browser, with no real people or companies. Free to use commercially.