Data Cleaning Basics: What to Do Before You Analyse
You can't get reliable insights from messy data. Before you analyse, you need to clean. Here are the basics.
Remove Duplicates
Duplicate rows skew counts and averages. Use your spreadsheet's built-in duplicate removal or a simple deduplication step. Decide in advance which columns define a "duplicate"—often a combination of ID, email, or date.
Fix Inconsistent Formatting
Dates like "15/08/2024", "Aug 15 2024", and "2024-08-15" in the same column cause problems. Standardise to one format. Same for text: "Yes" and "yes" and "YES" should be consistent. Use find-and-replace or formulas to normalise.
Handle Missing Values
Empty cells break calculations and charts. Decide how to handle them: remove the row, fill with a default (e.g. 0 or "Unknown"), or flag for manual review. Document your choice so others understand the data.
Check for Outliers
A single typo—"10000" instead of "100"—can distort averages and totals. Scan for values that don't make sense. Use simple checks: min, max, and a quick visual scan of the distribution.
Validate Against Rules
If a column should only contain certain values—e.g. status as "Active" or "Inactive"—check that it does. Invalid entries often come from manual entry or imports. Fix or flag them before analysis.
Data cleaning is not glamorous, but it's essential. A few hours spent cleaning can save days of wrong conclusions later.