Data Cleaning Basics: What to Do Before You Analyse

Data Cleaning, Analysis
Analysis is only as good as the data you put in. Here are the essential cleaning steps before you run any numbers.

You can't get reliable insights from messy data. Before you analyse, you need to clean. Here are the basics.

Remove Duplicates

Duplicate rows skew counts and averages. Use your spreadsheet's built-in duplicate removal or a simple deduplication step. Decide in advance which columns define a "duplicate"—often a combination of ID, email, or date.

Fix Inconsistent Formatting

Dates like "15/08/2024", "Aug 15 2024", and "2024-08-15" in the same column cause problems. Standardise to one format. Same for text: "Yes" and "yes" and "YES" should be consistent. Use find-and-replace or formulas to normalise.

Handle Missing Values

Empty cells break calculations and charts. Decide how to handle them: remove the row, fill with a default (e.g. 0 or "Unknown"), or flag for manual review. Document your choice so others understand the data.

Check for Outliers

A single typo—"10000" instead of "100"—can distort averages and totals. Scan for values that don't make sense. Use simple checks: min, max, and a quick visual scan of the distribution.

Validate Against Rules

If a column should only contain certain values—e.g. status as "Active" or "Inactive"—check that it does. Invalid entries often come from manual entry or imports. Fix or flag them before analysis.

Data cleaning is not glamorous, but it's essential. A few hours spent cleaning can save days of wrong conclusions later.

© 2024 Vahdettin Karataş. All rights reserved.
Data, reporting & automation.