Skip to content

Filtering and Cleaning Data

Real-world data is messy. Before you can analyze effectively, you often need to filter out irrelevant records, handle missing values, and standardize inconsistent formats. This guide covers the essential data cleaning techniques you’ll use in almost every analysis.

The principle is simple: garbage in, garbage out. Analysis on dirty data produces misleading results.

Common problems dirty data causes:

  • Duplicate rows inflate your totals
  • Null values break calculations or get silently excluded
  • Inconsistent formats (“CA” vs “California”) split what should be one category
  • Irrelevant records add noise to your analysis

A few minutes of cleaning saves hours of debugging wrong results.

Filtering removes rows you don’t need, focusing your analysis on what matters.

Filter by a single condition:

"Filter to 2024 transactions only"
"Show only active customers"
"Keep rows where status is 'Complete'"
"Remove rows where amount is zero"

Use comparisons for numeric and date filtering:

"Filter to orders over $100"
"Keep transactions from the last 30 days"
"Show customers who joined before 2023"
"Remove products with negative inventory"

Combine conditions with AND/OR logic:

"Filter to California customers from 2024"
"Keep orders over $100 from active accounts"
"Show Premium or Enterprise customers only"
"Remove cancelled orders and refunds"

Include or exclude specific values:

"Keep only these categories: Electronics, Clothing, Home"
"Filter to customers in CA, NY, or TX"
"Exclude these product IDs: [list]"
"Remove test accounts: test@, demo@, internal@"

Match text patterns:

"Keep rows where email contains @gmail.com"
"Filter to product names starting with 'Pro'"
"Remove entries where notes contain 'TEST'"

Missing values (nulls) are one of the most common data quality issues.

First, understand where your nulls are:

"How many null values are in each column?"
"Which columns have missing data?"
"Show rows where email is missing"
"Count how many orders have no customer ID"

When missing data makes rows unusable:

"Remove rows where revenue is null"
"Filter to rows with complete addresses"
"Keep only records where all required fields are filled"
"Remove entries missing customer_id"

Replace nulls with appropriate values:

With a specific value:

"Replace null categories with 'Unknown'"
"Fill missing prices with zero"
"Set null dates to 1900-01-01"

With calculated values:

"Fill missing prices with the median price"
"Replace null ages with the average age"
"Fill missing revenue with zero"

Forward or backward fill (for time series):

"Forward-fill the status column"
"Fill missing values with the previous row's value"

Duplicates can significantly skew your analysis.

Check if duplicates exist:

"Are there duplicate rows?"
"How many duplicate order IDs are there?"
"Compare count of unique customer IDs to total rows"
"Find rows where order_id appears more than once"

Remove exact duplicate rows:

"Remove duplicate rows"
"Deduplicate the data"

Keep one row per unique key:

"Keep only the first order per customer"
"Deduplicate based on email address"
"Remove duplicates by transaction_id, keeping the most recent"

Sometimes apparent duplicates are legitimate:

"Show me the duplicate customer IDs and their details"

Review before removing—multiple orders per customer isn’t a duplicate problem!

Inconsistent formats cause grouping problems and join failures.

Fix common text issues:

"Trim whitespace from all text columns"
"Remove extra spaces from customer names"
"Clean up the product_name column"

Make text consistent:

"Convert all categories to uppercase"
"Standardize state names to title case"
"Make email addresses lowercase"

Consolidate variations:

"Replace 'CA', 'Calif', 'California' with 'California'"
"Standardize phone number formats"
"Normalize company names to official names"

Make dates consistent:

"Convert all dates to YYYY-MM-DD format"
"Standardize the date column format"
"Parse dates from the text column"

Ensure columns have the right data type:

"Convert the amount column to numbers"
"Parse the date strings as dates"
"Convert zip codes to text (to preserve leading zeros)"

A typical cleaning workflow before analysis:

Step 1: "Remove rows where customer_id is null"
Step 2: "Deduplicate by order_id"
Step 3: "Standardize state names to abbreviations"
Step 4: "Filter to completed orders from 2024"
Step 5: Now analyze...

Standardize keys before joining datasets:

Step 1: "Trim whitespace and convert customer_id to uppercase in both datasets"
Step 2: "Join orders to customers on customer_id"

Without cleaning, “ABC123” won’t match ” abc123 ”.

Remove noise before summarizing:

Step 1: "Remove test transactions (where notes contain 'TEST')"
Step 2: "Filter out refunds and cancellations"
Step 3: "Calculate total revenue by month"

Sometimes you discover issues as you go:

"Show me unique values in the category column"
→ Discover 'Electronics' and 'Electronic' both exist
"Replace 'Electronic' with 'Electronics'"
→ Check again
"Show me unique values in the category column"
→ Now it's clean

Extreme values can skew analysis:

"Show me orders over $10,000—are these legitimate?"
"Find the top 10 highest transactions"
"Remove orders where amount is more than 3 standard deviations from mean"

Consider whether outliers should be:

  • Kept — they’re legitimate large orders
  • Removed — they’re data errors
  • Capped — set to a maximum reasonable value
  • Analyzed separately — understand them before including

These mean different things:

  • Null: Value is unknown or not recorded
  • Zero: Value is known to be zero
"Replace null revenue with zero for products with no sales"
"Keep nulls for revenue where we don't have data yet"

Text columns can have both:

"Replace empty strings with null in the notes column"
"Find rows where name is empty or null"

Your filter might be too aggressive:

"How many rows are removed by each filter condition?"

Check each condition separately to find the culprit.

You might be deduplicating on the wrong columns:

"Show me what columns differ between duplicate order_ids"

Maybe the same order_id has multiple line items—that’s not a duplicate!

”Join returned fewer rows than expected”

Section titled “”Join returned fewer rows than expected””

Key values might not match due to formatting:

"Show unique customer_ids from each table"
"Check for whitespace or case differences in the keys"

“Aggregations include unexpected nulls”

Section titled ““Aggregations include unexpected nulls””

Some rows might have nulls you didn’t notice:

"How many null values are in the column I'm summing?"

Decide whether to filter them out or fill them first.

  1. Explore before cleaning — Understand what you’re fixing
  2. Clean incrementally — One issue at a time, verify each fix
  3. Document your changes — Note what you cleaned and why
  4. Keep original data — Work on copies, preserve the source
  5. Verify after cleaning — Check that your cleaning worked as expected
  6. Be consistent — Use the same cleaning logic across similar datasets

With clean data, you’re ready for analysis: