Filtering and Cleaning Data

Real-world data is messy. Before you can analyze effectively, you often need to filter out irrelevant records, handle missing values, and standardize inconsistent formats. This guide covers the essential data cleaning techniques you’ll use in almost every analysis.

Why Clean Data Matters

The principle is simple: garbage in, garbage out. Analysis on dirty data produces misleading results.

Common problems dirty data causes:

Duplicate rows inflate your totals
Null values break calculations or get silently excluded
Inconsistent formats (“CA” vs “California”) split what should be one category
Irrelevant records add noise to your analysis

A few minutes of cleaning saves hours of debugging wrong results.

Filtering Rows

Filtering removes rows you don’t need, focusing your analysis on what matters.

Basic Filters

Filter by a single condition:

"Filter to 2024 transactions only"
"Show only active customers"
"Keep rows where status is 'Complete'"
"Remove rows where amount is zero"

Comparison Operators

Use comparisons for numeric and date filtering:

"Filter to orders over $100"
"Keep transactions from the last 30 days"
"Show customers who joined before 2023"
"Remove products with negative inventory"

Multiple Conditions

Combine conditions with AND/OR logic:

"Filter to California customers from 2024"
"Keep orders over $100 from active accounts"
"Show Premium or Enterprise customers only"
"Remove cancelled orders and refunds"

Filter by List

Include or exclude specific values:

"Keep only these categories: Electronics, Clothing, Home"
"Filter to customers in CA, NY, or TX"
"Exclude these product IDs: [list]"
"Remove test accounts: test@, demo@, internal@"

Filter by Pattern

Match text patterns:

"Keep rows where email contains @gmail.com"
"Filter to product names starting with 'Pro'"
"Remove entries where notes contain 'TEST'"

Handling Missing Values

Missing values (nulls) are one of the most common data quality issues.

Finding Nulls

First, understand where your nulls are:

"How many null values are in each column?"
"Which columns have missing data?"
"Show rows where email is missing"
"Count how many orders have no customer ID"

Removing Rows with Nulls

When missing data makes rows unusable:

"Remove rows where revenue is null"
"Filter to rows with complete addresses"
"Keep only records where all required fields are filled"
"Remove entries missing customer_id"

Filling Null Values

Replace nulls with appropriate values:

With a specific value:

"Replace null categories with 'Unknown'"
"Fill missing prices with zero"
"Set null dates to 1900-01-01"

With calculated values:

"Fill missing prices with the median price"
"Replace null ages with the average age"
"Fill missing revenue with zero"

Forward or backward fill (for time series):

"Forward-fill the status column"
"Fill missing values with the previous row's value"

Removing Duplicates

Duplicates can significantly skew your analysis.

Finding Duplicates

Check if duplicates exist:

"Are there duplicate rows?"
"How many duplicate order IDs are there?"
"Compare count of unique customer IDs to total rows"
"Find rows where order_id appears more than once"

Removing Duplicates

Remove exact duplicate rows:

"Remove duplicate rows"
"Deduplicate the data"

Deduplicating by Key

Keep one row per unique key:

"Keep only the first order per customer"
"Deduplicate based on email address"
"Remove duplicates by transaction_id, keeping the most recent"

Handling Intentional Duplicates

Sometimes apparent duplicates are legitimate:

"Show me the duplicate customer IDs and their details"

Review before removing—multiple orders per customer isn’t a duplicate problem!

Standardizing Data

Inconsistent formats cause grouping problems and join failures.

Text Cleanup

Fix common text issues:

"Trim whitespace from all text columns"
"Remove extra spaces from customer names"
"Clean up the product_name column"

Case Standardization

Make text consistent:

"Convert all categories to uppercase"
"Standardize state names to title case"
"Make email addresses lowercase"

Value Standardization

Consolidate variations:

"Replace 'CA', 'Calif', 'California' with 'California'"
"Standardize phone number formats"
"Normalize company names to official names"

Date Formatting

Make dates consistent:

"Convert all dates to YYYY-MM-DD format"
"Standardize the date column format"
"Parse dates from the text column"

Type Conversion

Ensure columns have the right data type:

"Convert the amount column to numbers"
"Parse the date strings as dates"
"Convert zip codes to text (to preserve leading zeros)"

Common Cleaning Patterns

Pattern 1: Clean Then Analyze

A typical cleaning workflow before analysis:

Step 1: "Remove rows where customer_id is null"
Step 2: "Deduplicate by order_id"
Step 3: "Standardize state names to abbreviations"
Step 4: "Filter to completed orders from 2024"
Step 5: Now analyze...

Pattern 2: Clean Before Joining

Standardize keys before joining datasets:

Step 1: "Trim whitespace and convert customer_id to uppercase in both datasets"
Step 2: "Join orders to customers on customer_id"

Without cleaning, “ABC123” won’t match ” abc123 ”.

Pattern 3: Clean Before Aggregating

Remove noise before summarizing:

Step 1: "Remove test transactions (where notes contain 'TEST')"
Step 2: "Filter out refunds and cancellations"
Step 3: "Calculate total revenue by month"

Pattern 4: Iterative Cleaning

Sometimes you discover issues as you go:

"Show me unique values in the category column"
→ Discover 'Electronics' and 'Electronic' both exist

"Replace 'Electronic' with 'Electronics'"
→ Check again

"Show me unique values in the category column"
→ Now it's clean

Handling Edge Cases

Outliers

Extreme values can skew analysis:

"Show me orders over $10,000—are these legitimate?"
"Find the top 10 highest transactions"
"Remove orders where amount is more than 3 standard deviations from mean"

Consider whether outliers should be:

Kept — they’re legitimate large orders
Removed — they’re data errors
Capped — set to a maximum reasonable value
Analyzed separately — understand them before including

Zero vs Null

These mean different things:

Null: Value is unknown or not recorded
Zero: Value is known to be zero

"Replace null revenue with zero for products with no sales"
"Keep nulls for revenue where we don't have data yet"

Empty Strings vs Null

Text columns can have both:

"Replace empty strings with null in the notes column"
"Find rows where name is empty or null"

Troubleshooting

”Too many rows were removed”

Your filter might be too aggressive:

"How many rows are removed by each filter condition?"

Check each condition separately to find the culprit.

”I still see duplicates”

You might be deduplicating on the wrong columns:

"Show me what columns differ between duplicate order_ids"

Maybe the same order_id has multiple line items—that’s not a duplicate!

”Join returned fewer rows than expected”

Key values might not match due to formatting:

"Show unique customer_ids from each table"
"Check for whitespace or case differences in the keys"

“Aggregations include unexpected nulls”

Some rows might have nulls you didn’t notice:

"How many null values are in the column I'm summing?"

Decide whether to filter them out or fill them first.

Cleaning Best Practices

Explore before cleaning — Understand what you’re fixing
Clean incrementally — One issue at a time, verify each fix
Document your changes — Note what you cleaned and why
Keep original data — Work on copies, preserve the source
Verify after cleaning — Check that your cleaning worked as expected
Be consistent — Use the same cleaning logic across similar datasets

Next Steps

With clean data, you’re ready for analysis:

Understanding Your Data — Explore before cleaning
How to Connect Your Data — Join clean datasets
Aggregating Data — Summarize clean data
Working with Dates — Handle time data