Filtering and Cleaning Data
Real-world data is messy. Before you can analyze effectively, you often need to filter out irrelevant records, handle missing values, and standardize inconsistent formats. This guide covers the essential data cleaning techniques you’ll use in almost every analysis.
Why Clean Data Matters
Section titled “Why Clean Data Matters”The principle is simple: garbage in, garbage out. Analysis on dirty data produces misleading results.
Common problems dirty data causes:
- Duplicate rows inflate your totals
- Null values break calculations or get silently excluded
- Inconsistent formats (“CA” vs “California”) split what should be one category
- Irrelevant records add noise to your analysis
A few minutes of cleaning saves hours of debugging wrong results.
Filtering Rows
Section titled “Filtering Rows”Filtering removes rows you don’t need, focusing your analysis on what matters.
Basic Filters
Section titled “Basic Filters”Filter by a single condition:
"Filter to 2024 transactions only""Show only active customers""Keep rows where status is 'Complete'""Remove rows where amount is zero"Comparison Operators
Section titled “Comparison Operators”Use comparisons for numeric and date filtering:
"Filter to orders over $100""Keep transactions from the last 30 days""Show customers who joined before 2023""Remove products with negative inventory"Multiple Conditions
Section titled “Multiple Conditions”Combine conditions with AND/OR logic:
"Filter to California customers from 2024""Keep orders over $100 from active accounts""Show Premium or Enterprise customers only""Remove cancelled orders and refunds"Filter by List
Section titled “Filter by List”Include or exclude specific values:
"Keep only these categories: Electronics, Clothing, Home""Filter to customers in CA, NY, or TX""Exclude these product IDs: [list]""Remove test accounts: test@, demo@, internal@"Filter by Pattern
Section titled “Filter by Pattern”Match text patterns:
"Keep rows where email contains @gmail.com""Filter to product names starting with 'Pro'""Remove entries where notes contain 'TEST'"Handling Missing Values
Section titled “Handling Missing Values”Missing values (nulls) are one of the most common data quality issues.
Finding Nulls
Section titled “Finding Nulls”First, understand where your nulls are:
"How many null values are in each column?""Which columns have missing data?""Show rows where email is missing""Count how many orders have no customer ID"Removing Rows with Nulls
Section titled “Removing Rows with Nulls”When missing data makes rows unusable:
"Remove rows where revenue is null""Filter to rows with complete addresses""Keep only records where all required fields are filled""Remove entries missing customer_id"Filling Null Values
Section titled “Filling Null Values”Replace nulls with appropriate values:
With a specific value:
"Replace null categories with 'Unknown'""Fill missing prices with zero""Set null dates to 1900-01-01"With calculated values:
"Fill missing prices with the median price""Replace null ages with the average age""Fill missing revenue with zero"Forward or backward fill (for time series):
"Forward-fill the status column""Fill missing values with the previous row's value"Removing Duplicates
Section titled “Removing Duplicates”Duplicates can significantly skew your analysis.
Finding Duplicates
Section titled “Finding Duplicates”Check if duplicates exist:
"Are there duplicate rows?""How many duplicate order IDs are there?""Compare count of unique customer IDs to total rows""Find rows where order_id appears more than once"Removing Duplicates
Section titled “Removing Duplicates”Remove exact duplicate rows:
"Remove duplicate rows""Deduplicate the data"Deduplicating by Key
Section titled “Deduplicating by Key”Keep one row per unique key:
"Keep only the first order per customer""Deduplicate based on email address""Remove duplicates by transaction_id, keeping the most recent"Handling Intentional Duplicates
Section titled “Handling Intentional Duplicates”Sometimes apparent duplicates are legitimate:
"Show me the duplicate customer IDs and their details"Review before removing—multiple orders per customer isn’t a duplicate problem!
Standardizing Data
Section titled “Standardizing Data”Inconsistent formats cause grouping problems and join failures.
Text Cleanup
Section titled “Text Cleanup”Fix common text issues:
"Trim whitespace from all text columns""Remove extra spaces from customer names""Clean up the product_name column"Case Standardization
Section titled “Case Standardization”Make text consistent:
"Convert all categories to uppercase""Standardize state names to title case""Make email addresses lowercase"Value Standardization
Section titled “Value Standardization”Consolidate variations:
"Replace 'CA', 'Calif', 'California' with 'California'""Standardize phone number formats""Normalize company names to official names"Date Formatting
Section titled “Date Formatting”Make dates consistent:
"Convert all dates to YYYY-MM-DD format""Standardize the date column format""Parse dates from the text column"Type Conversion
Section titled “Type Conversion”Ensure columns have the right data type:
"Convert the amount column to numbers""Parse the date strings as dates""Convert zip codes to text (to preserve leading zeros)"Common Cleaning Patterns
Section titled “Common Cleaning Patterns”Pattern 1: Clean Then Analyze
Section titled “Pattern 1: Clean Then Analyze”A typical cleaning workflow before analysis:
Step 1: "Remove rows where customer_id is null"Step 2: "Deduplicate by order_id"Step 3: "Standardize state names to abbreviations"Step 4: "Filter to completed orders from 2024"Step 5: Now analyze...Pattern 2: Clean Before Joining
Section titled “Pattern 2: Clean Before Joining”Standardize keys before joining datasets:
Step 1: "Trim whitespace and convert customer_id to uppercase in both datasets"Step 2: "Join orders to customers on customer_id"Without cleaning, “ABC123” won’t match ” abc123 ”.
Pattern 3: Clean Before Aggregating
Section titled “Pattern 3: Clean Before Aggregating”Remove noise before summarizing:
Step 1: "Remove test transactions (where notes contain 'TEST')"Step 2: "Filter out refunds and cancellations"Step 3: "Calculate total revenue by month"Pattern 4: Iterative Cleaning
Section titled “Pattern 4: Iterative Cleaning”Sometimes you discover issues as you go:
"Show me unique values in the category column"→ Discover 'Electronics' and 'Electronic' both exist
"Replace 'Electronic' with 'Electronics'"→ Check again
"Show me unique values in the category column"→ Now it's cleanHandling Edge Cases
Section titled “Handling Edge Cases”Outliers
Section titled “Outliers”Extreme values can skew analysis:
"Show me orders over $10,000—are these legitimate?""Find the top 10 highest transactions""Remove orders where amount is more than 3 standard deviations from mean"Consider whether outliers should be:
- Kept — they’re legitimate large orders
- Removed — they’re data errors
- Capped — set to a maximum reasonable value
- Analyzed separately — understand them before including
Zero vs Null
Section titled “Zero vs Null”These mean different things:
- Null: Value is unknown or not recorded
- Zero: Value is known to be zero
"Replace null revenue with zero for products with no sales""Keep nulls for revenue where we don't have data yet"Empty Strings vs Null
Section titled “Empty Strings vs Null”Text columns can have both:
"Replace empty strings with null in the notes column""Find rows where name is empty or null"Troubleshooting
Section titled “Troubleshooting””Too many rows were removed”
Section titled “”Too many rows were removed””Your filter might be too aggressive:
"How many rows are removed by each filter condition?"Check each condition separately to find the culprit.
”I still see duplicates”
Section titled “”I still see duplicates””You might be deduplicating on the wrong columns:
"Show me what columns differ between duplicate order_ids"Maybe the same order_id has multiple line items—that’s not a duplicate!
”Join returned fewer rows than expected”
Section titled “”Join returned fewer rows than expected””Key values might not match due to formatting:
"Show unique customer_ids from each table""Check for whitespace or case differences in the keys"“Aggregations include unexpected nulls”
Section titled ““Aggregations include unexpected nulls””Some rows might have nulls you didn’t notice:
"How many null values are in the column I'm summing?"Decide whether to filter them out or fill them first.
Cleaning Best Practices
Section titled “Cleaning Best Practices”- Explore before cleaning — Understand what you’re fixing
- Clean incrementally — One issue at a time, verify each fix
- Document your changes — Note what you cleaned and why
- Keep original data — Work on copies, preserve the source
- Verify after cleaning — Check that your cleaning worked as expected
- Be consistent — Use the same cleaning logic across similar datasets
Next Steps
Section titled “Next Steps”With clean data, you’re ready for analysis:
- Understanding Your Data — Explore before cleaning
- How to Connect Your Data — Join clean datasets
- Aggregating Data — Summarize clean data
- Working with Dates — Handle time data