Data Cleaning
Data Cleaning
Section titled “Data Cleaning”Let AI handle data cleaning tasks—from removing duplicates to standardizing formats. Simply describe what you want cleaned, and Querri’s cleaner tool takes care of the rest.
Using AI for Data Cleaning
Section titled “Using AI for Data Cleaning”Querri’s AI assistant can clean data from any source: uploaded files, connected databases, cloud storage, or business applications.
Simple Cleaning Requests
Section titled “Simple Cleaning Requests”Ask in natural language:
"Clean this customer data""Remove duplicates from the orders table""Fix formatting issues in the sales spreadsheet""Standardize phone numbers in contacts.csv"The AI automatically:
- Detects common data quality issues
- Applies appropriate cleaning operations
- Shows what was cleaned
- Returns the cleaned dataset
Specific Cleaning Instructions
Section titled “Specific Cleaning Instructions”Be specific about what to clean:
"Remove rows with missing email addresses""Fill blank values in the 'city' column with 'Unknown'""Convert all state names to two-letter codes""Standardize date format to YYYY-MM-DD""Trim whitespace from all text columns"The more specific you are, the more targeted the cleaning.
Common Cleaning Operations
Section titled “Common Cleaning Operations”Removing Duplicates
Section titled “Removing Duplicates”Identify and remove duplicate records:
Simple duplicate removal:
"Remove duplicate rows from customers.csv""Find duplicate email addresses and keep only the first""Deduplicate orders based on order_id"Advanced deduplication:
"Remove duplicates based on email and phone number""Keep the most recent record for each customer_id""Deduplicate by name, but keep different addresses separate"What the cleaner does:
- Identifies duplicate rows based on specified columns
- Keeps one instance (first, last, or most complete)
- Reports how many duplicates were removed
Example result:
Found 47 duplicate customersRemoved 47 rowsKept 1,203 unique customersHandling Missing Values
Section titled “Handling Missing Values”Deal with null, blank, or missing data:
Removing missing data:
"Remove rows where 'email' is blank""Delete records with null values in critical fields""Drop rows with more than 3 missing columns"Filling missing data:
"Fill missing prices with the average price""Replace blank cities with 'Not specified'""Fill missing dates with the previous row's date""Use 0 for null quantity values"Imputation strategies:
- Mean/Average: For numeric data
- Median: For numeric data with outliers
- Mode: For categorical data
- Forward fill: Copy from previous row
- Backward fill: Copy from next row
- Default value: Specify a constant
Example:
"For each product, fill missing stock quantities with the average stock for that product category"Standardizing Formats
Section titled “Standardizing Formats”Convert data to consistent formats:
Date standardization:
"Convert all dates to YYYY-MM-DD format""Standardize date_created to ISO 8601""Parse dates in 'MM/DD/YYYY' format to date objects"Phone number formatting:
"Format phone numbers as (123) 456-7890""Standardize phone numbers to E.164 format (+1234567890)""Remove special characters from phone numbers"Text standardization:
"Convert all emails to lowercase""Capitalize first letter of names""Standardize state abbreviations (California → CA)""Convert 'Y'/'N' to 'Yes'/'No'"Numeric formatting:
"Remove dollar signs and commas from prices""Convert currency to numeric values""Round decimals to 2 places""Format numbers with thousand separators"Trimming and Cleaning Text
Section titled “Trimming and Cleaning Text”Remove unwanted characters and whitespace:
Whitespace:
"Trim leading and trailing spaces from all columns""Remove extra spaces between words""Clean up whitespace in address fields"Special characters:
"Remove non-alphanumeric characters from product codes""Strip HTML tags from descriptions""Remove emoji from customer comments"Case normalization:
"Convert all text to lowercase""Uppercase all product SKUs""Title case for customer names"Example:
Before: " ACME Corp "After: "ACME Corp"
Before: "contact@example.COM"After: "contact@example.com"
Before: "$1,234.56"After: 1234.56Fixing Data Types
Section titled “Fixing Data Types”Ensure columns have the correct data type:
Type conversion:
"Convert 'price' column to decimal""Change 'order_date' to date type""Make 'quantity' an integer""Convert 'is_active' to boolean"Detecting issues:
"Find non-numeric values in the 'revenue' column""Show rows where 'date' isn't a valid date""Identify text in numeric fields"Coercion strategies:
"Convert to number, use null for invalid values""Parse dates, mark invalid dates as null""Cast 'Yes'/'No' to true/false boolean"Splitting and Combining Columns
Section titled “Splitting and Combining Columns”Restructure data by splitting or merging columns:
Splitting columns:
"Split 'full_name' into 'first_name' and 'last_name'""Separate 'address' into street, city, state, zip""Extract area code from phone number""Split 'datetime' into separate date and time columns"Combining columns:
"Combine 'first_name' and 'last_name' into 'full_name'""Merge city and state into 'location'""Concatenate year, month, day into a date string"Example:
Before: full_name = "John Doe"After: first_name = "John", last_name = "Doe"
Before: street = "123 Main St", city = "Boston"After: address = "123 Main St, Boston"Cleaner Tool Capabilities
Section titled “Cleaner Tool Capabilities”Querri’s cleaner_tool.py performs sophisticated data cleaning:
Automatic Issue Detection
Section titled “Automatic Issue Detection”The cleaner automatically scans for:
- Duplicate records
- Missing or null values
- Inconsistent formatting
- Invalid data types
- Outliers and anomalies
- Encoding issues
- Whitespace problems
Example request:
"Analyze the data quality in sales.csv"Response:
Data Quality Report:- 342 duplicate rows found- 'email' column has 15 null values- 'phone' has inconsistent formatting- 'date' column has 3 invalid dates- 'price' contains 5 non-numeric valuesBatch Cleaning
Section titled “Batch Cleaning”Apply multiple cleaning operations at once:
"Clean the customer data by:1. Removing duplicates based on email2. Filling missing phone numbers with 'N/A'3. Standardizing all state names to abbreviations4. Converting emails to lowercase5. Trimming whitespace from all fields"The cleaner executes all operations in sequence and shows a summary.
Column-Specific Cleaning
Section titled “Column-Specific Cleaning”Target specific columns for cleaning:
"In the 'email' column, remove invalid email addresses""For 'price', remove dollar signs and convert to float""Clean the 'description' field by removing HTML tags"Conditional Cleaning
Section titled “Conditional Cleaning”Apply cleaning based on conditions:
"For rows where country is 'USA', format phone as (123) 456-7890""Fill missing 'discount' with 0 only for retail customers""Standardize dates to ISO format, but only for dates after 2020"Data Validation
Section titled “Data Validation”Validate data quality after cleaning or identify issues before cleaning:
Validation Rules
Section titled “Validation Rules”Define what “clean” data should look like:
Format validation:
"Check if all emails are valid email addresses""Verify phone numbers match the pattern (###) ###-####""Ensure all dates are in YYYY-MM-DD format"Range validation:
"Flag prices less than 0 or greater than 10,000""Find ages outside the range 0-120""Identify quantities that aren't positive integers"Completeness validation:
"Find rows missing required fields (email, name, address)""Check for null values in critical columns""Report records with incomplete contact information"Uniqueness validation:
"Verify all customer_ids are unique""Check for duplicate email addresses""Ensure no order_ids are repeated"Validation Reports
Section titled “Validation Reports”Get comprehensive data quality reports:
"Run a data quality check on the customers table"Report includes:
- Total rows and columns
- Missing value counts per column
- Duplicate row count
- Data type mismatches
- Formatting inconsistencies
- Suggested cleaning operations
Example report:
Data Quality Summary:Total rows: 10,000Total columns: 15
Missing Values:- email: 234 (2.3%)- phone: 567 (5.7%)- address: 89 (0.9%)
Duplicates: 45 rows (0.45%)
Format Issues:- phone: 3 different formats detected- date_created: mix of formats (MM/DD/YYYY and YYYY-MM-DD)
Recommendations:1. Remove duplicate rows based on customer_id2. Standardize phone number format3. Fill missing emails or remove those rows4. Convert all dates to ISO formatQuality Checks
Section titled “Quality Checks”Perform specific quality checks:
Consistency Checks
Section titled “Consistency Checks”Ensure data is internally consistent:
"Check if start_date is always before end_date""Verify total equals sum of line items""Ensure parent_id references exist in the parents table""Validate that age matches birth_date"Referential Integrity
Section titled “Referential Integrity”Check relationships between datasets:
"Verify all customer_ids in orders exist in customers table""Find orphaned records (orders without matching customers)""Check if all product_ids have corresponding product details"Business Rule Validation
Section titled “Business Rule Validation”Apply domain-specific rules:
"Flag orders with quantity less than minimum order quantity""Find customers under 18 years old (invalid for this service)""Identify transactions outside business hours""Check if discount percentage exceeds maximum allowed (30%)"Statistical Outlier Detection
Section titled “Statistical Outlier Detection”Identify unusual values:
"Find outliers in the 'price' column""Detect anomalous transaction amounts (beyond 3 standard deviations)""Show unusually high or low values in revenue data"Cleaning Best Practices
Section titled “Cleaning Best Practices”Before Cleaning
Section titled “Before Cleaning”1. Backup your data:
"Save a copy of the original data to backup.csv"Always preserve the original before making changes.
2. Understand the data:
"Show me a sample of 10 rows""What are the column names and types?""Describe the data distribution for key columns"Know what you’re working with before cleaning.
3. Identify issues:
"Run a data quality report on this dataset"Understand what needs cleaning.
During Cleaning
Section titled “During Cleaning”4. Clean incrementally: Clean one issue at a time and verify results:
"Remove duplicates" → Review →"Fill missing emails" → Review →"Standardize dates" → Review5. Document changes: Keep notes on what was cleaned and why:
- Which duplicates were removed
- How missing values were filled
- What format standards were applied
6. Validate as you go:
"Show me a sample of the cleaned data""How many rows were changed?""Are there any remaining null values?"After Cleaning
Section titled “After Cleaning”7. Final validation:
"Run a final data quality check""Verify all required fields are populated""Check for any remaining duplicates"8. Compare before/after:
"Compare row counts before and after cleaning""Show summary statistics for original vs. cleaned data"9. Export clean data:
"Export the cleaned data to cleaned_customers.csv""Save cleaned data to the database"Common Cleaning Scenarios
Section titled “Common Cleaning Scenarios”Cleaning Customer Data
Section titled “Cleaning Customer Data”"Clean the customer data:- Remove duplicates based on email- Standardize phone numbers to E.164 format- Fill missing states with 'Unknown'- Convert all emails to lowercase- Trim whitespace from names and addresses- Capitalize first and last names- Remove invalid email addresses"Cleaning Sales Transactions
Section titled “Cleaning Sales Transactions”"Clean sales data:- Remove rows with negative quantities- Fill missing prices with product's average price- Standardize date format to YYYY-MM-DD- Convert currency columns to numeric (remove $ and commas)- Remove duplicate orders (same order_id)- Validate that total = price × quantity"Cleaning Survey Responses
Section titled “Cleaning Survey Responses”"Clean survey data:- Standardize Yes/No responses (handle Y, N, yes, no, 1, 0)- Remove rows with no answers at all- Fill missing age with median age- Trim whitespace from text responses- Remove HTML/special characters from comments- Categorize open-ended responses where possible"Cleaning Product Catalogs
Section titled “Cleaning Product Catalogs”"Clean product data:- Remove duplicate SKUs (keep most recent)- Fill missing descriptions with 'No description available'- Standardize category names- Ensure prices are positive numbers- Remove products with null SKU- Standardize unit of measure abbreviations"Advanced Cleaning Techniques
Section titled “Advanced Cleaning Techniques”Regular Expressions
Section titled “Regular Expressions”Use regex patterns for complex string cleaning:
"Extract 5-digit zip codes from addresses using regex""Remove all non-alphanumeric characters except spaces""Parse order numbers matching pattern ORD-YYYYMMDD-####"Custom Functions
Section titled “Custom Functions”Apply custom transformations:
"Apply this cleaning logic: If status is 'Pending' for more than 30 days, change to 'Expired'"
"For product names, remove brand prefixes but keep the core name"Machine Learning-Based Cleaning
Section titled “Machine Learning-Based Cleaning”Let AI learn patterns:
"Standardize company names (handle abbreviations, capitalization, Inc vs Incorporated)""Categorize products into standard categories based on descriptions""Detect and correct likely typos in city names"Troubleshooting Cleaning Issues
Section titled “Troubleshooting Cleaning Issues”Cleaning Removes Too Much Data
Section titled “Cleaning Removes Too Much Data”Problem: Aggressive cleaning deletes valid data.
Solution:
- Be more specific in cleaning instructions
- Review the data before mass deletions
- Use conditional cleaning (only clean if…)
- Start with a small sample
Cleaning Doesn’t Work as Expected
Section titled “Cleaning Doesn’t Work as Expected”Problem: AI misunderstands what to clean.
Solution:
- Provide clearer, more specific instructions
- Show examples of before/after
- Clean one operation at a time
- Use explicit column names
Cleaned Data Still Has Issues
Section titled “Cleaned Data Still Has Issues”Problem: Some issues remain after cleaning.
Solution:
- Run another quality check to identify remaining issues
- Apply additional targeted cleaning
- Some issues may require manual review
- Complex cleaning may need multiple passes
Next Steps
Section titled “Next Steps”- Data Enrichment - Enhance cleaned data with external information
- File Uploads - Best practices for preparing files before upload
- Managing Connections - Connect data sources to clean
Tips for Effective Cleaning
Section titled “Tips for Effective Cleaning”Start with the basics:
- Remove duplicates first
- Handle missing values early
- Standardize formats before analysis
Use consistent standards:
- Date format: YYYY-MM-DD
- Phone format: E.164 (+1234567890)
- Email: lowercase
- Text: trim whitespace
Validate frequently:
- Check after each major cleaning step
- Compare row counts before/after
- Sample the cleaned data visually
Document your process:
- Note what was cleaned and why
- Record decisions about missing values
- Keep a changelog of transformations
Know when to stop:
- Aim for “good enough” not “perfect”
- Some data quality issues may be acceptable
- Manual review may be needed for edge cases
- Focus on issues that impact your analysis