Data Cleaning

Let AI handle data cleaning tasks—from removing duplicates to standardizing formats. Simply describe what you want cleaned, and Querri’s cleaner tool takes care of the rest.

Using AI for Data Cleaning

Querri’s AI assistant can clean data from any source: uploaded files, connected databases, cloud storage, or business applications.

Simple Cleaning Requests

Ask in natural language:

"Clean this customer data"
"Remove duplicates from the orders table"
"Fix formatting issues in the sales spreadsheet"
"Standardize phone numbers in contacts.csv"

The AI automatically:

Detects common data quality issues
Applies appropriate cleaning operations
Shows what was cleaned
Returns the cleaned dataset

Specific Cleaning Instructions

Be specific about what to clean:

"Remove rows with missing email addresses"
"Fill blank values in the 'city' column with 'Unknown'"
"Convert all state names to two-letter codes"
"Standardize date format to YYYY-MM-DD"
"Trim whitespace from all text columns"

The more specific you are, the more targeted the cleaning.

Common Cleaning Operations

Removing Duplicates

Identify and remove duplicate records:

Simple duplicate removal:

"Remove duplicate rows from customers.csv"
"Find duplicate email addresses and keep only the first"
"Deduplicate orders based on order_id"

Advanced deduplication:

"Remove duplicates based on email and phone number"
"Keep the most recent record for each customer_id"
"Deduplicate by name, but keep different addresses separate"

What the cleaner does:

Identifies duplicate rows based on specified columns
Keeps one instance (first, last, or most complete)
Reports how many duplicates were removed

Example result:

Found 47 duplicate customers
Removed 47 rows
Kept 1,203 unique customers

Handling Missing Values

Deal with null, blank, or missing data:

Removing missing data:

"Remove rows where 'email' is blank"
"Delete records with null values in critical fields"
"Drop rows with more than 3 missing columns"

Filling missing data:

"Fill missing prices with the average price"
"Replace blank cities with 'Not specified'"
"Fill missing dates with the previous row's date"
"Use 0 for null quantity values"

Imputation strategies:

Mean/Average: For numeric data
Median: For numeric data with outliers
Mode: For categorical data
Forward fill: Copy from previous row
Backward fill: Copy from next row
Default value: Specify a constant

Example:

"For each product, fill missing stock quantities with the average stock for that product category"

Standardizing Formats

Convert data to consistent formats:

Date standardization:

"Convert all dates to YYYY-MM-DD format"
"Standardize date_created to ISO 8601"
"Parse dates in 'MM/DD/YYYY' format to date objects"

Phone number formatting:

"Format phone numbers as (123) 456-7890"
"Standardize phone numbers to E.164 format (+1234567890)"
"Remove special characters from phone numbers"

Text standardization:

"Convert all emails to lowercase"
"Capitalize first letter of names"
"Standardize state abbreviations (California → CA)"
"Convert 'Y'/'N' to 'Yes'/'No'"

Numeric formatting:

"Remove dollar signs and commas from prices"
"Convert currency to numeric values"
"Round decimals to 2 places"
"Format numbers with thousand separators"

Trimming and Cleaning Text

Remove unwanted characters and whitespace:

Whitespace:

"Trim leading and trailing spaces from all columns"
"Remove extra spaces between words"
"Clean up whitespace in address fields"

Special characters:

"Remove non-alphanumeric characters from product codes"
"Strip HTML tags from descriptions"
"Remove emoji from customer comments"

Case normalization:

"Convert all text to lowercase"
"Uppercase all product SKUs"
"Title case for customer names"

Example:

Before: "  ACME Corp   "
After:  "ACME Corp"

Before: "contact@example.COM"
After:  "contact@example.com"

Before: "$1,234.56"
After:  1234.56

Fixing Data Types

Ensure columns have the correct data type:

Type conversion:

"Convert 'price' column to decimal"
"Change 'order_date' to date type"
"Make 'quantity' an integer"
"Convert 'is_active' to boolean"

Detecting issues:

"Find non-numeric values in the 'revenue' column"
"Show rows where 'date' isn't a valid date"
"Identify text in numeric fields"

Coercion strategies:

"Convert to number, use null for invalid values"
"Parse dates, mark invalid dates as null"
"Cast 'Yes'/'No' to true/false boolean"

Splitting and Combining Columns

Restructure data by splitting or merging columns:

Splitting columns:

"Split 'full_name' into 'first_name' and 'last_name'"
"Separate 'address' into street, city, state, zip"
"Extract area code from phone number"
"Split 'datetime' into separate date and time columns"

Combining columns:

"Combine 'first_name' and 'last_name' into 'full_name'"
"Merge city and state into 'location'"
"Concatenate year, month, day into a date string"

Example:

Before: full_name = "John Doe"
After:  first_name = "John", last_name = "Doe"

Before: street = "123 Main St", city = "Boston"
After:  address = "123 Main St, Boston"

Cleaner Tool Capabilities

Querri’s cleaner_tool.py performs sophisticated data cleaning:

Automatic Issue Detection

The cleaner automatically scans for:

Duplicate records
Missing or null values
Inconsistent formatting
Invalid data types
Outliers and anomalies
Encoding issues
Whitespace problems

Example request:

"Analyze the data quality in sales.csv"

Response:

Data Quality Report:
- 342 duplicate rows found
- 'email' column has 15 null values
- 'phone' has inconsistent formatting
- 'date' column has 3 invalid dates
- 'price' contains 5 non-numeric values

Batch Cleaning

Apply multiple cleaning operations at once:

"Clean the customer data by:
1. Removing duplicates based on email
2. Filling missing phone numbers with 'N/A'
3. Standardizing all state names to abbreviations
4. Converting emails to lowercase
5. Trimming whitespace from all fields"

The cleaner executes all operations in sequence and shows a summary.

Column-Specific Cleaning

Target specific columns for cleaning:

"In the 'email' column, remove invalid email addresses"
"For 'price', remove dollar signs and convert to float"
"Clean the 'description' field by removing HTML tags"

Conditional Cleaning

Apply cleaning based on conditions:

"For rows where country is 'USA', format phone as (123) 456-7890"
"Fill missing 'discount' with 0 only for retail customers"
"Standardize dates to ISO format, but only for dates after 2020"

Data Validation

Validate data quality after cleaning or identify issues before cleaning:

Validation Rules

Define what “clean” data should look like:

Format validation:

"Check if all emails are valid email addresses"
"Verify phone numbers match the pattern (###) ###-####"
"Ensure all dates are in YYYY-MM-DD format"

Range validation:

"Flag prices less than 0 or greater than 10,000"
"Find ages outside the range 0-120"
"Identify quantities that aren't positive integers"

Completeness validation:

"Find rows missing required fields (email, name, address)"
"Check for null values in critical columns"
"Report records with incomplete contact information"

Uniqueness validation:

"Verify all customer_ids are unique"
"Check for duplicate email addresses"
"Ensure no order_ids are repeated"

Validation Reports

Get comprehensive data quality reports:

"Run a data quality check on the customers table"

Report includes:

Total rows and columns
Missing value counts per column
Duplicate row count
Data type mismatches
Formatting inconsistencies
Suggested cleaning operations

Example report:

Data Quality Summary:
Total rows: 10,000
Total columns: 15

Missing Values:
- email: 234 (2.3%)
- phone: 567 (5.7%)
- address: 89 (0.9%)

Duplicates: 45 rows (0.45%)

Format Issues:
- phone: 3 different formats detected
- date_created: mix of formats (MM/DD/YYYY and YYYY-MM-DD)

Recommendations:
1. Remove duplicate rows based on customer_id
2. Standardize phone number format
3. Fill missing emails or remove those rows
4. Convert all dates to ISO format

Quality Checks

Perform specific quality checks:

Consistency Checks

Ensure data is internally consistent:

"Check if start_date is always before end_date"
"Verify total equals sum of line items"
"Ensure parent_id references exist in the parents table"
"Validate that age matches birth_date"

Referential Integrity

Check relationships between datasets:

"Verify all customer_ids in orders exist in customers table"
"Find orphaned records (orders without matching customers)"
"Check if all product_ids have corresponding product details"

Business Rule Validation

Apply domain-specific rules:

"Flag orders with quantity less than minimum order quantity"
"Find customers under 18 years old (invalid for this service)"
"Identify transactions outside business hours"
"Check if discount percentage exceeds maximum allowed (30%)"

Statistical Outlier Detection

Identify unusual values:

"Find outliers in the 'price' column"
"Detect anomalous transaction amounts (beyond 3 standard deviations)"
"Show unusually high or low values in revenue data"

Cleaning Best Practices

Before Cleaning

1. Backup your data:

"Save a copy of the original data to backup.csv"

Always preserve the original before making changes.

2. Understand the data:

"Show me a sample of 10 rows"
"What are the column names and types?"
"Describe the data distribution for key columns"

Know what you’re working with before cleaning.

3. Identify issues:

"Run a data quality report on this dataset"

Understand what needs cleaning.

During Cleaning

4. Clean incrementally: Clean one issue at a time and verify results:

"Remove duplicates"  → Review →
"Fill missing emails" → Review →
"Standardize dates"   → Review

5. Document changes: Keep notes on what was cleaned and why:

Which duplicates were removed
How missing values were filled
What format standards were applied

6. Validate as you go:

"Show me a sample of the cleaned data"
"How many rows were changed?"
"Are there any remaining null values?"

After Cleaning

7. Final validation:

"Run a final data quality check"
"Verify all required fields are populated"
"Check for any remaining duplicates"

8. Compare before/after:

"Compare row counts before and after cleaning"
"Show summary statistics for original vs. cleaned data"

9. Export clean data:

"Export the cleaned data to cleaned_customers.csv"
"Save cleaned data to the database"

Common Cleaning Scenarios

Cleaning Customer Data

"Clean the customer data:
- Remove duplicates based on email
- Standardize phone numbers to E.164 format
- Fill missing states with 'Unknown'
- Convert all emails to lowercase
- Trim whitespace from names and addresses
- Capitalize first and last names
- Remove invalid email addresses"

Cleaning Sales Transactions

"Clean sales data:
- Remove rows with negative quantities
- Fill missing prices with product's average price
- Standardize date format to YYYY-MM-DD
- Convert currency columns to numeric (remove $ and commas)
- Remove duplicate orders (same order_id)
- Validate that total = price × quantity"

Cleaning Survey Responses

"Clean survey data:
- Standardize Yes/No responses (handle Y, N, yes, no, 1, 0)
- Remove rows with no answers at all
- Fill missing age with median age
- Trim whitespace from text responses
- Remove HTML/special characters from comments
- Categorize open-ended responses where possible"

Cleaning Product Catalogs

"Clean product data:
- Remove duplicate SKUs (keep most recent)
- Fill missing descriptions with 'No description available'
- Standardize category names
- Ensure prices are positive numbers
- Remove products with null SKU
- Standardize unit of measure abbreviations"

Advanced Cleaning Techniques

Regular Expressions

Use regex patterns for complex string cleaning:

"Extract 5-digit zip codes from addresses using regex"
"Remove all non-alphanumeric characters except spaces"
"Parse order numbers matching pattern ORD-YYYYMMDD-####"

Custom Functions

Apply custom transformations:

"Apply this cleaning logic:
 If status is 'Pending' for more than 30 days, change to 'Expired'"

"For product names, remove brand prefixes but keep the core name"

Machine Learning-Based Cleaning

Let AI learn patterns:

"Standardize company names (handle abbreviations, capitalization, Inc vs Incorporated)"
"Categorize products into standard categories based on descriptions"
"Detect and correct likely typos in city names"

Troubleshooting Cleaning Issues

Cleaning Removes Too Much Data

Problem: Aggressive cleaning deletes valid data.

Solution:

Be more specific in cleaning instructions
Review the data before mass deletions
Use conditional cleaning (only clean if…)
Start with a small sample

Cleaning Doesn’t Work as Expected

Problem: AI misunderstands what to clean.

Solution:

Provide clearer, more specific instructions
Show examples of before/after
Clean one operation at a time
Use explicit column names

Cleaned Data Still Has Issues

Problem: Some issues remain after cleaning.

Solution:

Run another quality check to identify remaining issues
Apply additional targeted cleaning
Some issues may require manual review
Complex cleaning may need multiple passes

Next Steps

Data Enrichment - Enhance cleaned data with external information
File Uploads - Best practices for preparing files before upload
Managing Connections - Connect data sources to clean

Tips for Effective Cleaning

Start with the basics:

Remove duplicates first
Handle missing values early
Standardize formats before analysis

Use consistent standards:

Date format: YYYY-MM-DD
Phone format: E.164 (+1234567890)
Email: lowercase
Text: trim whitespace

Validate frequently:

Check after each major cleaning step
Compare row counts before/after
Sample the cleaned data visually

Document your process:

Note what was cleaned and why
Record decisions about missing values
Keep a changelog of transformations

Know when to stop:

Aim for “good enough” not “perfect”
Some data quality issues may be acceptable
Manual review may be needed for edge cases
Focus on issues that impact your analysis