Skip to content

Data Cleaning

Let AI handle data cleaning tasks—from removing duplicates to standardizing formats. Simply describe what you want cleaned, and Querri’s cleaner tool takes care of the rest.

Querri’s AI assistant can clean data from any source: uploaded files, connected databases, cloud storage, or business applications.

Ask in natural language:

"Clean this customer data"
"Remove duplicates from the orders table"
"Fix formatting issues in the sales spreadsheet"
"Standardize phone numbers in contacts.csv"

The AI automatically:

  • Detects common data quality issues
  • Applies appropriate cleaning operations
  • Shows what was cleaned
  • Returns the cleaned dataset

Be specific about what to clean:

"Remove rows with missing email addresses"
"Fill blank values in the 'city' column with 'Unknown'"
"Convert all state names to two-letter codes"
"Standardize date format to YYYY-MM-DD"
"Trim whitespace from all text columns"

The more specific you are, the more targeted the cleaning.

Identify and remove duplicate records:

Simple duplicate removal:

"Remove duplicate rows from customers.csv"
"Find duplicate email addresses and keep only the first"
"Deduplicate orders based on order_id"

Advanced deduplication:

"Remove duplicates based on email and phone number"
"Keep the most recent record for each customer_id"
"Deduplicate by name, but keep different addresses separate"

What the cleaner does:

  • Identifies duplicate rows based on specified columns
  • Keeps one instance (first, last, or most complete)
  • Reports how many duplicates were removed

Example result:

Found 47 duplicate customers
Removed 47 rows
Kept 1,203 unique customers

Deal with null, blank, or missing data:

Removing missing data:

"Remove rows where 'email' is blank"
"Delete records with null values in critical fields"
"Drop rows with more than 3 missing columns"

Filling missing data:

"Fill missing prices with the average price"
"Replace blank cities with 'Not specified'"
"Fill missing dates with the previous row's date"
"Use 0 for null quantity values"

Imputation strategies:

  • Mean/Average: For numeric data
  • Median: For numeric data with outliers
  • Mode: For categorical data
  • Forward fill: Copy from previous row
  • Backward fill: Copy from next row
  • Default value: Specify a constant

Example:

"For each product, fill missing stock quantities with the average stock for that product category"

Convert data to consistent formats:

Date standardization:

"Convert all dates to YYYY-MM-DD format"
"Standardize date_created to ISO 8601"
"Parse dates in 'MM/DD/YYYY' format to date objects"

Phone number formatting:

"Format phone numbers as (123) 456-7890"
"Standardize phone numbers to E.164 format (+1234567890)"
"Remove special characters from phone numbers"

Text standardization:

"Convert all emails to lowercase"
"Capitalize first letter of names"
"Standardize state abbreviations (California → CA)"
"Convert 'Y'/'N' to 'Yes'/'No'"

Numeric formatting:

"Remove dollar signs and commas from prices"
"Convert currency to numeric values"
"Round decimals to 2 places"
"Format numbers with thousand separators"

Remove unwanted characters and whitespace:

Whitespace:

"Trim leading and trailing spaces from all columns"
"Remove extra spaces between words"
"Clean up whitespace in address fields"

Special characters:

"Remove non-alphanumeric characters from product codes"
"Strip HTML tags from descriptions"
"Remove emoji from customer comments"

Case normalization:

"Convert all text to lowercase"
"Uppercase all product SKUs"
"Title case for customer names"

Example:

Before: " ACME Corp "
After: "ACME Corp"
Before: "contact@example.COM"
After: "contact@example.com"
Before: "$1,234.56"
After: 1234.56

Ensure columns have the correct data type:

Type conversion:

"Convert 'price' column to decimal"
"Change 'order_date' to date type"
"Make 'quantity' an integer"
"Convert 'is_active' to boolean"

Detecting issues:

"Find non-numeric values in the 'revenue' column"
"Show rows where 'date' isn't a valid date"
"Identify text in numeric fields"

Coercion strategies:

"Convert to number, use null for invalid values"
"Parse dates, mark invalid dates as null"
"Cast 'Yes'/'No' to true/false boolean"

Restructure data by splitting or merging columns:

Splitting columns:

"Split 'full_name' into 'first_name' and 'last_name'"
"Separate 'address' into street, city, state, zip"
"Extract area code from phone number"
"Split 'datetime' into separate date and time columns"

Combining columns:

"Combine 'first_name' and 'last_name' into 'full_name'"
"Merge city and state into 'location'"
"Concatenate year, month, day into a date string"

Example:

Before: full_name = "John Doe"
After: first_name = "John", last_name = "Doe"
Before: street = "123 Main St", city = "Boston"
After: address = "123 Main St, Boston"

Querri’s cleaner_tool.py performs sophisticated data cleaning:

The cleaner automatically scans for:

  • Duplicate records
  • Missing or null values
  • Inconsistent formatting
  • Invalid data types
  • Outliers and anomalies
  • Encoding issues
  • Whitespace problems

Example request:

"Analyze the data quality in sales.csv"

Response:

Data Quality Report:
- 342 duplicate rows found
- 'email' column has 15 null values
- 'phone' has inconsistent formatting
- 'date' column has 3 invalid dates
- 'price' contains 5 non-numeric values

Apply multiple cleaning operations at once:

"Clean the customer data by:
1. Removing duplicates based on email
2. Filling missing phone numbers with 'N/A'
3. Standardizing all state names to abbreviations
4. Converting emails to lowercase
5. Trimming whitespace from all fields"

The cleaner executes all operations in sequence and shows a summary.

Target specific columns for cleaning:

"In the 'email' column, remove invalid email addresses"
"For 'price', remove dollar signs and convert to float"
"Clean the 'description' field by removing HTML tags"

Apply cleaning based on conditions:

"For rows where country is 'USA', format phone as (123) 456-7890"
"Fill missing 'discount' with 0 only for retail customers"
"Standardize dates to ISO format, but only for dates after 2020"

Validate data quality after cleaning or identify issues before cleaning:

Define what “clean” data should look like:

Format validation:

"Check if all emails are valid email addresses"
"Verify phone numbers match the pattern (###) ###-####"
"Ensure all dates are in YYYY-MM-DD format"

Range validation:

"Flag prices less than 0 or greater than 10,000"
"Find ages outside the range 0-120"
"Identify quantities that aren't positive integers"

Completeness validation:

"Find rows missing required fields (email, name, address)"
"Check for null values in critical columns"
"Report records with incomplete contact information"

Uniqueness validation:

"Verify all customer_ids are unique"
"Check for duplicate email addresses"
"Ensure no order_ids are repeated"

Get comprehensive data quality reports:

"Run a data quality check on the customers table"

Report includes:

  • Total rows and columns
  • Missing value counts per column
  • Duplicate row count
  • Data type mismatches
  • Formatting inconsistencies
  • Suggested cleaning operations

Example report:

Data Quality Summary:
Total rows: 10,000
Total columns: 15
Missing Values:
- email: 234 (2.3%)
- phone: 567 (5.7%)
- address: 89 (0.9%)
Duplicates: 45 rows (0.45%)
Format Issues:
- phone: 3 different formats detected
- date_created: mix of formats (MM/DD/YYYY and YYYY-MM-DD)
Recommendations:
1. Remove duplicate rows based on customer_id
2. Standardize phone number format
3. Fill missing emails or remove those rows
4. Convert all dates to ISO format

Perform specific quality checks:

Ensure data is internally consistent:

"Check if start_date is always before end_date"
"Verify total equals sum of line items"
"Ensure parent_id references exist in the parents table"
"Validate that age matches birth_date"

Check relationships between datasets:

"Verify all customer_ids in orders exist in customers table"
"Find orphaned records (orders without matching customers)"
"Check if all product_ids have corresponding product details"

Apply domain-specific rules:

"Flag orders with quantity less than minimum order quantity"
"Find customers under 18 years old (invalid for this service)"
"Identify transactions outside business hours"
"Check if discount percentage exceeds maximum allowed (30%)"

Identify unusual values:

"Find outliers in the 'price' column"
"Detect anomalous transaction amounts (beyond 3 standard deviations)"
"Show unusually high or low values in revenue data"

1. Backup your data:

"Save a copy of the original data to backup.csv"

Always preserve the original before making changes.

2. Understand the data:

"Show me a sample of 10 rows"
"What are the column names and types?"
"Describe the data distribution for key columns"

Know what you’re working with before cleaning.

3. Identify issues:

"Run a data quality report on this dataset"

Understand what needs cleaning.

4. Clean incrementally: Clean one issue at a time and verify results:

"Remove duplicates" → Review →
"Fill missing emails" → Review →
"Standardize dates" → Review

5. Document changes: Keep notes on what was cleaned and why:

  • Which duplicates were removed
  • How missing values were filled
  • What format standards were applied

6. Validate as you go:

"Show me a sample of the cleaned data"
"How many rows were changed?"
"Are there any remaining null values?"

7. Final validation:

"Run a final data quality check"
"Verify all required fields are populated"
"Check for any remaining duplicates"

8. Compare before/after:

"Compare row counts before and after cleaning"
"Show summary statistics for original vs. cleaned data"

9. Export clean data:

"Export the cleaned data to cleaned_customers.csv"
"Save cleaned data to the database"
"Clean the customer data:
- Remove duplicates based on email
- Standardize phone numbers to E.164 format
- Fill missing states with 'Unknown'
- Convert all emails to lowercase
- Trim whitespace from names and addresses
- Capitalize first and last names
- Remove invalid email addresses"
"Clean sales data:
- Remove rows with negative quantities
- Fill missing prices with product's average price
- Standardize date format to YYYY-MM-DD
- Convert currency columns to numeric (remove $ and commas)
- Remove duplicate orders (same order_id)
- Validate that total = price × quantity"
"Clean survey data:
- Standardize Yes/No responses (handle Y, N, yes, no, 1, 0)
- Remove rows with no answers at all
- Fill missing age with median age
- Trim whitespace from text responses
- Remove HTML/special characters from comments
- Categorize open-ended responses where possible"
"Clean product data:
- Remove duplicate SKUs (keep most recent)
- Fill missing descriptions with 'No description available'
- Standardize category names
- Ensure prices are positive numbers
- Remove products with null SKU
- Standardize unit of measure abbreviations"

Use regex patterns for complex string cleaning:

"Extract 5-digit zip codes from addresses using regex"
"Remove all non-alphanumeric characters except spaces"
"Parse order numbers matching pattern ORD-YYYYMMDD-####"

Apply custom transformations:

"Apply this cleaning logic:
If status is 'Pending' for more than 30 days, change to 'Expired'"
"For product names, remove brand prefixes but keep the core name"

Let AI learn patterns:

"Standardize company names (handle abbreviations, capitalization, Inc vs Incorporated)"
"Categorize products into standard categories based on descriptions"
"Detect and correct likely typos in city names"

Problem: Aggressive cleaning deletes valid data.

Solution:

  • Be more specific in cleaning instructions
  • Review the data before mass deletions
  • Use conditional cleaning (only clean if…)
  • Start with a small sample

Problem: AI misunderstands what to clean.

Solution:

  • Provide clearer, more specific instructions
  • Show examples of before/after
  • Clean one operation at a time
  • Use explicit column names

Problem: Some issues remain after cleaning.

Solution:

  • Run another quality check to identify remaining issues
  • Apply additional targeted cleaning
  • Some issues may require manual review
  • Complex cleaning may need multiple passes

Start with the basics:

  • Remove duplicates first
  • Handle missing values early
  • Standardize formats before analysis

Use consistent standards:

  • Date format: YYYY-MM-DD
  • Phone format: E.164 (+1234567890)
  • Email: lowercase
  • Text: trim whitespace

Validate frequently:

  • Check after each major cleaning step
  • Compare row counts before/after
  • Sample the cleaned data visually

Document your process:

  • Note what was cleaned and why
  • Record decisions about missing values
  • Keep a changelog of transformations

Know when to stop:

  • Aim for “good enough” not “perfect”
  • Some data quality issues may be acceptable
  • Manual review may be needed for edge cases
  • Focus on issues that impact your analysis