Skip to content

How to Connect Your Data

Real analysis often requires combining data from multiple sources. Whether you need to match customers to their orders, merge monthly reports into one view, or enrich your data with classifications, Querri makes connecting data simple through natural language prompts.

This guide covers two fundamental ways to connect data:

  • Joining — matching rows from different datasets based on a shared key
  • Stacking — combining similar datasets with the same columns into one

Joining connects rows from two datasets when they share a common value—like a customer ID, product code, or email address.

ScenarioExample
Add details to your dataMatch orders to customer info
Filter based on another listKeep only products in your catalog
Enrich with classificationsAdd industry labels to companies
Combine related tablesConnect invoices to payments

Imagine you have two lists:

  • List A: Your main data (e.g., 10,000 orders)
  • List B: Reference data you want to add (e.g., 500 customers)

A join finds matching rows and combines them. The key question is: what happens when there’s no match?

Join TypePlain EnglishWhen to Use
Left Join”Keep all of List A, add matching info from List B”You want all your main records, even if some don’t match
Inner Join”Only keep rows that exist in both lists”You only want records that have matches on both sides
Full Join”Keep everything from both lists”You want a complete picture of all data

Basic join — add customer details to orders:

"Join the orders data with customer info on customer_id"

Left join — keep all orders even if customer is unknown:

"Add customer details to orders, keeping all orders even if there's no customer match"

Inner join — only orders with known customers:

"Show me only orders where we have customer information"

Join on multiple keys:

"Match sales to inventory on both product_id and warehouse_id"

Join with different column names:

"Join orders to customers where orders.cust_id matches customers.customer_id"

A common pattern is joining to add classifications or lookups:

"I have a list of companies with industries I classified.
Join those industry labels back to my full transaction data on company_name"

This is especially powerful after using the Researcher tool to classify a subset of your data.

Fuzzy Matching: When Keys Don’t Match Exactly

Section titled “Fuzzy Matching: When Keys Don’t Match Exactly”

Real-world data is messy. “Acme Corp”, “ACME Corporation”, and “Acme Corp.” are the same company, but won’t match in a standard join.

Querri’s Data Query tool supports fuzzy matching using text similarity functions.

ScenarioExample
Name variations”John Smith” vs “Smith, John”
Company name differences”Microsoft” vs “Microsoft Corporation”
Typos in data entry”Califronia” vs “California”
Address matching”123 Main St” vs “123 Main Street”

Match companies with similar names:

"Join these two company lists, matching on company name with fuzzy matching.
Use a similarity threshold of 0.8 to catch spelling variations."

Find duplicate customers:

"Find potential duplicate customers where names are similar but not identical.
Show me pairs with similarity above 0.85 so I can review them."

Match addresses approximately:

"Match shipping addresses to our warehouse locations,
allowing for minor differences in street abbreviations"

Different data types need different thresholds:

Data TypeRecommended ThresholdNotes
Person names0.85 - 0.95Higher to avoid false matches
Company names0.80 - 0.90Allow for Inc/Corp/LLC variations
Addresses0.70 - 0.85Street abbreviations vary widely
Product names0.75 - 0.85Model numbers need precision
Catching typos0.90+Very similar strings only
  1. Standardize first — Remove extra spaces, convert to lowercase

    "Clean up company names by trimming whitespace and converting to uppercase,
    then do the fuzzy match"
  2. Add exact filters — Reduce comparison volume with exact criteria

    "Fuzzy match customer names, but only compare customers in the same state"
  3. Review results — Fuzzy matching can produce false positives

    "Show me the fuzzy matches with similarity scores so I can verify them"

Stacking (also called concatenating or appending) combines datasets that have the same structure—like monthly reports, regional data, or yearly exports.

ScenarioExample
Combine time periodsJanuary + February + March sales
Merge regional dataNorth + South + East + West territories
Consolidate file exportsMultiple CSV exports from same system
Append new recordsAdd this week’s data to historical data

Unlike joining (which adds columns), stacking adds rows:

Dataset A: 1,000 rows (January sales)
Dataset B: 1,200 rows (February sales)
Dataset C: 950 rows (March sales)
─────────────────────────────────────
Stacked: 3,150 rows (Q1 sales)

The key requirement: columns should match (or at least overlap significantly).

Basic stack — combine monthly data:

"Stack January, February, and March sales data into one dataset"

Stack with source tracking:

"Combine all quarterly reports and add a column showing which quarter each row came from"

Stack files with slightly different columns:

"Combine these three regional exports.
They have mostly the same columns but some regions have extra fields—just include all columns."

Append new data to existing:

"Add this week's transactions to the historical transactions data"
  1. Check column alignment first

    "Show me the columns in each dataset so I can verify they match"
  2. Add a source column — Track where rows came from

    "Stack these files and add a 'source_file' column"
  3. Handle column differences — Decide what to do with mismatched columns

    "Stack these datasets, filling in blanks where columns don't exist in some files"
  4. Watch for duplicates — Stacking can create duplicates if data overlaps

    "Stack the datasets and remove any duplicate rows based on transaction_id"
AspectJoiningStacking
What it doesAdds columns from another datasetAdds rows from another dataset
RequiresA matching key (ID, name, code)Similar column structure
Result sizeUsually same row count as main dataSum of all input rows
Use whenEnriching or filtering dataCombining similar data
ExampleOrders + Customer detailsJan sales + Feb sales + Mar sales

When you need to classify values that repeat across many rows:

  1. Extract unique values:

    "Get unique company names from the transactions data"
  2. Classify the unique list:

    "Classify each company by industry"
  3. Join classifications back:

    "Join the industry classifications to the full transactions data on company name"

This avoids classifying the same company hundreds of times.

When your data arrives in pieces:

  1. Stack all sources:

    "Combine Q1, Q2, Q3, and Q4 sales files into one dataset"
  2. Clean if needed:

    "Remove duplicates and standardize the date format"
  3. Analyze the combined data:

    "Show me monthly revenue trends for the full year"

When you need data from several reference tables:

  1. Start with main data:

    "Load the transactions data"
  2. Join customer info:

    "Add customer details by matching customer_id"
  3. Join product info:

    "Add product categories by matching product_id"
  4. Analyze enriched data:

    "Show revenue by customer segment and product category"

Cause: Keys don’t match exactly (different formats, extra spaces, case differences).

Fix: Standardize keys first or use fuzzy matching:

"Clean up the ID columns by trimming spaces and converting to uppercase, then join"

“My join has way more rows than expected”

Section titled ““My join has way more rows than expected””

Cause: One-to-many relationship—one record in List A matches multiple in List B.

Fix: This might be correct! If not, deduplicate first:

"Remove duplicate customer_ids from the customer list before joining"

“Some columns are missing after stacking”

Section titled ““Some columns are missing after stacking””

Cause: Column names differ slightly between datasets.

Fix: Rename columns to match before stacking:

"Rename 'sales_amount' to 'revenue' in the Q1 data, then stack with Q2"

Cause: Comparing every row to every other row is computationally expensive.

Fix: Add exact filters to reduce comparisons:

"Fuzzy match companies, but only compare those in the same country first"