Skip to content

Categorize Tool

The Categorize tool automatically discovers categories in your text data when you don’t know what the groups should be ahead of time. It uses AI-powered clustering to find natural groupings, then generates human-readable labels for each one.

The Categorize tool analyzes a text column in your data, finds patterns across all rows, and groups similar items together. Under the hood, it:

  1. Embeds your text into numerical representations that capture meaning
  2. Clusters similar items together using density-based algorithms
  3. Labels each cluster with a descriptive, human-readable category name
  4. Optionally generates sub-categories for finer-grained detail

Key capabilities:

  • Discover themes in survey responses, support tickets, or feedback
  • Group similar items without defining categories upfront
  • Choose your granularity — broad themes, medium topics, or specific sub-categories
  • Handle ambiguous items — assigns low-confidence items to the nearest group with a confidence score
ScenarioExample
Discovering feedback themes”What topics are customers talking about in these reviews?”
Grouping support tickets”Categorize these support tickets by theme”
Exploring survey responses”Find the main themes in our open-ended survey responses”
Topic discovery”What kinds of feature requests do we get?”
Content organization”Group these articles by topic”

The Categorize tool isn’t the right choice when:

  • You already know the categories — use the Researcher to classify into predefined groups
  • Your data isn’t text-based — Categorize works on text columns, not numbers or dates
  • You have fewer than 20 rows — there isn’t enough data to discover meaningful clusters
  • You need exact keyword matching — use filters or formulas instead

Describe what you want to explore in natural language. The agent recognizes when your request calls for category discovery and invokes the Categorize tool automatically.

Theme discovery:

  • “What are the main themes in these customer reviews?”
  • “Categorize these support tickets by topic”
  • “Group these survey responses into themes”

Exploratory analysis:

  • “What kinds of feedback are people leaving?”
  • “Find patterns in these product descriptions”
  • “Discover the main topics in these comments”

Specific data:

  • “Categorize the feature requests in the description column”
  • “Group these job postings by role type”
  • “Find themes in the notes field”

The Categorize tool uses a preview-then-execute approach so you can choose the right level of detail before processing your full dataset.

When you ask to categorize data, Querri analyzes a sample and presents three options:

LevelDescriptionBest For
Broad~10–15 high-level themesExecutive summaries, quick overviews
Medium~30–80 topic areasWorking analysis, dashboards
SpecificFine-grained with sub-categoriesDetailed breakdowns, deep dives

Each option includes a tailored description based on your actual data, so you can pick the level that fits your analysis.

After you choose a granularity level, the tool processes your entire dataset. You’ll see progress updates as it works through the pipeline — this can take a few minutes for larger datasets.

The Categorize tool adds new columns to your data:

ColumnDescription
Category column (e.g., theme)The discovered category label for each row
{column}_confidenceHow well the row fits its assigned category (0–1)
{column}_estimatedtrue if the row was ambiguous and assigned to the nearest group
{column}_detailSub-category within the main group (only with Specific granularity)
  • High confidence (0.8–1.0): The row is a clear fit for its category
  • Medium confidence (0.5–0.8): Reasonable fit, but the row has some overlap with other categories
  • Low confidence (below 0.5): The row was ambiguous — check the _estimated flag

The tool needs meaningful text to cluster. Short labels or single-word entries won’t produce useful groupings.

Works well:

  • Customer feedback paragraphs
  • Support ticket descriptions
  • Survey open-ended responses
  • Product reviews

Won’t work well:

  • Single-word tags
  • Numeric codes
  • Empty or null-heavy columns

Better input produces better categories:

Remove irrelevant rows:

"Filter out rows where the description is empty"

Focus on the right subset:

"Filter to feedback from 2025, then categorize by theme"

Deduplicate if needed:

"Remove duplicate descriptions, then categorize"

The Categorize tool automatically detects which column contains the most meaningful text. It prioritizes columns with names like description, text, body, message, feedback, and review. If your text is in a differently named column, just mention it in your prompt:

"Categorize the data based on the comments column"

If you’re exploring unfamiliar data, start with the Broad granularity to get an overview. You can always re-run with Specific once you understand the landscape.

Processing fewer rows is faster and often produces cleaner categories. Filter, deduplicate, or sample before categorizing:

"Filter to the last 6 months of tickets, then categorize by theme"

If the information you need is spread across multiple columns, the tool can analyze them together. Mention which columns matter:

"Categorize based on both the subject and description fields"

Once your data is categorized, use the new columns for downstream analysis:

"Show a bar chart of ticket count by theme"
"What's the average satisfaction score per category?"
"Which themes have the most critical-priority tickets?"

Starting data: 3,000 customer feedback responses

Step 1: Prepare

"Filter to responses where feedback is not empty"

Step 2: Categorize

"Categorize this feedback by theme"

Step 3: Choose granularity — select Medium for a working analysis

Step 4: Analyze

"Show the top 10 themes by volume as a bar chart"

Starting data: 8,000 support tickets with subject and body

Step 1: Focus

"Filter to tickets from Q4 2025"

Step 2: Categorize

"Categorize these tickets by topic using the subject and body"

Step 3: Choose granularity — select Specific for sub-categories

Step 4: Drill down

"Show ticket count by category and detail as a stacked bar chart"

Starting data: 1,500 open-ended survey responses

Step 1: Categorize

"What are the main themes in these survey responses?"

Step 2: Choose granularity — select Broad for an executive summary

Step 3: Summarize

"Create a table showing each theme, its row count, and a representative example"

Cause: Fewer than 20 rows with meaningful text content.

Fix:

  • Check for empty or null values: “How many rows have empty descriptions?”
  • Make sure the right column is being used: “Use the notes column instead”
  • Add more data if available

”Categories are too broad or too narrow”

Section titled “”Categories are too broad or too narrow””

Cause: The granularity level doesn’t match your needs.

Fix: Re-run with a different granularity. Start with Medium if unsure.

Cause: Those rows didn’t fit neatly into any cluster and were assigned to the nearest one.

Fix: This is expected. Check the _confidence column — low-confidence estimated rows may genuinely be outliers or mixed topics. You can filter them out for cleaner analysis:

"Filter to rows where estimated is false"

Cause: Large dataset or high granularity level.

Fix:

  • Reduce row count by filtering or deduplicating first
  • Use Broad granularity for faster results
  • The tool shows progress updates — longer processing often means better categories