Item: regex-vs-llm-structured-text
Rating: 8.2
Author: Implexa

regex-vs-llm-structured-text

Decision framework for choosing between regex and LLM when parsing structured text — start with regex, add LLM only for low-confidence edge cases.

installs

stars

karma

SkillRank score ↗

8.2/ 10

evaluated by implexa, claude-haiku-4-5 · 2026-06-08

regex-vs-llm-structured-text provides a cost-optimized decision framework and hybrid pipeline for parsing structured text. starts with deterministic regex (95-98% coverage) and routes low-confidence cases to llm validation, reducing api calls by ~95% vs all-llm approaches.

structure

9.0

trigger phrases

8.0

procedure

9.0

edge cases

7.0

documentation

8.0

view original SKILL.md from skills.shclick to expand

Regex vs LLM for Structured Text Parsing

A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.

When to Activate

Parsing structured text with repeating patterns (questions, forms, tables)

Deciding between regex and LLM for text extraction

Building hybrid pipelines that combine both approaches

Optimizing cost/accuracy tradeoffs in text processing

Decision Framework
Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│   ├── Regex handles 95%+ → Done, no LLM needed
│   └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly

Architecture Pattern

Source Text
    │
    ▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
    │
    ▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
    │
    ▼
[Confidence Scorer] ─── Flags low-confidence extractions
    │
    ├── High confidence (≥0.95) → Direct output
    │
    └── Low confidence (<0.95) → [LLM Validator] → Output

Implementation

1. Regex Parser (Handles the Majority)

import re
from dataclasses import dataclass

@dataclass(frozen=True)
class ParsedItem:
    id: str
    text: str
    choices: tuple[str, ...]
    answer: str
    confidence: float = 1.0

def parse_structured_text(content: str) -> list[ParsedItem]:
    """Parse structured text using regex patterns."""
    pattern = re.compile(
        r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
        r"(?P<choices>(?:[A-D]\..+?\n)+)"
        r"Answer:\s*(?P<answer>[A-D])",
        re.MULTILINE | re.DOTALL,
    )
    items = []
    for match in pattern.finditer(content):
        choices = tuple(
            c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
        )
        items.append(ParsedItem(
            id=match.group("id"),
            text=match.group("text").strip(),
            choices=choices,
            answer=match.group("answer"),
        ))
    return items

2. Confidence Scoring

Flag items that may need LLM review:

@dataclass(frozen=True)
class ConfidenceFlag:
    item_id: str
    score: float
    reasons: tuple[str, ...]

def score_confidence(item: ParsedItem) -> ConfidenceFlag:
    """Score extraction confidence and flag issues."""
    reasons = []
    score = 1.0

    if len(item.choices) < 3:
        reasons.append("few_choices")
        score -= 0.3

    if not item.answer:
        reasons.append("missing_answer")
        score -= 0.5

    if len(item.text) < 10:
        reasons.append("short_text")
        score -= 0.2

    return ConfidenceFlag(
        item_id=item.id,
        score=max(0.0, score),
        reasons=tuple(reasons),
    )

def identify_low_confidence(
    items: list[ParsedItem],
    threshold: float = 0.95,
) -> list[ConfidenceFlag]:
    """Return items below confidence threshold."""
    flags = [score_confidence(item) for item in items]
    return [f for f in flags if f.score < threshold]

3. LLM Validator (Edge Cases Only)

def validate_with_llm(
    item: ParsedItem,
    original_text: str,
    client,
) -> ParsedItem:
    """Use LLM to fix low-confidence extractions."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Cheapest model for validation
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": (
                f"Extract the question, choices, and answer from this text.\n\n"
                f"Text: {original_text}\n\n"
                f"Current extraction: {item}\n\n"
                f"Return corrected JSON if needed, or 'CORRECT' if accurate."
            ),
        }],
    )
    # Parse LLM response and return corrected item...
    return corrected_item

4. Hybrid Pipeline

def process_document(
    content: str,
    *,
    llm_client=None,
    confidence_threshold: float = 0.95,
) -> list[ParsedItem]:
    """Full pipeline: regex -> confidence check -> LLM for edge cases."""
    # Step 1: Regex extraction (handles 95-98%)
    items = parse_structured_text(content)

    # Step 2: Confidence scoring
    low_confidence = identify_low_confidence(items, confidence_threshold)

    if not low_confidence or llm_client is None:
        return items

    # Step 3: LLM validation (only for flagged items)
    low_conf_ids = {f.item_id for f in low_confidence}
    result = []
    for item in items:
        if item.id in low_conf_ids:
            result.append(validate_with_llm(item, content, llm_client))
        else:
            result.append(item)

    return result

Real-World Metrics

From a production quiz parsing pipeline (410 items):

Metric
Value

Regex success rate
98.0%

Low confidence items
8 (2.0%)

LLM calls needed
~5

Cost savings vs all-LLM
~95%

Test coverage
93%

Best Practices

Start with regex — even imperfect regex gives you a baseline to improve

Use confidence scoring to programmatically identify what needs LLM help

Use the cheapest LLM for validation (Haiku-class models are sufficient)

Never mutate parsed items — return new instances from cleaning/validation steps

TDD works well for parsers — write tests for known patterns first, then edge cases

Log metrics (regex success rate, LLM call count) to track pipeline health

Anti-Patterns to Avoid

Sending all text to an LLM when regex handles 95%+ of cases (expensive and slow)

Using regex for free-form, highly variable text (LLM is better here)

Skipping confidence scoring and hoping regex "just works"

Mutating parsed objects during cleaning/validation steps

Not testing edge cases (malformed input, missing fields, encoding issues)

When to Use

Quiz/exam question parsing

Form data extraction

Invoice/receipt processing

Document structure parsing (headers, sections, tables)

Any structured text with repeating patterns where cost matters

don't have the plugin yet? install it then click "run inline in claude" again.

added explicit inputs for llm client setup, confidence threshold tuning, and regex pattern guidance; expanded procedure into 7 discrete steps with clear inputs/outputs; extracted decision logic into 5 if-else branches covering consistency check, regex success threshold, llm availability, json parsing, and zero-flags case; documented output contract with field-level success criteria and storage format; added outcome signal for regex-only, hybrid, and failure modes.

intent

use regex first for structured text parsing (quizzes, forms, invoices, documents) because it handles 95-98% of cases cheaply and deterministically. reserve expensive llm calls for the remaining edge cases where confidence scoring flags extraction problems. this skill teaches you when to combine both approaches and how to measure which one to use.

inputs

text source

structured text with repeating patterns (required). examples: quiz questions with answer keys, form responses with consistent field order, invoice line items, multi-page documents with headers and sections.

confidence threshold (optional)

float between 0.0 and 1.0, default 0.95. items scoring below this trigger llm validation.

llm client (optional)

anthropic client or compatible api. required only if you want llm validation for low-confidence items. set via environment variable ANTHROPIC_API_KEY or pass directly. needs at least messages:create permission. use haiku-class models to minimize cost.

regex patterns (optional)

pre-compiled regex patterns for your specific text format. if not provided, you must write them for your domain. patterns must include named capture groups for each field you want to extract.

edge case logs (optional)

historical data on what regex misses in your domain. use this to tune confidence scoring thresholds and identify which edge cases matter most.

procedure

examine your text format for consistency
- input: raw structured text sample (at least 10-20 items).
- look for repeating patterns: numbering schemes (1., 2., 3. or Q1, Q2), field ordering (question, then choices, then answer), delimiters (newlines, markers, whitespace).
- output: confidence estimate (≥90% follows pattern = proceed to regex; <90% = skip regex, use llm directly).
write regex patterns with named groups
- input: identified repeating pattern, list of fields to extract (id, text, choices, answer, etc.).
- build regex using re.compile() with named capture groups (?P<name>...), multiline flag re.MULTILINE, and dotall flag re.DOTALL if text spans multiple lines.
- test regex against 5-10 representative samples to confirm it matches expected fields.
- output: compiled regex pattern object.
parse text using regex
- input: regex pattern from step 2, raw text.
- iterate over pattern.finditer(text) to extract all matches.
- for each match, extract named groups and post-process (strip whitespace, normalize case, split comma-separated fields).
- construct a data object for each item (e.g., ParsedItem with fields: id, text, choices, answer, confidence=1.0).
- output: list of parsed items, all with confidence=1.0 initially.
score confidence for each item
- input: list of parsed items.
- for each item, compute a confidence score (start at 1.0) and deduct points for risk factors:
  - fewer than 3 choices: -0.3.
  - missing answer field: -0.5.
  - text length < 10 characters: -0.2.
  - malformed choice formatting: -0.2.
  - any other domain-specific anomaly: -0.1 to -0.3.
- clamp final score to [0.0, 1.0].
- output: list of (item_id, score, [reasons]) tuples for items below threshold.
identify low-confidence items
- input: list of confidence scores, threshold (default 0.95).
- filter items where score < threshold.
- output: list of item ids and reasons flagged for llm review.
optionally validate low-confidence items with llm
- input: low-confidence item ids, original text snippet, llm client, model name (e.g., claude-haiku-4-5-20251001).
- for each flagged item, call client.messages.create() with:
  - model: haiku or cheaper tier.
  - max_tokens: 500.
  - prompt: ask llm to extract question, choices, answer and compare to regex extraction. ask it to return "CORRECT" if regex result is valid, or corrected json otherwise.
- parse llm response (handle json errors gracefully).
- if llm suggests corrections, replace item with corrected version. if llm confirms "CORRECT", keep original.
- output: list of corrected items (for flagged ids only).
merge and return
- input: original parsed items, corrected items from step 6.
- build final result: for each item in original list, substitute corrected version if id was flagged, else use original.
- output: final list of parsed items.

decision points

is text format consistent and repeating (≥90% matches a pattern)?

yes: proceed to regex (steps 2-3). regex will handle the majority cheaply and fast.
no: skip regex entirely, use llm directly on full text. regex will fail too often and waste developer time.

does regex succeed on ≥95% of items?

yes: confidence scoring returns empty list of flagged items. skip llm entirely. return regex results as final output.
no: proceed to confidence scoring (step 4) and identify edge cases.

is llm_client provided and low-confidence items exist?

yes: run llm validation on flagged items only (step 6).
no: return regex results as-is, even if some items are below confidence threshold. log a warning that validation was skipped.

does llm response parse as valid json?

yes: extract corrected fields and replace original item.
no: assume llm returned plain text (e.g., "CORRECT"). if response contains "CORRECT" or "valid" (case-insensitive), keep original item. if response suggests an error, log it and keep original with confidence score halved.

did confidence scoring identify zero low-confidence items?

yes: skip llm calls entirely, return regex results immediately.
no: validate only flagged items, not entire dataset.

output contract

data structure for each parsed item:

id (string): unique identifier from text (e.g., "1", "Q5").
text (string): main content, whitespace normalized.
choices (tuple of strings): multiple-choice options, cleaned.
answer (string): correct answer or selection.
confidence (float): 1.0 after regex, may be lowered by llm if corrections made.

final output format:

list of ParsedItem objects (or equivalent dict/dataclass in your language).
order: same as input text order.
encoding: utf-8, all control characters stripped.

success criteria for a single item:

id is non-empty and matches source text.
text is at least 3 characters.
choices is a non-empty tuple (at least 1 choice).
answer matches one of the choices or is blank (if optional).
confidence >= 0.0 and <= 1.0.

success criteria for full output:

all items pass above checks.
regex success rate >= 95% (i.e., items flagged for llm = total items * 0.05 or fewer).
if llm was used, all flagged items were validated and returned (none dropped).
no items mutated in-place; all transformations return new instances.

file/storage location:

write parsed items to json (one object per line, or array) or csv (headers: id, text, choices, answer, confidence).
log metrics to stdout or file: regex_success_rate, llm_calls_made, total_items, avg_confidence.

outcome signal

regex success signal: 95%+ of items parse without confidence flags, zero llm calls made, metrics log shows regex_success_rate >= 0.95.
hybrid success signal: low-confidence items identified (2-5% of total), llm calls made only for those, corrections applied, final confidence scores all >= 0.95.
failure signal: regex success rate < 80% (indicates pattern too inconsistent; consider pure llm instead), or llm calls exceed 10-15% of total items (indicates regex thresholds need tuning, or text format is not actually structured).
user experience check: compare regex-only output to llm-corrected output on a test set. if corrections are minimal (< 5% of flagged items changed), regex alone is sufficient and you can disable llm permanently.

regex-vs-llm-structured-text

related skills

intent

inputs

procedure

decision points

output contract

outcome signal