Audits whether test results can be trusted: flakiness, isolation, real external dependencies, time/random/order dependency, and shared state. Use when auditing…
Paths: File paths (references/, ../ln-*) are relative to this skill directory.
Trustworthiness Auditor (L3 Worker)
Type: L3 Worker
Specialized worker auditing whether automated test results are deterministic, isolated, and trustworthy.
Purpose & Scope
Audit Test Trustworthiness (Category 5: Medium Priority)
Check determinism, isolation, and dependency control
Detect flaky tests, time/random/order dependency, shared state, and real external dependencies
Emit REWRITE_FOR_DETERMINISM or DELETE_IF_LOW_VALUE
Calculate compliance score (X/10)
Inputs
MANDATORY READ: Load references/audit_worker_core_contract.md.
Receives contextStore with: tech_stack, testFilesMetadata, codebase_root, output_dir.
Workflow
Detection policy: use two-layer detection (candidate scan, then context verification); load references/two_layer_detection.md only when the verification method is ambiguous.
Parse Context: Extract tech stack, trustworthiness checklist, test file list, output_dir from contextStore
Check Isolation (Layer 1): Check isolation for 6 categories (APIs, DB, FS, Time, Random, Network)
2b) Context Analysis (Layer 2 -- MANDATORY): For each isolation violation, ask:
Is this an integration test? (real dependencies are intentional) -> do NOT flag. Only flag isolation issues in unit tests
Is in-memory DB configured via test config (not visible in grep)? -> skip
Is this a test helper that sets up mocks for other tests? -> skip
Check Determinism: Check for flaky tests, time-dependent assertions, order-dependent tests, shared mutable state
Evaluate trust action: Use REWRITE_FOR_DETERMINISM by default; use DELETE_IF_LOW_VALUE only when the test is both untrustworthy and low-value according to obvious local evidence
Collect Findings: Record each violation with severity, location (file:line), effort estimate (S/M/L), action, recommendation
Calculate Score: Count violations by severity, calculate compliance score (X/10)
Write Report: Build full markdown report in memory per references/templates/audit_worker_report_template.md, write to {output_dir}/ln-635--global.md in single Write call
Return Summary: Return minimal summary to coordinator (see Output Format)
Audit Rules: Test Isolation
1. External APIs
Good: Mocked (jest.mock, sinon, nock)
Bad: Real HTTP calls to external APIs
Detection:
Grep for axios.get, fetch(, http.request without mocks
Check if test makes actual network calls
Severity: HIGH
Recommendation: Ensure external API calls are controlled (mock, stub, or test server). Tool choice depends on project stack. Exception: Integration tests are EXPECTED to use real dependencies -- do NOT flag
Effort: M
2. Database
Good: In-memory DB (sqlite :memory:) or mocked
Bad: Real database (PostgreSQL, MySQL)
Detection:
Check DB connection strings (localhost:5432, real DB URL)
Grep for beforeAll(async () => { await db.connect() }) without :memory:
Severity: MEDIUM
Recommendation: Ensure DB state is controlled and isolated between test runs. Exception: Integration tests with in-memory DB via config -> skip
Effort: M-L
3. File System
Good: Mocked (mock-fs, vol)
Bad: Real file reads/writes
Detection:
Grep for fs.readFile, fs.writeFile without mocks
Check if test creates/deletes real files
Severity: MEDIUM
Recommendation: Ensure file system operations are isolated (mock, temp directory, or cleanup). Tool choice depends on project stack
Effort: S-M
4. Time/Date
Good: Mocked (jest.useFakeTimers, sinon.useFakeTimers)
Bad: new Date(), Date.now() without mocks
Detection:
Grep for new Date() in test files without useFakeTimers
Severity: MEDIUM
Recommendation: Ensure time-dependent logic uses controlled clock (fake timers, injected clock, or time provider). Tool choice depends on project stack
Effort: S
5. Random
Good: Seeded random (Math.seedrandom, fixed seed)
Bad: Math.random() without seed
Detection:
Grep for Math.random() without seed setup
Severity: LOW
Recommendation: Use seeded random for deterministic tests
Effort: S
6. Network
Good: Mocked (supertest for Express, no real ports)
Bad: Real network requests (localhost:3000, binding to port)
Detection:
Grep for app.listen(3000) in tests
Check for real HTTP requests
Severity: MEDIUM
Recommendation: Use supertest (no real port)
Effort: M
Audit Rules: Determinism
1. Flaky Tests
What: Tests that pass/fail randomly
Detection:
Run tests multiple times, check for inconsistent results
Grep for setTimeout, setInterval without proper awaits
Check for race conditions (async operations not awaited)
Severity: HIGH
Recommendation: Fix race conditions, use proper async/await
Effort: M-L
2. Time-Dependent Assertions
What: Assertions on current time (expect(timestamp).toBeCloseTo(Date.now()))
Detection:
Grep for Date.now(), new Date() in assertions
Severity: MEDIUM
Recommendation: Mock time
Effort: S
3. Order-Dependent Tests
What: Tests that fail when run in different order
Detection:
Run tests in random order, check for failures
Grep for shared mutable state between tests
Severity: MEDIUM
Recommendation: Isolate tests, reset state in beforeEach
Effort: M
4. Shared Mutable State
What: Global variables modified across tests
Detection:
Grep for let globalVar at module level
Check for state shared between tests
Severity: MEDIUM
Recommendation: Use beforeEach to reset state
Effort: S-M
Audit Rules: Trustworthiness Drag
1. Overlarge Test With Shared Setup (>100 lines)
What: Test with >100 lines, testing too many scenarios
Detection:
Count lines per test
If >100 lines -> Giant
Severity: MEDIUM
Recommendation: Split into focused tests (one scenario per test)
Effort: S-M
2. Slow Poke (>5 seconds)
What: Test taking >5 seconds to run
Detection:
Measure test duration
If >5s -> Slow Poke
Severity: MEDIUM
Recommendation: Control external deps with test doubles or in-memory services selected from the project stack; parallelize only after isolation is verified
Effort: M
3. Conjoined Twins (Unit test without controlled dependencies)
What: Test labeled "Unit" but not mocking dependencies
Detection:
Check if test name includes "Unit"
Verify all dependencies are mocked
If no mocks -> actually Integration test
Severity: LOW
Recommendation: Either mock dependencies OR rename to Integration test
Effort: S
4. Default Value Blindness (Tests with default config)
What: Tests with default config values only. Use the non-default config rule from references/risk_based_testing_guide.md; load references/risk_based_testing_methodology.md only when examples are needed.
Detection:
Grep for common defaults in test setup: :8080, :3000, 30000, limit: 20, offset: 0
Check if test config values match framework/library defaults
Look for || DEFAULT patterns in source code with matching test values
Severity: HIGH
Effort: S
Scoring Algorithm
MANDATORY READ: Load references/audit_scoring.md.
Severity mapping:
Flaky tests, External API not controlled, Default Value Blindness -> HIGH
Real database, File system, Time/Date, Network, Overlarge shared setup, Slow Poke -> MEDIUM
Random without seed, Order-dependent, Conjoined Twins -> LOW
Output Format
MANDATORY READ: Load references/templates/audit_worker_report_template.md.
Write JSON summary per references/audit_summary_contract.md. In managed mode the caller passes both runId and summaryArtifactPath; in standalone mode the worker generates its own run-scoped artifact path per shared contract.
Write report to {output_dir}/ln-635--global.md with category: "Test Trustworthiness" and checks: api_isolation, db_isolation, fs_isolation, time_isolation, random_isolation, network_isolation, flaky_tests, order_dependency, shared_state, default_value_blindness.
Return summary per references/audit_summary_contract.md.
When summaryArtifactPath is absent, write the standalone runtime summary under .hex-skills/runtime-artifacts/runs/{run_id}/evaluation-worker/{worker}--{identifier}.json and optionally echo the same summary in structured output.
Report written: .hex-skills/runtime-artifacts/runs/{run_id}/audit-report/ln-635--global.md
Score: X.X/10 | Issues: N (C:N H:N M:N L:N)
Note: Findings are flattened into single array. Use principle field prefix (Isolation / Determinism / Dependency Control) to identify issue category. Each finding includes action: "REWRITE_FOR_DETERMINISM" or action: "DELETE_IF_LOW_VALUE".
Critical Rules
Apply the already-loaded references/audit_worker_core_contract.md.
Do not auto-fix: Report only
Effort realism: S = <1h, M = 1-4h, L = >4h
Flat findings: Merge isolation + determinism + dependency-control findings into single findings array, use principle prefix to distinguish
Context-aware: Supertest with real Express app is acceptable for integration tests
Unique angle: Only audit whether test results can be trusted. Do not evaluate product behavior, E2E journey value, portfolio value, missing coverage, oracle strength, manual evidence, or structure.
Action required: Every finding uses REWRITE_FOR_DETERMINISM unless evidence shows the test is also low-value enough to use DELETE_IF_LOW_VALUE.
Monitor (2.1.98+): For repeated test runs expected >30s each, use Monitor. Fallback: Bash(run_in_background=true).
Definition of Done
Apply the already-loaded references/audit_worker_core_contract.md.
contextStore parsed successfully (including output_dir)
All 3 audit groups completed:
Isolation (6 categories: APIs, DB, FS, Time, Random, Network)
Determinism (4 checks: flaky, time-dependent, order-dependent, shared state)
Dependency control (overlarge shared setup, slow tests, conjoined dependencies, default-value blindness)
Findings collected with severity, location, effort, action, recommendation
Score calculated using penalty algorithm
Report written to {output_dir}/ln-635--global.md (atomic single Write call)
Summary written per contract
Version: 3.0.0
Last Updated: 2025-12-23don't have the plugin yet? install it then click "run inline in claude" again.