Definition: Engineering practice where every time an agent makes a mistake, a solution is designed to ensure the agent never makes that mistake again. Concept popularized by Mitchell Hashimoto in Ghostty development.
— Source: NERVICO, Product Development Consultancy
Harness Engineering
Definition
Harness Engineering is the engineering practice where every time an AI agent makes a mistake, the engineer takes the necessary time to design a solution that ensures the agent never makes that specific mistake again. Instead of simply correcting the error manually, infrastructure (guardrails, tests, validations, constraints) is built to prevent its recurrence. The concept was popularized by Mitchell Hashimoto (co-founder of HashiCorp) during the development of Ghostty, his terminal emulator project, where he documented his experience working intensively with AI agents. Harness metaphor: Like a climber’s safety harness, the engineering “harness” protects the agent from falls, allowing it to work at greater heights (complexity) without risk. Core philosophy: Don’t fix the error, fix the system that allowed the error.
Why It Matters
Exponential continuous improvement: Each error corrected through harness engineering permanently increases the agent’s capability. In 3 months of working with agents, Hashimoto built a harness so robust that agents could handle tasks that initially required continuous supervision. Agent scalability: Without harness engineering, each new agent makes the same mistakes. With harness engineering, each agent inherits accumulated protections, dramatically reducing training time and error rate. Compound ROI: Initial investment in harness (2-4 hours per error) pays dividends every time the agent handles similar tasks. Instead of supervising 100 future tasks, you invest once and automate. Mindset shift: Harness engineering changes the engineer’s role from “code writer” to “system designer”. You don’t write code, you design constraints within which the agent can work safely.
Real Examples
Ghostty Development (Mitchell Hashimoto)
Context: Mitchell Hashimoto built Ghostty, a modern terminal emulator, using AI agents intensively with harness engineering. Typical errors and harnesses created: Error 1: Agent modifies critical files without tests
- Harness: Pre-commit hook that blocks commits without minimum 80% test coverage
- Result: Agent forced to write tests before changes Error 2: Agent introduces breaking changes in public API
- Harness: API contract tests with automatic semantic versioning
- Result: CI/CD fails if API changes without version bump Error 3: Agent generates code that doesn’t compile
- Harness: GitHub Actions runs build on every push
- Result: Agent receives immediate feedback and auto-corrects Progress over 3 months:
- Month 1: Continuous supervision, 40% of commits require correction
- Month 2: Reduced supervision, 15% of commits require correction
- Month 3: Occasional supervision, 3% of commits require correction
E-commerce API Development
Context: E-commerce startup using Devin to build REST APIs. Recurring error: Agent didn’t validate input correctly Implemented harness:
// Mandatory Zod schema for all endpoints
const productSchema = z.object({
name: z.string().min(3).max(100),
price: z.number().positive(),
stock: z.number().int().nonnegative(),
});
// Middleware that rejects requests without validation
app.use((req, res, next) => {
if (!req.validationSchema) {
throw new Error('Endpoint must define validation schema');
}
next();
});Result: After implementing harness, 0 input validation vulnerabilities in 6 months vs 12-15/month previously.
Fintech Payment Processing
Context: Agent implementing payment logic with Stripe. Critical error: Agent processed refunds without verifying payment state Implemented harness:
- Explicit state machine for payment lifecycle
- Property-based testing (QuickCheck-style)
- Mandatory sandbox for Stripe API calls in development
- Human approval required for any code touching production Stripe keys Result: Zero payment bugs in production in 8 months post-harness vs 3 incidents in 2 months pre-harness.
How to Implement Harness Engineering
1. Identify Error Pattern
When agent makes error, ask yourself:
- Is this error repeatable?
- Could it occur in similar contexts?
- What type of error is it? (logic, security, performance, tests)
2. Design the Harness
Harness options by error type: Security errors:
- Linters with custom rules (ESLint, Semgrep)
- SAST tools in CI/CD (SonarQube, Snyk)
- Secret scanning (GitGuardian) Logic errors:
- Property-based testing
- Contract testing between services
- Mutation testing to validate test quality Performance errors:
- Performance budgets in CI/CD
- Lighthouse CI with thresholds
- Automatic load testing API design errors:
- OpenAPI schema validation
- Breaking change detection
- Enforced API versioning
3. Automate the Harness
Harness must execute automatically:
- Pre-commit hooks (local code)
- CI/CD pipelines (remote code)
- Deployment gates (production)
4. Iterate and Refine
Monitor harness effectiveness:
- Is agent still making similar errors?
- Does harness generate false positives?
- Does it need to be stricter or more flexible?
Relationship with Slam Dunk Tasks
Harness engineering enables Slam Dunk Tasks: once you’ve built sufficient harnesses around a type of task, you can delegate that task completely to the agent with confidence it won’t fail. Typical progression:
- New task → 100% supervision
- Harness built → 50% supervision
- Harness refined → 10% supervision
- Task becomes Slam Dunk → 0% supervision
Tools and Technologies
Linters and Validators:
- ESLint / Prettier (JavaScript/TypeScript)
- Ruff / Black (Python)
- Clippy (Rust)
- Custom rules via AST parsing Testing Frameworks:
- Jest / Vitest (unit tests)
- Playwright (E2E tests)
- Hypothesis / QuickCheck (property-based) CI/CD Harnesses:
- GitHub Actions with custom actions
- Pre-commit framework
- Husky (git hooks)
- Danger (PR automation) Security Harnesses:
- Semgrep (SAST)
- Snyk / Dependabot (dependencies)
- GitGuardian (secrets)
- OWASP ZAP (DAST)
Related Terms
- Slam Dunk Tasks - Tasks that agents can execute with high confidence
- Agentic Coding - Development where agents execute code autonomously
- Agent-Ops - Role that designs and maintains harnesses for agents
- Auto-Healing - Systems that self-repair when detecting problems
Challenges and Considerations
Over-engineering the harness: Not all errors need harness. If an error occurs once and is trivial to fix, don’t build complex infrastructure. Harness maintenance: Harnesses require maintenance. When codebase evolves, harnesses must evolve too. Budget 10-15% of engineering time to harness maintenance. Balance with speed: Strict harnesses can slow initial development. For MVPs, consider minimal harnesses (security + critical bugs) and expand later. False sense of security: Harness engineering reduces errors but doesn’t eliminate them. Maintain code reviews for critical architectural decisions.
Additional Resources
- Mitchell Hashimoto: My AI Adoption Journey
- Agentic Engineering in Action — Zed’s Blog
- The Agent Harness — Michael Livs
- AI Agents in Production: The Harness Dissected
Last updated: February 2026 Category: AI Development Popularized by: Mitchell Hashimoto (HashiCorp, Ghostty) Related to: Slam Dunk Tasks, Agentic Coding, Agent-Ops, Continuous Improvement Keywords: harness engineering, mitchell hashimoto, agent improvement, agentic coding, ghostty, ai agent frameworks, agent guardrails