Prompt Engineering for Quality Engineering
Most QA teams use AI the same way they Googled things in 2010 — a vague question, a generic answer. This guide changes that. 50+ production-tested prompts for the workflows you actually do every day.
What's inside
- Test case generation prompts — covering functional, edge, negative, and boundary scenarios from acceptance criteria or user stories
- AI-powered failure analysis — prompts to diagnose flaky tests, interpret stack traces, and surface root causes faster
- BDD scenario writing — structured prompts for Gherkin, Given-When-Then, and scenario outline generation
- Exploratory testing charters — generate risk-based charters and session notes with LLMs
- AI-assisted code review — prompts for reviewing test code quality, coverage gaps, and anti-patterns
- Prompt patterns that actually work — chain-of-thought, role prompting, and few-shot techniques adapted for QA contexts
Get the Free Guide
Free PDF. No spam — just relevant QE content when I publish something new.
Enjoy the guide!
Your download starts automatically. If it doesn't, use the button below.
Download PDFPrompt Engineering for Quality Engineering
A practitioner's hands-on guide to AI-augmented testing. 50+ production-tested prompts. Real patterns from enterprise QE.
By Pankaj Nakhat · Director of Quality Engineering · 22 years · 25+ teams · Abu Dhabi, UAE
Contents (16 Chapters + Appendix)
Why AI Prompting Is the Most Valuable Skill in QA Right Now
Quality Engineering is going through its biggest shift since Agile. AI and large language models have moved from "interesting experiment" to "serious part of the engineering toolkit" faster than most of us expected. But most QA teams are still using AI the same way they used Google in 2010. Vague question in. Generic answer out. Nothing changes.
This guide is different. Everything in here comes from real work: leading quality, release, and reliability across 25+ product teams in a Spotify-aligned engineering organisation, building multi-agent test automation systems, integrating LLMs into CI/CD pipelines, and evaluating AI outputs against real production defects. The examples are real. The anti-patterns are real. The cost considerations are real.
01 · What Is Prompt Engineering for QA Engineers?
Prompt engineering is the discipline of crafting instructions to reliably get high-quality, useful output from large language models. For QE people: think of it as test design for AI systems. A poorly constructed prompt produces ambiguous, incomplete, or hallucinated outputs — same problem, different medium.
Anatomy of a Prompt
| Element | What it does | QE Example |
|---|---|---|
| Role / Persona | Sets the model's context and expertise | "You are a senior SDET specialising in REST API contract testing..." |
| Context | Background the model needs | OpenAPI spec, user story, existing test code |
| Task | The specific instruction | "Generate Playwright test cases covering all acceptance criteria..." |
| Format | Structure of expected output | "Output as TypeScript using Page Object Model, with AAA comments" |
| Constraints | Boundaries the model must respect | "Do not use data-testid selectors. Use ARIA roles only." |
| Examples | One or more input/output demos | A sample test showing the exact structure you want |
| Evaluation | How output will be judged | "Each test must have exactly one assertion per logical expectation" |
Core Prompting Strategies
Model Behaviours Every QE Must Know
| Behaviour | What it means for your prompts |
|---|---|
| Recency bias | Put your most important constraint at the END of the prompt, not just the beginning. |
| Sycophancy | The model will agree with incorrect assertions. Prompt it to find problems, not confirm assumptions. |
| Hallucination | Models confidently fabricate API methods and test IDs. Every generated test must be executed before you trust it. |
| Token limits | Long prompts cost more and degrade quality. Inject only the relevant section of a spec, not the whole file. |
| Instruction following | "Must include" outperforms "should include" every time. Be explicit. |
02 · AI Prompts for Test Case Generation from Acceptance Criteria
Effective test generation prompts follow a consistent meta-pattern: ROLE → CONTEXT → TASK → CONSTRAINTS → FORMAT → EVALUATION. Five high-value patterns:
Give the model the user story, ACs, and your existing Page Object. Ask it to cover every Given/When/Then, use the POM, follow AAA pattern, and flag any untestable ACs.
Provide structured field specs (type, min, max, required). For each field: minimum valid, maximum valid, one below minimum, one above maximum, null/missing, type mismatch.
Prompt as "a hostile user trying to break this API." Cover invalid data types, boundary violations, malformed JSON, SQLi/XSS payloads, race conditions, authentication bypass attempts.
Feed the PR diff and existing test inventory. Output JSON: critical_tests[], smoke_suite[], coverage_gaps[]. Identify tests covering changed code paths and transitively dependent code.
03 · How to Generate API Tests with AI Prompts
Extract only the endpoint you need, not the whole spec. Generate complete Supertest/Jest suites covering all documented response codes, schema validation, auth flows (valid/expired/missing token), and Zod contract validation.
Generate auth helper modules for OAuth2 client_credentials and authorization_code flows with token caching, pre-expiry refresh, exponential backoff, and test-isolation pools.
Dedicated prompts per validation type: null handling, enum values, nested objects, array fields, date/time formats. Each prompt targets one validation concern for maximum specificity.
04 · AI Prompts for Playwright UI and E2E Test Generation
UI test generation requires maximum structural context: DOM snapshots, component docs, or accessibility trees. Don't ask the model to guess what your UI looks like.
Feed the React component. Generate TypeScript POM with ARIA-first locators (getByRole, getByLabel, getByText), methods per user interaction (not raw locators), JSDoc comments. No CSS selectors, XPath, or nth-child.
Feed journey steps and existing POMs. Set up test data via API, execute steps using POMs, assert after each critical step (not just final state), tear down via API, tag @e2e and @feature.
WCAG 2.1 Level AA: keyboard navigation, focus indicators, ARIA labels, heading hierarchy, color contrast flagging. Use @axe-core/playwright for automated checks supplemented by manual check comments.
05 · AI-Assisted Performance Testing and OWASP Security Prompts
Design five test types: Baseline (10 users, 5 min), Load (ramp to peak over 10 min, hold 20 min), Stress (exceed peak 50%), Spike (10x sudden traffic, 1 min), Soak (80% load, 4 hours for memory leak detection).
For authorised security testing only. Generate test cases covering: BOLA, broken authentication (expired/malformed JWTs), broken property-level auth (mass assignment), unrestricted resource consumption (missing rate limits), and injection (SQLi/NoSQLi).
06 · Building Multi-Agent QA Systems with LLMs
Multi-agent architectures solve single-agent limitations by decomposing the QA workflow into specialised, collaborating agents.
Approval Gates
- Spec Review Gate: Analyst output reviewed before Writer runs
- Diff Review Gate: Healer fixes reviewed before merge
- Data Gate: Any test creating/modifying data requires human approval
- Commit Gate: Generated tests go to feature branch pending PR review
07 · How to Test AI Features: LLM Evaluation for QA Teams
When your organisation ships AI-powered features, the QE team must own the quality gate for LLM outputs. Hallucination, relevance drift, and toxicity are production defects. Treat the LLM as a system under test.
Adversarial testing vectors: prompt injection, jailbreaking (roleplay framings), data extraction, scope violation, indirect injection via RAG documents.
08 · Automating QA in CI/CD Pipelines with AI Prompt Engineering
| Stage | LLM Assistance |
|---|---|
| PR Created | Analyse diff, suggest relevant tests |
| Build | Generate missing unit tests for new functions |
| Test Execution | Classify failures: PRODUCT_BUG | TEST_BUG | ENVIRONMENT | FLAKY |
| Code Review | Comment on testability of new code, suggest refactors |
| Release Gate | Synthesise results into go/no-go risk assessment |
Failure Triage Prompt Output
JSON: { classification, confidence, rationale, nextAction }
09 · Controlling AI Token Costs and Prompt Governance in QA
A prompt is code. It belongs in source control with a changelog, reviews, and regression tests. Structure as a typed TypeScript module consumed by all teams.
Cost controls: Context compression · Response caching · Model tiering (smaller models for structured tasks) · Batch processing (Azure Batch API: 50% cost reduction).
10 · Advanced Prompting: RAG, Self-Correction, and Chaining for QA
Query vector DB for top-K similar existing tests. Inject as context. Instruct: "Do NOT duplicate these. Fill the coverage gaps they leave." Eliminates redundancy without stuffing entire knowledge base.
After generating, instruct the model to review against a checklist: async/await correctness, no hardcoded data, one logical assertion group, ARIA-first selectors, no cross-test dependencies. "If any criterion fails, revise before outputting."
5-step chain: (1) Dependency Mapping → (2) State Design → (3) Test Code Generation → (4) Review against Quality Constitution → (5) BDD documentation generation. Each step uses the previous output as input.
11 · 10 Dangerous AI Prompting Mistakes QA Teams Make
12 · How to Build a Prompt Engineering Practice in Your QA Team
- Identify 3 high-value, low-risk use cases (API test generation from specs is a great start)
- Select 2–3 pilot engineers with strong testing fundamentals
- Establish shared prompt library repository with version control
- Create first 3 team-standard prompt templates
- Integrate failure triage prompting into CI/CD for pilot teams
- Establish token cost monitoring dashboard
- Run prompt regression tests for first 10 production prompts
- Retrospective: measure compilation rate, execution rate, coverage improvement
- Roll out to all teams with training and shared prompt library access
- Implement approval gates for AI-generated tests in CRITICAL risk areas
- Launch internal prompt engineering community of practice (bi-weekly 30-minute session)
- Set 6-month targets: test generation time reduction, coverage improvement
| KPI | Target (6 months) |
|---|---|
| Test authoring time | 50% reduction |
| Test coverage (new features) | +10 percentage points |
| Defect escape rate | 20% reduction |
| Time to triage CI failure | 30 min → 5 min (AI-assisted) |
| Prompt quality score (compile + run) | 90%+ within 90 days |
| AI testing cost per sprint | <$150 per team |
13 · Using Claude Code for Agentic QA Automation
Claude Code is Anthropic's agentic coding assistant that operates directly in your terminal, reads and writes files in your repository, executes commands, and runs tests. For QE teams, this is a qualitative shift from chat-based prompt engineering to agentic automation.
CLAUDE.md essentials for QE teams: Framework version, CI platform, auth type · Selector strategy (ARIA-first, never CSS/XPath) · Test data rules (always use DataFactory) · Test independence requirements · Approval gates for @critical tests.
14 · Building QA Automation Agents with the Anthropic Claude API
A Claude-powered QA agent uses the Anthropic API in a loop: call Claude with a task and tools → Claude calls a tool or produces a final answer → execute the tool call and return the result → continue until task completion.
Fetches story from Jira, checks existing Xray tests to avoid duplication, generates Gherkin (happy path + negative + boundary), creates each scenario linked to the story, reports coverage and untestable ACs.
Monitors CI for selector-related failures. Strict scope: may only change locators, never assertions or test logic. Reads current DOM via snapshot tool before fixing. Outputs unified diff — human reviews before any merge. If correct selector is unclear: outputs CANNOT_HEAL + reason.
15 · Production Playwright TypeScript Patterns for Enterprise QE
Patterns from enterprise QE teams: ARIA-first selectors, fixture-based test isolation, API-driven setup/teardown, typed Page Objects.
Constructor assigns typed Locators via getByRole with accessible names. Methods expose interactions, not raw locators. Spinner awaited hidden before returning from search methods.
Token cache scoped to worker (not test) for performance. Authenticated page injects Bearer header. Test item fixture creates via API, yields, then always deletes — guaranteed cleanup even on test failure.
fullyParallel: true · 2 retries in CI · 4 workers in CI · HTML + JUnit + playwright-spec-doc-reporter · projects: Chromium Desktop + Pixel 7 Mobile · globalSetup for shared state.
16 · The End-to-End AI-Augmented Quality Engineering Workflow
| Stage | Who / What | Action |
|---|---|---|
| 1. Story grooming | Claude Code + CLAUDE.md | Reads story, flags untestable ACs, asks PO for clarification |
| 2. Analyst Agent | Xray BDD Generator | Fetches story, checks duplicates, generates Gherkin |
| 3. Approval Gate | QA Lead (human) | Reviews Analyst JSON. Nothing proceeds without sign-off. |
| 4. Writer Agent | Claude Code subagent | Reads approved plan and POMs, generates Playwright spec, runs tests |
| 5. PR Review | Claude Code + GitHub MCP | Auto-generates review comments on selector quality, missing negatives |
| 6. CI Execution | Azure DevOps + Sentinel | PRODUCT_BUG = auto Jira ticket; TEST_BUG = Healer proposes diff |
| 7. Healer Gate | QA Lead (human) | Reviews Healer diff. Approves or rejects. Diffs never auto-applied. |
| 8. Release Gate | Oracle Agent | Synthesises sprint results + DORA metrics into go/no-go risk report |
The 12 Golden Rules
- Run everything. No AI-generated code goes to production without being executed.
- Version your prompts. A prompt is code. Belongs in source control with a changelog.
- Measure output quality. Track compilation rate, execution rate, defect detection rate per prompt.
- Keep humans at gates. AI generates. Humans approve for anything CRITICAL, data-destructive, or regulatory.
- Never inject PII. Anonymise all data before it enters any external LLM API call.
- Tier your models. Save expensive models for complex reasoning.
- Cache aggressively. Identical prompt and context must never call the API twice in the same sprint.
- Scope your Healer. Self-healing agents touch only locators. Assertions and logic are human territory.
- Govern your agents. Every agent needs a written constitution it cannot violate.
- Monitor drift. LLM output quality changes with model updates. Run prompt regression before upgrading.
- Own the cost. AI token spend is an engineering cost. Track it, budget it, optimise it.
- Invest in CLAUDE.md. Time spent writing team standards there compounds across every engineer and session.
Appendix: QA Prompt Engineering Quick-Reference Templates
ROLE · STORY · ACs · POM → Playwright TypeScript tests covering all ACs (tagged with AC ID), AAA pattern, ARIA-first selectors, independent, 1 happy + 2 negative + 1 boundary per AC. Flag untestable ACs.
METHOD + PATH + endpoint spec only + auth type → Supertest/Jest tests for all response codes, schema validation, auth (valid/expired/missing), type violations. Compilable TypeScript only. No explanations.
Test name + error + stack trace + last 5 results + recent commits → PRODUCT_BUG | TEST_BUG | ENVIRONMENT | FLAKY. Output JSON: { classification, confidence, rationale, nextAction }
"Quality Engineering has always been the discipline that asks the hard questions before the user has to. With prompt engineering in your toolkit, the scope of those questions has expanded dramatically."
— Pankaj Nakhat
Get the Full 34-Page PDF
All 50+ production-tested prompts, code examples, and the complete prompt engineering checklist — formatted for printing and team distribution.
Download Free