Free PDF Guide

Prompt Engineering for Quality Engineering

Most QA teams use AI the same way they Googled things in 2010 — a vague question, a generic answer. This guide changes that. 50+ production-tested prompts for the workflows you actually do every day.

What's inside

Test case generation prompts — covering functional, edge, negative, and boundary scenarios from acceptance criteria or user stories
AI-powered failure analysis — prompts to diagnose flaky tests, interpret stack traces, and surface root causes faster
BDD scenario writing — structured prompts for Gherkin, Given-When-Then, and scenario outline generation
Exploratory testing charters — generate risk-based charters and session notes with LLMs
AI-assisted code review — prompts for reviewing test code quality, coverage gaps, and anti-patterns
Prompt patterns that actually work — chain-of-thought, role prompting, and few-shot techniques adapted for QA contexts

AI Testing Prompt Engineering LLMs QA Automation BDD Test Strategy

Pankaj Nakhat

QA Evangelist with 22+ years in quality engineering. Built AI-assisted testing pipelines and prompt workflows used across 25+ engineering teams globally.

🤖

Get the Free Guide

Free PDF. No spam — just relevant QE content when I publish something new.

First Name

Work Email

Your Role

No spam, ever. Unsubscribe any time.

✅

Enjoy the guide!

Your download starts automatically. If it doesn't, use the button below.

Download PDF

Prompt Engineering for Quality Engineering

A practitioner's hands-on guide to AI-augmented testing. 50+ production-tested prompts. Real patterns from enterprise QE.

By Pankaj Nakhat · Director of Quality Engineering · 22 years · 25+ teams · Abu Dhabi, UAE

Contents (16 Chapters + Appendix)

Why AI Prompting Is the Most Valuable Skill in QA Right Now 01 · Foundations 02 · Test Case Generation 03 · API Testing 04 · UI & E2E Testing 05 · Performance & Security 06 · Multi-Agent QA 07 · LLM Evaluation 08 · CI/CD Integration 09 · Governance & Token Cost 10 · Advanced Techniques 11 · Anti-Patterns 12 · Building a Practice 13 · Claude Code for QE 14 · Building QA Agents 15 · Production Playwright 16 · Complete AI-Augmented Workflow A · Quick Reference Templates

Why AI Prompting Is the Most Valuable Skill in QA Right Now

Quality Engineering is going through its biggest shift since Agile. AI and large language models have moved from "interesting experiment" to "serious part of the engineering toolkit" faster than most of us expected. But most QA teams are still using AI the same way they used Google in 2010. Vague question in. Generic answer out. Nothing changes.

This guide is different. Everything in here comes from real work: leading quality, release, and reliability across 25+ product teams in a Spotify-aligned engineering organisation, building multi-agent test automation systems, integrating LLMs into CI/CD pipelines, and evaluating AI outputs against real production defects. The examples are real. The anti-patterns are real. The cost considerations are real.

01 · What Is Prompt Engineering for QA Engineers?

Prompt engineering is the discipline of crafting instructions to reliably get high-quality, useful output from large language models. For QE people: think of it as test design for AI systems. A poorly constructed prompt produces ambiguous, incomplete, or hallucinated outputs — same problem, different medium.

Anatomy of a Prompt

Element	What it does	QE Example
Role / Persona	Sets the model's context and expertise	"You are a senior SDET specialising in REST API contract testing..."
Context	Background the model needs	OpenAPI spec, user story, existing test code
Task	The specific instruction	"Generate Playwright test cases covering all acceptance criteria..."
Format	Structure of expected output	"Output as TypeScript using Page Object Model, with AAA comments"
Constraints	Boundaries the model must respect	"Do not use data-testid selectors. Use ARIA roles only."
Examples	One or more input/output demos	A sample test showing the exact structure you want
Evaluation	How output will be judged	"Each test must have exactly one assertion per logical expectation"

Core Prompting Strategies

Zero-shot — Ask with no examples. Works for simple tasks. Quality ceiling is lower for complex QE work.

Few-shot — Provide 1–3 input/output examples. The single highest-leverage technique for QE. Encodes your team's coding standards directly in the prompt. If only one technique: this one.

Chain-of-thought (CoT) — Tell the model to reason step-by-step before producing output. Essential for complex test scenario design.

Role prompting — Give the model a specific expert persona. A "penetration tester" probes differently than an "API contract testing specialist."

Model Behaviours Every QE Must Know

Behaviour	What it means for your prompts
Recency bias	Put your most important constraint at the END of the prompt, not just the beginning.
Sycophancy	The model will agree with incorrect assertions. Prompt it to find problems, not confirm assumptions.
Hallucination	Models confidently fabricate API methods and test IDs. Every generated test must be executed before you trust it.
Token limits	Long prompts cost more and degrade quality. Inject only the relevant section of a spec, not the whole file.
Instruction following	"Must include" outperforms "should include" every time. Be explicit.

02 · AI Prompts for Test Case Generation from Acceptance Criteria

Effective test generation prompts follow a consistent meta-pattern: ROLE → CONTEXT → TASK → CONSTRAINTS → FORMAT → EVALUATION. Five high-value patterns:

Acceptance Criteria → Test Cases

Give the model the user story, ACs, and your existing Page Object. Ask it to cover every Given/When/Then, use the POM, follow AAA pattern, and flag any untestable ACs.

Boundary Value Analysis

Provide structured field specs (type, min, max, required). For each field: minimum valid, maximum valid, one below minimum, one above maximum, null/missing, type mismatch.

Negative Testing

Prompt as "a hostile user trying to break this API." Cover invalid data types, boundary violations, malformed JSON, SQLi/XSS payloads, race conditions, authentication bypass attempts.

Regression Test Prioritisation

Feed the PR diff and existing test inventory. Output JSON: critical_tests[], smoke_suite[], coverage_gaps[]. Identify tests covering changed code paths and transitively dependent code.

03 · How to Generate API Tests with AI Prompts

OpenAPI Spec-Driven Generation

Extract only the endpoint you need, not the whole spec. Generate complete Supertest/Jest suites covering all documented response codes, schema validation, auth flows (valid/expired/missing token), and Zod contract validation.

Authentication Flow Testing

Generate auth helper modules for OAuth2 client_credentials and authorization_code flows with token caching, pre-expiry refresh, exponential backoff, and test-isolation pools.

Schema Validation Prompts

Dedicated prompts per validation type: null handling, enum values, nested objects, array fields, date/time formats. Each prompt targets one validation concern for maximum specificity.

04 · AI Prompts for Playwright UI and E2E Test Generation

UI test generation requires maximum structural context: DOM snapshots, component docs, or accessibility trees. Don't ask the model to guess what your UI looks like.

Page Object Model Generation

Feed the React component. Generate TypeScript POM with ARIA-first locators (getByRole, getByLabel, getByText), methods per user interaction (not raw locators), JSDoc comments. No CSS selectors, XPath, or nth-child.

User Journey to E2E Conversion

Feed journey steps and existing POMs. Set up test data via API, execute steps using POMs, assert after each critical step (not just final state), tear down via API, tag @e2e and @feature.

Accessibility Test Generation

WCAG 2.1 Level AA: keyboard navigation, focus indicators, ARIA labels, heading hierarchy, color contrast flagging. Use @axe-core/playwright for automated checks supplemented by manual check comments.

05 · AI-Assisted Performance Testing and OWASP Security Prompts

k6 Performance Test Strategy

Design five test types: Baseline (10 users, 5 min), Load (ramp to peak over 10 min, hold 20 min), Stress (exceed peak 50%), Spike (10x sudden traffic, 1 min), Soak (80% load, 4 hours for memory leak detection).

OWASP API Security Top 10 Coverage

For authorised security testing only. Generate test cases covering: BOLA, broken authentication (expired/malformed JWTs), broken property-level auth (mass assignment), unrestricted resource consumption (missing rate limits), and injection (SQLi/NoSQLi).

06 · Building Multi-Agent QA Systems with LLMs

Multi-agent architectures solve single-agent limitations by decomposing the QA workflow into specialised, collaborating agents.

Analyst

Reads requirements → structured test plan JSON (chain-of-thought)

Writer

Consumes test plan → executable test code (few-shot)

Sentinel

Runs tests → classifies failures (flaky vs genuine)

Healer

Analyses broken selectors → proposes minimal-diff fixes

Oracle

Synthesises results → executive quality report (JSON schema)

Approval Gates

Spec Review Gate: Analyst output reviewed before Writer runs
Diff Review Gate: Healer fixes reviewed before merge
Data Gate: Any test creating/modifying data requires human approval
Commit Gate: Generated tests go to feature branch pending PR review

07 · How to Test AI Features: LLM Evaluation for QA Teams

When your organisation ships AI-powered features, the QE team must own the quality gate for LLM outputs. Hallucination, relevance drift, and toxicity are production defects. Treat the LLM as a system under test.

Answer Relevancy

Does the output answer the question?

Faithfulness

Only claims from context (RAG systems)

Contextual Precision

Retrieved context relevant to query

Hallucination

Output contradicts source context

Toxicity

Harmful content in user-facing AI

Adversarial testing vectors: prompt injection, jailbreaking (roleplay framings), data extraction, scope violation, indirect injection via RAG documents.

08 · Automating QA in CI/CD Pipelines with AI Prompt Engineering

Stage	LLM Assistance
PR Created	Analyse diff, suggest relevant tests
Build	Generate missing unit tests for new functions
Test Execution	Classify failures: PRODUCT_BUG \| TEST_BUG \| ENVIRONMENT \| FLAKY
Code Review	Comment on testability of new code, suggest refactors
Release Gate	Synthesise results into go/no-go risk assessment

Failure Triage Prompt Output

JSON: { classification, confidence, rationale, nextAction }

09 · Controlling AI Token Costs and Prompt Governance in QA

A prompt is code. It belongs in source control with a changelog, reviews, and regression tests. Structure as a typed TypeScript module consumed by all teams.

Cost per test generated

<$0.05

Cost per CI build

<$0.50

Cache hit rate

>40%

Monthly ROI

>10x

Cost controls: Context compression · Response caching · Model tiering (smaller models for structured tasks) · Batch processing (Azure Batch API: 50% cost reduction).

10 · Advanced Prompting: RAG, Self-Correction, and Chaining for QA

RAG-Enhanced Test Generation

Query vector DB for top-K similar existing tests. Inject as context. Instruct: "Do NOT duplicate these. Fill the coverage gaps they leave." Eliminates redundancy without stuffing entire knowledge base.

Self-Correcting Prompts

After generating, instruct the model to review against a checklist: async/await correctness, no hardcoded data, one logical assertion group, ARIA-first selectors, no cross-test dependencies. "If any criterion fails, revise before outputting."

Prompt Chaining for Complex Scenarios

5-step chain: (1) Dependency Mapping → (2) State Design → (3) Test Code Generation → (4) Review against Quality Constitution → (5) BDD documentation generation. Each step uses the previous output as input.

11 · 10 Dangerous AI Prompting Mistakes QA Teams Make

01 The vague task — "Write tests for our login page." Fix: always specify framework, language, pattern, selector strategy, and provide a concrete example.

02 Hallucination acceptance — Trusting AI-generated tests without running them. Fix: non-negotiable rule — every generated test must execute before committing.

03 Context overload — Injecting your entire codebase into one prompt. Fix: extract only the minimum necessary context.

04 One-shot for complex scenarios — Asking for a 20-scenario E2E suite in one prompt. Fix: 5 scenarios maximum per prompt; use chaining.

05 Ignoring model drift — Assuming a prompt produces equivalent results after a model update. Fix: maintain prompt regression test suite.

06 No human at high-risk gates — Fully automating generation for HIGH-risk or regulatory coverage. Fix: mandatory human review for HIGH and CRITICAL regardless of AI confidence.

07 Self-healing as communication substitute — Using self-healing instead of fixing unstable selectors. Fix: every self-heal triggers a team notification and a stabilisation ticket.

08 Sharing sensitive data in prompts — Including production user data or PII in prompts to external LLM APIs. Fix: anonymise all data before injection.

09 The magic wand expectation — Expecting prompt engineering to replace experienced QA judgment. Fix: it's a force multiplier, not a replacement.

10 No performance measurement — Using prompts in production without measuring quality over time. Fix: track compilation rate, execution rate, defect detection rate.

12 · How to Build a Prompt Engineering Practice in Your QA Team

Days 1–30: Foundation

Identify 3 high-value, low-risk use cases (API test generation from specs is a great start)
Select 2–3 pilot engineers with strong testing fundamentals
Establish shared prompt library repository with version control
Create first 3 team-standard prompt templates

Days 31–60: Integration

Integrate failure triage prompting into CI/CD for pilot teams
Establish token cost monitoring dashboard
Run prompt regression tests for first 10 production prompts
Retrospective: measure compilation rate, execution rate, coverage improvement

Days 61–90: Scale

Roll out to all teams with training and shared prompt library access
Implement approval gates for AI-generated tests in CRITICAL risk areas
Launch internal prompt engineering community of practice (bi-weekly 30-minute session)
Set 6-month targets: test generation time reduction, coverage improvement

KPI	Target (6 months)
Test authoring time	50% reduction
Test coverage (new features)	+10 percentage points
Defect escape rate	20% reduction
Time to triage CI failure	30 min → 5 min (AI-assisted)
Prompt quality score (compile + run)	90%+ within 90 days
AI testing cost per sprint	<$150 per team

13 · Using Claude Code for Agentic QA Automation

Claude Code is Anthropic's agentic coding assistant that operates directly in your terminal, reads and writes files in your repository, executes commands, and runs tests. For QE teams, this is a qualitative shift from chat-based prompt engineering to agentic automation.

File read/write

Creates and updates Page Objects, test files, fixtures directly in your repo

Command execution

Runs Playwright, Jest, k6 and reads actual test output to inform next steps

Repository awareness

Understands your existing test architecture before generating

CLAUDE.md

Persistent team coding standards across all sessions — time spent writing it compounds

CLAUDE.md essentials for QE teams: Framework version, CI platform, auth type · Selector strategy (ARIA-first, never CSS/XPath) · Test data rules (always use DataFactory) · Test independence requirements · Approval gates for @critical tests.

14 · Building QA Automation Agents with the Anthropic Claude API

A Claude-powered QA agent uses the Anthropic API in a loop: call Claude with a task and tools → Claude calls a tool or produces a final answer → execute the tool call and return the result → continue until task completion.

Xray BDD Test Generator Agent

Fetches story from Jira, checks existing Xray tests to avoid duplication, generates Gherkin (happy path + negative + boundary), creates each scenario linked to the story, reports coverage and untestable ACs.

Playwright Healer Agent

Monitors CI for selector-related failures. Strict scope: may only change locators, never assertions or test logic. Reads current DOM via snapshot tool before fixing. Outputs unified diff — human reviews before any merge. If correct selector is unclear: outputs CANNOT_HEAL + reason.

15 · Production Playwright TypeScript Patterns for Enterprise QE

Patterns from enterprise QE teams: ARIA-first selectors, fixture-based test isolation, API-driven setup/teardown, typed Page Objects.

Page Object Pattern

Constructor assigns typed Locators via getByRole with accessible names. Methods expose interactions, not raw locators. Spinner awaited hidden before returning from search methods.

Typed Fixtures

Token cache scoped to worker (not test) for performance. Authenticated page injects Bearer header. Test item fixture creates via API, yields, then always deletes — guaranteed cleanup even on test failure.

playwright.config.ts

fullyParallel: true · 2 retries in CI · 4 workers in CI · HTML + JUnit + playwright-spec-doc-reporter · projects: Chromium Desktop + Pixel 7 Mobile · globalSetup for shared state.

16 · The End-to-End AI-Augmented Quality Engineering Workflow

Stage	Who / What	Action
1. Story grooming	Claude Code + CLAUDE.md	Reads story, flags untestable ACs, asks PO for clarification
2. Analyst Agent	Xray BDD Generator	Fetches story, checks duplicates, generates Gherkin
3. Approval Gate	QA Lead (human)	Reviews Analyst JSON. Nothing proceeds without sign-off.
4. Writer Agent	Claude Code subagent	Reads approved plan and POMs, generates Playwright spec, runs tests
5. PR Review	Claude Code + GitHub MCP	Auto-generates review comments on selector quality, missing negatives
6. CI Execution	Azure DevOps + Sentinel	PRODUCT_BUG = auto Jira ticket; TEST_BUG = Healer proposes diff
7. Healer Gate	QA Lead (human)	Reviews Healer diff. Approves or rejects. Diffs never auto-applied.
8. Release Gate	Oracle Agent	Synthesises sprint results + DORA metrics into go/no-go risk report

The 12 Golden Rules

Run everything. No AI-generated code goes to production without being executed.
Version your prompts. A prompt is code. Belongs in source control with a changelog.
Measure output quality. Track compilation rate, execution rate, defect detection rate per prompt.
Keep humans at gates. AI generates. Humans approve for anything CRITICAL, data-destructive, or regulatory.
Never inject PII. Anonymise all data before it enters any external LLM API call.
Tier your models. Save expensive models for complex reasoning.
Cache aggressively. Identical prompt and context must never call the API twice in the same sprint.
Scope your Healer. Self-healing agents touch only locators. Assertions and logic are human territory.
Govern your agents. Every agent needs a written constitution it cannot violate.
Monitor drift. LLM output quality changes with model updates. Run prompt regression before upgrading.
Own the cost. AI token spend is an engineering cost. Track it, budget it, optimise it.
Invest in CLAUDE.md. Time spent writing team standards there compounds across every engineer and session.

Appendix: QA Prompt Engineering Quick-Reference Templates

A.1 Acceptance Criteria → Test Cases

ROLE · STORY · ACs · POM → Playwright TypeScript tests covering all ACs (tagged with AC ID), AAA pattern, ARIA-first selectors, independent, 1 happy + 2 negative + 1 boundary per AC. Flag untestable ACs.

A.2 API Test from OpenAPI

METHOD + PATH + endpoint spec only + auth type → Supertest/Jest tests for all response codes, schema validation, auth (valid/expired/missing), type violations. Compilable TypeScript only. No explanations.

A.3 Failure Triage

Test name + error + stack trace + last 5 results + recent commits → PRODUCT_BUG | TEST_BUG | ENVIRONMENT | FLAKY. Output JSON: { classification, confidence, rationale, nextAction }

A.4 Model Settings by Task

Test generation (code): temp 0.1–0.3

Test strategy / brainstorm: temp 0.7–0.9

Failure classification: temp 0

Executive reporting: temp 0.4

"Quality Engineering has always been the discipline that asks the hard questions before the user has to. With prompt engineering in your toolkit, the scope of those questions has expanded dramatically."
— Pankaj Nakhat

Get the Full 34-Page PDF

All 50+ production-tested prompts, code examples, and the complete prompt engineering checklist — formatted for printing and team distribution.

Download Free