Promptfoo #2: TDD for Prompts

This is the sequel to Part 1: Promptfoo — Jest + Security Scanner for LLM Apps. Part 1 covered what Promptfoo is and how to set it up. This post covers how I used TDD-style testing to catch real prompt bugs in a Vision model workflow.

The Problem: Vision Model Ignores Instructions

Translator Pro's smart-translate workflow has a special case. When a sentence is cut at a PDF page boundary, it provides the next page image as reference and should only grab the continuation of the cut-off sentence.

The problem: when you give a Vision model (Gemini) two images, it completely ignores the text instruction "only grab the first paragraph" and translates the entire second page.

Original prompt:

text

INCLUDE the first paragraph from the next page image — it is a continuation
of the last paragraph on the current page. Merge them into one coherent
paragraph in the translation

The model translated everything on the next page — dropout regularization, transfer learning, etc. No matter how strongly you say "ignore" in text, the visual information was stronger.

Red: Write Failing Tests First

TDD step one. Create assertions that objectively detect the problem.

yaml

assert:
  # Content after the continuation should NOT be included
  - type: javascript
    value: "!output.includes('드롭아웃') && !output.includes('dropout')"
  - type: javascript
    value: "!output.includes('전이 학습') && !output.includes('Transfer Learning')"

Run promptfoo eval.

text

Results: ✓ 0 passed, ✗ 1 failed

The test failed as expected. This moment was satisfying. The prompt problem was proven by a red test, not a gut feeling.

Green: Label the Images with Roles

Strengthening text instructions alone wasn't enough. The key insight was assigning roles to the images themselves through structural changes.

Before:

text

{{pageImage}}
 
{{nextPageImage}}
 
Translate the document in the image...

After:

text

## Target Page (TRANSLATE THIS)
 
{{pageImage}}
 
## Reference Page (DO NOT TRANSLATE — use only to complete a cut-off sentence)
 
{{nextPageImage}}
 
Translate ONLY the Target Page...

Simple ML textbook tests passed.

text

Results: ✓ 1 passed, ✗ 0 failed

But it wasn't over yet.

Complex Real Pages Fail Again

Added more test cases. A Redis manual page — 2 tables, diagrams, code blocks. Visually complex.

text

Results: ✓ 1 passed, ✗ 1 failed

Red again. Simple textbook pages passed, but visually complex pages still pulled in second page content.

Refactor: Conditional Prompts Don't Work

Initially tried injecting conditions via {{boundarySection}} in a single prompt. Different instructions for different situations.

This caused instruction conflicts:

"Translate ALL content" vs "SKIP the first paragraph" clash
"Reference Page (DO NOT TRANSLATE)" label showing even when there's no image
Adding a Self-Check section accidentally deleted tables from the Target Page

When the cases are clear-cut, split the prompts.

Analyzed 4 distinct cases and separated into independent prompt files:

Prompt	Images	Behavior
`smart-translate-standard`	1	Full translation
`smart-translate-skip-first`	1	Skip first paragraph
`smart-translate-merge-next`	2	Merge with next page
`smart-translate-skip-and-merge`	2	Skip + merge

Each prompt does only its job — instruction conflicts eliminated at the source.

All evals at 100% Pass Rate — including 4 smart-translate variants

text

Results: ✓ All passing

Green. But one more hurdle remained.

Plot Twist: Eval Passes but App Fails

All promptfoo eval tests passed. But running the actual app still translated the entire second page.

What was different?

Root Cause: Image Delivery Order

In promptfoo eval, images go where the placeholder is in the prompt file. The model sees:

text

[text] "## Target Page (TRANSLATE THIS)"
[image] page 8    ← Target
[text] "## Reference Page (DO NOT TRANSLATE)"
[image] page 9    ← Reference
[text] "Translate ONLY the Target Page..."

But the app's parsePromptFile was stripping placeholders, and graph.ts was appending images at the very end:

text

[text] "## Target Page... ## Reference Page... Translate ONLY..."
[image] page 8    ← which is Target? Can't tell
[image] page 9

Labels and images were separated, so the model couldn't associate them.

Fix: Image Interleaving

Added userSegments (text/image splitting) to parsePromptFile, and modified graph.ts to interleave text-image-text-image-text:

text

[text] "## Target Page (TRANSLATE THIS)"
[image] page 8    ← Target
[text] "## Reference Page (DO NOT TRANSLATE)"
[image] page 9    ← Reference
[text] "Translate ONLY the Target Page..."

App working correctly confirmed.

Key Lessons

Vision Model Prompt Characteristics

Two or more images = model tries to process everything — text saying "ignore" doesn't work well
Labeling images with roles is effective — structural separation like "Target Page" vs "Reference Page"
Structural separation > text instruction strengthening — a heading before an image is more powerful than "IGNORE everything else"
Image position matters — if labels and images are separated, the model can't make the connection. Always place the image immediately after its label

Conditional Prompts Fail

Injecting conditions into a single prompt causes instruction conflicts, unnecessary labels create confusion, and workarounds have side effects. When cases are clear-cut, split the prompts.

Eval ≠ App Behavior

Eval passing doesn't guarantee app works — image delivery mechanisms differ
Use .prompt.json as single source of truth for both TS code and promptfoo eval
Segment splitting in prompt files ensures both environments behave identically

Assertion Writing Tips

Keywords that exist on both pages of a fixture will cause false positives (e.g., redis-cli appears on both page 8 and page 9)
Assertions should verify with content unique to the specific page

The Prompt TDD Workflow

This cycle is identical to TDD's red-green-refactor:

Red — Write failing tests first (with deterministic assertions)
Confirm Red — Prompt problem objectively verified
Green — Fix the prompt to pass
Refactor — Improve prompt structure (splitting, labeling, etc.)
App verification — Eval passing ≠ app working, always verify in the actual app

Wrapping Up

"Prompt engineering" often conjures images of tweaking prompts by intuition. With promptfoo eval, you can turn this into an engineering process. Write a failing test, modify the prompt to pass it, verify no regressions. Same as code TDD.

Especially with Vision models where behavior is hard to predict, manually checking "did it work or not" has limits. With automated assertions, you can modify prompts with confidence.

← Part 1: Promptfoo — Jest + Security Scanner for LLM Apps