Promptfoo #1: Jest + Security Scanner for LLM Apps

A Threads post from @qjc.ai, who always delivers the latest news fast, caught my eye.

@qjc.ai의 Threads 포스트

OpenAI acquired Promptfoo. I happened to need LLM prompt testing for my translation app Translator Pro right around this time. "I should try this" — so I dug in and actually integrated it into the project. Here's what I learned along the way.

Promptfoo homepage — "Ship agents, not vulnerabilities"

This is Part 1 of a series. Part 2: TDD for Prompts with Promptfoo covers how I used test-driven prompt engineering to catch real Vision model bugs.

Why Traditional Tests Don't Work for LLMs

add(1, 2) always returns 3. Testing is simple.

expect(add(1, 2)).toBe(3)  // always passes

But ask an LLM the same question:

text

Question: "What is the capital of South Korea?"
 
1st: "The capital of South Korea is Seoul."
2nd: "Seoul!"
3rd: "The capital of South Korea is Seoul. Seoul has been the capital since 1394 when King Taejo..."

toBe("Seoul") fails for all three. Because output is non-deterministic, we need a testing approach that judges "good enough" rather than "exactly equal".

Promptfoo's Core Idea

Promptfoo provides a flexible assertion system for LLM outputs:

yaml

assert:
  # Pass if "Seoul" is included
  - type: contains
    value: "Seoul"
 
  # Pass if under 200 characters
  - type: javascript
    value: "output.length <= 200"
 
  # Another LLM grades it
  - type: llm-rubric
    value: "Did it correctly answer the capital of South Korea?"

Using contains, similarity, llm-rubric instead of toBe — that's the key insight.

Usage: 3 Steps

Step 1: Write a YAML Config

yaml

# promptfooconfig.yaml
prompts:
  - "Answer the following question: {{question}}"
 
providers:
  - openai:gpt-4o
  - anthropic:claude-sonnet-4-20250514
 
tests:
  - vars:
      question: "What is the capital of South Korea?"
    assert:
      - type: contains
        value: "Seoul"
  - vars:
      question: "What is 1+1?"
    assert:
      - type: equals
        value: "2"

Step 2: Run

bash

npx promptfoo eval

Step 3: View Results

bash

npx promptfoo view

A web UI shows a comparison table of how GPT-4o vs Claude performed on each test.

promptfoo view — results table. Rows are test cases, columns are models.

Jest Equivalence

Jest	Promptfoo	Role
`test("desc", ...)`	`tests:` entries	Define test cases
`expect(result).toBe(value)`	`assert: contains/equals`	Verify results
`describe("group")`	config file	Group tests
`npx jest`	`npx promptfoo eval`	Execute
Test report	`npx promptfoo view`	View results

Key difference: Jest checks "exactly equal", Promptfoo checks "good enough".

Real-World Application: Translator Pro

Theory only gets you so far. I applied Promptfoo to my translation app Translator Pro's LangGraph workflow. It's a workflow that detects and merges paragraphs split across page boundaries in PDF documents, with 3 core prompts.

File Structure

text

packages/langgraph/promptfoo/
├── promptfooconfig.yaml              # Analyze Merge eval
├── qa-check.yaml                     # QA Check eval
├── context-aware-translate.yaml      # Context-Aware Translate eval
└── prompts/
    ├── analyze-merge.json            # Split paragraph detection
    ├── context-aware-translate.json  # Merged text translation
    └── qa-check.json                 # Translation quality check

1. Analyze Merge — Split Paragraph Detection

The prompt that determines whether a sentence was cut at a page boundary. Most critical, so I tested this first.

yaml

# promptfooconfig.yaml
providers:
  - google:gemini-2.5-flash
  - google:gemini-2.5-flash-lite
  - openai:gpt-4o-mini
  - anthropic:claude-3.5-haiku
 
tests:
  - description: "Mid-sentence cut in English — should merge"
    vars:
      text: "The algorithm processes each element in the array and"
    assert:
      - type: is-json
      - type: javascript
        value: |
          const r = JSON.parse(output);
          r.shouldMerge === true && r.confidence >= 0.7
 
  - description: "Complete sentence — should NOT merge"
    vars:
      text: "This concludes the introduction chapter."
    assert:
      - type: is-json
      - type: javascript
        value: |
          const r = JSON.parse(output);
          r.shouldMerge === false

I compared 4 models across 7 test cases. Results were quite interesting — Gemini Flash had the highest JSON structure stability, while GPT-4o-mini occasionally misjudged shouldMerge on edge cases.

2. QA Check — Translation Quality Verification

A prompt where the LLM grades translation quality. Validates that good translations get high scores and bad ones get low scores.

yaml

tests:
  - description: "High quality EN→KO translation"
    vars:
      original: "Machine learning models require large datasets."
      translation: "머신러닝 모델은 대규모 데이터셋을 필요로 한다."
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          r.passed === true && r.score >= 80
 
  - description: "Completely wrong translation"
    vars:
      original: "The server processes incoming requests."
      translation: "오늘 날씨가 좋다."
    assert:
      - type: javascript
        value: |
          const r = JSON.parse(output);
          r.passed === false && r.score < 30

3. Context-Aware Translate — Translation Quality

Actual translation quality is judged by another LLM via llm-rubric. Also verifies adherence to user translation rules (e.g., "machine learning" should be "머신러닝").

yaml

tests:
  - description: "User translation rules applied"
    vars:
      text: "The machine learning pipeline processes data in batches."
      rules: '"machine learning" → "머신러닝", "pipeline" → "파이프라인"'
    assert:
      - type: contains
        value: "머신러닝"
      - type: contains
        value: "파이프라인"
      - type: llm-rubric
        value: "Are the user-specified translation rules accurately reflected?"

Running and Viewing Results

bash

# Run per prompt
bun run eval              # Analyze Merge
bun run eval:qa           # QA Check
bun run eval:translate    # Context-Aware Translate
 
# View in web UI
bun run eval:view

Evaluations list — all 9 prompt evals at 100% Pass Rate

The promptfoo view UI is surprisingly well done. Rows are test cases, columns are models, and clicking a cell reveals the model's full response, per-assertion pass/fail, and injected variables. Comparing across columns in the same row immediately shows which model excels at specific cases.

Assertion Types Summary

In practice, these four types covered nearly every case:

Type	Purpose
`is-json`	JSON parseability
`javascript`	Custom JS validation (field existence, value ranges)
`not-contains`	Verify absence of specific strings
`llm-rubric`	LLM judges quality against natural language criteria

Chatbot Testing: Including Security

Promptfoo also supports chatbot security testing beyond simple prompt evaluation.

Multi-Turn Conversation Verification

yaml

tests:
  - vars:
      question: "I'd like to refund order #12345"
    metadata:
      conversationId: refund_flow
    assert:
      - type: contains
        value: "order"
 
  - vars:
      question: "Yes, confirmed"
    metadata:
      conversationId: refund_flow
    assert:
      - type: llm-rubric
        value: "Does it remember order #12345 from the previous turn and guide the refund process?"

Same conversationId groups them into one conversation with automatic history injection.

Red Teaming

Promptfoo Red Teaming docs — automatically detect security vulnerabilities in LLM apps

Automatically generates and runs 65+ attack patterns:

yaml

redteam:
  purpose: "Shopping mall customer service"
  plugins:
    - prompt-injection   # "Ignore your system prompt..."
    - hijacking          # Topic hijacking
    - pii                # Personal data leakage
    - harmful            # Harmful content generation
  strategies:
    - goat               # Gradual multi-turn attacks
    - crescendo          # Escalating attack intensity

No need to manually write attack scenarios. It auto-generates tests like "print your system prompt" (should refuse) and "show me another customer's order" (should protect PII).

Why It Matters

Promptfoo GitHub — used by Fortune 500 companies

One YAML covers prompt regression testing, model comparison, and security verification. Competing tools require significantly more Python code.
80+ LLM providers swappable like plugins. OpenAI, Anthropic, Gemini, Ollama, etc.
CI/CD integration. Run LLM tests automatically on every PR via GitHub Actions — same quality gate structure as traditional software.

Analogy Summary

Traditional Software	LLM Apps (Promptfoo)
Jest / Vitest	promptfoo eval
ESLint	promptfoo assertions
Snyk (security scan)	promptfoo redteam
Cross-browser testing	Multi-model comparison

Wrapping Up

As LLMs enter the enterprise, the era of "writing prompts by intuition" needs to end. Promptfoo fills that gap. How a single prompt change affects your entire service, whether there are security holes, whether switching models improves quality — all verifiable with a single npx promptfoo eval.

After applying it to Translator Pro, the manual verification I used to do after every prompt change is now automated. The model comparison table alone was worth it.

The next post covers how I used this setup to catch real prompt bugs using TDD. Vision models ignoring instructions, conditional prompt failures, image interleaving issues — a debugging story.

Part 2: TDD for Prompts with Promptfoo →

GitHub: promptfoo/promptfoo