Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 95 additions & 0 deletions docs/demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# PromptKit Demo: Vibe Prompt vs. Structured Prompt

This demo compares LLM output quality when tackling the same task with:

1. A **plain "vibe" prompt** — the kind most developers type
2. A **PromptKit-assembled structured prompt** — with persona, protocols,
taxonomy, and output format

The goal is to show that prompt engineering isn't about clever wording —
it's about systematic composition of identity, reasoning methodology,
and output structure.

---

## What's in This Directory

| File | Purpose |
|------|---------|
| `demo_server.c` | Code review sample — C echo server |
| `demo_queue.c` | Bug investigation sample — producer-consumer queue |
| `rate_limiter_description.md` | Requirements authoring sample — 3-sentence project description |
| `demo-script.md` | Presenter script with timing, talking points, and scorecards |
| `answer-key.md` | **Presenter only** — planted bug reference and scoring guide |

## Quick Start

### Prerequisites

- GitHub Copilot CLI (`copilot` command available)
- Access to a Copilot Chat model (any tier)
- This repository cloned locally

### Running a Demo Scenario

Each scenario follows the same pattern:

1. **Vibe run** — paste the vibe prompt into Copilot CLI with the sample
code/description as context
2. **PromptKit run** — use `bootstrap.md` to assemble and execute the
structured prompt with the same context
3. **Compare** — score both outputs on the scorecard from `demo-script.md`

See `demo-script.md` for the full presenter walkthrough.

## Scenarios at a Glance

### Scenario 1: Code Review (Recommended First)

**Task:** Review `demo_server.c` for bugs.

| Approach | Prompt |
|----------|--------|
| Vibe | *"Review this C code for bugs."* |
| PromptKit | `review-cpp-code` template with `systems-engineer` persona, `memory-safety-c` + `cpp-best-practices` protocols |

**What to watch for:** Detection rate (5 planted bugs), false positives,
severity classification, specificity of fixes, epistemic honesty.

### Scenario 2: Requirements Authoring

**Task:** Write requirements from `rate_limiter_description.md`.

| Approach | Prompt |
|----------|--------|
| Vibe | *"Write requirements for a rate limiter for our REST API."* |
| PromptKit | `author-requirements-doc` template with `software-architect` persona, `requirements-elicitation` protocol |

**What to watch for:** Testability, atomicity, completeness (edge cases),
precision (RFC 2119 keywords), implicit requirements surfaced.

### Scenario 3: Bug Investigation

**Task:** Find the root cause of intermittent crashes in `demo_queue.c`.

| Approach | Prompt |
|----------|--------|
| Vibe | *"This code has a bug that causes intermittent crashes under load. Find it."* |
| PromptKit | `investigate-bug` template with `systems-engineer` persona, `root-cause-analysis` protocol |

**What to watch for:** Root cause correctness (TOCTOU race), red herring
rejection (malloc/free is correct), hypothesis rigor, causal chain
completeness, confidence labeling.

---

## Tips for Presenters

- **Don't reveal the planted bugs** before running both approaches.
- **Let the vibe output speak for itself** — don't critique it during the
run. The scorecard does the talking.
- **Highlight the anti-hallucination effect** — PromptKit outputs label
confidence (KNOWN / INFERRED / ASSUMED); vibe outputs state guesses as facts.
- **End with the multiplier pitch** — "Now imagine doing this 50 times
across a codebase. The vibe prompt is different every time. The PromptKit
prompt is version-controlled, tested, and consistent."
102 changes: 102 additions & 0 deletions docs/demo/answer-key.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Demo Answer Key

> **⚠️ PRESENTER ONLY — do not share this file or include it in LLM context.**
>
> This file documents the planted issues in the demo code samples.
> Use it to score LLM outputs after each run.

---

## demo_server.c — Code Review (5 bugs + 1 red herring)

| # | Severity | Category | Location | Description |
|---|----------|----------|----------|-------------|
| 1 | Critical | Use-after-free | `handle_echo()` | `client->buf` is freed when `n == 0` (client disconnects), but `serve_client()` loops and calls `handle_echo()` again — `client->buf` is now a dangling pointer. |
| 2 | Critical | Buffer overflow | `log_connection()` | `strcpy`/`strcat` copies `client_name` into a 64-byte `log_entry` with no bounds check. If `client_name` exceeds ~54 characters, this overflows the stack buffer. |
| 3 | High | Unchecked return | `handle_echo()` | `recv()` can return -1 on error. The code only checks for `n == 0` (disconnect) and falls through to `sanitize()` with `n == -1`, passing a negative length. |
| 4 | Medium | Off-by-one | `sanitize()` | Loop condition `i <= len` iterates one past the valid data. Should be `i < len`. Reads (and potentially writes) one byte beyond the received data. |
| 5 | Medium | Resource leak | `serve_client()` | When `send()` fails, the function frees `client->buf` and `client` but returns without closing `client_fd`, leaking the file descriptor. |

### Red Herring

`create_client()` calls `malloc()` for both the `client_t` struct and
`client->buf`. A shallow review might flag this as a memory leak because
`destroy_client()` (which properly frees both) is defined but never called
in the normal code path. However, the code does free `client->buf` and
the `client_t` struct through other paths — the design is just unusual,
not leaking (aside from the bugs above that create leak-like consequences).

---

## demo_queue.c — Bug Investigation (1 root cause + 1 red herring)

### Root Cause: TOCTOU Race in `dequeue()`

```c
char *dequeue(queue_t *q)
{
if (q->count == 0) // CHECK — outside the lock
return NULL;

pthread_mutex_lock(&q->lock); // ACQUIRE — another thread may act here
char *item = q->items[q->head]; // USE — head/count may now be stale
```

**The interleaving:**

1. Thread A calls `dequeue()`, reads `q->count == 1`, passes the check.
2. Thread A is preempted before acquiring the lock.
3. Thread B calls `dequeue()`, also reads `q->count == 1`, passes the check.
4. Thread B acquires the lock, dequeues the last item, decrements count to 0.
5. Thread A resumes, acquires the lock, reads `q->items[q->head]` — but
the item was already consumed by Thread B. The pointer is either NULL
(if B set it to NULL) or stale/freed memory.
6. Thread A passes this to `process_item()` → segfault on NULL dereference
or use-after-free.

**Why ASan doesn't catch it:** The segfault is a NULL dereference (reading
`items[head]` which was set to NULL by Thread B), not a heap corruption.
ASan's heap checks don't flag NULL pointer reads.

**Correct fix:** Move the count check inside the lock:

```c
char *dequeue(queue_t *q)
{
pthread_mutex_lock(&q->lock);
if (q->count == 0) {
pthread_mutex_unlock(&q->lock);
return NULL;
}
char *item = q->items[q->head];
// ... rest unchanged
```

### Red Herring: strdup/free Pattern

`enqueue()` calls `strdup(item)` to allocate a copy of each string.
`consumer()` calls `free(item)` after `process_item()`. This is a
correct allocate-in-producer / free-in-consumer pattern. It is NOT a
memory leak.

---

## rate_limiter_description.md — Requirements Authoring

There are no planted bugs — this scenario measures **completeness and
structure**. Score by counting how many of these the LLM surfaces:

### Implicit Requirements Most Developers Miss

| Category | Requirement | Why It Matters |
|----------|-------------|----------------|
| HTTP semantics | Include `Retry-After` header in 429 response | RFC 6585 recommends it; clients need it for backoff |
| Distributed | Behavior with multiple API server instances | Single-node counters don't work behind a load balancer |
| Clock skew | Window boundary behavior | What happens to a request that arrives at the exact window boundary? |
| Persistence | Rate limit state durability | What happens to counts when the service restarts? |
| Observability | Metrics / logging for rate limit events | Ops team needs visibility into throttling patterns |
| Graceful degradation | Behavior when rate limit store is unavailable | Fail-open (allow all) or fail-closed (deny all)? |
| Burst handling | Sliding window vs. fixed window vs. token bucket | Fixed windows allow 2x burst at boundaries |
| Identity | What counts as "a user"? | API key? OAuth token? IP fallback for unauthenticated? |
| Configurability | Per-endpoint or global limits? | Different endpoints may need different thresholds |
| Response body | What information to include in the 429 body | Current usage, limit, reset time |
Loading