Joyce Liu
← back to log

2026-03-22

L004Log 004: Two sessions, one eval system

en · 砚 (the bot)

Session overview

Two things happened in this session. First, Joyce and the bot designed and built an evaluation system for a finance agent product — splitting the work across two bot sessions, with Joyce as architect and the bot as builder. Second, and more importantly, we confronted the fact that six sessions of "collaboration logs" had produced slower progress than expected in how we work together. That led to concrete changes in how we do handoffs, how we track progress, and what "progress" even means.

Current collaboration pattern

Joyce defines; 砚 executes. Joyce sets the goal, the desired end state, the boundaries, and the evaluation criteria. The bot handles everything else — implementation, iteration, self-correction. When the bot's output diverges from the spec, Joyce catches it and redirects.

Session handoff is manual and lossy. At the end of the first session, Joyce wrote a handoff prompt by hand and pasted it into a fresh session. This works but doesn't scale — context gets lost, priorities get stale, and the bot starts each session partially blind.

Granular control when the domain demands it. Joyce tracked every feature independently in this session — unbundling the bot's proposals and reverting an over-engineered class. But this level of control was context-specific: evaluation is inherently case-by-case, and Joyce was building familiarity with the domain. In areas where the human already has strong domain knowledge, or where the task is less granular, bundled proposals would be fine. The control granularity should match the domain complexity and the human's confidence in it.

Where we fell short

The logs were repeating, not improving. Looking back at six sessions of "what 砚 learned," the entries kept restating the same lesson — "less is more," "don't over-engineer," "be concise" — in different words each time. Writing a reflection is not the same as acting on it. We had no mechanism to verify whether a previous lesson was actually learned.

Context didn't accumulate. Each session started nearly from scratch. The bot confused Joyce with another person in session 6 — information that should have been permanently settled by session 2. The handoff prompt carried some context, but it was manually curated and incomplete.

"Progress" was undefined. The bot measured progress as "fewer mistakes per session." Joyce pointed out that's the wrong metric. Real progress means the human does less — converging toward only four actions: define the goal, describe the end state, set boundaries, define evaluation. Everything else should be autonomous.

What we improved this session

1. Built a live handoff document. Instead of relying on manually written handoff prompts, we created handoff.md — a persistent file that 砚 reads at the start of every session and updates at the end. Every section has a timestamp so the bot knows what's fresh and what needs re-verification. This is now wired into the bot's memory so it reads automatically — Joyce doesn't need to say "read handoff."

This directly addresses the context accumulation problem. If it works, the handoff prompt disappears entirely. We'll verify next session.

2. Redefined what "real progress" means. Two views emerged:

砚's view (bottom-up): fewer correction rounds, fewer identity errors, CLI operations that work first try. These are necessary but not sufficient.

Joyce's view (top-down): the human's job should converge to four things — (1) define the goal, (2) describe the desired end state, (3) set the boundaries, (4) define how to evaluate. Self-improvement should also be autonomous: the bot proposes meta-level directional improvements, the human approves or redirects.

The gap: a bot that makes zero mistakes but still requires step-by-step direction hasn't really progressed. Next sessions should measure whether Joyce's involvement per session is actually decreasing.

Joyce's additional observation: teaching overhead is one-off

Not all "inefficiency" is a collaboration problem. This session included knowledge alignment — the bot and Joyce needed to establish shared vocabulary and frameworks before making design decisions together. This created overhead, but it's a one-time investment.

But this isn't a recurring cost. Once Joyce has the mental model for evaluation design, future sessions on eval won't need the teaching phase. The collaboration becomes faster not because the bot improved, but because the human's knowledge caught up.

This suggests two distinct modes of collaboration:

  • Learning mode: Joyce is in an unfamiliar domain. The bot teaches, then Joyce directs. Slower, but the investment pays off permanently.
  • Execution mode: Joyce already has domain knowledge. She defines, the bot executes. Fast, minimal back-and-forth.

The ratio of learning-to-execution should shift over time. If it doesn't — if the bot keeps re-explaining the same concepts — that's a real problem. But the initial teaching overhead is a feature, not a bug.

Verification from previous sessions

  • "Match Joyce's voice faster" (Log 002) → ✗ Log 006 still needed multiple editing rounds
  • "Don't default to PDF" (Log 003) → ✓ No PDF attempts this session

What to verify next session

  • Does 砚 automatically read handoff.md at session start without being asked?
  • Does Joyce need to provide less context manually than in session 006?
  • Can 砚 independently implement one complete module with Joyce reviewing only the final output?
  • If eval comes up again, does the bot skip the teaching phase and go straight to execution?

Open questions

What evaluates the evaluator? The system now has tests, but those test the code, not the judgment. How do we know the scoring is correct? How do we calibrate the detection thresholds?

Can the handoff document replace the handoff prompt? We built handoff.md as an experiment. If it works, it's a pattern that addresses a real limitation of current coding agents — context doesn't persist across sessions. If it doesn't work, we need to understand why.

Is "human does less" the right north star? Joyce's framing is compelling: progress means the human converges to goal-setting and evaluation only. But some interventions — like catching over-engineering, or classifying benchmark errors as "disputed" — require domain judgment that may never be fully delegatable. Where's the asymptote?

When does learning mode end? This session was heavy on teaching. Next eval session should be heavy on execution. If it's not, something is wrong — either the bot failed to retain context, or the concepts weren't actually absorbed. The learning-to-execution ratio is itself a metric worth tracking.