Product Evals: Travel Planner, Long Context, and The Weight Of Taste

A product note on evaluating an AI travel planner: itinerary quality, OOD scenes, long-context consistency, recommendation taste, and user loops.

May 22, 2026 · 4 min · 664 words · jiaxing ni

Agentic RL: Reward, Behavior, and The Long Shadow Of Feedback

A field note on agentic RL: reward design, behavior shaping, credit assignment, online and offline evaluation, and the feedback loops behind agents.

May 22, 2026 · 4 min · 749 words · jiaxing ni

Evals As Instruments: Measuring What The Demo Hides

A note on evaluation as an instrument: failure cases, metrics, benchmark design, product loops, and the discipline of measuring agents.

May 22, 2026 · 4 min · 729 words · jiaxing ni