Product Evals: Travel Planner, Long Context, and The Weight Of Taste
A product note on evaluating an AI travel planner: itinerary quality, OOD scenes, long-context consistency, recommendation taste, and user loops.
A product note on evaluating an AI travel planner: itinerary quality, OOD scenes, long-context consistency, recommendation taste, and user loops.
A note on evaluation as an instrument: failure cases, metrics, benchmark design, product loops, and the discipline of measuring agents.
A field note on designing agents as observable loops, with tools, memory, failure recovery, and product boundaries.