Event

Winning the OpenAI Codex Hackathon: Moving from Outputs to Outcomes โ€” The Decision Layer

Illustration of the OpenAI Codex Hackathon winning project: a champion character holding an OpenAI flag and an Outcomes lightbulb, alongside a funnel processing ideas through a decision graph โ€” representing the shift from raw AI outputs to structured outcomes through evaluation infrastructure

OpenAI Hackathon

At OpenAI HQ on 02/05/26, Vineet and I won the OpenAI Codex Hackathon building a continuously running multi-agent system for scientific discovery.

Thanks to Sam for the few words, and to the jury: Greg Brockman, Sonya Huang, Thibault Sottiaux, Lenny Rachitsky, and Peter Steinberger.

The win itself is not the point, but it is a useful signal. It validated that the problem we were chasing is real, and that the approach is worth taking seriously.

The system we built was not a chatbot. It was not a prompt loop. It was a background system designed to run continuously and accumulate work over time.

And the moment you build something that runs continuously, one question becomes unavoidable.

When agents run in the background, how do you decide what idea to pursue, evaluate the outcome is satisfactory and relevant beyond the tokens generated? In our case, the research paper written presents breakthrough, novel and relevant ideas?

Because in continuous systems, the hard part is not generating outputs. The hard part is turning them into outcomes.

One-shot agents fail under continuity

Most agent systems today are built as episodic workflows: a human triggers them, the system runs, the system produces something, and the human evaluates the result. This works because the human is still doing the actual decision-making, including filtering, prioritization, and selecting what happens next. The agent is not.

This model breaks as soon as agents run continuously, as soon as the system explores an open-ended space, and as soon as it produces more candidates than a human can realistically inspect. At that point, outputs accumulate while progress does not. The system becomes productive in the wrong way: it generates constantly, but it never converges.

Continuous agents make generation cheap

In our prototype, agents were always active, which meant they could continuously propose research directions, refine hypotheses across iterations, suggest experiments, draft partial notes, and branch into alternative threads. This is exactly what you want from a system designed to run in the background, and it is also exactly what breaks it.

The reason is simple: once agents run continuously, generation stops being scarce. The system does not struggle to find ideas anymore; it struggles to decide which ideas deserve attention, which ones deserve compute, and which ones should be dropped. In continuous systems, creativity is not the bottleneck. Selection and evaluation are.

Outputs are not outcomes

A one-shot agent produces an output that a user can read, evaluate, and turn into an outcome. A continuous system cannot rely on that loop. It produces outputs all the time, including hypotheses, plans, drafts, and candidate directions. If every output is treated as an outcome, the system does not scale; it simply expands.

At that point, the system becomes a factory for plausible artifacts rather than a machine for progress. This is why we made the separation explicit. Agents generate candidates, and the system decides what deserves action.

OpenAI Hackathon Winners

The decision layer is a missing layer

We introduced a decision and evaluation layer. This layer did not generate ideas, it evaluated them. It did not produce text but it brought control, necessary in agents and even more for multi agents systems.

Concretely, when the system generated a research paper draft, we did not treat the draft as the outcome. We treated it as raw material. We extracted each idea from the paper as a separate unit, inserted it into a graph, and evaluated it against everything that already existed.

In practice, this looked like citation-aware scoring. Ideas that were repeatedly cited by other generated artifacts gained weight. Ideas that connected strongly to previous work gained weight. Ideas that were isolated, redundant, or weakly supported lost weight. We ranked the entire set using a PageRank-inspired algorithm, so the system could prioritize ideas based on structure rather than intuition.

This mattered because evaluation must be contextual. A hypothesis cannot be judged in isolation. It must be judged relative to alternatives, relative to dependencies, and relative to the evidence and contradictions around it. The decision layer made that possible, and it turned a stream of outputs into a structured state the system could act on.

Why a graph instead of another model call: explainability

A graph provides explicit state that persists across iterations. It preserves relationships between ideas, not just text. It makes dependencies, contradictions, and supporting evidence first-class objects. It lets the system compare many alternatives at once instead of re-evaluating them in isolation.

Most importantly, a graph makes decisions explainable. You can inspect what was selected, what was ignored, and why. You can trace the path from an action back to the ideas and evidence that justified it.

What the hackathon validated

Winning the OpenAI Codex Hackathon was not the interesting part. The interesting part was what the system forced us to build.

It confirmed a practical conclusion: continuous agent systems require explicit evaluation. Prompting alone does not provide sufficient control. Structured decision logic scales better than ad-hoc heuristics.

This is also a small snippet of Rippletide's convictions and technology. We do not bet on agents as the product. We bet on the infrastructure that makes continuous systems reliable, explainable, and safe to run.

References

Official materials from OpenAI / Cerebral Valley: Hackathon Gallery

Why this matters beyond research

This pattern is not specific to scientific discovery. It is what happens whenever a system is expected to run continuously and produce actions, not just suggestions.

It shows up in enterprise workflows, where systems must handle long-lived processes and messy state. It shows up in long-running automations, where small errors compound over time. It shows up in regulated environments, where every decision must be traceable and defensible.

The moment a system acts continuously, evaluation stops being optional. It becomes infrastructure. Not because it is elegant, but because without it the system cannot converge, cannot be trusted, and cannot be deployed.

OpenAI Hackathon Jury

Why Rippletide focuses on decision infrastructure

This experiment is one of the reasons Rippletide focuses on decision and evaluation infrastructure rather than on building agents themselves.

The future is not a world with more agents. The future is a world where agents generate continuously, but systems decide selectively.

Outcomes require decisions.

Want to try our hypergraph for controlled decisions? Click Here

Continue Reading