Will the first AI model that saturates Humanity's Last Exam be employable as a software engineer?
Will the first AI model that receives a score of 75 or higher be capable (with an agent scaffolding) of replacing a software engineer? Resolves based on my personal judgement, in particular whether it is cost- and time-effective for ZeroPath to use it to replace one of our engineers (or accomplish the same amount with fewer people). Example tasks it should be capable of: "Fix this error we're getting on BetterStack." "Move our Redis cache from DigitalOcean to AWS." "Add and implement a cancellation feature for ZeroPath scans." "Add the results of this evaluation to our internal benchmark." I will not be betting, but let it be known that I am pessimistic about the state of current evals. Update 2025-03-17 (PST) (AI summary of creator comment): Clarification on team size reduction: Fewer people in the parenthetical (or enable us to accomplish the same amount with fewer people) is defined to mean eighty percent of original team size. This means the AI could allow us to do the same amount of engineering work with 20% fewer people than was possible March 2024. Update 2025-07-31 (PST) (AI summary of creator comment): In response to a question about the market's timeline, the creator has stated the closing date will be extended as needed.