I would consider any system that solves similar problems to be AGI. What I suspect will happen is that this benchmark will saturate long before any such system exists.
Basically that's the reason why they built this benchmark. By posing challenging, seemingly unfair, benchmark questions, the system will be forced to at least have some generalisation ability that it previously did not posess.
When I look at the benchmark questions, they all look like they are exploiting the fact that LLMs suck at composition of subtasks. They might be able to solve each individual problem, but not the combination of them.
I would consider any system that solves similar problems to be AGI. What I suspect will happen is that this benchmark will saturate long before any such system exists.
Basically that's the reason why they built this benchmark. By posing challenging, seemingly unfair, benchmark questions, the system will be forced to at least have some generalisation ability that it previously did not posess.
When I look at the benchmark questions, they all look like they are exploiting the fact that LLMs suck at composition of subtasks. They might be able to solve each individual problem, but not the combination of them.
Solving that would be a far cry away from AGI.
Because of benchmark leakage / contamination?