AI Benchmarks Outdated for Measuring Chatbot Abilities for Average Users

Current AI benchmarks often test narrow, niche capabilities not relevant to how most people use AI chatbots in real life. They fail to capture typical use cases.
Many popular benchmarks are 3+ years old and no longer applicable as AI systems now have broad consumer use cases beyond just research.
Benchmark questions can contain errors like typos or nonsensical writing, calling into question what exactly these benchmarks are measuring.
Some benchmarks seem to test rote memorization more than actual reasoning and causal understanding. Models can score well by just associating keywords.
Experts recommend combining benchmarks with real human evaluation of model responses to genuine user queries. But some question if benchmarks can ever properly evaluate real-world AI impacts.