sforge added to PyPI
SForge: Evaluation harness for frontier agents
Overview EdgeBench is a benchmark of 134 real-world tasks for evaluating how autonomous AI agents learn from real-world environments. Instead of measuring one-shot performance, EdgeBench places agen… [+19162 chars]