passwedge added to PyPI
Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents: pass@k, pass^k, Bayesian posteriors, RDC/VAF/GDS/MOP.
Reliability-science metrics for repeated-attempt evaluation of long-horizon LLM agents. pass@1 tells you whether a model can do a task once. It says nothing about whether an agent does so consisten… [+4574 chars]