N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?
N-Day-Bench tests whether frontier LLMs can find known security vulnerabilities in real repository code. Each month it pulls fresh cases from GitHub security advisories, checks out the repo at the last commit before the patch, and gives models a sandboxed bas…
N-Day-Bench measures the capability of frontier language models to find real-world vulnerabilities or "N-Days" disclosed post their respective knowledge cut-off date. All models are given the same ha… [+413 chars]