News

reward-lens added to PyPI

  • None--Pypi.org
  • published date: 2026-04-12 13:34:36 UTC

Mechanistic interpretability toolkit for reward models

Mechanistic interpretability toolkit for reward models. The first comprehensive open-source library for understanding what happens inside the models that define the RLHF training signal. Reward-lens… [+12934 chars]