reward-lens added to PyPI
Mechanistic interpretability toolkit for reward models
Mechanistic interpretability toolkit for reward models. The first comprehensive open-source library for understanding what happens inside the models that define the RLHF training signal. Reward-lens… [+12934 chars]