News

Training a Reward Hacker Despite Perfect Labels

  • ariana_azarbal--Lesswrong.com
  • published date: 2025-08-14 23:57:21 UTC

Published on August 14, 2025 11:57 PM GMTSummary:  Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surpr…

Summary:  Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. W… [+13169 chars]