Training a Reward Hacker Despite Perfect Labels
Published on August 14, 2025 11:57 PM GMTSummary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surpr…
Summary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. W… [+13169 chars]