Human Feedback Attack on Online RLHF: Attack and Robust Defense

Chenye Yang, Mo Lyu, Guanlin Liu, Lifeng Lai. Human Feedback Attack on Online RLHF: Attack and Robust Defense. IEEE Transactions on Signal Processing, 73:3886-3901, 2025. [doi]

Abstract

Abstract is missing.