Shuyi Zhang, Wei Shi, Sihang Li 0002, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang 0010. Interpretable Reward Model via Sparse Autoencoder. In Sven Koenig, Chad Jenkins, Matthew E. Taylor, editors, Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational Advances in Artificial Intelligence, AAAI 2026, Singapore, January 20-27, 2026. pages 34808-34816, AAAI Press, 2026. [doi]
Abstract is missing.