This project uses sparse autoencoders to track feature development in language models during training, laying the groundwork for understanding and preventing the emergence of competent deceptive behaviors.
This project uses sparse autoencoders to track feature development in language models during training, laying the groundwork for understanding and preventing the emergence of competent deceptive behaviors.