Rogan Inglis

Sparse Features Through Time

This project uses sparse autoencoders to track feature development in language models during training, laying the groundwork for understanding and preventing the emergence of competent deceptive behaviors.

5 days ago