#LLM | Patrick’s Blog

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (ICLR ’25)

Posted on Tue, Jan 14, 2025 📖 Note LLM Interpretability Causal

Sleeper Agents: 训练能在安全训练中持续欺骗的大语言模型

Posted on Fri, Jan 10, 2025 📖 Note LLM Alignment

Stage-Wise Model Diffing 阶段性模型差异比较

Posted on Thu, Jan 2, 2025 📖 Note LLM Interpretability

Trained Transformers Learn Linear Models In-context (JMLR ’24)

Posted on Mon, Oct 21, 2024 📖 Note LLM