#LLM

稀疏特征电路 Sparse Feature Circuits

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (ICLR ’25)

Posted on Tue, Jan 14, 2025 📖 Note LLM Interpretability Causal

卧底特工 Sleeper Agents

Sleeper Agents: 训练能在安全训练中持续欺骗的大语言模型

Posted on Fri, Jan 10, 2025 📖 Note LLM Alignment

阶段性模型差异比较

Stage-Wise Model Diffing 阶段性模型差异比较

Posted on Thu, Jan 2, 2025 📖 Note LLM Interpretability

训练后的 Transformer 可以在上下文中学习线性模型

Trained Transformers Learn Linear Models In-context (JMLR ’24)

Posted on Mon, Oct 21, 2024 📖 Note LLM