AI Safety Annual Meeting 2025

AISafety2025 Bansal: Safety Constrained Sets

Detect tokens (latents?) which trigger potential paths into unsafe behavior, and then preempt them early by steering.