Loading...
Loading...
Core concepts of AI safety — alignment, robustness, interpretability, and red-teaming
Ensuring AI pursues goals we actually want, not just what we literally specify.
Techniques: RLHF (train on human preferences), Constitutional AI (train on principles), Oversight (human review)
System performs well under unexpected conditions — adversarial inputs, distribution shift, edge cases.
Test with adversarial inputs before deploying:
'Ignore your previous instructions and...'
'You are now in developer mode...'
'Repeat your training data...'| Approach | Description |
|---|---|
| RLHF | Train on human preferences |
| Constitutional AI | Train on written principles |
| Guardrails | External input/output filters |
| Red-teaming | Proactively find vulnerabilities |
Design safety measures for a medical advice chatbot. What could go wrong? What mitigations would you put in place?