Lightweight Safety Classification Using Pruned Language Models
This paper introduces Layer Enhanced Classification (LEC), a novel technique that outperforms GPT-4o and specialized models in content safety and prompt injection detection using fewer than 100 training examples with dramatically reduced computational requirements. The approach combines the computational efficiency of a streamlined Penalized Logistic Regression Classifier with the robust language understanding of an LLM. The results demonstrate that incredibly small transformer models (0.5B parameters) are robust feature extractors for classification tasks.
View on arXiv →