Show HN: Improving Prompt Injection Detection with Weighted Ensembles

Hi HN — I’m the author.

This project started as an open-source system for detecting prompt injections in LLMs. The goal is to flag adversarial prompts before they reach a model, while keeping latency low and probabilities well-calibrated.

The main insight came from ensembles: not all models are equally good at every case. Instead of just averaging outputs, I:

1. Benchmarked each candidate model first to see what it actually contributes.

2. Remove models that don’t improve the ensemble through ablation studies (e.g., ProtectAI's Deberta finetune was dropped as it only contributed 0.5% to ECE and actually decreased accuracy).

3. Weight predictions by each model’s accuracy, letting models specialize in what they’re good at.

With this approach, the ensemble is smaller (~237M parameters vs ~600M for the leading baseline), 2x faster, and more calibrated (lower Expected Calibration Error) while still achieving competitive accuracy. Lower confidence on wrong predictions makes it safer for “human-in-the-loop” fallback systems.

For more info, you can check it out here: https://github.com/appleroll-research/promptforest

This project is open to all forms of contributions, and I’d love to hear feedback from the HN community — especially on ideas to further improve calibration, robustness, or ensemble design.