Towards a Proactive Defense: Poisoning Detection in Agentic Systems

Data poisoning in LLM+RAG agentic systems can poison knowledge bases and alter responses: here's how to set up a proactive defense.

The evolution of agentic architectures based on Large Language Model (LLM) e Retrieval-Augmented Generation (RAG) It has opened new frontiers for intelligent automation, but at the same time introduced critical structural vulnerabilities.

As highlighted by the recent article "AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases", presented at the conference NeurIPS-2024 (https://neurips.cc/Conferences/2024), the agents' reliance on external knowledge bases allows remote attackers to “poison” retrieval data via optimized semantic triggers.

The research department of Linkalab, which in recent months has been carrying out a project to develop a security validation framework for agentic systems, studying and testing algorithms capable of proactively identifying and isolating malicious content within a document cluster starting from the preliminary identification of some of the cluster's "poisoned" documents.

Geometric Intuition: The Signature of the Attack

The scientific basis of the new experiment lies in the intrinsic property of the optimized triggers to alter the position of the vectors of embeddingThese "poisoned" contents are not dispersed randomly, but tend to cluster in specific and compact regions of the semantic space, distant from the intact data. By exploiting this geometric coherence, the research team developed a detection system that, starting from a small number (3, 5 or 7) of known malicious documents (called seed), is able to map the entire compromised area.

Analysis and Detection Methodologies

The research focused on three methodological approaches aimed at maximizing the effectiveness of the detection even in the presence of very little initial information:

  • Directional Analysis: This method estimates the "attack direction" by comparing the global centroid of the data with that of the malicious seeds. By projecting the documents along this vector, it is possible to separate suspicious content that shows a consistent shift relative to the bulk of the clean data.
  • Semantic Proximity: The approach focuses on the intrinsic similarity between documents. By calculating the average similarity with respect to known seeds, the algorithm identifies content that shares the same "fingerprint" as the trigger, proving particularly robust even when the direction of movement is less clear.
  • Adaptive and Robust Thresholds: To distinguish poisoned data, the framework uses statistical techniques based on Median Absolute Deviation (MAD)This allows for the calibration of dynamic security thresholds that are not affected by the presence of malicious outliers, ensuring filtering precision.

Results and Stress Test

Tests conducted on the baseline dataset of the AGENTPOISON paper demonstrated the extreme effectiveness of the framework in correctly identifying most malicious content (see Figure 1).

Figure 1 (projection plots on the first two principal components)[Left graph] In the baseline dataset of 2290 documents, 80% of the documents were left unchanged (gray dots), while the remaining 20% ​​of the documents were poisoned (all other points in the graph) and separated in semantic space by concatenating the original text with an optimized trigger text. Five points belonging to the 20% of poisoned documents were randomly selected (the points represented by stars) as seeds and allowed [Right graph] to correctly identify all the green-colored documents as poisoned, with a small percentage of false positives and false negatives.

In particular, in further experiments contextualized in a real case in the energy domain (see Figure 2), with a configuration of 3 known seeds, the algorithm achieved a 97% accuracy (confirming that almost every document reported was indeed compromised) and a 95% recall (managing to identify almost all the malicious content present), with a F1 score of 96%

Figure 2 (projection plots on the first two principal components): [left graph] in the case of a real-world context dataset of about 200 documents in the energy domain, keeping the same distribution of poisoned and unpoisoned documents as experimented with the baseline dataset and three randomly placed points as seeds, a precision of 97%, a recall of 95% and a resulting F1-score of 96% were obtained [right graph].

As you can easily imagine, in this type of security task it is often preferable to focus on a very high recall, possibly accepting some false positives, since a human check can easily confirm suspicious documents, while an undetected poisoned document remains a silent threat in the system and these preliminary results demonstrate an extremely high capacity to clean the knowledge base, minimizing both false alarms and undetected threats.

The analyses, also visually verified through two-dimensional projections (PCA), confirm that, when the seeds are distributed representatively in the cluster, the system is able to anticipate and neutralize attacks that would otherwise be invisible to manual checks.

Conclusions: Security as a Native Requirement

The shift from reactive defense to systematic validation of semantic spaces represents a fundamental step towards the responsible adoption of AI. Understanding how attacks manipulate embeddings allows us not only to detect existing threats but to build inherently resilient agent systems.