When Documents Start Talking Back: How Hidden Instructions Hijack AI Agents

AI agents are being asked to do something traditional software never had to do at scale: read untrusted human content and then act on it.

That sounds harmless until you realize what “content” now includes. Emails. PDFs. Web pages. Resumes. Reports. Support tickets. Shared docs. CRM notes. Knowledge base articles. Anything an agent can retrieve, summarize, reason over, or pass into a tool becomes part of its operating environment.

And that creates a new failure mode:

The document is no longer just data.

It can also be an attack.

The moment documents stop being passive

The core problem is simple. An agent is given a goal by a user, but it also consumes external content while trying to complete that goal. If that content contains hidden or malicious instructions, the agent may follow the document instead of the user.

That is the essence of indirect prompt injection.

Unlike a direct jailbreak, the attacker does not need a chat box. They only need to place adversarial text somewhere the agent is likely to read. The attack rides into the system inside seemingly ordinary content.

An invoice can tell the agent to ignore previous instructions.

A web page can hide text in HTML or metadata that never appears to a human reviewer.

A resume can include instructions to exfiltrate recruiter notes.

A report can inject malicious steps into a downstream workflow.

The user thinks they are asking the agent to summarize, compare, classify, or act. But the agent may be receiving a second set of instructions from the content itself.

When that happens, documents start talking back.

Greshake et al. named the problem

The foundational paper here is Kai Greshake et al.’s 2023 paper, Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection.

That paper made a crucial observation: LLM-integrated applications blur the line between data and instructions. Once a model reads retrieved content inside the same context window as system instructions and user intent, malicious text can compete with or override the original task.

This was not presented as a theoretical curiosity. Greshake et al. showed that indirect prompt injection could be used to remotely exploit real LLM-integrated applications by planting instructions in data likely to be retrieved. The paper also framed the consequences in security terms, including data theft, manipulation of application behavior, and control over downstream API usage.

That was the first big warning shot: if a system cannot reliably distinguish trusted instructions from untrusted content, then retrieval is not just a relevance problem. It is a security boundary problem.

Agents make the problem worse

If Greshake et al. showed that retrieved content can influence model behavior, InjecAgent showed what happens when that model is no longer just answering questions, but calling tools.

In InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents, Qiusi Zhan and coauthors moved the discussion from “can retrieved text change outputs?” to “can retrieved text make agents do harmful things?”

That shift matters.

Tool-integrated agents do not just summarize a malicious email or read a poisoned webpage. They may also send messages, query systems, write data, trigger workflows, or reveal sensitive information while trying to complete a task.

InjecAgent gave the field a benchmark for this risk and measured it at scale. The paper evaluates 30 different LLM agents across 1,054 test cases spanning 17 user tools and 62 attacker tools. The results are hard to dismiss as edge cases. The authors report that agents are vulnerable to indirect prompt injection attacks, and that a ReAct-prompted GPT-4 agent was successfully attacked 24% of the time in their evaluation. In an enhanced setting, the attack success rate rose further, nearly doubling for that GPT-4 setup.

That finding changes the conversation in two ways.

First, it shows that the issue is not limited to quirky demos or early prototypes. It persists in structured agent settings with tool use.

Second, it shows that the consequence is not only wrong text generation. The consequence can be harmful action or private-data exfiltration.

Once an agent can read and act, hidden instructions in content become a path to behavior hijacking.

AI Agent Traps broadens the frame

The 2026 SSRN preprint AI Agent Traps by Matija Franklin and coauthors provides the broader conceptual overlay.

Its key contribution is to argue that the threat is bigger than prompt injection narrowly defined. The environment itself becomes adversarial. In that framing, malicious content is not just a bad string in a prompt. It is part of a larger class of traps designed to manipulate how agents perceive, reason, remember, and act.

The paper describes six trap categories:

Content Injection Traps
Semantic Manipulation Traps
Cognitive State Traps
Behavioural Control Traps
Systemic Traps
Human-in-the-Loop Traps

That taxonomy is useful because it explains why “just filter prompts” is not enough.

Some attacks hide instructions in the machine-readable version of a page but not the human-visible one. Some bias the agent’s reasoning or verification process. Some poison long-term memory or retrieved knowledge. Some hijack tool use directly. Some create cascading failures across interacting agents. Some even exploit the human reviewer who is supposed to be supervising the system.

In other words, indirect prompt injection is not the whole story. It is the front door into a much larger attack surface.

The real security lesson

The deepest lesson across these papers is that enterprise AI systems should not trust content just because they can retrieve it.

That assumption is built into too many current architectures. If a connector can access the file, and retrieval says it is relevant, the system often treats it as usable context. But relevance is not the same thing as safety, and access is not the same thing as trust.

For agents, this distinction is critical.

A document may be:

relevant but malicious
accessible but unauthorized for this action
visible to the user but unsafe to ground on
topically correct but stale or superseded
clean for human reading but weaponized for machine parsing

That means the question is no longer just:

Can the agent find this document?

The real question is:

Should this document be allowed to influence the agent at all?

That is the control point most AI systems are still missing.

Why this matters now

The risk is accelerating because enterprise AI is moving from chat to action.

The moment an agent can read a support ticket, inspect an attachment, search the web, query a CRM, draft an email, and trigger a workflow, every content source becomes part of the runtime attack surface.

This is why the phrase “documents talking back” matters. It captures the mental model shift.

In older software, a document was usually an object to be stored, displayed, or parsed.

In agentic systems, a document can become an active influence over reasoning and action. It can compete with user intent. It can redirect workflows. It can leak secrets. It can turn trusted tooling against the user.

That is not a UX bug. It is a security architecture problem.

What enterprises need to do differently

The practical implication is that AI systems need a content qualification layer before retrieved content reaches the model or agent.

Not every file should be treated as safe AI context just because it is available through a connector or returned by a tool. Systems need to evaluate whether content is:

relevant to the task
permitted for that user and agent
trustworthy enough to ground an answer or action
safe enough to expose to the model

That is the missing boundary.

Without it, malicious instructions can enter through ordinary business content and steer agents from the inside.

With it, enterprises can start making explicit decisions about which content should be allowed, flagged, redacted, reviewed, or blocked before it shapes model behavior.

Conclusion

Greshake et al. showed that retrieved content can smuggle in instructions. InjecAgent showed that tool-integrated agents can be manipulated into harmful behavior and data exfiltration. AI Agent Traps showed that this belongs to a broader class of environmental attacks on autonomous systems.

Taken together, the message is clear:

the attack surface for AI agents is not just the model and not just the prompt.

It is everything the agent reads.

When documents start talking back, context becomes the new security boundary.

That is why context integrity matters. The future of AI security will depend not only on protecting models and filtering outputs, but on deciding which external content is allowed to become live runtime context in the first place.

Sources

Greshake et al. (2023), Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection: https://arxiv.org/abs/2302.12173
Zhan et al. (2024), InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents: https://aclanthology.org/2024.findings-acl.624/ and https://arxiv.org/abs/2403.02691
Franklin et al. (2026), AI Agent Traps (SSRN abstract page referenced via DOI): https://doi.org/10.2139/ssrn.6372438