Detecting Malicious AI Agents with Honeypots

The Honeypot Tradition

Honeypots have been a staple of network security for over two decades. The concept is elegantly simple: deploy a system that has no legitimate purpose, and any interaction with it is, by definition, suspicious. Network honeypots mimic vulnerable servers to attract and study attackers. Web honeypots present fake login pages and admin panels to catch credential-stuffing bots. Email honeypots use never-published addresses to detect spam campaigns.

The power of the honeypot approach lies in its low false-positive rate. Because the decoy system serves no real function, any traffic it receives is inherently suspect. This makes honeypots invaluable as a detection mechanism — they don't need to understand the attacker's methods, only that the attacker interacted with something they shouldn't have.

But traditional honeypots were designed for traditional threats. The new generation of AI agents requires a new generation of traps.

Why Traditional Honeypots Fall Short

Traditional web honeypots rely on a simple interaction model: deploy a decoy, wait for traffic, log the request metadata. This works well for automated scanners and brute-force bots that blindly probe every endpoint they find. But AI agents interact with content in ways that expose the limitations of this approach.

Surface-level logging isn't enough. Traditional honeypots capture IP addresses, user agents, and request patterns. Against AI agents, this metadata is largely meaningless. Agents use residential proxies, rotate user agents, and vary their request patterns. The metadata tells you something visited — but not what it is or who operates it.

Static decoys are transparent. A honeypot that serves the same static content regardless of the visitor provides limited intelligence. AI agents that encounter generic decoy content may simply move on without revealing anything about themselves. The honeypot confirms a visit happened but learns nothing from it.

No intelligence extraction. The most critical gap in traditional honeypots is the inability to extract intelligence from visitors. When an AI agent visits a traditional honeypot, you learn that an automated system found your decoy. You don't learn who sent it, what it's looking for, what model powers it, or what its full instruction set contains. This is the difference between detecting a threat and understanding it.

AI-Aware Honeypots: A New Approach

AI-aware honeypots are purpose-built to exploit the unique characteristics of AI agents. Instead of passively logging visits, they actively interact with agents to extract maximum intelligence. The key insight is that AI agents process content semantically — they read, understand, and act on what they find. This creates opportunities for detection and intelligence gathering that don't exist with traditional bots.

An AI-aware honeypot uses three classes of traps, each targeting a different aspect of AI agent behavior.

Callback Tokens: The Silent Alarm

Callback tokens are the simplest and most reliable detection mechanism. A callback token is a hidden URL embedded in trap content. The URL points to a monitoring endpoint that logs any access. Here's why this is so effective against AI agents:

When an AI agent processes a web page or API response, it often extracts and follows URLs as part of its task. A callback URL hidden in an HTML comment, a CSS-hidden element, or a zero-width-character-encoded string is invisible to human visitors but will be discovered and potentially followed by an agent that processes the full content.

The callback URL contains a unique session identifier, so each agent interaction is tracked individually. When the callback fires, the monitoring system captures the request metadata, timing information, and any data the agent sends along with the request. This provides several important signals:

The agent is processing content beyond what's visible on screen — a strong indicator of automated content analysis. The agent is following URLs extracted from non-visible content — something no human visitor would do. The timing between the initial trap visit and the callback provides behavioral data about the agent's processing speed and workflow.

Callback tokens are deployed across multiple trap types: robots.txt files (which are the first thing many agents check), API endpoints, configuration files, and documentation pages. Each location catches agents with different crawl strategies.

Extraction Tokens: Intelligence Gathering

Extraction tokens take detection a step further by using reverse prompt injection to extract intelligence from visiting AI agents. The trap content contains carefully crafted instructions that exploit the agent's tendency to follow embedded directives.

When an AI agent processes content containing extraction token payloads, the injected instructions tell the agent to send specific information to a callback endpoint. The types of intelligence that can be extracted include the agent's system prompt (its full instruction set), the model it's running on, its operator's identity, the tools available to it, and its specific task parameters.

This transforms a honeypot visit from a simple detection event into an intelligence gathering operation. Instead of just knowing that an agent visited, you know who sent it, what it's looking for, and what capabilities it has.

Extraction tokens use multiple embedding formats simultaneously — HTML comments, CSS-hidden elements, zero-width characters, and JSON metadata — to maximize the probability that at least one will be processed by the visiting agent. Different AI frameworks and models have different content processing pipelines, so using multiple formats ensures broad coverage.

Canary Credentials: The Attribution Trap

Canary credentials are fake API keys, database connection strings, and authentication tokens planted in locations where credential-harvesting agents are likely to look. Common deployment locations include .env files, configuration endpoints, Git repository artifacts, and documentation pages that reference authentication.

The credentials are realistic enough to be mistaken for genuine secrets — they follow the format patterns of real credentials for popular services (AWS, Stripe, database connection strings). But instead of granting access to real systems, they point to LobsterHoney monitoring infrastructure.

When an AI agent extracts a canary credential and attempts to use it, the attempt is logged with full attribution data. This provides high-confidence detection (only a credential-harvesting agent would try to use a fake credential) and valuable intelligence about the agent's post-discovery behavior.

Canary credentials are particularly effective against agents designed for vulnerability reconnaissance and secret scanning. These agents systematically probe for exposed credentials across thousands of targets. By seeding canary credentials in expected locations, you can detect and attribute these scans with near-zero false positives.

Behavioral Scoring: Putting It All Together

Individual signals from callback tokens, extraction tokens, and canary credentials are valuable, but the real power comes from combining them into a behavioral scoring model. Each session is evaluated against a set of signals, both tripwire-based (definitive indicators like callback hits and credential use) and behavioral (patterns like systematic crawling and timing analysis).

The scoring model classifies each session into one of four categories:

HUMAN (score 0-10): Minimal suspicious activity. Likely a human visitor who stumbled onto the trap or a very basic scanner.

BOT (score 11-25): Behavioral patterns consistent with traditional automated crawling. Some suspicious signals but no definitive AI agent indicators.

AI_AGENT (score 26-55): Strong evidence of an AI agent. Tripwire signals present — the visitor followed callback URLs, acted on hidden content, or triggered extraction payloads.

AI_AGENT_MALICIOUS (score 56+): Confirmed malicious AI agent. Multiple tripwire signals fired, credentials were extracted or used, system prompt was leaked. This classification indicates an agent that is actively attempting to harvest data, credentials, or intelligence from your infrastructure.

The scoring model also assigns a confidence score (0-100%) based on signal diversity and strength, and a severity rating (low, medium, high, critical) based on the specific signals fired. This multi-dimensional classification enables proportionate response — a low-severity bot detection might just be logged, while a critical malicious agent detection triggers immediate Slack alerts and IP blocking.

Real-World Detection Scenarios

To illustrate how AI-aware honeypots work in practice, consider these scenarios:

Scenario 1: Documentation scraping agent. A competitive intelligence agent systematically crawls your documentation site. It checks robots.txt first (triggering the robots.txt trap), then visits documentation pages in order. At each page, it processes the full content, including hidden callback URLs. The callbacks fire in sequence, revealing a systematic crawl pattern. The scoring engine classifies this as AI_AGENT based on the callback hits and systematic crawl behavior. The extraction payload in one of the documentation pages successfully extracts the agent's system prompt, revealing it was sent by a competitor to monitor your product documentation for changes.

Scenario 2: Credential harvesting agent. A vulnerability scanner agent probes common paths: /.env, /wp-config.php, /api/config. The .env trap serves canary credentials. The agent extracts the fake AWS keys and attempts to validate them against the AWS API — but the credentials point to a LobsterHoney monitoring endpoint instead. The credential use triggers an immediate critical severity alert. The scoring engine classifies this as AI_AGENT_MALICIOUS with maximum confidence.

Scenario 3: Multi-vector reconnaissance. A sophisticated agent performs a multi-phase attack: first mapping the application structure via API probing, then extracting credentials from configuration files, then attempting to use those credentials. Each phase triggers different traps, and the accumulated signals drive the score steadily upward. By the time the agent tries to use canary credentials, its classification has escalated from BOT to AI_AGENT to AI_AGENT_MALICIOUS, and the security team has been alerted with a complete dossier of the agent's behavior, identity, and objectives.

Getting Started

Deploying AI-aware honeypots doesn't require changes to your production infrastructure. The traps are hosted on a separate domain (or subdomain) and are referenced from strategic locations in your existing infrastructure — hidden links in documentation, references in configuration files, entries in robots.txt.

The key is strategic placement. Put traps where AI agents are most likely to look: robots.txt (the first thing most crawlers check), common configuration file paths, API endpoint directories, and documentation sections that reference authentication. The more traps you deploy, the higher the probability that a visiting agent will trigger at least one.

In the agentic age, the question isn't whether AI agents are probing your infrastructure — it's whether you know about it. AI-aware honeypots give you that visibility, turning every unauthorized agent visit into actionable intelligence.