In our previous blog post, we introduced Red Sift’s AI Agent for lookalike classification – an intelligent system that determines whether a suspicious domain has been deliberately crafted to mimic a legitimate one or if the resemblance is merely coincidental. That post focused on the what and why of the solution: why rule-based automation alone can’t reliably judge nuanced similarities, and how an AI-driven approach can bring human-like reasoning to the problem.
In this post, we shift gears to the how – exploring the key technical challenges behind building a scalable, agentic system and the engineering decisions that enabled it to reason, adapt, and operate reliably at scale.
Problem formulation
At its core, the task can be framed as a two-way classification problem: determining whether a lookalike domain is a legitimate website or a potential impersonation. Alongside the binary decision, the system must also produce a natural-language explanation detailing the reasoning behind the verdict, making the output more interpretable and auditable. The input for each evaluation consists of a single lookalike domain and its corresponding legitimate asset, enriched with multiple contextual signals collected by Brand Trust – including domain names, screenshots, DNS and WHOIS information, and customer context, among others.
Benchmark dataset
Building an AI system that reasons about intent requires more than clever modeling: it demands reliable evaluation. Our benchmark dataset contains 300 data points drawn from a variety of well-known brands across multiple sectors. Together, these elements create a realistic foundation for evaluating how different approaches perform when faced with the nuanced and evolving nature of lookalike detection.
Optimization process
We followed a two-phase optimization process when developing the AI agent. The first phase was all about pushing accuracy to its limit – exploring different design directions to see how far we could take performance before introducing any practical constraints. This was the experimental stage, where the focus was not yet on efficiency or scale, but on learning what truly drives reliable decision-making in this problem space. The objective was simple: if the agent couldn’t demonstrate tangible value and outperform rule-based heuristics, there would be little point in operationalizing it further.
Once we reached a configuration that clearly delivered value to users, we transitioned into the second phase: making the system viable in production. Here, the focus shifted from maximizing accuracy to finding the right balance between speed, scalability, and cost. The goal was to preserve the agent’s decision quality while ensuring it could run across large volumes of data without exhausting resources or compromising user experience.
Phase 1: Accuracy optimization
This phase involved iterating through several architectural designs to determine what structure yielded the best accuracy and consistency.
Single-agent approach
As generative AI models grow more capable – with stronger reasoning and longer context windows – it’s tempting to provide them with all available data and powerful tools, expecting them to reason through complexity autonomously. We began with a single-agent design that consumed all signals directly. The agent was equipped with an agentic web search tool, enabling it to iteratively query and refine information toward a final classification goal.
Benchmark results
- Accuracy: 70.5% (average of two runs)
- Agreement between runs: 87.3%
Observation
Performance was modest and inconsistent. Despite deterministic settings (temperature is set to 0 to minimize stochastic responses), the model produced quite different results across runs. The input space was too large, and the reasoning process was too unconstrained. Prompt engineering proved challenging, as instructions had to cover many unrelated aspects. As a result, the analyses were sometimes incomplete or misaligned with the decision objective.
Multi-agent architecture
Next, we experimented with a multi-agent architecture, introducing a coordinator that consolidated the outputs of several specialized agents. Each agent focused on a specific threat vector, such as linguistic similarity, visual resemblance, or infrastructure overlap. The coordinator reviewed their analyses, made the final decision, and produced an explanation.
Benchmark results
- Accuracy: 83.7%
- Agreement between runs: 93%
Observation
Accuracy and consistency improved, but the model remained unreliable across similar domains. For instance, two parked domains with nearly identical characteristics could yield different results – indicating that while the structure improved focus, reasoning reliability across comparable inputs was still limited.
Hybrid model: Agents + Rules
Upon reviewing the predictions and reasoning traces, we noticed that impersonation outcomes could be explained by a finite set of combinations of threat indicators. This insight led to a hybrid design: each specialized agent not only produced an analysis but also classified the domain into one of several threat levels. A deterministic rule set then translated these classifications into a final verdict (suspicious / not suspicious), while the main agent focused on generating a coherent, contextual explanation rather than making the decision itself.
Benchmark results
- Accuracy: 93.7%
- Agreement between runs: 97.3%
Observation
The hybrid setup delivered both high accuracy and consistency. Combining structured rules with contextual AI reasoning produced outcomes that were not only smarter but also more stable – providing a strong foundation to proceed to the next optimization stage.
Phase 2: Scalability and efficiency optimization
Once the agent achieved strong accuracy and consistency, the next challenge was making it viable at scale. High performance is only useful if it can be maintained efficiently across hundreds of thousands of domains per day.
Smart invocation of expensive operations
One of our agents is responsible for uncovering business relationships between a lookalike domain and the customer such as shared ownership, partner brands, or legitimate rebrands. It uses an agentic web search tool that autonomously searches for relevant information through multiple iterations, making it both powerful and computationally expensive. To avoid invoking it unnecessarily, we added a lightweight Google presence check as a preliminary step: if the lookalike domain has no meaningful search hits, it’s reasonable to assume that a deeper, multi-iteration search would also yield no useful results. This simple optimization dramatically reduces cost while preserving decision quality.
Speed: Cost trade-off
Given the nature of our product, reviewing lookalikes isn’t a real-time task, so we can trade some latency for efficiency. We moved from a “run everything in parallel” approach to a staged, conditional pipeline, where we build a directed acyclic graph (DAG) and execute subsequent agents only when required. For example, if content similarity is low and the domain string isn’t confusing or malicious, we can confidently classify it as safe without triggering costly exploration. The business relationship agent is reserved for the trickier boundary cases – when a lookalike closely resembles the asset and could be either an unknown legitimate property or a deliberate impersonation.
Accuracy: Complexity trade-off
Leveraging modularity for precision
Decomposing the evaluation into task-specific agents allows each one to be optimized independently with its own benchmark dataset and success criteria. This modularity makes it easier to fine-tune instructions, manage complexity, and diagnose weaknesses without affecting the entire system. Not all agents need the same level of sophistication: a coordinator that consolidates and explains results can be simpler and faster, while classification agents require higher reasoning precision. This separation of concerns not only improves maintainability but also ensures that improvements in one component translate cleanly into better overall system performance.
Reasoning effort and model behaviour
When exploring different models in the GPT family, we found clear trade-offs between reasoning effort and consistency. Reasoning models like GPT-5 tend to explore problems more deeply and make better use of external tools such as web search. However, due to the lack of temperature control, GPT-5 is less consistent across runs compared to GPT-4.1, which, while not a reasoning model, delivers more stable outputs. Increasing the reasoning effort in GPT-5 improves accuracy but also raises cost and variability, showing that more reasoning doesn’t always translate to better practical performance.
Token economics are real
Although GPT-5 has a lower per-token price, reasoning models tend to generate far more tokens as they articulate multi-step chains of thought. This often offsets the apparent savings and can even make runs more expensive than simpler models. Attempts to limit token generation – by reducing reasoning verbosity or tightening instructions – do lower cost but typically degrade performance, as the model cuts short useful intermediate reasoning. Striking the right balance between reasoning depth and token efficiency proved essential for achieving sustainable performance at scale.
Final thoughts
Building an effective AI agent for lookalike detection isn’t just about using a powerful model – it’s about balancing accuracy, consistency, and scalability. By combining reasoning capabilities with structured rules and smart orchestration, we’ve created a system that performs well in practice, not just in theory. This work highlights how thoughtful engineering can turn advanced AI into a reliable, production-ready capability. To see how the AI Agent operates in practice – and how it can help your team reduce manual triage and act on lookalike threats with greater speed and clarity – request a demo to explore the experience.