All Posts

IT Operations

Security Operations

min read

Data Hygiene for AI Security: Stop Ingesting Everything, Start Engineering Signal

Marianne Chrisos

Published on

February 16, 2026

Table of Contents

Security teams are being asked to do more with less while the volume of security telemetry keeps climbing. The promise of AI often shows up as the shortcut: ingest everything, apply analytics, and productivity will follow.

In the real world, this hasn’t yet consistently delivered the outcome leaders expect. The gap between the AI pitch and SOC reality tends to widen as data volumes increase, data quality varies, and analysts spend more time validating what systems produce instead of acting on it.

The scale of the problem is not theoretical. Microsoft reports that it processes 78 trillion security signals per day as part of its threat landscape insights, which is a useful reminder of how quickly the “data firehose” becomes unmanageable without strong filtering and structure. And the consequences of this noise are measurable. In the SANS 2024 SOC Survey, “Too many alerts that we can’t look into/lack of correlation between alerts” appears as an explicit barrier to full SOC capability utilization.

During a recent webinar conversation on scalable productivity, Anomali leaders described why this keeps happening and what changes the trajectory. The through line was consistent: AI productivity rises and falls with the quality, consistency, and usability of the data it runs on, and with whether analytics are designed to support decisions rather than generate more output.

Why “Ingest Everything” Turns Into Noise

Security analytics has carried the same marching orders for years: bring in as much data as possible for maximum visibility. Patrick Holt, Senior Principal Product Manager leading data lake, SIEM, and content strategy, explained why that logic breaks down.

“The conventional wisdom was the more data you have, the better visibility you’re going to have,” Holt said. “The reality is as you bring in more and more and more data, you also bring more and more noise.” He added the line that every SOC has learned the hard way: “If you don’t have a clear understanding of what the data is actually going to be used for, more is just simply more. It’s not better.”

That’s not just a philosophical point, but operational math. Logging systems generate millions of events per hour. As Holt put it: “When you start looking at logging systems, you’re looking at millions of things an hour, maybe tens of millions of things an hour. How does any one person, how does any one team… have enough time to look at all of this?”

If the answer is “AI will,” the next question becomes more important: will AI be able to interpret the data correctly and reliably enough that humans do not have to re-check its work?

Data Hygiene Is the Prerequisite for AI, Not the Cleanup Step

Holt’s practitioner view goes straight at the root cause. AI does not magically normalize messy telemetry. If the inputs are inconsistent, the outputs become inconsistent. “It’s not so simple as just slapping an AI on top of this data structure,” he said.

He described the most common failure mode in plain terms: “If you’re looking at the vendor’s logs without doing anything to make sure that those logs are standardized in any meaningful way, it’s garbage in and garbage out.”

A quick example from the conversation makes this tangible. Holt walked through a Windows Security event and pointed out that the fields and structure are defined by Microsoft. That creates a familiar shape, but not a uniform one. “If I were to bring up a Unix log, it would look very different,” he said. “If I was to bring up an IAM log, let’s say that you’re an Okta customer, their data also has different names.”

From an AI perspective, that inconsistency is not a minor annoyance. It’s the core interpretability problem. Holt framed it as the system having to understand “what do all these different keys actually mean” across vendor-specific formats.

The fix is not another model, but consistent structure.

Holt referenced conforming logs into a common schema, describing how a normalized format allows data from different systems to land in “the same exact format.” In his words, it “allows the AIs to make quick decisions and give you quick, intelligent information and analysis about what it’s finding.”

This is the point many teams underestimate. In the webinar, Chris Vincent, Chief Commerical Officer at Anomali, summarized it as a foundational principle: “AI is going to be as good as the underlying data context that it could be given.”

Stop Treating All Data the Same: Segment by Purpose

Most teams collect data for at least two different reasons, and mixing those purposes is where analytics programs get noisy fast.

Holt called this out directly: “Not all data has actionable value. Some data is there for compliance. Some data is there to action off of.”

A practical way to operationalize this is to classify sources and event types into tiers, then apply different expectations to each tier.

Here is a lightweight model that works in real SOCs:

Compliance retention data: collected to meet audit, regulatory, or policy requirements

Detection and response data: collected because it supports decisions, triage, correlation, or investigation workflows

This segmentation helps teams avoid a common trap: treating compliance volume as if it must drive detections. Retention data can be essential without being a high-signal feed for alerting. It still has value, but its value often shows up during investigation, audit, or post-incident reconstruction.

The productivity win comes from being disciplined about what data is expected to generate alerts, what data is expected to enrich context, and what data is primarily archived unless needed.

Normalize First, Then Enrich Context

Even after segmentation, detection and response data still has to be usable across systems. Holt described “parsing” as the tax every SIEM user pays. Vendors normalize logs into a schema, and then customers customize, override, and reshape it further. The risk is that inconsistency becomes baked into the environment.

“If you don’t have a very well documented standardized published schema that you can then justify,” Holt warned, “you end up with abnormalities.”

He also tied this directly to AI effectiveness: “If it’s changed or if it’s not standard, it can’t really do a super fantastic job. The AI is going to make assumptions based on that schema.”

Then comes the part that makes data turn into decisions: context enrichment.

Holt emphasized that “log for log’s sake doesn’t actually mean a tremendous amount,” and described how context accelerates judgment. Knowing who a person is, what peer group they belong to, what systems they normally access, and what “normal” behavior looks like changes the speed and quality of triage.

He gave the shape of the example without overreaching into surveillance narratives: “It’s not to try to be big brother,” he said, but if privileged users “start to do things that are strange behaviorally or if they start to step outside of their normal behavior, we need to know about that.”

A real-world practitioner pattern here is straightforward:

Normalize identity and endpoint events into consistent fields

Attach identity context (role, group, privilege level, typical access patterns)

Attach asset context (criticality, environment, sensitivity)

Use that context to tune detections so that alerts represent decision points, not raw anomalies

If your organization has ever tried to deploy an AI triage layer over unnormalized, unenriched logs, you already know what happens. The system produces plausible stories at high speed, but analysts still spend time verifying basic facts.

Design Analytics Backward From Workflow

Data hygiene and enrichment matter because they reduce the cognitive overhead of investigations, but they do not automatically produce productivity. The real inflection point comes when analytics are designed to support how teams actually work.

Holt described this as a workflow-first mindset: “If I understand the workflow or where I’m going, define with the end in mind, then I can actually work backwards, write all of my detections, and get some real value out of that.”

That approach turns analytics from a generator of alerts into a reducer of effort. It forces clarity on questions like:

What constitutes a decision that must be made by a human analyst?

What context is required to make that decision quickly?

What correlations reduce uncertainty early?

What can be automated safely because the inputs are consistent and trusted?

On the leadership side, George Moser, Chief Growth Officer at Anomali, made the same point from a different angle. He said productivity “scales by shrinking the amount of work that ever reaches a human analyst,” not by giving analysts more features.

This is why data hygiene belongs in the productivity conversation. If analytics create more investigative overhead, they add complexity. If analytics reduce the work that reaches humans, they earn trust and create operational slack.

Speed Matters, But Only After the Foundation Holds

During the conversation, Holt demonstrated how investigation speed changes when a platform can search across broad data quickly. He contrasted older systems, where queries might take minutes or longer, with a scenario where results arrive in seconds, even when searching across days of data.

The underlying point was not that speed is a feature. The point is that speed enables workflow continuity. If analysts can ask a simple question and get an answer immediately, they stay inside the investigation instead of waiting and context-switching.

Holt also explained why older architectures often struggle, especially when they are “lift and shift” systems that moved appliance-era assumptions into virtualized infrastructure. Cloud elasticity can help, but only if the data model and search architecture were designed to take advantage of it.

The real win is when fast search is paired with normalized, enriched data. Speed without trust can accelerate the wrong work.

What “Better Data Hygiene” Looks Like in Practice

Most teams do not need a sweeping transformation project to improve data hygiene. They need a repeatable operating model that makes the system more consistent over time.

A pragmatic sequence looks like this:

Declare intent per data source: compliance, investigation, detection, or enrichment

Normalize detection and enrichment sources into a defensible schema

Add identity and asset context where it reduces decision time

Build detections backward from workflows and handoffs

Measure reduction in analyst effort, not increase in platform activity

Longer lifecycles are not just costly. They amplify burnout and force teams into reactive modes where every alert feels urgent because trust in prioritization is low.

AI Productivity Starts With Signal Engineering

AI can absolutely improve productivity in security operations, but it does not do it by brute-forcing messy telemetry. It does it when the organization earns the right to automate.

That “right” is earned with disciplined data hygiene: deciding which data matters, normalizing it so it can be trusted across sources, enriching it so it supports decisions, and aligning detections to workflows so work is removed rather than rearranged.

SOC leaders have two potential productivity paths to follow:

If you want to reduce alert fatigue without losing visibility, start by mapping your data sources to intent, then normalize and enrich the ones that drive decisions.

If your AI rollout still feels like “more work,” the fastest reset is a data foundation assessment focused on schema consistency, context coverage, and workflow alignment.

Get more expert-led insight into optimizing data and analytics for scalable security productivity by checking out the full conversation.

‍

FEATURED RESOURCES

March 9, 2026

Anomali Cyber Watch

The Silence Before the Storm: Iran's Cyber War Has Moved From Warning to Confirmed Compromise

March 6, 2026

Anomali Cyber Watch

When “Quiet" Means Pre-Positioned: Why Iranian Cyber Threats Are More Dangerous Than Headlines Suggest

March 5, 2026

Anomali Cyber Watch

The Deceptive Lull: Why the Iran Cyber Conflict's Real Threat Isn't the One Making Headlines

Explore All