All Insights
Threat Intelligencellm obfuscation8 min

When LLMs Leak: Obfuscation Failures in AI-Assisted Redaction

David Greenhill
David GreenhillTechnical Lead
·

Organizations are rushing to deploy LLMs for document redaction. The appeal is obvious: feed a document in, get a redacted version out, at a fraction of the cost and time of manual review. But LLM-based redaction has failure modes that traditional redaction methods don't — and some of them are genuinely dangerous.

The Promise and the Problem

Traditional redaction (regex patterns, NER models like Presidio) operates on pattern recognition. It finds things that look like SSNs, phone numbers, or names, and removes them. The failure modes are well-understood: missed patterns, false positives, format edge cases.

LLM-based redaction operates differently. The model "understands" the document's meaning. It can identify PII from context — recognizing that "the 45-year-old CEO of Acme Corp who graduated from Stanford in 2003" uniquely identifies a person even though no name appears. This contextual understanding is the selling point.

It's also the vulnerability.

Failure Mode 1: Contextual Reconstruction

An LLM that's smart enough to identify PII from context is also smart enough to reconstruct PII that's been removed. If a document mentions "the defendant's wife" in paragraph 2 and "Sarah" in paragraph 8, a model processing the full document may associate them — and that association can leak through the model's outputs, embeddings, or attention patterns.

This isn't theoretical. Research has demonstrated that LLMs can reconstruct redacted names from surrounding context with non-trivial accuracy, particularly when the model has seen similar patterns in training data.

Failure Mode 2: Embedding Leakage

When documents are processed through LLM pipelines for redaction, intermediate representations (embeddings) may encode the very information being redacted. If those embeddings are stored, cached, or transmitted — even after the final output is clean — the PII exists in the pipeline's state.

Vector databases indexing "redacted" documents may contain embeddings that, under the right queries, surface information about the redacted content.

Failure Mode 3: Hallucinated Redaction

LLMs hallucinate. In the redaction context, this means the model may:

  • Claim to have redacted something it didn't
  • Replace PII with plausible-looking but incorrect substitutions that still leak information
  • Introduce new information that wasn't in the original document
  • Apply inconsistent redaction (removing a name in one place but not another)

Failure Mode 4: Prompt Injection Bypass

If the document being redacted contains adversarial content, it may manipulate the LLM into skipping redaction. A carefully crafted passage could instruct the model to preserve PII that should be removed. This attack vector doesn't exist in regex or NER-based redaction.

Mitigation Approaches

TCI's position is that LLMs should not be the sole redaction layer. Our pipeline uses:

  • Deterministic first pass: Regex and NER-based redaction catches known patterns
  • LLM contextual analysis: Identifies PII that pattern matching misses
  • Independent verification: A separate model (or human) reviews the output without access to the original
  • No persistent embeddings: Intermediate representations are discarded after processing
  • Output-only evaluation: The redacted document is evaluated in isolation, not compared against the original in the same context

The Regulatory Angle

If an LLM-redacted document is later found to contain recoverable PII, the liability question becomes complex. Was the redaction "reasonable"? If the organization knew about LLM failure modes and deployed the technology without mitigation, the argument for negligence strengthens.

Organizations using LLM-based redaction need to document their methodology, including known limitations and mitigation strategies. "We used AI to redact it" is not a compliance defense.

Conclusion

LLMs are powerful tools for identifying PII. They are unreliable tools for removing it — unless embedded in a pipeline with deterministic verification layers. The convenience of end-to-end LLM redaction comes with risks that most organizations haven't evaluated.

David Greenhill

Written by

David Greenhill

Technical Lead, The Commonlight Initiative

Need help with your evidence infrastructure?

TCI builds capture pipelines, redaction workflows, and air-gapped processing systems for organizations handling sensitive data.