All Insights

Field Report: Deploying an Evidence Processing Agent on Isolated Infrastructure

David Greenhill
David GreenhillTechnical Lead
·

This is a field report from TCI's deployment of an autonomous evidence processing agent on isolated infrastructure. The engagement involved processing a large corpus of documents requiring entity extraction, relationship mapping, and selective redaction — all on hardware with no internet connectivity.

The Brief

The requirement: process approximately 12,000 documents (PDFs, emails, spreadsheets, images) to extract entities, map relationships, and produce redacted versions suitable for production. Timeline: two weeks. Constraint: no document content could leave the client's premises.

Hardware Setup

We deployed on client-provided hardware:

  • Dell PowerEdge R750 with dual Intel Xeon processors
  • 256GB RAM
  • NVIDIA A6000 (48GB VRAM)
  • 4TB NVMe RAID-10 array
  • No network interfaces enabled

Software was installed from verified USB media. The entire stack (OS, models, pipeline code) was pre-built and tested in our lab before physical delivery to the client site.

Model Configuration

We deployed a two-model configuration:

  • Primary analysis: Llama 3 70B at 4-bit quantization for entity extraction and relationship analysis
  • Redaction verification: Mistral 7B for independent verification of redaction completeness

Both models ran through vLLM with custom prompt templates optimized for forensic document analysis.

What Worked

Batch Processing Pipeline

Documents were queued and processed in batches of 50. Each batch produced:

  • Entity extraction results in structured JSON
  • Relationship graph updates (stored in a local SQLite database)
  • Redaction candidates with confidence scores
  • Processing logs for chain of custody

The pipeline processed approximately 800 documents per day — well within the two-week timeline.

Entity Extraction Quality

The Llama 70B model performed well on entity extraction, particularly for:

  • Named entities (people, organizations, locations)
  • Financial entities (account numbers, transaction amounts)
  • Temporal relationships (sequences of events)

Accuracy was comparable to what we achieve with cloud models on similar document types. The model struggled more with handwritten annotations on scanned documents, but that's a limitation of the input quality rather than the model.

Redaction Consistency

The ensemble approach (deterministic + LLM detection, independent LLM verification) caught several PII instances that either layer alone would have missed. Cross-document consistency was maintained through the entity registry.

Audit Trail

The comprehensive logging proved valuable when the client's legal team wanted to understand how specific redaction decisions were made. Every decision traced back to specific detections with confidence scores.

What Didn't Work

Initial Setup Time

Physical media installation and verification took a full day. Each software package had to be verified against pre-computed checksums. A model quantization issue required re-transferring a 40GB model file via USB 3.0 — which took longer than expected.

OCR Quality

Approximately 15% of the documents were scanned PDFs with mediocre OCR quality. The local OCR pipeline (Tesseract) produced lower quality output than cloud OCR services. This cascaded into lower entity extraction accuracy for those documents.

Lesson learned: Invest in better local OCR (or pre-process OCR before air-gapping if the scanned documents themselves aren't sensitive).

Memory Management

Processing large spreadsheets (some with 50,000+ rows) caused memory pressure. The pipeline needed manual intervention to increase batch partitioning for oversized files.

Lesson learned: Add document size analysis to the preprocessing stage and automatically adjust batch parameters.

Model Hallucinations on Financial Data

The model occasionally hallucinated entity relationships in financial documents — inferring connections that weren't supported by the source text. The verification layer caught most of these, but several made it to the human review queue.

Lesson learned: Add domain-specific validation rules for financial entity relationships (e.g., verify that inferred transaction relationships reference real account numbers present in the corpus).

Performance Numbers

| Metric | Result |

|--------|--------|

| Total documents processed | 12,247 |

| Processing time | 11 days |

| Entities extracted | 34,891 |

| Relationships mapped | 12,403 |

| Redactions applied | 8,672 |

| Documents flagged for human review | 847 (6.9%) |

| Post-review redaction corrections | 23 (0.19%) |

Recommendations for Next Deployment

  • Pre-stage all software on a verified disk image — reduce setup time from one day to hours
  • Include higher-quality OCR (PaddleOCR or EasyOCR as alternatives to Tesseract)
  • Implement automatic document profiling — analyze size, format, and complexity before processing
  • Add financial entity validation rules to reduce hallucinated relationships
  • Bring a spare GPU — hardware failure on isolated infrastructure has no quick fix

Conclusion

Air-gapped AI agent deployment is viable for production evidence processing workloads. The capability gap between local and cloud models is real but manageable for most forensic analysis tasks. The setup overhead is significant but amortizes over the engagement duration.

TCI is refining the deployment playbook based on lessons from this and subsequent engagements.

David Greenhill

Written by

David Greenhill

Technical Lead, The Commonlight Initiative

Need help with your evidence infrastructure?

TCI builds capture pipelines, redaction workflows, and air-gapped processing systems for organizations handling sensitive data.