This is a field report from TCI's deployment of an autonomous evidence processing agent on isolated infrastructure. The engagement involved processing a large corpus of documents requiring entity extraction, relationship mapping, and selective redaction — all on hardware with no internet connectivity.
The Brief
The requirement: process approximately 12,000 documents (PDFs, emails, spreadsheets, images) to extract entities, map relationships, and produce redacted versions suitable for production. Timeline: two weeks. Constraint: no document content could leave the client's premises.
Hardware Setup
We deployed on client-provided hardware:
- ▶Dell PowerEdge R750 with dual Intel Xeon processors
- ▶256GB RAM
- ▶NVIDIA A6000 (48GB VRAM)
- ▶4TB NVMe RAID-10 array
- ▶No network interfaces enabled
Software was installed from verified USB media. The entire stack (OS, models, pipeline code) was pre-built and tested in our lab before physical delivery to the client site.
Model Configuration
We deployed a two-model configuration:
- ▶Primary analysis: Llama 3 70B at 4-bit quantization for entity extraction and relationship analysis
- ▶Redaction verification: Mistral 7B for independent verification of redaction completeness
Both models ran through vLLM with custom prompt templates optimized for forensic document analysis.
What Worked
Batch Processing Pipeline
Documents were queued and processed in batches of 50. Each batch produced:
- ▶Entity extraction results in structured JSON
- ▶Relationship graph updates (stored in a local SQLite database)
- ▶Redaction candidates with confidence scores
- ▶Processing logs for chain of custody
The pipeline processed approximately 800 documents per day — well within the two-week timeline.
Entity Extraction Quality
The Llama 70B model performed well on entity extraction, particularly for:
- ▶Named entities (people, organizations, locations)
- ▶Financial entities (account numbers, transaction amounts)
- ▶Temporal relationships (sequences of events)
Accuracy was comparable to what we achieve with cloud models on similar document types. The model struggled more with handwritten annotations on scanned documents, but that's a limitation of the input quality rather than the model.
Redaction Consistency
The ensemble approach (deterministic + LLM detection, independent LLM verification) caught several PII instances that either layer alone would have missed. Cross-document consistency was maintained through the entity registry.
Audit Trail
The comprehensive logging proved valuable when the client's legal team wanted to understand how specific redaction decisions were made. Every decision traced back to specific detections with confidence scores.
What Didn't Work
Initial Setup Time
Physical media installation and verification took a full day. Each software package had to be verified against pre-computed checksums. A model quantization issue required re-transferring a 40GB model file via USB 3.0 — which took longer than expected.
OCR Quality
Approximately 15% of the documents were scanned PDFs with mediocre OCR quality. The local OCR pipeline (Tesseract) produced lower quality output than cloud OCR services. This cascaded into lower entity extraction accuracy for those documents.
Lesson learned: Invest in better local OCR (or pre-process OCR before air-gapping if the scanned documents themselves aren't sensitive).
Memory Management
Processing large spreadsheets (some with 50,000+ rows) caused memory pressure. The pipeline needed manual intervention to increase batch partitioning for oversized files.
Lesson learned: Add document size analysis to the preprocessing stage and automatically adjust batch parameters.
Model Hallucinations on Financial Data
The model occasionally hallucinated entity relationships in financial documents — inferring connections that weren't supported by the source text. The verification layer caught most of these, but several made it to the human review queue.
Lesson learned: Add domain-specific validation rules for financial entity relationships (e.g., verify that inferred transaction relationships reference real account numbers present in the corpus).
Performance Numbers
| Metric | Result |
|--------|--------|
| Total documents processed | 12,247 |
| Processing time | 11 days |
| Entities extracted | 34,891 |
| Relationships mapped | 12,403 |
| Redactions applied | 8,672 |
| Documents flagged for human review | 847 (6.9%) |
| Post-review redaction corrections | 23 (0.19%) |
Recommendations for Next Deployment
- ▶Pre-stage all software on a verified disk image — reduce setup time from one day to hours
- ▶Include higher-quality OCR (PaddleOCR or EasyOCR as alternatives to Tesseract)
- ▶Implement automatic document profiling — analyze size, format, and complexity before processing
- ▶Add financial entity validation rules to reduce hallucinated relationships
- ▶Bring a spare GPU — hardware failure on isolated infrastructure has no quick fix
Conclusion
Air-gapped AI agent deployment is viable for production evidence processing workloads. The capability gap between local and cloud models is real but manageable for most forensic analysis tasks. The setup overhead is significant but amortizes over the engagement duration.
TCI is refining the deployment playbook based on lessons from this and subsequent engagements.
