Descriptions:
Deep Singh, who leads emerging data technologies and AI architecture at Fujitsu North America, presents a production engineering deep dive into low-latency voice intelligence extraction for enterprise contact centers. The session opens with a stark operational reality: the average contact center call lasts 6.5 minutes, but agents spend nearly as long — 6.3 minutes — on after-call work (ACW), manually typing notes, selecting disposition codes, and summarizing what happened. That near-1:1 ratio of talk time to administrative overhead is the engineering target.
The solution is a four-stage pipeline that transforms raw multi-channel audio into structured JSON summaries in near real-time. Stage one handles audio capture with channel mapping to separate agent and customer speech, plus early-stage PII masking to prevent credit card numbers and passwords from ever reaching LLM memory. Stage two applies speech-to-text with acoustic modeling, domain-specific dictionaries (distinguishing “term life” from “turn” for insurance use cases), inverse text normalization, and auto-punctuation. Stage three is the generative AI core: rather than dumping transcripts at an LLM, the team uses few-shot prompt libraries to enforce structured bullet-point output, a reasoning layer that classifies call intent against a predefined taxonomy with explanations, and a trust layer that runs hallucination checks to ensure summaries are grounded in the transcript. Stage four maps LLM JSON output directly to CRM fields via an API gateway schema mapper.
Singh reports the architecture targets 50% or greater ACW reduction and discusses ongoing constraints around STT accuracy thresholds (above 90% required) and roadmap items.
📺 Source: AI Engineer · Published April 08, 2026
🏷️ Format: Workflow Case Study







