Run OpenAI’s Internal PII Detection Model Locally – Privacy Filter Setup & Demo

Run OpenAI’s Internal PII Detection Model Locally – Privacy Filter Setup & Demo

More

Descriptions:

OpenAI has open-sourced Privacy Filter, the PII detection model they built and deployed internally to sanitize data before it enters their own systems. Released under an Apache 2.0 license and hosted on Hugging Face, the model is now available for anyone to run locally — a notable departure for a lab that keeps most of its core tooling closed. In this hands-on walkthrough, Fahd Mirza installs and runs Privacy Filter on an Ubuntu machine with an NVIDIA RTX A6000, measuring just over 3GB of VRAM consumption — well within reach of consumer hardware or even a CPU with sufficient RAM.

The video covers both high-level pipeline usage and low-level token scoring across 33 PII label classes, using a BIOES tagging system (Begin, Inside, End, Single) to mark entity spans at the token level. Mirza demonstrates how production teams can control detection thresholds — requiring 0.99 confidence for medical data pipelines versus 0.85 for log sanitization — rather than relying on the pipeline’s default decisions. A context-aware detection demo shows the model correctly distinguishing a personal phone number from a company hotline or a doctor’s office number, something regex tools cannot reliably do.

The walkthrough concludes with a deployable redaction function that merges tokenizer-split spans and replaces them with placeholders from end to start (preserving string positions), giving viewers production-ready code for integrating PII filtering into any enterprise AI pipeline.


📺 Source: Fahd Mirza · Published May 02, 2026
🏷️ Format: Tutorial Demo

1 Item

Channels

1 Item

Companies