Skip to main content Skip to secondary navigation
Journal Article

Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial Scale

Abstract

Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications. We present a first-of-its-kind study showing how existing knowledge resources from across an organization can be used as weak supervision in order to bring development time and cost down by an order of magnitude, and introduce Snorkel DryBell, a new weak supervision management system for this setting. Snorkel DryBell builds on the Snorkel framework, extending it in three critical aspects: flexible, template-based ingestion of diverse organizational knowledge, cross-feature production serving, and scalable, sampling-free execution. On three classification tasks at Google, we find that Snorkel DryBell creates classifiers of comparable quality to ones trained with tens of thousands of hand-labeled examples, converts non-servable organizational resources to servable models for an average 52% performance improvement, and executes over millions of data points in tens of minutes.

Project page

A system for rapidly creating, modeling, and managing training data, focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labeled training sets are not available or easy to obtain.
Author(s)
Stephen H. Bach
Daniel Rodriguez
Yintao Liu
Chong Luo
Haidong Shao
Cassandra Xia
Souvik Sen
Alexander Ratner
Braden Hancock
Houman Alborzi
Rahul Kuchhal
Christopher Ré
Rob Malkin
Journal Name
SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data
Publication Date
June, 2019
DOI
10.1145/3299869.3314036
Publisher
ACM