Search for a command to run...
Techniques such as ensembling and distillation promise model quality\nimprovements when paired with almost any base model. However, due to increased\ntest-time cost (for ensembles) and increased complexity of the training\npipeline (for distillation), these techniques are challenging to use in\nindustrial settings. In this paper we explore a variant of distillation which\nis relatively straightforward to use as it does not require a complicated\nmulti-stage setup or many new hyperparameters. Our first claim is that online\ndistillation enables us to use extra parallelism to fit very large datasets\nabout twice as fast. Crucially, we can still speed up training even after we\nhave already reached the point at which additional parallelism provides no\nbenefit for synchronous or asynchronous stochastic gradient descent. Two neural\nnetworks trained on disjoint subsets of the data can share knowledge by\nencouraging each model to agree with the predictions the other model would have\nmade. These predictions can come from a stale version of the other model so\nthey can be safely computed using weights that only rarely get transmitted. Our\nsecond claim is that online distillation is a cost-effective way to make the\nexact predictions of a model dramatically more reproducible. We support our\nclaims using experiments on the Criteo Display Ad Challenge dataset, ImageNet,\nand the largest to-date dataset used for neural language modeling, containing\n$6\\times 10^{11}$ tokens and based on the Common Crawl repository of web data.\n