Federated studying with differential privateness, i.e. non-public federated studying (PFL), makes it doable to coach fashions on non-public knowledge distributed throughout customers’ units with out harming privateness. PFL is environment friendly for fashions, similar to neural networks, which have a set variety of parameters, and thus a fixed-dimensional gradient vector. Such fashions embody neural-net language fashions, however not tokenizers, the subject of this work. Coaching a tokenizer requires frequencies of phrases from a vast vocabulary, and current strategies for locating a vast vocabulary want a separate privateness price range.
A workaround is to coach the tokenizer on publicly accessible knowledge. Nonetheless, on this paper we first present {that a} tokenizer skilled on mismatched knowledge ends in worse mannequin perfor- mance in comparison with a privacy-violating “oracle” tokenizer that accesses person knowledge, with per- plexity growing by 20%. We additionally present that sub-word tokenizers are higher suited to the federated context than word-level ones, since they’ll encode new phrases, although with extra tokens per phrase. Second, we suggest a novel methodology to acquire a tokenizer with out utilizing any extra pri- vacy price range. Throughout non-public federated studying of the language mannequin, we pattern from the mannequin, prepare a brand new tokenizer on the sampled sequences, and replace the mannequin embeddings. We then proceed non-public federated studying, and acquire efficiency inside 1% of the “oracle” tokenizer. Since this course of trains the tokenizer solely not directly on non-public knowledge, we are able to use the “postprocessing assure” of differential privateness and thus use no extra privateness price range.