explosion/spaCy: v2.1.0: New models, ULMFit/BERT/Elmo-like pretraining, faster tokenization, better Matcher, bug fixes & more

20190 citationsJournal Articlegreen Open Access

Authors

Matthew Honnibal · Gezhouba Explosive (China)

Ines Montani · Gezhouba Explosive (China)

Matthew Honnibal · Gezhouba Explosive (China)

György Orosz · LogMeIn (United Kingdom)

Søren Lind Kristiansen

Abstract

⚠️ This version of spaCy requires downloading new models. You can use the <code>spacy validate</code> command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version. ✨ New features and improvements Tagger, Parser, NER and Text Categorizer NEW: Experimental ULMFit/BERT/Elmo-like pretraining (see #2931) via the new <code>spacy pretrain</code> command. This pre-trains the CNN using BERT's cloze task. A new trick we're calling Language Modelling with Approximate Outputs is used to apply the pre-training to smaller models. The pre-training outputs CNN and embedding weights that can be used in <code>spacy train</code>, using the new <code>-t2v</code> argument. NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens. Make parser, tagger and NER faster, through better hyperparameters. Add simpler, GPU-friendly option to <code>TextCategorizer</code>, and allow setting <code>exclusive_classes</code> and <code>architecture</code> arguments on initialization. Add <code>EntityRecognizer.labels</code> property. Remove document length limit during training, by implementing faster Levenshtein alignment. Use Thinc v7.0, which defaults to single-thread with fast <code>blis</code> kernel for matrix multiplication. Parallelisation should be performed at the task level, e.g. by running more containers. Models & Language Data NEW: 2-3 times faster tokenization across all languages at the same accuracy! NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages. NEW: The English and German models are now available under the MIT license. NEW: Statistical models for Greek. NEW: Alpha support for Tamil, Ukrainian and Kannada, and base language classes for Afrikaans, Bulgarian, Czech, Icelandic, Lithuanian, Latvian, Slovak, Slovenian and Albanian. Improve loading time of <code>French</code> by ~30%. Add <code>Vocab.writing_system</code> (populated via the language data) to expose settings like writing direction. CLI NEW: <code>pretrain</code> command for ULMFit/BERT/Elmo-like pretraining (see #2931). NEW: New <code>ud-train</code> command, to train and evaluate using the CoNLL 2017 shared task data. Check if model is already installed before downloading it via <code>spacy download</code>. Pass additional arguments of <code>download</code> command to <code>pip</code> to customise installation. Improve <code>train</code> command by letting <code>GoldCorpus</code> stream data, instead of loading into memory. Improve <code>init-model</code> command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces the <code>spacy vocab</code> command, which is now deprecated. Add support for multi-task objectives to <code>train</code> command. Add support for data-augmentation to <code>train</code> command. Other NEW: Enhanced pattern API for rule-based <code>Matcher</code> (see #1971). NEW: <code>Doc.retokenize</code> context manager for merging and splitting tokens more efficiently. NEW: Add support for custom pipeline component factories via entry points (#2348). NEW: Implement fastText vectors with subword features. NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513). NEW: Allow <code>PhraseMatcher</code> to match on token attributes other than <code>ORTH</code>, e.g. <code>LOWER</code> (for case-insensitive matching) or even <code>POS</code> or <code>TAG</code>. NEW: Replace <code>ujson</code>, <code>msgpack</code>, <code>msgpack-numpy</code>, <code>pickle</code>, <code>cloudpickle</code> and <code>dill</code> with our own package <code>srsly</code> to centralise dependencies and allow binary wheels. NEW: <code>Doc.to_json()</code> method which outputs data in spaCy's training format. This will be the only place where the format is hard-coded (see #2932). NEW: Built-in <code>EntityRuler</code> component to make it easier to build rule-based NER and combinations of statistical and rule-based systems. NEW: <code>gold.spans_from_biluo_tags</code> helper that returns <code>Span</code> objects, e.g. to overwrite the <code>doc.ents</code>. Add warnings if <code>.similarity</code> method is called with empty vectors or without word vectors. Improve rule-based <code>Matcher</code> and add <code>return_matches</code> keyword argument to <code>Matcher.pipe</code> to yield <code>(doc, matches)</code> tuples instead of only <code>Doc</code> objects, and <code>as_tuples</code> to add context to the <code>Doc</code> objects. Make stop words via <code>Token.is_stop</code> and <code>Lexeme.is_stop</code> case-insensitive. Accept <code>"TEXT"</code> as an alternative to <code>"ORTH"</code> in <code>Matcher</code> patterns. Use <code>black</code> for auto-formatting <code>.py</code> source and optimse codebase using <code>flake8</code>. You can now run <code>flake8 spacy</code> and it should return no errors or warnings. See <code>CONTRIBUTING.md</code> for details. 🔴 Bug fixes Fix issue #795: Fix behaviour of <code>Token.conjuncts.</code> Fix issue #1487: Add <code>Doc.retokenize()</code> context manager. Fix issue #1537: Make <code>Span.as_doc</code> return a copy, not a view. Fix issue #1574: Make sure stop words are available in medium and large English models. Fix issue #1585: Prevent parser from predicting unseen classes. Fix issue #1642: Replace <code>regex</code> with <code>re</code> and speed up tokenization. Fix issue #1665: Correct typos in symbol <code>Animacy_inan</code> and add <code>Animacy_nhum</code>. Fix issue #1748, #1798, #2756, #2934: Add simpler GPU-friendly option to <code>TextCategorizer</code>. Fix issue #1773: Prevent tokenizer exceptions from setting <code>POS</code> but not <code>TAG</code>. Fix issue #1782, #2343: Fix training on GPU. Fix issue #1816: Allow custom <code>Language</code> subclasses via entry points. Fix issue #1865: Correct licensing of <code>it_core_news_sm</code> model. Fix issue #1889: Make stop words case-insensitive. Fix issue #1903: Add <code>relcl</code> dependency label to symbols. Fix issue #1963: Resize <code>Doc.tensor</code> when merging spans. Fix issue #1971: Update <code>Matcher</code> engine to support regex, extension attributes and rich comparison. Fix issue #2014: Make <code>Token.pos_</code> writeable. Fix issue #2091: Fix <code>displacy</code> support for RTL languages. Fix issue #2203, #3268: Prevent bad interaction of lemmatizer and tokenizer exceptions. Fix issue #2329: Correct <code>TextCategorizer</code> and <code>GoldParse</code> API docs. Fix issue #2369: Respect pre-defined warning filters. Fix issue #2390: Support setting lexical attributes during retokenization. Fix issue #2396: Fix <code>Doc.get_lca_matrix</code>. Fix issue #2464, #3009: Fix behaviour of <code>Matcher</code>'s <code>?</code> quantifier. Fix issue #2482: Fix serialization when parser model is empty. Fix issue #2512, #2153: Fix issue with deserialization into non-empty vocab. Fix issue #2603: Improve handling of missing NER tags. Fix issue #2644: Add table explaining training metrics to docs. Fix issue #2648: Fix <code>KeyError</code> in <code>Vectors.most_similar</code>. Fix issue #2671, #2675: Fix incorrect match ID on some patterns. Fix issue #2693: Only use <code>'sentencizer'</code> as built-in sentence boundary component name. Fix issue #2728: Fix HTML escaping in <code>displacy</code> NER visualization and correct API docs. Fix issue #2740: Add ability to pass additional arguments to pipeline components. Fix issue #2754, #3028: Make <code>NORM</code> a <code>Token</code> attribute instead of a <code>Lexeme</code> attribute to allow setting context-specific norms in tokenizer exceptions. Fix issue #2769: Fix issue that'd cause segmentation fault when calling <code>EntityRecognizer.add_label</code>. Fix issue #2772: Fix bug in sentence starts for non-projective parses. Fix issue #2779: Fix handling of pre-set entities. Fix issue #2782: Make <code>like_num</code> work with prefixed numbers. Fix issue #2833: Raise better error if <code>Token</code> or <code>Span</code> are pickled. Fix issue #2838: Add <code>Retokenizer.split</code> method to split one token into several. Fix issue #2869: Make <code>doc[0].is_sent_start == True</code>. Fix issue #2870: Make it illegal for the entity recognizer to predict whitespace tokens as <code>B</code>, <code>L</code> or <code>U</code>. Fix issue #2871: Fix vectors for reserved words. Fix issue #2901: Fix issue with first call of <code>nlp</code> in Japanese (MeCab). Fix issue #2924: Make IDs of displaCy arcs more unique to avoid clashes. Fix issue #3012: Fix clobber of <code>Doc.is_tagged</code> in <code>Doc.from_array</code>. Fix issue #3027: Allow <code>Span</code> to take unicode value for <code>label</code> argument. Fix issue #3036: Support mutable default arguments in extension attributes. Fix issue #3048: Raise better errors for uninitialized pipeline components. Fix issue #3064: Allow single string attributes in <code>Doc.to_array</code>. Fix issue #3093, #3067: Set <code>vectors.name</code> correctly when exporting model via CLI. Fix issue #3112: Make sure entity types are added correctly on GPU. Fix issue #3191: Fix pickling of <code>Japanese</code>. Fix issue #3122: Correct docs of <code>Token.subtree</code> and <code>Span.subtree</code>. Fix issue #3128: Improve error handling in converters. Fix issue #3248: Fix <code>PhraseMatcher</code> pickling and make <code>__len__</code> consist

Topics & Keywords

Simulation Techniques and Applications

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.2597447

Command Palette

explosion/spaCy: v2.1.0: New models, ULMFit/BERT/Elmo-like pretraining, faster tokenization, better Matcher, bug fixes &amp; more

Authors

Abstract

Topics & Keywords

Publication Details

explosion/spaCy: v2.1.0: New models, ULMFit/BERT/Elmo-like pretraining, faster tokenization, better Matcher, bug fixes & more