explosion/spaCy: v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install

20190 citationsJournal Articlegreen Open Access

Authors

Matthew Honnibal · Gezhouba Explosive (China)

Ines Montani · Gezhouba Explosive (China)

Honnibal, Matthew · Gezhouba Explosive (China)

Landeghem, Sofie Van · Occidental Petroleum (United States)

Abstract

✨ New features and improvements <strong>NEW:</strong> Support multiprocessing in <code>nlp.pipe</code> via the <code>n_process</code> argument (Python 3 only). Base language support for Luxembourgish. Add noun chunks iterator for Swedish. Retrained models for Greek, Norwegian Bokmål and Lithuanian that now correctly support parser-based sentence segmentation. Repackaged models for Greek and German with improved lookup tables via <code>spacy-lookups-data</code>. Add warning in <code>debug-data</code> for low sentences per doc ratio. Improve checks and errors related to ill-formed IOB input in <code>convert</code> and <code>debug-data</code> CLI. Support training dict format as JSONL. Make <code>EntityRuler</code> ID resolution 2× faster and support <code>"id"</code> in patterns to set <code>Token.ent_id</code>. Improve rendering of named entity spans in <code>displacy</code> for RTL languages. Update Thinc to ditch <code>thinc_gpu_ops</code> for simpler GPU install. Support Mish activation in <code>spacy pretrain</code>. Add backwards-compatible support for new <code>Language.disable_pipes</code> API, which will become the default in the future. The method can now also take a list of component names as its first argument (instead of a variable number of arguments).<pre><code class="lang-diff">- disabled = nlp.disable_pipes("tagger", "parser") + disabled = nlp.disable_pipes(["tagger", "parser"]) </code></pre> Add backwards-compatible support for new <code>Matcher.add</code> and <code>PhraseMatcher.add</code> API, which will become the default in the future. The patterns are now the second argument and a list (instead of a variable number of arguments). The <code>on_match</code> callback becomes an optional keyword argument.<pre><code class="lang-diff">patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] - matcher.add("GoogleNow", None, *patterns) + matcher.add("GoogleNow", patterns) - matcher.add("GoogleNow", on_match, *patterns) + matcher.add("GoogleNow", patterns, on_match=on_match) </code></pre> Add new and improved tokenization alignment in <code>gold.align</code> behind a feature flag. The new alignment may produce backwards-incompatible results, so it won't be enabled by default before v3.0.<pre><code class="lang-python">import spacy.gold spacy.gold.USE_NEW_ALIGN = True </code></pre> Add wheel for Python 3.8 on Linux (Windows and Mac are coming as soon as our CI providers and third-party tools are updated). 🔴 Bug fixes Fix issue #1303: Support multiprocessing in <code>nlp.pipe</code>. Fix issue #1745: Ditch <code>thinc_gpu_ops</code> for simpler GPU install. Fix issue #2411: Update Thinc to fix compilation on cygwin. Fix issue #3412: Prevent division by zero in <code>Vectors.most_similar</code>. Fix issue #3618: Fix memory leak for long-running parsing processes. Fix issue #4241: Update Greek lookups in <code>spacy-lookups-data</code>. Fix issue #4269: Extend unicode character block for Sinhala. Fix issue #4362: Improve <code>URL_PATTERN</code> and handling in tokenizer. Fix issue #4373: Make <code>PhraseMatcher.vocab</code> consistent with <code>Matcher.vocab</code>. Fix issue #4377: Clarify serialization of extension attributes. Fix issue #4382: Improve usage of <code>pkg_resources</code> and handling of entry points. Fix issue #4386: Consider <code>batch_size</code> when sorting similar vectors. Fix issue #4389: Fix <code>ner_jsonl2json</code> converter. Fix issue #4397: Ensure <code>on_match</code> callback is executed in <code>PhraseMatcher</code>. Fix issue #4401, #4408: Fix sentence segmentation in Greek, Norwegian and Lithuanian models. Fix issue #4402: Fix issue with how training data was passed through the pipeline. Fix issue #4406: Correct spelling in lemmatizer API docs. Fix issue #4418, #4438: Improve knowledge base and Wikidata parsing. Fix issue #4435: Fix <code>PhraseMatcher.remove</code> for overlapping patterns. Fix issue #4443: Fix bug in <code>Vectors.most_similar</code>. Fix issue #4452: Fix <code>gold.docs_to_json</code> documentation. Fix issue #4463: Add missing <code>cats</code> to <code>GoldParse.from_annot_tuples</code> in <code>Scorer</code>. Fix issue #4470: Suppress convert output if writing to <code>stdout</code>. Fix issue #4475: Correct mistake in docs example. Fix issue #4485: Update tag maps and docs for English and German. Fix issue #4493: Update information in spaCy Universe. Fix issue #4496: Improve docs of <code>PhraseMatcher.add</code> arguments. Fix issue #4506: Ensure <code>Vectors.most_similar</code> returns <code>1.0</code> for identical vectors. Fix issue #4509: Fix <code>None</code> iteration error in entity linking script. Fix issue #4524: Fix typo in <code>Parser</code> sample construction of <code>GoldParse</code>. Fix issue #4528: Fix serialization of extension attribute values in <code>DocBin</code>. Fix issue #4529: Ensure <code>GoldParse</code> is initialized correctly with misaligned tokens. Fix issue #4538: Backport memory leak fix to v2.1.x branch and release v2.1.9. ⚠️ Backwards incompatibilities The unused attributes <code>lemma_rules</code>, <code>lemma_index</code>, <code>lemma_exc</code> and <code>lemma_lookup</code> of the <code>Language.Defaults</code> have now been removed to prevent confusion (e.g. if users add rules that then have no effect). The only place lemmatization tables are stored and can be modified at runtime is via <code>nlp.vocab.lookups</code>.<pre><code class="lang-diff">- nlp.Defaults.lemma_lookup["spaCies"] = "spaCy" + lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup") + lemma_lookup["spaCies"] = "spaCy" </code></pre> 📖 Documentation and examples Fix various typos and inconsistencies. Add more projects to the spaCy Universe. 👥 Contributors Thanks to @tamuhey, @PeterGilles, @akornilo, @danielkingai2, @ghollah, @pberba, @gustavengstrom, @ju-sh, @kabirkhan, @ZhuoruLin, @nipunsadvilkar and @neelkamath for the pull requests and contributions.

Topics & Keywords

Simulation Techniques and Applications Statistical Methods and Inference

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.3524402

Command Palette

explosion/spaCy: v2.2.2: Multiprocessing, future APIs, Luxembourgish base support &amp; simpler GPU install

Authors

Abstract

Topics & Keywords

Publication Details

explosion/spaCy: v2.2.2: Multiprocessing, future APIs, Luxembourgish base support & simpler GPU install