explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes

20200 citationsJournal Articlegreen Open Access

Authors

Ines Montani · Gezhouba Explosive (China)

Matthew Honnibal · Gezhouba Explosive (China)

Honnibal, Matthew · Gezhouba Explosive (China)

Landeghem, Sofie Van · Gezhouba Explosive (China)

Abstract

⚠️ This version of spaCy requires downloading new models. You can use the <code>spacy validate</code> command to find out which models need updating, and print update instructions. If you've been training your own models, you'll need to retrain them with the new version. ✨ New features and improvements NEW: Pretrained model families for Chinese, Danish, Japanese, Polish and Romanian, as well as larger models with vectors for Dutch, German, French, Italian, Greek, Lithuanian, Portuguese and Spanish. 29 new models and 46 model packages in total! NEW: 2-4× faster loading times for models with vectors and 2× smaller packages. NEW: Alpha support for Armenian, Gujarati and Malayalam. NEW: Lookup lemmatization for Polish. NEW: Allow <code>Matcher</code> to match on both <code>Doc</code> and <code>Span</code> objects. NEW: Add <code>Token.is_sent_end</code> property. Improve language data for Danish, Dutch, French, German, Italian, Lithuanian, Norwegian, Romanian and Spanish to better match UD corpora. Update language data for Danish, Kannada, Korean, Persian, Swedish and Urdu. Add support for <code>pkuseg</code> alongside <code>jieba</code> for Chinese. Switch from <code>fugashi</code> to <code>sudachipy</code> for Japanese. Improve punctuation used in sentencizer. Switch to new and more consistent alignment method in <code>gold.align</code>. Reduce stored lexemes data and move non-derivable features to <code>spacy-lookups-data</code>. 🔴 Bug fixes Fix issue #5056: Introduce support for matching <code>Span</code> objects. Fix issue #5086: Remove <code>Vectors.from_glove</code>. Fix issue #5131: Improve data processing in named entity linking scripts. Fix issue #5137: Fix passing of component configuration to component. Fix issue #5144: Fix sentence comparison in test util. Fix issue #5166: Fix handling of <code>exclusive_classes</code> in textcat ensemble. Fix issue #5170: Set rank for new vector in <code>Vocab.set_vector</code>. Fix issue #5181: Prevent <code>None</code> values in gold fields. Fix issue #5191: Fix <code>GoldParse</code> initialization when the number of tokens has changed. Fix issue #5193: Correctly pin <code>cupy-cuda</code> extra dependencies. Fix issue #5200: Fix minor bugs in train CLI. Fix issue #5216: Modify <code>Vectors.resize</code> to work with <code>cupy</code>. Fix issue #5228: Raise error for inplace resize with new vector dimension. Fix issue #5230: Fix <code>unittest</code> warnings when saving a model. Fix issue #5257: Use inline flags in <code>token_match</code> patterns. Fix issue #5278, #5359: Add missing <code>__init__.py</code> files to language data tests. Fix issue #5281: Fix comparison predicate handling for <code>!=</code>. Fix issue #5287: Normalize <code>TokenC.sent_start</code> values for <code>Matcher</code>. Fix issue #5292: Fix typo in option name <code>--n-save_every</code>. Fix issue #5303: Use <code>max(uint64)</code> for OOV lexeme rank. Fix issue #5311: Fix alignment of cards on landing page. Fix issue #5320: Fix <code>most_similar</code> for vectors with unused rows. Fix issue #5344: Prevent pip from installing spaCy on Python 3.4. Fix issue #5356: Fix bug in <code>Span.similarity</code> that could trigger <code>TypeError</code>. Fix issue #5361: Fix problems with lower and whitespace in variants. Fix issue #5373: Improve exceptions for <code>'d</code> (would/had) in English. Fix issue #5387: Fix logic in train CLI timing eval on CPU/GPU. Fix issue #5393, #5458: Fix check for overlapping spans in noun chunks. Fix issue #5429: Modify array type to accommodate <code>OOV_RANK</code>. Fix issue #5430: Check that row is within bounds when adding vector. Fix issue #5435: Use <code>Token.sent_start</code> for <code>Span.sent</code>. Fix issue #5436: Fix <code>ErrorsWithCodes().__class__</code> return value. Fix issue #5450: Disallow merging 0-length spans. 📖 Documentation and examples Fix various typos and inconsistencies. Add new projects to the spaCy Universe. Move <code>bin/wiki_entity_linking</code> scripts for Wikipedia to <code>projects</code> repo. 🔥 ICYMI: We recently updated the free and interactive spaCy course to include translations for German (with German NLP examples), Spanish (with Spanish NLP examples) and Japanese, as well as videos for English and German. Translations for Chinese (with Chinese NLP examples), French (with French NLP examples) and Russian coming soon! 📦 Model packages (43) Model Language Version Vectors <code>zh_core_web_sm</code> Chinese 2.3.0 𐄂 <code>zh_core_web_md</code> Chinese 2.3.0 ✓ <code>zh_core_web_lg</code> Chinese 2.3.0 ✓ <code>da_core_news_sm</code> Danish 2.3.0 𐄂 <code>da_core_news_md</code> Danish 2.3.0 ✓ <code>da_core_news_lg</code> Danish 2.3.0 ✓ <code>nl_core_news_sm</code> Dutch 2.3.0 𐄂 <code>nl_core_news_md</code> Dutch 2.3.0 ✓ <code>nl_core_news_lg</code> Dutch 2.3.0 ✓ <code>en_core_web_sm</code> English 2.3.0 𐄂 <code>en_core_web_md</code> English 2.3.0 ✓ <code>en_core_web_lg</code> English 2.3.0 ✓ <code>fr_core_news_sm</code> French 2.3.0 𐄂 <code>fr_core_news_md</code> French 2.3.0 ✓ <code>fr_core_news_lg</code> French 2.3.0 ✓ <code>de_core_news_sm</code> German 2.3.0 𐄂 <code>de_core_news_md</code> German 2.3.0 ✓ <code>de_core_news_lg</code> German 2.3.0 ✓ <code>el_core_news_sm</code> Greek 2.3.0 𐄂 <code>el_core_news_md</code> Greek 2.3.0 ✓ <code>el_core_news_lg</code> Greek 2.3.0 ✓ <code>it_core_news_sm</code> Italian 2.3.0 𐄂 <code>it_core_news_md</code> Italian 2.3.0 ✓ <code>it_core_news_lg</code> Italian 2.3.0 ✓ <code>ja_core_news_sm</code> Italian 2.3.0 𐄂 <code>ja_core_news_md</code> Italian 2.3.0 ✓ <code>ja_core_news_lg</code> Italian 2.3.0 ✓ <code>lt_core_news_sm</code> Lithuanian 2.3.0 𐄂 <code>lt_core_news_md</code> Lithuanian 2.3.0 ✓ <code>lt_core_news_lg</code> Lithuanian 2.3.0 ✓ <code>nb_core_news_sm</code> Norwegian Bokmål 2.3.0 𐄂 <code>nb_core_news_md</code> Norwegian Bokmål 2.3.0 ✓ <code>nb_core_news_lg</code> Norwegian Bokmål 2.3.0 ✓ <code>pl_core_news_sm</code> Polish 2.3.0 𐄂 <code>pl_core_news_md</code> Polish 2.3.0 ✓ <code>pl_core_news_lg</code> Polish 2.3.0 ✓ <code>pt_core_news_sm</code> Portuguese 2.3.0 𐄂 <code>pt_core_news_md</code> Portuguese 2.3.0 ✓ <code>pt_core_news_lg</code> Portuguese 2.3.0 ✓ <code>ro_core_news_sm</code> Romanian 2.3.0 𐄂 <code>ro_core_news_md</code> Romanian 2.3.0 ✓ <code>ro_core_news_lg</code> Romanian 2.3.0 ✓ <code>es_core_news_sm</code> Spanish 2.3.0 𐄂 <code>es_core_news_md</code> Spanish 2.3.0 ✓ <code>es_core_news_lg</code> Spanish 2.3.0 ✓ <code>xx_ent_wiki_sm</code> Multi-language 2.3.0 𐄂 👥 Contributors Thanks to @mabraham, @sloev, @pinealan, @pmbaumgartner, @Baciccin, @nlptechbook, @guerda, @Tiljander, @nikhilsaldanha, @tommilligan, @Jacse, @leicmi, @YohannesDatasci, @mirfan899, @koaning, @umarbutler, @chopeen, @paoloq, @thomasthiebaud, @sebastienharinck, @elben10, @laszabine, @Mlawrence95, @sabiqueqb, @punitvara, @michael-k, @louisguitton, @vondersam, @thoppe, @vishnupriyavr, @ilivans and @osori for the pull requests and contributions. 🙏 Special thanks to everyone who helped us develop and test the new models: @lixiepeng, @lingvisa and @howl-anderson (Chinese), @hvingelby (Danish), @hiroshi-matsuda-rit and @polm (Japanese), @ryszardtuora (Polish) and @avramandrei and @dumitrescustefan (Romanian).

Topics & Keywords

Natural Language Processing Techniques Mathematics, Computing, and Information Processing Computational Physics and Python Applications

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.3897194

Command Palette

explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements &amp; lots of bug fixes

Authors

Abstract

Topics & Keywords

Publication Details

explosion/spaCy: v2.3.0: Models for Chinese, Danish, Japanese, Polish and Romanian, new updated models with vectors, faster loading, small API improvements & lots of bug fixes