Home
Softono
e

explosion

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Total Products
13

Software by explosion

spaCy
Open Source

spaCy

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # spaCy: Industrial-strength NLP spaCy is a library for **advanced Natural Language Processing** in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with [pretrained pipelines](https://spacy.io/models) and currently supports tokenization and training for **70+ languages**. It features state-of-the-art speed and **neural network models** for tagging, parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready [**training system**](https://spacy.io/usage/training) and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the [MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE). 💫 **Version 3.8 out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases) [![tests](https://github.com/explosion/spaCy/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spaCy/actions/workflows/tests.yml) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy) [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) <br /> [![PyPi downloads](https://static.pepy.tech/personalized-badge/spacy?period=total&units=international_system&left_color=grey&right_color=orange&left_text=pip%20downloads)](https://pypi.org/project/spacy/) [![Conda downloads](https://img.shields.io/conda/dn/conda-forge/spacy?label=conda%20downloads)](https://anaconda.org/conda-forge/spacy) ## 📖 Documentation | Documentation | | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! | | 📚 **[Usage Guides]** | How to use spaCy and its features. | | 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. | | 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. | | 🎛 **[API Reference]** | The detailed reference for spaCy's API. | | ⏩ **[GPU Processing]** | Use spaCy with CUDA-compatible GPU processing. | | 📦 **[Models]** | Download trained pipelines for spaCy. | | 🦙 **[Large Language Models]** | Integrate LLMs into spaCy pipelines. | | 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. | | ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. | | 👩‍🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. | | 📰 **[Blog]** | Read about current spaCy and Prodigy development, releases, talks and more from Explosion. | | 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. | | 🔴 **[Live Stream]** | Join Matt as he works on spaCy and chat about NLP. | | 🛠 **[Changelog]** | Changes and version history. | | 💝 **[Contribute]** | How to contribute to the spaCy project and code base. | | 👕 **[Swag]** | Support us and our work with unique, custom-designed swag! | | <a href="https://explosion.ai/tailored-solutions"><img src="https://github.com/explosion/spaCy/assets/13643239/36d2a42e-98c0-4599-90e1-788ef75181be" width="150" alt="Tailored Solutions"/></a> | Custom NLP consulting, implementation and strategic advice by spaCy’s core development team. Streamlined, production-ready, predictable and maintainable. Send us an email or take our 5-minute questionnaire, and well'be in touch! **[Learn more &rarr;](https://explosion.ai/tailored-solutions)** | [spacy 101]: https://spacy.io/usage/spacy-101 [new in v3.0]: https://spacy.io/usage/v3 [usage guides]: https://spacy.io/usage/ [api reference]: https://spacy.io/api/ [gpu processing]: https://spacy.io/usage#gpu [models]: https://spacy.io/models [large language models]: https://spacy.io/usage/large-language-models [universe]: https://spacy.io/universe [spacy vs code extension]: https://github.com/explosion/spacy-vscode [videos]: https://www.youtube.com/c/ExplosionAI [live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c [online course]: https://course.spacy.io [blog]: https://explosion.ai [project templates]: https://github.com/explosion/projects [changelog]: https://spacy.io/usage#changelog [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md [swag]: https://explosion.ai/merch ## 💬 Where to ask questions The spaCy project is maintained by the [spaCy team](https://explosion.ai/about). Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it. | Type | Platforms | | ------------------------------- | --------------------------------------- | | 🚨 **Bug Reports** | [GitHub Issue Tracker] | | 🎁 **Feature Requests & Ideas** | [GitHub Discussions] · [Live Stream] | | 👩‍💻 **Usage Questions** | [GitHub Discussions] · [Stack Overflow] | | 🗯 **General Discussion** | [GitHub Discussions] · [Live Stream] | [github issue tracker]: https://github.com/explosion/spaCy/issues [github discussions]: https://github.com/explosion/spaCy/discussions [stack overflow]: https://stackoverflow.com/questions/tagged/spacy [live stream]: https://www.youtube.com/playlist?list=PLBmcuObd5An5_iAxNYLJa_xWmNzsYce8c ## Features - Support for **70+ languages** - **Trained pipelines** for different languages and tasks - Multi-task learning with pretrained **transformers** like BERT - Support for pretrained **word vectors** and embeddings - State-of-the-art speed - Production-ready **training system** - Linguistically-motivated **tokenization** - Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more - Easily extensible with **custom components** and attributes - Support for custom models in **PyTorch**, **TensorFlow** and other frameworks - Built in **visualizers** for syntax and NER - Easy **model packaging**, deployment and workflow management - Robust, rigorously evaluated accuracy 📖 **For more details, see the [facts, figures and benchmarks](https://spacy.io/usage/facts-figures).** ## ⏳ Install spaCy For detailed installation instructions, see the [documentation](https://spacy.io/usage). - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio) - **Python version**: Python >=3.7, <3.13 (only 64 bit) - **Package managers**: [pip] · [conda] (via `conda-forge`) [pip]: https://pypi.org/project/spacy/ [conda]: https://anaconda.org/conda-forge/spacy ### pip Using pip, spaCy releases are available as source packages and binary wheels. Before you install spaCy and its dependencies, make sure that your `pip`, `setuptools` and `wheel` are up to date. ```bash pip install -U pip setuptools wheel pip install spacy ``` To install additional data tables for lemmatization and normalization you can run `pip install spacy[lookups]` or install [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) separately. The lookups package is needed to create blank models with lemmatization data, and to lemmatize in languages that don't yet come with pretrained models and aren't powered by third-party libraries. When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state: ```bash python -m venv .env source .env/bin/activate pip install -U pip setuptools wheel pip install spacy ``` ### conda You can also install spaCy from `conda` via the `conda-forge` channel. For the feedstock including the build recipe and configuration, check out [this repository](https://github.com/conda-forge/spacy-feedstock). ```bash conda install -c conda-forge spacy ``` ### Updating spaCy Some updates to spaCy may require downloading new statistical models. If you're running spaCy v2.0 or higher, you can use the `validate` command to check if your installed models are compatible and if not, print details on how to update them: ```bash pip install -U spacy python -m spacy validate ``` If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend **retraining your models** with the new version. 📖 **For details on upgrading from spaCy 2.x to spaCy 3.x, see the [migration guide](https://spacy.io/usage/v3#migrating).** ## 📦 Download model packages Trained pipelines for spaCy can be installed as **Python packages**. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's [`download`](https://spacy.io/api/cli#download) command, or manually by pointing pip to a path or URL. | Documentation | | | -------------------------- | ---------------------------------------------------------------- | | **[Available Pipelines]** | Detailed pipeline descriptions, accuracy figures and benchmarks. | | **[Models Documentation]** | Detailed usage and installation instructions. | | **[Training]** | How to train your own pipelines on your data. | [available pipelines]: https://spacy.io/models [models documentation]: https://spacy.io/usage/models [training]: https://spacy.io/usage/training ```bash # Download best-matching version of specific model for your spaCy installation python -m spacy download en_core_web_sm # pip install .tar.gz archive or .whl from path or URL pip install /Users/you/en_core_web_sm-3.0.0.tar.gz pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz ``` ### Loading and using models To load a model, use [`spacy.load()`](https://spacy.io/api/top-level#spacy.load) with the model name or a path to the model data directory. ```python import spacy nlp = spacy.load("en_core_web_sm") doc = nlp("This is a sentence.") ``` You can also `import` a model directly via its full name and then call its `load()` method with no arguments. ```python import spacy import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp("This is a sentence.") ``` 📖 **For more info and examples, check out the [models documentation](https://spacy.io/docs/usage/models).** ## ⚒ Compile from source The other way to install spaCy is to clone its [GitHub repository](https://github.com/explosion/spaCy) and build it from source. That is the common way if you want to make changes to the code base. You'll need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, [pip](https://pip.pypa.io/en/latest/installing/), [virtualenv](https://virtualenv.pypa.io/en/latest/) and [git](https://git-scm.com) installed. The compiler part is the trickiest. How to do that depends on your system. | Platform | | | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Ubuntu** | Install system-level dependencies via `apt-get`: `sudo apt-get install build-essential python-dev git` . | | **Mac** | Install a recent version of [XCode](https://developer.apple.com/xcode/), including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled. | | **Windows** | Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that matches the version that was used to compile your Python interpreter. | For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage#source) and the [quickstart widget](https://spacy.io/usage#section-quickstart) to get the right commands for your platform and Python version. ```bash git clone https://github.com/explosion/spaCy cd spaCy python -m venv .env source .env/bin/activate # make sure you are using the latest pip python -m pip install -U pip setuptools wheel pip install -r requirements.txt pip install --no-build-isolation --editable . ``` To install with extras: ```bash pip install --no-build-isolation --editable .[lookups,cuda102] ``` ## 🚦 Run tests spaCy comes with an [extensive test suite](spacy/tests). In order to run the tests, you'll usually want to clone the repository and build spaCy from source. This will also install the required development dependencies and test utilities defined in the [`requirements.txt`](requirements.txt). Alternatively, you can run `pytest` on the tests from within the installed `spacy` package. Don't forget to also install the test utilities via spaCy's [`requirements.txt`](requirements.txt): ```bash pip install -r requirements.txt python -m pytest --pyargs spacy ```

ML Frameworks
33.6K Github Stars
thinc
Open Source

thinc

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # Thinc: A refreshing functional take on deep learning, compatible with your favorite libraries ### From the makers of [spaCy](https://spacy.io) and [Prodigy](https://prodi.gy) [Thinc](https://thinc.ai) is a **lightweight deep learning library** that offers an elegant, type-checked, functional-programming API for **composing models**, with support for layers defined in other frameworks such as **PyTorch, TensorFlow and MXNet**. You can use Thinc as an interface layer, a standalone toolkit or a flexible way to develop new models. Previous versions of Thinc have been running quietly in production in thousands of companies, via both [spaCy](https://spacy.io) and [Prodigy](https://prodi.gy). We wrote the new version to let users **compose, configure and deploy custom models** built with their favorite framework. [![tests](https://github.com/explosion/thinc/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/thinc/actions/workflows/tests.yml) [![Current Release Version](https://img.shields.io/github/v/release/explosion/thinc.svg?include_prereleases&sort=semver&style=flat-square&logo=github)](https://github.com/explosion/thinc/releases) [![PyPi Version](https://img.shields.io/pypi/v/thinc.svg?include_prereleases&sort=semver&style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/thinc) [![conda Version](https://img.shields.io/conda/vn/conda-forge/thinc.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/thinc) [![Python wheels](https://img.shields.io/badge/wheels-%E2%9C%93-4c1.svg?longCache=true&style=flat-square&logo=python&logoColor=white)](https://github.com/explosion/wheelwright/releases) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) [![Open demo in Colab][colab]][intro_to_thinc_colab] ## 🔥 Features - **Type-check** your model definitions with custom types and [`mypy`](https://mypy.readthedocs.io/en/latest/) plugin. - Wrap **PyTorch**, **TensorFlow** and **MXNet** models for use in your network. - Concise **functional-programming** approach to model definition, using composition rather than inheritance. - Optional custom infix notation via **operator overloading**. - Integrated **config system** to describe trees of objects and hyperparameters. - Choice of **extensible backends**. - **[Read more &rarr;](https://thinc.ai/docs)** ## 🚀 Quickstart Thinc runs on **Linux**, **macOS** and **Windows**. The latest releases with binary wheels are available from [pip](https://pypi.python.org/pypi/thinc). Before you install Thinc and its dependencies, make sure that your `pip`, `setuptools` and `wheel` are up to date. For the most recent releases, pip 19.3 or newer is recommended. ```bash pip install -U pip setuptools wheel pip install thinc ``` See the [extended installation docs](https://thinc.ai/docs/install#extended) for details on optional dependencies for different backends and GPU. You might also want to [set up static type checking](https://thinc.ai/docs/install#type-checking) to take advantage of Thinc's type system. > ⚠️ If you have installed PyTorch and you are using Python 3.7+, uninstall the > package `dataclasses` with `pip uninstall dataclasses`, since it may have been > installed by PyTorch and is incompatible with Python 3.7+. ### 📓 Selected examples and notebooks Also see the [`/examples`](examples) directory and [usage documentation](https://thinc.ai/docs) for more examples. Most examples are Jupyter notebooks – to launch them on [Google Colab](https://colab.research.google.com) (with GPU support!) click on the button next to the notebook name. | Notebook | Description | | --------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`intro_to_thinc`][intro_to_thinc]<br />[![Open in Colab][colab]][intro_to_thinc_colab] | Everything you need to know to get started. Composing and training a model on the MNIST data, using config files, registering custom functions and wrapping PyTorch, TensorFlow and MXNet models. | | [`transformers_tagger_bert`][transformers_tagger_bert]<br />[![Open in Colab][colab]][transformers_tagger_bert_colab] | How to use Thinc, `transformers` and PyTorch to train a part-of-speech tagger. From model definition and config to the training loop. | | [`pos_tagger_basic_cnn`][pos_tagger_basic_cnn]<br />[![Open in Colab][colab]][pos_tagger_basic_cnn_colab] | Implementing and training a basic CNN for part-of-speech tagging model without external dependencies and using different levels of Thinc's config system. | | [`parallel_training_ray`][parallel_training_ray]<br />[![Open in Colab][colab]][parallel_training_ray_colab] | How to set up synchronous and asynchronous parameter server training with Thinc and [Ray](https://ray.readthedocs.io/en/latest/). | **[View more &rarr;](examples)** [colab]: https://gistcdn.githack.com/ines/dcf354aa71a7665ae19871d7fd14a4e0/raw/461fc1f61a7bc5860f943cd4b6bcfabb8c8906e7/colab-badge.svg [intro_to_thinc]: examples/00_intro_to_thinc.ipynb [intro_to_thinc_colab]: https://colab.research.google.com/github/explosion/thinc/blob/master/examples/00_intro_to_thinc.ipynb [transformers_tagger_bert]: examples/02_transformers_tagger_bert.ipynb [transformers_tagger_bert_colab]: https://colab.research.google.com/github/explosion/thinc/blob/master/examples/02_transformers_tagger_bert.ipynb [pos_tagger_basic_cnn]: examples/03_pos_tagger_basic_cnn.ipynb [pos_tagger_basic_cnn_colab]: https://colab.research.google.com/github/explosion/thinc/blob/master/examples/03_pos_tagger_basic_cnn.ipynb [parallel_training_ray]: examples/04_parallel_training_ray.ipynb [parallel_training_ray_colab]: https://colab.research.google.com/github/explosion/thinc/blob/master/examples/04_parallel_training_ray.ipynb ### 📖 Documentation & usage guides | Documentation | Description | | --------------------------------------------------------------------------------- | ----------------------------------------------------- | | [Introduction](https://thinc.ai/docs) | Everything you need to know. | | [Concept & Design](https://thinc.ai/docs/concept) | Thinc's conceptual model and how it works. | | [Defining and using models](https://thinc.ai/docs/usage-models) | How to compose models and update state. | | [Configuration system](https://thinc.ai/docs/usage-config) | Thinc's config system and function registry. | | [Integrating PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks) | Interoperability with machine learning frameworks | | [Layers API](https://thinc.ai/docs/api-layers) | Weights layers, transforms, combinators and wrappers. | | [Type Checking](https://thinc.ai/docs/usage-type-checking) | Type-check your model definitions and more. | ## 🗺 What's where | Module | Description | | ----------------------------------------- | --------------------------------------------------------------------------------- | | [`thinc.api`](thinc/api.py) | **User-facing API.** All classes and functions should be imported from here. | | [`thinc.types`](thinc/types.py) | Custom [types and dataclasses](https://thinc.ai/docs/api-types). | | [`thinc.model`](thinc/model.py) | The `Model` class. All Thinc models are an instance (not a subclass) of `Model`. | | [`thinc.layers`](thinc/layers) | The layers. Each layer is implemented in its own module. | | [`thinc.shims`](thinc/shims) | Interface for external models implemented in PyTorch, TensorFlow etc. | | [`thinc.loss`](thinc/loss.py) | Functions to calculate losses. | | [`thinc.optimizers`](thinc/optimizers.py) | Functions to create optimizers. Currently supports "vanilla" SGD, Adam and RAdam. | | [`thinc.schedules`](thinc/schedules.py) | Generators for different rates, schedules, decays or series. | | [`thinc.backends`](thinc/backends) | Backends for `numpy` and `cupy`. | | [`thinc.config`](thinc/config.py) | Config parsing and validation and function registry system. | | [`thinc.util`](thinc/util.py) | Utilities and helper functions. | ## 🐍 Development notes Thinc uses [`black`](https://github.com/psf/black) for auto-formatting, [`flake8`](http://flake8.pycqa.org/en/latest/) for linting and [`mypy`](https://mypy.readthedocs.io/en/latest/) for type checking. All code includes type hints wherever possible. See the [type reference](https://thinc.ai/docs/api-types) for more details on Thinc's custom types. ### 👷‍♀️ Building Thinc from source Building Thinc from source requires the full dependencies listed in [`requirements.txt`](requirements.txt) to be installed. You'll also need a compiler to build the C extensions. ```bash git clone https://github.com/explosion/thinc cd thinc python -m venv .env source .env/bin/activate pip install -U pip setuptools wheel pip install -r requirements.txt pip install --no-build-isolation . ``` Alternatively, install in editable mode: ```bash pip install -r requirements.txt pip install --no-build-isolation --editable . ``` Or by setting `PYTHONPATH`: ```bash export PYTHONPATH=`pwd` pip install -r requirements.txt python setup.py build_ext --inplace ``` ### 🚦 Running tests Thinc comes with an [extensive test suite](thinc/tests). The following should all pass and not report any warnings or errors: ```bash python -m pytest thinc # test suite python -m mypy thinc # type checks python -m flake8 thinc # linting ``` To view test coverage, you can run `python -m pytest thinc --cov=thinc`. We aim for a 100% test coverage. This doesn't mean that we meticulously write tests for every single line – we ignore blocks that are not relevant or difficult to test and make sure that the tests execute all code paths.

AI & Machine Learning ML Frameworks
2.9K Github Stars
spacy-course
Open Source

spacy-course

# Advanced NLP with spaCy: A free online course This repo contains both an [**online course**](https://course.spacy.io), as well as its modern open-source web framework. In the course, you'll learn how to use [spaCy](https://spacy.io) to build advanced natural language understanding systems, using both rule-based and machine learning approaches. The front-end is powered by [Gatsby](http://gatsbyjs.org/), [Reveal.js](https://revealjs.com) and [Plyr](https://github.com/sampotts/plyr), and the back-end code execution uses [Binder](https://mybinder.org) 💖 It's all open-source and published under the MIT license (code and framework) and CC BY-NC (spaCy course materials). _This course is mostly intended for **self-study**. Yes, you can cheat – the solutions are all in this repo, there's no penalty for clicking "Show hints" or "Show solution", and you can mark an exercise as done when you think it's done._ [![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/10/master.svg?logo=azure-devops&style=flat-square)](https://dev.azure.com/explosion-ai/public/_build?definitionId=10) ![Netlify Status](https://api.netlify.com/api/v1/badges/2eae6a1a-d7a3-437e-a700-61e32d7d991b/deploy-status) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/explosion/spacy-course/master) ## 💬 Languages and Translations | Language | Text Examples<sup>1</sup> | Source | Authors | | -------------------------------------------- | ------------------------- | ------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **[English](https://course.spacy.io/en)** | English | [`chapters/en`](chapters/en), [`exercises/en`](exercises/en) | [@ines](https://github.com/ines) | | **[German](https://course.spacy.io/de)** | German | [`chapters/de`](chapters/de), [`exercises/de`](exercises/de) | [@ines](https://github.com/ines), [@Jette16](https://github.com/Jette16) | | **[Spanish](https://course.spacy.io/es)** | Spanish | [`chapters/es`](chapters/es), [`exercises/es`](exercises/es) | [@mariacamilagl](https://github.com/mariacamilagl), [@damian-romero](https://github.com/damian-romero) | | **[French](https://course.spacy.io/fr)** | French | [`chapters/fr`](chapters/fr), [`exercises/fr`](exercises/fr) | [@datakime](https://github.com/datakime) | | **[Japanese](https://course.spacy.io/ja)** | Japanese | [`chapters/ja`](chapters/ja), [`exercises/ja`](exercises/ja) | [@tamuhey](https://github.com/tamuhey), [@hiroshi-matsuda-rit](https://github.com/hiroshi-matsuda-rit), [@icoxfog417](https://github.com/icoxfog417), [@akirakubo](https://github.com/akirakubo), [@forest1988](https://github.com/forest1988), [@ao9mame](https://github.com/ao9mame), [@matsurih](https://github.com/matsurih), [@HiromuHota](https://github.com/HiromuHota), [@mei28](https://github.com/mei28), [@polm](https://github.com/polm) | | **[Chinese](https://course.spacy.io/zh)** | Chinese | [`chapters/zh`](chapters/zh), [`exercises/zh`](exercises/zh) | [@crownpku](https://github.com/crownpku) | | **[Portuguese](https://course.spacy.io/pt)** | English | [`chapters/pt`](chapters/pt), [`exercises/pt`](exercises/pt) | [@Cristianasp](https://github.com/Cristianasp) | If you spot a mistake, I always appreciate [pull requests](https://github.com/explosion/spacy-course/pulls)! **1.** This is the language used for the text examples and resources used in the exercises. For example, the German version of the course also uses German text examples and models. It's not always possible to translate all code examples, so some translations may still use and analyze English text as part of the course. ### Related resources - 📚 **Prefer notebooks?** Check out [the Jupyter notebook version](https://github.com/cristianasp/spacy) of this course, put together by [@cristianasp](https://github.com/cristianasp). ## 💁 FAQ #### Is this related to the spaCy course on DataCamp? I originally developed the content for DataCamp, but I wanted to make a free version to make it available to more people, and so you don't have to sign up for their service. As a weekend project, I ended up putting together my own little app to present the exercises and content in a fun and interactive way. #### Can I use this to build my own course? Probably, yes! If you've been looking for a DIY way to publish your materials, I hope that my little framework can be useful. Because so many people expressed interest in this, I put together some starter repos that you can fork and adapt: - 🐍 Python: [`ines/course-starter-python`](https://github.com/ines/course-starter-python) - 🇷 R: [`ines/course-starter-r`](https://github.com/ines/course-starter-r) #### Why the different licenses? The source of the app, UI components and Gatsby framework for building interactive courses is licensed as MIT, like pretty much all of my open-source software. The course materials themselves (slides and chapters), are licensed under CC BY-NC. This means that you can use them freely – you just can't make money off them. #### I want to help translate this course into my language. How can I contribute? First, thanks so much, this is really cool and valuable to the community 🙌 I've tried to set up the course structure so it's easy to add different languages: language-specific files are organized into directories in [`exercises`](exercises) and [`chapters`](chapters), and other language specific texts are available in [`locale.json`](locale.json). If you want to contribute, there are two different ways to get involved: 1. Start a community translation project. This is the easiest, no-strings-attached way. You can fork the repo, copy-paste the English version, change the [language code](https://www.loc.gov/standards/iso639-2/php/code_list.php), start translating and invite others to contribute (if you like). If you're looking for contributors, feel free to open an issue here or tag [@spacy_io](https://twitter.com/spacy_io) on Twitter so we can help get the word out. We're also happy to answer your questions on the issue tracker. 2. Make us an offer. We're open to commissioning translations for different languages, so if you're interested, email us at [[email protected]](mailto:[email protected]) and include your offer, estimated time schedule and a bit about you and your background (and any technical writing or translation work you've done in the past, if available). It doesn't matter where you're based, but you should be able to issue invoices as a freelancer or similar, depending on your country. #### I want to help create an audio/video tutorial for an existing translation. How can I get involved? Again, thanks, this is super cool! While the [English](https://www.youtube.com/watch?v=THduWAnG97k) and [German](https://www.youtube.com/watch?v=K1elwpgDdls) videos also include a video recording, it's not a requirement and we'd be happy to just provide an audio track alongside the slides. We'd take care of the postprocessing and video editing, so all we need is the audio recording. If you feel comfortable recording yourself reading out the slide notes in your language, email us at [[email protected]](mailto:[email protected]) and make us an offer and include a bit about you and similar work you've done in the past, if available. ## 🎛 Usage & API ### Running the app To start the local development server, install [Gatsby](https://gatsbyjs.org) and then all other dependencies, then use `npm run dev` to start the development server. Make sure you have at least Node 10.15 installed. ```bash npm install -g gatsby-cli # Install Gatsby globally npm install # Install dependencies npm run dev # Run the development server ``` If running with docker just run `make build` and then `make gatsby-dev` ### How it works When building the site, Gatsby will look for `.py` files and make their contents available to query via GraphQL. This lets us use the raw code within the app. Under the hood, the app uses [Binder](https://mybinder.org) to serve up an image with the package dependencies, including the spaCy models. By calling into [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/), we can then execute code using the active kernel. This lets you edit the code in the browser and see the live results. Also see my [`juniper`](https://github.com/ines/juniper) repo for more details on the implementation. To validate the code when the user hits "Submit", I'm currently using a slightly hacky trick. Since the Python code is sent back to the kernel as a string, we can manipulate it and add tests – for example, exercise `exc_01_02_01.py` will be validated using `test_01_02_01.py` (if available). The user code and test are combined using a string template. At the moment, the `testTemplate` in the `meta.json` looks like this: ``` from wasabi import msg __msg__ = msg __solution__ = """${solution}""" ${solution} ${test} try: test() except AssertionError as e: __msg__.fail(e) ``` If present, `${solution}` will be replaced with the string value of the submitted user code. In this case, we're inserting it twice: once as a string so we can check whether the submission includes something, and once as the code, so we can actually run it and check the objects it creates. `${test}` is replaced by the contents of the test file. I'm also making [`wasabi`](https://github.com/ines/wasabi)'s printer available as `__msg__`, so we can easily print pretty messages in the tests. Finally, the `try`/`accept` block checks if the test function raises an `AssertionError` and if so, displays the error message. This also hides the full error traceback (which can easily leak the correct answers). A test file could then look like this: ```python def test(): assert "spacy.load" in __solution__, "Are you calling spacy.load?" assert nlp.meta["lang"] == "en", "Are you loading the correct model?" assert nlp.meta["name"] == "core_web_sm", "Are you loading the correct model?" assert "nlp(text)" in __solution__, "Are you processing the text correctly?" assert "print(doc.text)" in __solution__, "Are you printing the Doc's text?" __msg__.good( "Well done! Now that you've practiced loading models, let's look at " "some of their predictions." ) ``` With this approach, it's not _always_ possible to validate the input perfectly – there are too many options and we want to avoid false positives. #### Running automated tests The automated tests make sure that the provided solution code is compatible with the test file that's used to validate submissions. The test suite is powered by the [`pytest`](https://docs.pytest.org/en/latest/) framework and runnable test files are generated automatically in a directory `__tests__` before the test session starts. See the [`conftest.py`](conftest.py) for implementation details. ```bash # Install requirements pip install -r binder/requirements.txt # Run the tests (will generate the files automatically) python -m pytest __tests__ ``` If running with docker just run `make build` and then `make pytest` ### Directory Structure ```yaml ├── binder | └── requirements.txt # Python dependency requirements for Binder ├── chapters # chapters, grouped by language | ├── en # English chapters, one Markdown file per language | | └── slides # English slides, one Markdown file per presentation | └── ... # other languages ├── exercises # code files, tests and assets for exercises | ├── en # English exercises, solutions, tests and data | └── ... # other languages ├── public # compiled site ├── src # Gatsby/React source, independent from content ├── static # static assets like images, available in slides/chapters ├── locale.json # translations of meta and UI text ├── meta.json # course metadata └── theme.sass # UI theme colors and settings ``` ### Setting up Binder The [`requirements.txt`](binder/requirements.txt) in the repository defines the packages that are installed when building it with Binder. For this course, I'm using the source repo as the Binder repo, as it allows to keep everything in one place. It also lets the exercises reference and load other files (e.g. JSON), which will be copied over into the Python environment. I build the binder from a branch `binder`, though, which I only update if Binder-relevant files change. Otherwise, every update to `master` would trigger an image rebuild. You can specify the binder settings like repo, branch and kernel type in the `"juniper"` section of the `meta.json`. I'd recommend running the very first build via the interface on the [Binder website](https://mybinder.org), as this gives you a detailed build log and feedback on whether everything worked as expected. Enter your repository URL, click "launch" and wait for it to install the dependencies and build the image. ![Binder](https://user-images.githubusercontent.com/13643239/39412757-a518d416-4c21-11e8-9dad-8b4cc14737bc.png) ### File formats #### Chapters Chapters are placed in [`/chapters`](/chapters) and are Markdown files consisting of `<exercise>` components. They'll be turned into pages, e.g. `/chapter1`. In their frontmatter block at the top of the file, they need to specify `type: chapter`, as well as the following meta: ```yaml --- title: The chapter title description: The chapter description prev: /chapter1 # exact path to previous chapter or null to not show a link next: /chapter3 # exact path to next chapter or null to not show a link id: 2 # unique identifier for chapter type: chapter # important: this creates a standalone page from the chapter --- ``` #### Slides Slides are placed in [`/slides`](/slides) and are markdown files consisting of slide content, separated by `---`. They need to specify the following frontmatter block at the top of the file: ```yaml --- type: slides --- ``` The **first and last slide** use a special layout and will display the headline in the center of the slide. **Speaker notes** (in this case, the script) can be added at the end of a slide, prefixed by `Notes:`. They'll then be shown on the right next to the slides. Here's an example slides file: ```markdown --- type: slide --- # Processing pipelines Notes: This is a slide deck about processing pipelines. --- # Next slide - Some bullet points here - And another bullet point <img src="/image.jpg" alt="An image located in /static" /> ``` ### Custom Elements When using custom elements, make sure to place a newline between the opening/closing tags and the children. Otherwise, Markdown content may not render correctly. #### `<exercise>` Container of a single exercise. | Argument | Type | Description | | ------------ | --------------- | -------------------------------------------------------------- | | `id` | number / string | Unique exercise ID within chapter. | | `title` | string | Exercise title. | | `type` | string | Optional type. `"slides"` makes container wider and adds icon. | | **children** | - | The contents of the exercise. | ```markdown <exercise id="1" title="Introduction to spaCy"> Content goes here... </exercise> ``` #### `<codeblock>` | Argument | Type | Description | | ------------ | --------------- | -------------------------------------------------------------------------------------------- | | `id` | number / string | Unique identifier of the code exercise. | | `source` | string | Name of the source file (without file extension). Defaults to `exc_${id}` if not set. | | `solution` | string | Name of the solution file (without file extension). Defaults to `solution_${id}` if not set. | | `test` | string | Name of the test file (without file extension). Defaults to `test_${id}` if not set. | | **children** | string | Optional hints displayed when the user clicks "Show hints". | ```markdown <codeblock id="02_03"> This is a hint! </codeblock> ``` #### `<slides>` Container to display slides interactively using Reveal.js and a Markdown file. | Argument | Type | Description | | -------- | ------ | --------------------------------------------- | | `source` | string | Name of slides file (without file extension). | ```markdown <slides source="chapter1_01_introduction-to-spacy"> </slides> ``` #### `<choice>` Container for multiple-choice question. | Argument | Type | Description | | ------------ | --------------- | -------------------------------------------------------------------------------------------- | | `id` | string / number | Optional unique ID. Can be used if more than one choice question is present in one exercise. | | **children** | nodes | Only `<opt>` components for the options. | ```markdown <choice> <opt text="Option one">You have selected option one! This is not good.</opt> <opt text="Option two" correct="true">Yay! </opt> </choice> ``` #### `<opt>` A multiple-choice option. | Argument | Type | Description | | ------------ | ------ | ---------------------------------------------------------------------------------------------- | | `text` | string | The option text to be displayed. Supports inline HTML. | | `correct` | string | `"true"` if the option is the correct answer. | | **children** | string | The text to be displayed if the option is selected (explaining why it's correct or incorrect). |

ML Frameworks LMS
2.4K Github Stars
spacy-models
Open Source

spacy-models

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # spaCy models This repository contains [releases](https://github.com/explosion/spacy-models/releases) of models for the [spaCy](https://github.com/explosion/spaCy) NLP library. For more info on how to download, install and use the models, see the [models documentation](https://spacy.io/usage/models). > **⚠️ Important note:** Because the models can be very large and consist mostly > of binary data, we can't simply provide them as files in a GitHub repository. > Instead, we've opted for adding them to > [releases](https://github.com/explosion/spacy-models/releases) as `.whl` and > `.tar.gz` files. This allows us to still maintain a public release history. ## Quickstart To install a specific model, run the following command with the model name (for example `en_core_web_sm`): ```bash python -m spacy download [model] ``` - [spaCy v3.x models directory](https://spacy.io/models) - [spaCy v3.x model comparison](https://spacy.io/usage/facts-figures#spacy-models) - [spaCy v2.x models directory](https://v2.spacy.io/models) - [spaCy v2.x model comparison](https://v2.spacy.io/usage/facts-figures#spacy-models) - [Individual release notes](https://github.com/explosion/spacy-models/releases) For the spaCy v1.x models, [see here](#spacy-v1x-releases). ## Model naming conventions In general, spaCy expects all model packages to follow the naming convention of `[lang]_[name]`. For our provided pipelines, we divide the name into three components: - **type**: Model capabilities: - `core`: a general-purpose model with tagging, parsing, lemmatization and named entity recognition - `dep`: only tagging, parsing and lemmatization - `ent`: only named entity recognition - `sent`: only sentence segmentation - **genre**: Type of text the model is trained on (e.g. `web` for web text, `news` for news text) - **size**: Model size indicator: - `sm`: no word vectors - `md`: reduced word vector table with 20k unique vectors for ~500k words - `lg`: large word vector table with ~500k entries For example, `en_core_web_md` is a medium-sized English model trained on written web text (blogs, news, comments), that includes a tagger, a dependency parser, a lemmatizer, a named entity recognizer and a word vector table with 20k unique vectors. ### Model versioning Additionally, the model versioning reflects both the compatibility with spaCy, as well as the model version. A model version `a.b.c` translates to: - `a`: **spaCy major version**. For example, `2` for spaCy v2.x. - `b`: **spaCy minor version**. For example, `3` for spaCy v2.3.x. - `c`: **Model version.** Different model config: e.g. from being trained on different data, with different parameters, for different numbers of iterations, with different vectors, etc. For a detailed compatibility overview, see the [`compatibility.json`](compatibility.json). This is also the source of spaCy's internal compatibility check, performed when you run the `download` command. ### Support for older versions If you're using an older version (v1.6.0 or below), you can still download and install the old models from within spaCy using `python -m spacy.en.download all` or `python -m spacy.de.download all`. The `.tar.gz` archives are also [attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0). To download and install the models manually, unpack the archive, drop the contained directory into `spacy/data` and load the model via `spacy.load('en')` or `spacy.load('de')`. ## Downloading models To increase transparency and make it easier to use spaCy with your own models, all data is now available as direct downloads, organised in [individual releases](https://github.com/explosion/spacy-models/releases). spaCy 1.7 also supports installing and loading models as **Python packages**. You can now choose how and where you want to keep the data files, and set up "shortcut links" to load models by name from within spaCy. For more info on this, see the new [models documentation](https://spacy.io/usage/models). ```bash # download best-matching version of specific model for your spaCy installation python -m spacy download en_core_web_sm # pip install .whl or .tar.gz archive from path or URL pip install /Users/you/en_core_web_sm-3.0.0.tar.gz pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl ``` ## Loading and using models To load a model, use `spacy.load()` with the model name, a shortcut link or a path to the model data directory. ```python import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(u"This is a sentence.") ``` You can also `import` a model directly via its full name and then call its `load()` method with no arguments. This should also work for older models in previous versions of spaCy. ```python import spacy import en_core_web_sm nlp = en_core_web_sm.load() doc = nlp(u"This is a sentence.") ``` ## Manual download and installation In some cases, you might prefer downloading the data manually, for example to place it into a custom directory. You can download the model via your browser from the [latest releases](https://github.com/explosion/spacy-models/releases), or configure your own download script using the URL of the archive file. The archive consists of a model directory that contains another directory with the model data. ```yaml └── en_core_web_md-3.0.0.tar.gz # downloaded archive ├── setup.py # setup file for pip installation ├── meta.json # copy of pipeline meta └── en_core_web_md # 📦 pipeline package ├── __init__.py # init for pip installation └── en_core_web_md-3.0.0 # pipeline data ├── config.cfg # pipeline config ├── meta.json # pipeline meta └── ... # directories with component data ``` **📖 For more info and examples, check out the [models documentation](https://spacy.io/usage/models).** ## spaCy v1.x Releases | Date | Model | Version | Dep | Ent | Vec | Size | License | | | | ------------ | --------------------- | ------- | :-: | :-: | :-: | ------: | -------- | ------------------------------------- | ------------------------------------ | | `2017-06-06` | `es_core_web_md` | 1.0.0 | X | X | X | 377 MB | CC BY-SA | [![][i]][i-es_core_web_md-1.0.0] | [![][dl]][es_core_web_md-1.0.0] | | `2017-04-26` | `fr_depvec_web_lg` | 1.0.0 | X | | X | 1.33 GB | CC BY-NC | [![][i]][i-fr_depvec_web_lg-1.0.0] | [![][dl]][fr_depvec_web_lg-1.0.0] | | `2017-03-21` | `en_core_web_md` | 1.2.1 | X | X | X | 1 GB | CC BY-SA | [![][i]][i-en_core_web_md-1.2.1] | [![][dl]][en_core_web_md-1.2.1] | | `2017-03-21` | `en_depent_web_md` | 1.2.1 | X | X | | 328 MB | CC BY-SA | [![][i]][i-en_depent_web_md-1.2.1] | [![][dl]][en_depent_web_md-1.2.1] | | `2017-03-17` | `en_core_web_sm` | 1.2.0 | X | X | X | 50 MB | CC BY-SA | [![][i]][i-en_core_web_sm-1.2.0] | [![][dl]][en_core_web_sm-1.2.0] | | `2017-03-17` | `en_core_web_md` | 1.2.0 | X | X | X | 1 GB | CC BY-SA | [![][i]][i-en_core_web_md-1.2.0] | [![][dl]][en_core_web_md-1.2.0] | | `2017-03-17` | `en_depent_web_md` | 1.2.0 | X | X | | 328 MB | CC BY-SA | [![][i]][i-en_depent_web_md-1.2.0] | [![][dl]][en_depent_web_md-1.2.0] | | `2016-05-10` | `de_core_news_md` | 1.0.0 | X | X | X | 645 MB | CC BY-SA | [![][i]][i-de_core_news_md-1.0.0] | [![][dl]][de_core_news_md-1.0.0] | | `2016-03-08` | `en_vectors_glove_md` | 1.0.0 | | | X | 727 MB | CC BY-SA | [![][i]][i-en_vectors_glove_md-1.0.0] | [![][dl]][en_vectors_glove_md-1.0.0] | [es_core_web_md-1.0.0]: https://github.com/explosion/spacy-models/releases/download/es_core_web_md-1.0.0/es_core_web_md-1.0.0.tar.gz [fr_depvec_web_lg-1.0.0]: https://github.com/explosion/spacy-models/releases/download/fr_depvec_web_lg-1.0.0/fr_depvec_web_lg-1.0.0.tar.gz [en_core_web_md-1.2.1]: https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.1/en_core_web_md-1.2.1.tar.gz [en_depent_web_md-1.2.1]: https://github.com/explosion/spacy-models/releases/download/en_depent_web_md-1.2.1/en_depent_web_md-1.2.1.tar.gz [en_core_web_sm-1.2.0]: https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz [en_core_web_md-1.2.0]: https://github.com/explosion/spacy-models/releases/download/en_core_web_md-1.2.0/en_core_web_md-1.2.0.tar.gz [en_depent_web_md-1.2.0]: https://github.com/explosion/spacy-models/releases/download/en_depent_web_md-1.2.0/en_depent_web_md-1.2.0.tar.gz [de_core_news_md-1.0.0]: https://github.com/explosion/spacy-models/releases/download/de_core_news_md-1.0.0/de_core_news_md-1.0.0.tar.gz [en_vectors_glove_md-1.0.0]: https://github.com/explosion/spacy-models/releases/download/en_vectors_glove_md-1.0.0/en_vectors_glove_md-1.0.0.tar.gz [i-es_core_web_md-1.0.0]: https://github.com/explosion/spacy-models/releases/es_core_web_md-1.0.0 [i-fr_depvec_web_lg-1.0.0]: https://github.com/explosion/spacy-models/releases/fr_depvec_web_lg-1.0.0 [i-en_core_web_md-1.2.1]: https://github.com/explosion/spacy-models/releases/en_core_web_md-1.2.1 [i-en_depent_web_md-1.2.1]: https://github.com/explosion/spacy-models/releases/en_depent_web_md-1.2.1 [i-en_core_web_sm-1.2.0]: https://github.com/explosion/spacy-models/releases/en_core_web_sm-1.2.0 [i-en_core_web_md-1.2.0]: https://github.com/explosion/spacy-models/releases/en_core_web_md-1.2.0 [i-en_depent_web_md-1.2.0]: https://github.com/explosion/spacy-models/releases/en_depent_web_md-1.2.0 [i-de_core_news_md-1.0.0]: https://github.com/explosion/spacy-models/releases/de_core_news_md-1.0.0 [i-en_vectors_glove_md-1.0.0]: https://github.com/explosion/spacy-models/releases/en_vectors_glove_md-1.0.0 [dl]: http://i.imgur.com/gQvPgr0.png [i]: http://i.imgur.com/OpLOcKn.png ### Model naming conventions for v1.x models - **type**: Model capabilities (e.g. `core` for general-purpose model with vocabulary, syntax, entities and word vectors, or `depent` for only vocab, syntax and entities) - **genre**: Type of text the model is trained on (e.g. `web` for web text, `news` for news text) - **size**: Model size indicator (`sm`, `md` or `lg`) For example, `en_depent_web_md` is a medium-sized English model trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities. ## Issues and bug reports To report an issue with a model, please open an issue on the [spaCy issue tracker](https://github.com/explosion/spaCy). Please note that no model is perfect. Because models are statistical, their expected behaviour **will always include some errors**. However, particular errors can indicate deeper issues with the training feature extraction or optimisation code. If you come across patterns in the model's performance that seem suspicious, please do file a report.

AI & Machine Learning Developer Tools ML Frameworks
1.9K Github Stars
sense2vec
Open Source

sense2vec

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # sense2vec: Contextually-keyed word vectors sense2vec ([Trask et. al](https://arxiv.org/abs/1511.06388), 2015) is a nice twist on [word2vec](https://en.wikipedia.org/wiki/Word2vec) that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. For more details, check out [our blog post](https://explosion.ai/blog/sense2vec-reloaded). To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the [interactive demo](https://demos.explosion.ai/sense2vec). 🦆 **Version 2.0 (for spaCy v3) out now!** [Read the release notes here.](https://github.com/explosion/sense2vec/releases/) [![tests](https://github.com/explosion/sense2vec/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/sense2vec/actions/workflows/tests.yml) [![Current Release Version](https://img.shields.io/github/v/release/explosion/sense2vec.svg?style=flat-square&logo=github)](https://github.com/explosion/sense2vec/releases) [![pypi Version](https://img.shields.io/pypi/v/sense2vec.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/sense2vec/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) ## ✨ Features ![](https://user-images.githubusercontent.com/13643239/69330759-d3981600-0c53-11ea-8f64-e5c075f7ea10.jpg) - Query **vectors for multi-word phrases** based on part-of-speech tags and entity labels. - spaCy **pipeline component** and **extension attributes**. - Fully **serializable** so you can easily ship your sense2vec vectors with your spaCy model packages. - Optional **caching of nearest neighbors** for super fast "most similar" queries. - **Train your own vectors** using a pretrained spaCy model, raw text and [GloVe](https://github.com/stanfordnlp/GloVe) or Word2Vec via [fastText](https://github.com/facebookresearch/fastText) ([details](#-training-your-own-sense2vec-vectors)). - [Prodigy](https://prodi.gy) **annotation recipes** for evaluating models, creating lists of similar multi-word phrases and converting them to match patterns, e.g. for rule-based NER or to bootstrap NER annotation ([details & examples](#-prodigy-recipes)). ## 🚀 Quickstart ### Standalone usage ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md") query = "natural_language_processing|NOUN" assert query in s2v vector = s2v[query] freq = s2v.get_freq(query) most_similar = s2v.most_similar(query, n=3) # [('machine_learning|NOUN', 0.8986967), # ('computer_vision|NOUN', 0.8636297), # ('deep_learning|NOUN', 0.8573361)] ``` ### Usage as a spaCy pipeline component > ⚠️ Note that this example describes usage with > [spaCy v3](https://spacy.io/usage/v3). For usage with spaCy v2, download > `sense2vec==1.0.3` and check out the > [`v1.x`](https://github.com/explosion/sense2vec/tree/v1.x) branch of this > repo. ```python import spacy nlp = spacy.load("en_core_web_sm") s2v = nlp.add_pipe("sense2vec") s2v.from_disk("/path/to/s2v_reddit_2015_md") doc = nlp("A sentence about natural language processing.") assert doc[3:6].text == "natural language processing" freq = doc[3:6]._.s2v_freq vector = doc[3:6]._.s2v_vec most_similar = doc[3:6]._.s2v_most_similar(3) # [(('machine learning', 'NOUN'), 0.8986967), # (('computer vision', 'NOUN'), 0.8636297), # (('deep learning', 'NOUN'), 0.8573361)] ``` ### Interactive demos <img width="34%" src="https://user-images.githubusercontent.com/13643239/68093565-1bb6ea80-fe97-11e9-8192-e293acc290fe.png" align="right" /> To try out our pretrained vectors trained on Reddit comments, check out the [interactive sense2vec demo](https://explosion.ai/demos/sense2vec). This repo also includes a [Streamlit](https://streamlit.io) demo script for exploring vectors and the most similar phrases. After installing `streamlit`, you can run the script with `streamlit run` and **one or more paths to pretrained vectors** as **positional arguments** on the command line. For example: ```bash pip install streamlit streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors ``` ### Pretrained vectors To use the vectors, download the archive(s) and pass the extracted directory to `Sense2Vec.from_disk` or `Sense2VecComponent.from_disk`. The vector files are **attached to the GitHub release**. Large files have been split into multi-part downloads. | Vectors | Size | Description | 📥 Download (zipped) | | -------------------- | -----: | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `s2v_reddit_2019_lg` | 4 GB | Reddit comments 2019 (01-07) | [part 1](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.001), [part 2](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.002), [part 3](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2019_lg.tar.gz.003) | | `s2v_reddit_2015_md` | 573 MB | Reddit comments 2015 | [part 1](https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz) | To merge the multi-part archives, you can run the following: ```bash cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz ``` ## ⏳ Installation & Setup sense2vec releases are available on pip: ```bash pip install sense2vec ``` To use pretrained vectors, download [one of the vector packages](#pretrained-vectors), unpack the `.tar.gz` archive and point `from_disk` to the extracted data directory: ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md") ``` ## 👩‍💻 Usage ### Usage with spaCy v3 The easiest way to use the library and vectors is to plug it into your spaCy pipeline. The `sense2vec` package exposes a `Sense2VecComponent`, which can be initialised with the shared vocab and added to your spaCy pipeline as a [custom pipeline component](https://spacy.io/usage/processing-pipelines#custom-components). By default, components are added to the _end of the pipeline_, which is the recommended position for this component, since it needs access to the dependency parse and, if available, named entities. ```python import spacy from sense2vec import Sense2VecComponent nlp = spacy.load("en_core_web_sm") s2v = nlp.add_pipe("sense2vec") s2v.from_disk("/path/to/s2v_reddit_2015_md") ``` The component will add several [extension attributes and methods](https://spacy.io/usage/processing-pipelines#custom-components-attributes) to spaCy's `Token` and `Span` objects that let you retrieve vectors and frequencies, as well as most similar terms. ```python doc = nlp("A sentence about natural language processing.") assert doc[3:6].text == "natural language processing" freq = doc[3:6]._.s2v_freq vector = doc[3:6]._.s2v_vec most_similar = doc[3:6]._.s2v_most_similar(3) ``` For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag): ```python doc = nlp("A sentence about Facebook and Google.") for ent in doc.ents: assert ent._.in_s2v most_similar = ent._.s2v_most_similar(3) ``` #### Available attributes The following extension attributes are exposed on the `Doc` object via the `._` property: | Name | Attribute Type | Type | Description | | ------------- | -------------- | ---- | ----------------------------------------------------------------------------------- | | `s2v_phrases` | property | list | All sense2vec-compatible phrases in the given `Doc` (noun phrases, named entities). | The following attributes are available via the `._` property of `Token` and `Span` objects – for example `token._.in_s2v`: | Name | Attribute Type | Return Type | Description | | ------------------ | -------------- | ------------------ | ---------------------------------------------------------------------------------- | | `in_s2v` | property | bool | Whether a key exists in the vector map. | | `s2v_key` | property | unicode | The sense2vec key of the given object, e.g. `"duck NOUN"`. | | `s2v_vec` | property | `ndarray[float32]` | The vector of the given key. | | `s2v_freq` | property | int | The frequency of the given key. | | `s2v_other_senses` | property | list | Available other senses, e.g. `"duck\|VERB"` for `"duck\|NOUN"`. | | `s2v_most_similar` | method | list | Get the `n` most similar terms. Returns a list of `((word, sense), score)` tuples. | | `s2v_similarity` | method | float | Get the similarity to another `Token` or `Span`. | > ⚠️ **A note on span attributes:** Under the hood, entities in `doc.ents` are > `Span` objects. This is why the pipeline component also adds attributes and > methods to spans and not just tokens. However, it's not recommended to use the > sense2vec attributes on arbitrary slices of the document, since the model > likely won't have a key for the respective text. `Span` objects also don't > have a part-of-speech tag, so if no entity label is present, the "sense" > defaults to the root's part-of-speech tag. #### Adding sense2vec to a trained pipeline If you're training and packaging a spaCy pipeline and want to include a sense2vec component in it, you can load in the data via the [`[initialize]` block](https://spacy.io/usage/training#config-lifecycle) of the training config: ```ini [initialize.components] [initialize.components.sense2vec] data_path = "/path/to/s2v_reddit_2015_md" ``` ### Standalone usage You can also use the underlying `Sense2Vec` class directly and load in the vectors using the `from_disk` method. See below for the available API methods. ```python from sense2vec import Sense2Vec s2v = Sense2Vec().from_disk("/path/to/reddit_vectors-1.1.0") most_similar = s2v.most_similar("natural_language_processing|NOUN", n=10) ``` > ⚠️ **Important note:** To look up entries in the vectors table, the keys need > to follow the scheme of `phrase_text|SENSE` (note the `_` instead of spaces > and the `|` before the tag or label) – for example, `machine_learning|NOUN`. > Also note that the underlying vector table is case-sensitive. ## 🎛 API ### <kbd>class</kbd> `Sense2Vec` The standalone `Sense2Vec` object that holds the vectors, strings and frequencies. #### <kbd>method</kbd> `Sense2Vec.__init__` Initialize the `Sense2Vec` object. | Argument | Type | Description | | -------------- | --------------------------- | ---------------------------------------------------------------------------------------------------------------------- | | `shape` | tuple | The vector shape. Defaults to `(1000, 128)`. | | `strings` | `spacy.strings.StringStore` | Optional string store. Will be created if it doesn't exist. | | `senses` | list | Optional list of all available senses. Used in methods that generate the best sense or other senses. | | `vectors_name` | unicode | Optional name to assign to the `Vectors` table, to prevent clashes. Defaults to `"sense2vec"`. | | `overrides` | dict | Optional custom functions to use, mapped to names registered via the registry, e.g. `{"make_key": "custom_make_key"}`. | | **RETURNS** | `Sense2Vec` | The newly constructed object. | ```python s2v = Sense2Vec(shape=(300, 128), senses=["VERB", "NOUN"]) ``` #### <kbd>method</kbd> `Sense2Vec.__len__` The number of rows in the vectors table. | Argument | Type | Description | | ----------- | ---- | ---------------------------------------- | | **RETURNS** | int | The number of rows in the vectors table. | ```python s2v = Sense2Vec(shape=(300, 128)) assert len(s2v) == 300 ``` #### <kbd>method</kbd> `Sense2Vec.__contains__` Check if a key is in the vectors table. | Argument | Type | Description | | ----------- | ------------- | -------------------------------- | | `key` | unicode / int | The key to look up. | | **RETURNS** | bool | Whether the key is in the table. | ```python s2v = Sense2Vec(shape=(10, 4)) s2v.add("avocado|NOUN", numpy.asarray([4, 2, 2, 2], dtype=numpy.float32)) assert "avocado|NOUN" in s2v assert "avocado|VERB" not in s2v ``` #### <kbd>method</kbd> `Sense2Vec.__getitem__` Retrieve a vector for a given key. Returns None if the key is not in the table. | Argument | Type | Description | | ----------- | --------------- | --------------------- | | `key` | unicode / int | The key to look up. | | **RETURNS** | `numpy.ndarray` | The vector or `None`. | ```python vec = s2v["avocado|NOUN"] ``` #### <kbd>method</kbd> `Sense2Vec.__setitem__` Set a vector for a given key. Will raise an error if the key doesn't exist. To add a new entry, use `Sense2Vec.add`. | Argument | Type | Description | | -------- | --------------- | ------------------ | | `key` | unicode / int | The key. | | `vector` | `numpy.ndarray` | The vector to set. | ```python vec = s2v["avocado|NOUN"] s2v["avacado|NOUN"] = vec ``` #### <kbd>method</kbd> `Sense2Vec.add` Add a new vector to the table. | Argument | Type | Description | | -------- | --------------- | ------------------------------------------------------------ | | `key` | unicode / int | The key to add. | | `vector` | `numpy.ndarray` | The vector to add. | | `freq` | int | Optional frequency count. Used to find best matching senses. | ```python vec = s2v["avocado|NOUN"] s2v.add("🥑|NOUN", vec, 1234) ``` #### <kbd>method</kbd> `Sense2Vec.get_freq` Get the frequency count for a given key. | Argument | Type | Description | | ----------- | ------------- | ------------------------------------------------- | | `key` | unicode / int | The key to look up. | | `default` | - | Default value to return if no frequency is found. | | **RETURNS** | int | The frequency count. | ```python vec = s2v["avocado|NOUN"] s2v.add("🥑|NOUN", vec, 1234) assert s2v.get_freq("🥑|NOUN") == 1234 ``` #### <kbd>method</kbd> `Sense2Vec.set_freq` Set a frequency count for a given key. | Argument | Type | Description | | -------- | ------------- | ----------------------------- | | `key` | unicode / int | The key to set the count for. | | `freq` | int | The frequency count. | ```python s2v.set_freq("avocado|NOUN", 104294) ``` #### <kbd>method</kbd> `Sense2Vec.__iter__`, `Sense2Vec.items` Iterate over the entries in the vectors table. | Argument | Type | Description | | ---------- | ----- | ----------------------------------------- | | **YIELDS** | tuple | String key and vector pairs in the table. | ```python for key, vec in s2v: print(key, vec) for key, vec in s2v.items(): print(key, vec) ``` #### <kbd>method</kbd> `Sense2Vec.keys` Iterate over the keys in the table. | Argument | Type | Description | | ---------- | ------- | ----------------------------- | | **YIELDS** | unicode | The string keys in the table. | ```python all_keys = list(s2v.keys()) ``` #### <kbd>method</kbd> `Sense2Vec.values` Iterate over the vectors in the table. | Argument | Type | Description | | ---------- | --------------- | ------------------------- | | **YIELDS** | `numpy.ndarray` | The vectors in the table. | ```python all_vecs = list(s2v.values()) ``` #### <kbd>property</kbd> `Sense2Vec.senses` The available senses in the table, e.g. `"NOUN"` or `"VERB"` (added at initialization). | Argument | Type | Description | | ----------- | ---- | --------------------- | | **RETURNS** | list | The available senses. | ```python s2v = Sense2Vec(senses=["VERB", "NOUN"]) assert "VERB" in s2v.senses ``` #### <kbd>property</kbd> `Sense2vec.frequencies` The frequencies of the keys in the table, in descending order. | Argument | Type | Description | | ----------- | ---- | -------------------------------------------------- | | **RETURNS** | list | The `(key, freq)` tuples by frequency, descending. | ```python most_frequent = s2v.frequencies[:10] key, score = s2v.frequencies[0] ``` #### <kbd>method</kbd> `Sense2vec.similarity` Make a semantic similarity estimate of two keys or two sets of keys. The default estimate is cosine similarity using an average of vectors. | Argument | Type | Description | | ----------- | ------------------------ | ----------------------------------- | | `keys_a` | unicode / int / iterable | The string or integer key(s). | | `keys_b` | unicode / int / iterable | The other string or integer key(s). | | **RETURNS** | float | The similarity score. | ```python keys_a = ["machine_learning|NOUN", "natural_language_processing|NOUN"] keys_b = ["computer_vision|NOUN", "object_detection|NOUN"] print(s2v.similarity(keys_a, keys_b)) assert s2v.similarity("machine_learning|NOUN", "machine_learning|NOUN") == 1.0 ``` #### <kbd>method</kbd> `Sense2Vec.most_similar` Get the most similar entries in the table. If more than one key is provided, the average of the vectors is used. To make this method faster, see the [script for precomputing a cache](scripts/06_precompute_cache.py) of the nearest neighbors. | Argument | Type | Description | | ------------ | ------------------------- | ------------------------------------------------------- | | `keys` | unicode / int / iterable  | The string or integer key(s) to compare to. | | `n` | int | The number of similar keys to return. Defaults to `10`. | | `batch_size` | int | The batch size to use. Defaults to `16`. | | **RETURNS** | list | The `(key, score)` tuples of the most similar vectors. | ```python most_similar = s2v.most_similar("natural_language_processing|NOUN", n=3) # [('machine_learning|NOUN', 0.8986967), # ('computer_vision|NOUN', 0.8636297), # ('deep_learning|NOUN', 0.8573361)] ``` #### <kbd>method</kbd> `Sense2Vec.get_other_senses` Find other entries for the same word with a different sense, e.g. `"duck|VERB"` for `"duck|NOUN"`. | Argument | Type | Description | | ------------- | ------------- | ----------------------------------------------------------------- | | `key` | unicode / int | The key to check. | | `ignore_case` | bool | Check for uppercase, lowercase and titlecase. Defaults to `True`. | | **RETURNS** | list | The string keys of other entries with different senses. | ```python other_senses = s2v.get_other_senses("duck|NOUN") # ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ'] ``` #### <kbd>method</kbd> `Sense2Vec.get_best_sense` Find the best-matching sense for a given word based on the available senses and frequency counts. Returns `None` if no match is found. | Argument | Type | Description | | ------------- | ------- | ------------------------------------------------------------------------------------------------------- | | `word` | unicode | The word to check. | | `senses` | list | Optional list of senses to limit the search to. If not set / empty, all senses in the vectors are used. | | `ignore_case` | bool | Check for uppercase, lowercase and titlecase. Defaults to `True`. | | **RETURNS** | unicode | The best-matching key or None. | ```python assert s2v.get_best_sense("duck") == "duck|NOUN" assert s2v.get_best_sense("duck", ["VERB", "ADJ"]) == "duck|VERB" ``` #### <kbd>method</kbd> `Sense2Vec.to_bytes` Serialize a `Sense2Vec` object to a bytestring. | Argument | Type | Description | | ----------- | ----- | ----------------------------------------- | | `exclude` | list | Names of serialization fields to exclude. | | **RETURNS** | bytes | The serialized `Sense2Vec` object. | ```python s2v_bytes = s2v.to_bytes() ``` #### <kbd>method</kbd> `Sense2Vec.from_bytes` Load a `Sense2Vec` object from a bytestring. | Argument | Type | Description | | ------------ | ----------- | ----------------------------------------- | | `bytes_data` | bytes | The data to load. | | `exclude` | list | Names of serialization fields to exclude. | | **RETURNS** | `Sense2Vec` | The loaded object. | ```python s2v_bytes = s2v.to_bytes() new_s2v = Sense2Vec().from_bytes(s2v_bytes) ``` #### <kbd>method</kbd> `Sense2Vec.to_disk` Serialize a `Sense2Vec` object to a directory. | Argument | Type | Description | | --------- | ---------------- | ----------------------------------------- | | `path` | unicode / `Path` | The path. | | `exclude` | list | Names of serialization fields to exclude. | ```python s2v.to_disk("/path/to/sense2vec") ``` #### <kbd>method</kbd> `Sense2Vec.from_disk` Load a `Sense2Vec` object from a directory. | Argument | Type | Description | | ----------- | ---------------- | ----------------------------------------- | | `path` | unicode / `Path` | The path to load from | | `exclude` | list | Names of serialization fields to exclude. | | **RETURNS** | `Sense2Vec` | The loaded object. | ```python s2v.to_disk("/path/to/sense2vec") new_s2v = Sense2Vec().from_disk("/path/to/sense2vec") ``` --- ### <kbd>class</kbd> `Sense2VecComponent` The pipeline component to add sense2vec to spaCy pipelines. #### <kbd>method</kbd> `Sense2VecComponent.__init__` Initialize the pipeline component. | Argument | Type | Description | | --------------- | --------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | | `vocab` | `Vocab` | The shared `Vocab`. Mostly used for the shared `StringStore`. | | `shape` | tuple | The vector shape. | | `merge_phrases` | bool | Whether to merge sense2vec phrases into one token. Defaults to `False`. | | `lemmatize` | bool | Always look up lemmas if available in the vectors, otherwise default to original word. Defaults to `False`. | | `overrides` | Optional custom functions to use, mapped to names registred via the registry, e.g. `{"make_key": "custom_make_key"}`. | | **RETURNS** | `Sense2VecComponent` | The newly constructed object. | ```python s2v = Sense2VecComponent(nlp.vocab) ``` #### <kbd>classmethod</kbd> `Sense2VecComponent.from_nlp` Initialize the component from an nlp object. Mostly used as the component factory for the entry point (see setup.cfg) and to auto-register via the `@spacy.component` decorator. | Argument | Type | Description | | ----------- | -------------------- | ----------------------------- | | `nlp` | `Language` | The `nlp` object. | | `**cfg` | - | Optional config parameters. | | **RETURNS** | `Sense2VecComponent` | The newly constructed object. | ```python s2v = Sense2VecComponent.from_nlp(nlp) ``` #### <kbd>method</kbd> `Sense2VecComponent.__call__` Process a `Doc` object with the component. Typically only called as part of the spaCy pipeline and not directly. | Argument | Type | Description | | ----------- | ----- | ------------------------ | | `doc` | `Doc` | The document to process. | | **RETURNS** | `Doc` | the processed document. | #### <kbd>method</kbd> `Sense2Vec.init_component` Register the component-specific extension attributes here and only if the component is added to the pipeline and used – otherwise, tokens will still get the attributes even if the component is only created and not added. #### <kbd>method</kbd> `Sense2VecComponent.to_bytes` Serialize the component to a bytestring. Also called when the component is added to the pipeline and you run `nlp.to_bytes`. | Argument | Type | Description | | ----------- | ----- | ------------------------- | | **RETURNS** | bytes | The serialized component. | #### <kbd>method</kbd> `Sense2VecComponent.from_bytes` Load a component from a bytestring. Also called when you run `nlp.from_bytes`. | Argument | Type | Description | | ------------ | -------------------- | ------------------ | | `bytes_data` | bytes | The data to load. | | **RETURNS** | `Sense2VecComponent` | The loaded object. | #### <kbd>method</kbd> `Sense2VecComponent.to_disk` Serialize the component to a directory. Also called when the component is added to the pipeline and you run `nlp.to_disk`. | Argument | Type | Description | | -------- | ---------------- | ----------- | | `path` | unicode / `Path` | The path. | #### <kbd>method</kbd> `Sense2VecComponent.from_disk` Load a `Sense2Vec` object from a directory. Also called when you run `nlp.from_disk`. | Argument | Type | Description | | ----------- | -------------------- | --------------------- | | `path` | unicode / `Path` | The path to load from | | **RETURNS** | `Sense2VecComponent` | The loaded object. | --- ### <kbd>class</kbd> `registry` Function registry (powered by [`catalogue`](https://github.com/explosion/catalogue)) to easily customize the functions used to generate keys and phrases. Allows you to decorate and name custom functions, swap them out and serialize the custom names when you save out the model. The following registry options are available: | Name | Description | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `registry.make_key` | Given a `word` and `sense`, return a string of the key, e.g. `"word\|sense".` | | `registry.split_key` | Given a string key, return a `(word, sense)` tuple. | | `registry.make_spacy_key` | Given a spaCy object (`Token` or `Span`) and a boolean `prefer_ents` keyword argument (whether to prefer the entity label for single tokens), return a `(word, sense)` tuple. Used in extension attributes to generate a key for tokens and spans. | | `registry.get_phrases` | Given a spaCy `Doc`, return a list of `Span` objects used for sense2vec phrases (typically noun phrases and named entities). | | `registry.merge_phrases` | Given a spaCy `Doc`, get all sense2vec phrases and merge them into single tokens.  | Each registry has a `register` method that can be used as a function decorator and takes one argument, the name of the custom function. ```python from sense2vec import registry @registry.make_key.register("custom") def custom_make_key(word, sense): return f"{word}###{sense}" @registry.split_key.register("custom") def custom_split_key(key): word, sense = key.split("###") return word, sense ``` When initializing the `Sense2Vec` object, you can now pass in a dictionary of overrides with the names of your custom registered functions. ```python overrides = {"make_key": "custom", "split_key": "custom"} s2v = Sense2Vec(overrides=overrides) ``` This makes it easy to experiment with different strategies and serializing the strategies as plain strings (instead of having to pass around and/or pickle the functions themselves). ## 🚂 Training your own sense2vec vectors The [`/scripts`](/scripts) directory contains command line utilities for preprocessing text and training your own vectors. ### Requirements To train your own sense2vec vectors, you'll need the following: - A **very large** source of raw text (ideally more than you'd use for word2vec, since the senses make the vocabulary more sparse). We recommend at least 1 billion words. - A [pretrained spaCy model](https://spacy.io/models) that assigns part-of-speech tags, dependencies and named entities, and populates the `doc.noun_chunks`. If the language you need doesn't provide a built in [syntax iterator for noun phrases](https://spacy.io/usage/adding-languages#syntax-iterators), you'll need to write your own. (The `doc.noun_chunks` and `doc.ents` are what sense2vec uses to determine what's a phrase.) - [GloVe](https://github.com/stanfordnlp/GloVe) or [fastText](https://github.com/facebookresearch/fastText) installed and built. You should be able to clone the repo and run `make` in the respective directory. ### Step-by-step process The training process is split up into several steps to allow you to resume at any given point. Processing scripts are designed to operate on single files, making it easy to parallellize the work. The scripts in this repo require either [Glove](https://github.com/stanfordnlp/GloVe) or [fastText](https://github.com/facebookresearch/fastText) which you need to clone and `make`. For Fasttext, the scripts will require the path to the created binary file. If you're working on Windows, you can build with `cmake`, or alternatively use the `.exe` file from this **unofficial** repo with FastText binary builds for Windows: https://github.com/xiamx/fastText/releases. | | Script | Description | | ------ | -------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **1.** | [`01_parse.py`](scripts/01_parse.py) | Use spaCy to parse the raw text and output binary collections of `Doc` objects (see [`DocBin`](https://spacy.io/api/docbin)). | | **2.** | [`02_preprocess.py`](scripts/02_preprocess.py) | Load a collection of parsed `Doc` objects produced in the previous step and output text files in the sense2vec format (one sentence per line and merged phrases with senses). | | **3.** | [`03_glove_build_counts.py`](scripts/03_glove_build_counts.py) | Use [GloVe](https://github.com/stanfordnlp/GloVe) to build the vocabulary and counts. Skip this step if you're using Word2Vec via [FastText](https://github.com/facebookresearch/fastText). | | **4.** | [`04_glove_train_vectors.py`](scripts/04_glove_train_vectors.py)<br />[`04_fasttext_train_vectors.py`](scripts/04_fasttext_train_vectors.py) | Use [GloVe](https://github.com/stanfordnlp/GloVe) or [FastText](https://github.com/facebookresearch/fastText) to train vectors. | | **5.** | [`05_export.py`](scripts/05_export.py) | Load the vectors and frequencies and output a sense2vec component that can be loaded via `Sense2Vec.from_disk`. | | **6.** | [`06_precompute_cache.py`](scripts/06_precompute_cache.py) | **Optional:** Precompute nearest-neighbor queries for every entry in the vocab to make `Sense2Vec.most_similar` faster. | For more detailed documentation of the scripts, check out the source or run them with `--help`. For example, `python scripts/01_parse.py --help`. ## 🍳 Prodigy recipes This package also seamlessly integrates with the [Prodigy](https://prodi.gy) annotation tool and exposes recipes for using sense2vec vectors to quickly generate lists of multi-word phrases and bootstrap NER annotations. To use a recipe, `sense2vec` needs to be installed in the same environment as Prodigy. For an example of a real-world use case, check out this [NER project](https://github.com/explosion/projects/tree/master/ner-fashion-brands) with downloadable datasets. The following recipes are available – see below for more detailed docs. | Recipe | Description | | ------------------------------------------------------------------- | -------------------------------------------------------------------- | | [`sense2vec.teach`](#recipe-sense2vecteach) | Bootstrap a terminology list using sense2vec. | | [`sense2vec.to-patterns`](#recipe-sense2vecto-patterns) | Convert phrases dataset to token-based match patterns. | | [`sense2vec.eval`](#recipe-sense2veceval) | Evaluate a sense2vec model by asking about phrase triples. | | [`sense2vec.eval-most-similar`](#recipe-sense2veceval-most-similar) | Evaluate a sense2vec model by correcting the most similar entries. | | [`sense2vec.eval-ab`](#recipe-sense2veceval-ab) | Perform an A/B evaluation of two pretrained sense2vec vector models. | ### <kbd>recipe</kbd> `sense2vec.teach` Bootstrap a terminology list using sense2vec. Prodigy will suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used. ```bash prodigy sense2vec.teach [dataset] [vectors_path] [--seeds] [--threshold] [--n-similar] [--batch-size] [--resume] ``` | Argument | Type | Description | | -------------------- | ---------- | ----------------------------------------- | | `dataset` | positional | Dataset to save annotations to. | | `vectors_path` | positional | Path to pretrained sense2vec vectors. | | `--seeds`, `-s` | option | One or more comma-separated seed phrases. | | `--threshold`, `-t` | option | Similarity threshold. Defaults to `0.85`. | | `--n-similar`, `-n` | option | Number of similar items to get at once. | | `--batch-size`, `-b` | option | Batch size for submitting annotations. | | `--resume`, `-R` | flag | Resume from an existing phrases dataset. | #### Example ```bash prodigy sense2vec.teach tech_phrases /path/to/s2v_reddit_2015_md --seeds "natural language processing, machine learning, artificial intelligence" ``` ### <kbd>recipe</kbd> `sense2vec.to-patterns` Convert a dataset of phrases collected with `sense2vec.teach` to token-based match patterns that can be used with [spaCy's `EntityRuler`](https://spacy.io/usage/rule-based-matching#entityruler) or recipes like `ner.match`. If no output file is specified, the patterns are written to stdout. The examples are tokenized so that multi-token terms are represented correctly, e.g.: `{"label": "SHOE_BRAND", "pattern": [{ "LOWER": "new" }, { "LOWER": "balance" }]}`. ```bash prodigy sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file] [--case-sensitive] [--dry] ``` | Argument | Type | Description | | ------------------------- | ---------- | -------------------------------------------- | | `dataset` | positional | Phrase dataset to convert. | | `spacy_model` | positional | spaCy model for tokenization. | | `label` | positional | Label to apply to all patterns. | | `--output-file`, `-o` | option | Optional output file. Defaults to stdout. | | `--case-sensitive`, `-CS` | flag | Make patterns case-sensitive. | | `--dry`, `-D` | flag | Perform a dry run and don't output anything. | #### Example ```bash prodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY --output-file /path/to/patterns.jsonl ``` ### <kbd>recipe</kbd> `sense2vec.eval` Evaluate a sense2vec model by asking about phrase triples: is word A more similar to word B, or to word C? If the human mostly agrees with the model, the vectors model is good. The recipe will only ask about vectors with the same sense and supports different example selection strategies. ```bash prodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses] [--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole] [--eval-only] [--show-scores] ``` | Argument | Type | Description | | ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- | | `dataset` | positional | Dataset to save annotations to. | | `vectors_path` | positional | Path to pretrained sense2vec vectors. | | `--strategy`, `-st` | option | Example selection strategy. `most similar` (default) or `random`. | | `--senses`, `-s` | option | Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. | | `--exclude-senses`, `-es` | option | Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults. | | `--n-freq`, `-f` | option | Number of most frequent entries to limit to. | | `--threshold`, `-t` | option | Minimum similarity threshold to consider examples. | | `--batch-size`, `-b` | option | Batch size to use. | | `--eval-whole`, `-E` | flag | Evaluate the whole dataset instead of the current session. | | `--eval-only`, `-O` | flag | Don't annotate, only evaluate the current dataset. | | `--show-scores`, `-S` | flag | Show all scores for debugging. | #### Strategies | Name | Description | | -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `most_similar` | Pick a random word from a random sense and get its most similar entries of the same sense. Ask about the similarity to the last and middle entry from that selection. | | `most_least_similar` | Pick a random word from a random sense and get the least similar entry from its most similar entries, and then the last most similar entry of that. | | `random` | Pick a random sample of 3 words from the same random sense. | #### Example ```bash prodigy sense2vec.eval vectors_eval /path/to/s2v_reddit_2015_md --senses NOUN,ORG,PRODUCT --threshold 0.5 ``` ![UI preview of sense2vec.eval](https://user-images.githubusercontent.com/13643239/67994212-668cf400-fc44-11e9-8fe2-bf264ae32b0a.png) ### <kbd>recipe</kbd> `sense2vec.eval-most-similar` Evaluate a vectors model by looking at the most similar entries it returns for a random phrase and unselecting the mistakes. ```bash prodigy sense2vec.eval [dataset] [vectors_path] [--senses] [--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only] [--show-scores] ``` | Argument | Type | Description | | ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- | | `dataset` | positional | Dataset to save annotations to. | | `vectors_path` | positional | Path to pretrained sense2vec vectors. | | `--senses`, `-s` | option | Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. | | `--exclude-senses`, `-es` | option | Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults. | | `--n-freq`, `-f` | option | Number of most frequent entries to limit to. | | `--n-similar`, `-n` | option | Number of similar items to check. Defaults to `10`. | | `--batch-size`, `-b` | option | Batch size to use. | | `--eval-whole`, `-E` | flag | Evaluate the whole dataset instead of the current session. | | `--eval-only`, `-O` | flag | Don't annotate, only evaluate the current dataset. | | `--show-scores`, `-S` | flag | Show all scores for debugging. | ```bash prodigy sense2vec.eval-most-similar vectors_eval_sim /path/to/s2v_reddit_2015_md --senses NOUN,ORG,PRODUCT ``` ### <kbd>recipe</kbd> `sense2vec.eval-ab` Perform an A/B evaluation of two pretrained sense2vec vector models by comparing the most similar entries they return for a random phrase. The UI shows two randomized options with the most similar entries of each model and highlights the phrases that differ. At the end of the annotation session the overall stats and preferred model are shown. ```bash prodigy sense2vec.eval [dataset] [vectors_path_a] [vectors_path_b] [--senses] [--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only] [--show-mapping] ``` | Argument | Type | Description | | ------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------- | | `dataset` | positional | Dataset to save annotations to. | | `vectors_path_a` | positional | Path to pretrained sense2vec vectors. | | `vectors_path_b` | positional | Path to pretrained sense2vec vectors. | | `--senses`, `-s` | option | Comma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used. | | `--exclude-senses`, `-es` | option | Comma-separated list of senses to exclude. See `prodigy_recipes.EVAL_EXCLUDE_SENSES` fro the defaults. | | `--n-freq`, `-f` | option | Number of most frequent entries to limit to. | | `--n-similar`, `-n` | option | Number of similar items to check. Defaults to `10`. | | `--batch-size`, `-b` | option | Batch size to use. | | `--eval-whole`, `-E` | flag | Evaluate the whole dataset instead of the current session. | | `--eval-only`, `-O` | flag | Don't annotate, only evaluate the current dataset. | | `--show-mapping`, `-S` | flag | Show which models are option 1 and option 2 in the UI (for debugging). | ```bash prodigy sense2vec.eval-ab vectors_eval_sim /path/to/s2v_reddit_2015_md /path/to/s2v_reddit_2019_md --senses NOUN,ORG,PRODUCT ``` ![UI preview of sense2vec.eval-ab](https://user-images.githubusercontent.com/13643239/68088514-46d21780-fe60-11e9-9b29-fe313bb2154d.png) ## Pretrained vectors The pretrained Reddit vectors support the following "senses", either part-of-speech tags or entity labels. For more details, see spaCy's [annotation scheme overview](https://spacy.io/api/annotation). | Tag | Description | Examples | | ------- | ------------------------- | ------------------------------------ | | `ADJ` | adjective | big, old, green | | `ADP` | adposition | in, to, during | | `ADV` | adverb | very, tomorrow, down, where | | `AUX` | auxiliary  | is, has (done), will (do) | | `CONJ` | conjunction | and, or, but | | `DET` | determiner | a, an, the | | `INTJ` | interjection | psst, ouch, bravo, hello | | `NOUN` | noun | girl, cat, tree, air, beauty | | `NUM` | numeral | 1, 2017, one, seventy-seven, MMXIV | | `PART` | particle | 's, not | | `PRON` | pronoun | I, you, he, she, myself, somebody | | `PROPN` | proper noun | Mary, John, London, NATO, HBO | | `PUNCT` | punctuation | , ? ( ) | | `SCONJ` | subordinating conjunction | if, while, that | | `SYM` | symbol | \$, %, =, :), 😝 | | `VERB` | verb | run, runs, running, eat, ate, eating | | Entity Label | Description | | ------------- | ---------------------------------------------------- | | `PERSON` | People, including fictional. | | `NORP` | Nationalities or religious or political groups. | | `FACILITY` | Buildings, airports, highways, bridges, etc. | | `ORG` | Companies, agencies, institutions, etc. | | `GPE` | Countries, cities, states. | | `LOC` | Non-GPE locations, mountain ranges, bodies of water. | | `PRODUCT` | Objects, vehicles, foods, etc. (Not services.) | | `EVENT` | Named hurricanes, battles, wars, sports events, etc. | | `WORK_OF_ART` | Titles of books, songs, etc. | | `LANGUAGE` | Any named language. |

ML Frameworks
1.7K Github Stars
projects
Open Source

projects

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # 🪐 Project Templates [Weasel](https://github.com/explosion/weasel), previously [spaCy projects](https://spacy.io/usage/projects), lets you manage and share **end-to-end workflows** for different **use cases and domains**, and orchestrate training, packaging and serving your custom pipelines. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a pipeline, export it as a Python package, upload your outputs to a remote storage and share your results with your team. > ⚠️ Weasel project templates require > [**Weasel**](https://github.com/explosion/weasel), which is also included by > default with spaCy v3.7+. You can install it from pip with > `pip install weasel` or conda with `conda install weasel -c conda-forge`. Make > sure to use a fresh virtual environment. > > See the [`master` branch](https://github.com/explosion/projects/tree/master) > for the previous version of this repo. [![tests](https://github.com/explosion/projects/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/projects/actions/workflows/tests.yml) [![spaCy](https://img.shields.io/static/v1?label=made%20with%20%E2%9D%A4%20and&message=spaCy&color=09a3d5&style=flat-square)](https://spacy.io) ## 🗃 Categories | Name | Description | | ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`pipelines`](pipelines) | Templates for training NLP pipelines with different components on different corpora. | | [`tutorials`](tutorials) | Templates that work through a specific NLP use case end-to-end. | | [`integrations`](integrations) | Templates showing integrations with third-party libraries and tools for managing your data and experiments, iterating on demos and prototypes and shipping your models into production. | | [`benchmarks`](benchmarks) | Templates to reproduce our benchmarks and produce quantifiable results that are easy to compare against other systems or versions of spaCy. | | [`experimental`](experimental) | Experimental workflows and other cutting-edge stuff to use at your own risk. | ## 🚀 Quickstart Projects can be used via the [`weasel`](https://github.com/explosion/weasel/blob/main/docs/cli.md) CLI, or through the [`spacy project`](https://spacy.io/api/cli#project) alias. To find out more about a command, add `--help`. For detailed instructions, see the [Weasel documentation](https://github.com/explosion/weasel/tree/main#-documentation) or [spaCy projects usage guide](https://spacy.io/usage/projects). 1. **Clone** the project template you want to use. ```bash python -m weasel clone tutorials/ner_fashion_brands ``` 2. **Install** any project requirements. ```bash cd ner_fashion_brands python -m pip install -r requirements.txt ``` 3. **Fetch assets** (data, weights) defined in the `project.yml`. ```bash python -m weasel assets ``` 4. **Run a command** defined in the `project.yml`. ```bash python -m weasel run preprocess ``` 5. **Run a workflow** of multiple steps in order. ```bash python -m weasel run all ``` 6. **Adjust** the template for **your specific use case**, load in your own data, adjust the settings and model and share the result with your team. ## 👷‍♀️Repository maintanance To keep the project templates and their documentation up to date, this repo contains several scripts: | Script | Description | | -------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`update_docs.py`](.github/update_docs.py) | Update all auto-generated docs in the given root. Calls into [`spacy project document`](https://spacy.io/api/cli#project-document) and only replaces the auto-generated sections, not any custom content before or after. | | [`update_category_docs.py`](.github/update_category_docs.py) | Update the auto-generated `README.md` in the category directories listing the available project templates. | | [`update_configs.py`](.github/update_configs.py) | Update and auto-fill all `config.cfg` files included in the repo, similar to [`spacy init fill-config`](https://spacy.io/api/cli#init-fill-config). Can be used to keep the configs up to date with changes in spaCy. | | [`update_projects_jsonl.py`](.github/update_projects_jsonl.py) | Update `projects.jsonl` file in the given root. Should be used at the root level of the repo. |

Developer Tools Documentation
1.4K Github Stars
spacy-transformers
Open Source

spacy-transformers

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # spacy-transformers: Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy This package provides [spaCy](https://github.com/explosion/spaCy) components and architectures to use transformer models via [Hugging Face's `transformers`](https://github.com/huggingface/transformers) in spaCy. The result is convenient access to state-of-the-art transformer architectures, such as BERT, GPT-2, XLNet, etc. > **This release requires [spaCy v3](https://spacy.io/usage/v3).** For the > previous version of this library, see the > [`v0.6.x` branch](https://github.com/explosion/spacy-transformers/tree/v0.6.x). [![tests](https://github.com/explosion/spacy-transformers/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-transformers/actions/workflows/tests.yml) [![PyPi](https://img.shields.io/pypi/v/spacy-transformers.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.python.org/pypi/spacy-transformers) [![GitHub](https://img.shields.io/github/release/explosion/spacy-transformers/all.svg?style=flat-square&logo=github)](https://github.com/explosion/spacy-transformers/releases) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) ## Features - Use pretrained transformer models like **BERT**, **RoBERTa** and **XLNet** to power your spaCy pipeline. - Easy **multi-task learning**: backprop to one transformer model from several pipeline components. - Train using spaCy v3's powerful and extensible config system. - Automatic alignment of transformer output to spaCy's tokenization. - Easily customize what transformer data is saved in the `Doc` object. - Easily customize how long documents are processed. - Out-of-the-box serialization and model packaging. ## 🚀 Installation Installing the package from pip will automatically install all dependencies, including PyTorch and spaCy. Make sure you install this package **before** you install the models. Also note that this package requires **Python 3.6+**, **PyTorch v1.5+** and **spaCy v3.0+**. ```bash pip install 'spacy[transformers]' ``` For GPU installation, find your CUDA version using `nvcc --version` and add the [version in brackets](https://spacy.io/usage/#gpu), e.g. `spacy[transformers,cuda92]` for CUDA9.2 or `spacy[transformers,cuda100]` for CUDA10.0. If you are having trouble installing PyTorch, follow the [instructions](https://pytorch.org/get-started/locally/) on the official website for your specific operating system and requirements. ## 📖 Documentation > ⚠️ **Important note:** This package has been extensively refactored to take > advantage of [spaCy v3.0](https://spacy.io). Previous versions that were built > for [spaCy v2.x](https://v2.spacy.io) worked considerably differently. Please > see previous tagged versions of this README for documentation on prior > versions. - 📘 [Embeddings, Transformers and Transfer Learning](https://spacy.io/usage/embeddings-transformers): How to use transformers in spaCy - 📘 [Training Pipelines and Models](https://spacy.io/usage/training): Train and update components on your own data and integrate custom models - 📘 [Layers and Model Architectures](https://spacy.io/usage/layers-architectures): Power spaCy components with custom neural networks - 📗 [`Transformer`](https://spacy.io/api/transformer): Pipeline component API reference - 📗 [Transformer architectures](https://spacy.io/api/architectures#transformers): Architectures and registered functions ## Applying pretrained text and token classification models Note that the `transformer` component from `spacy-transformers` does not support task-specific heads like token or text classification. A task-specific transformer model can be used as a source of features to train spaCy components like `ner` or `textcat`, but the `transformer` component does not provide access to task-specific heads for training or inference. Alternatively, if you only want use to the **predictions** from an existing Hugging Face text or token classification model, you can use the wrappers from [`spacy-huggingface-pipelines`](https://github.com/explosion/spacy-huggingface-pipelines) to incorporate task-specific transformer models into your spaCy pipelines. ## Bug reports and other issues Please use [spaCy's issue tracker](https://github.com/explosion/spaCy/issues) to report a bug, or open a new thread on the [discussion board](https://github.com/explosion/spaCy/discussions) for any other issue.

JavaScript Libraries & Components ML Frameworks
1.4K Github Stars
spacy-llm
Open Source

spacy-llm

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> <a href="https://explosion.ai"><img src="assets/logo.png" width="125" height="125" align="left" style="margin-right:30px" /></a> <h1 align="center"> <span style="font: bold 38pt'Courier New';">spacy-llm</span> <br>Structured NLP with LLMs </h1> <br><br> [![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/explosion/spacy-llm/test.yml?branch=main)](https://github.com/explosion/spacy-llm/actions/workflows/test.yml) [![pypi Version](https://img.shields.io/pypi/v/spacy-llm.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy-llm/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) This package integrates Large Language Models (LLMs) into [spaCy](https://spacy.io), featuring a modular system for **fast prototyping** and **prompting**, and turning unstructured responses into **robust outputs** for various NLP tasks, **no training data** required. ## Feature Highlight - Serializable `llm` **component** to integrate prompts into your spaCy pipeline - **Modular functions** to define the [**task**](https://spacy.io/api/large-language-models#tasks) (prompting and parsing) and [**model**](https://spacy.io/api/large-language-models#models) - Interfaces with the APIs of - **[OpenAI](https://platform.openai.com/docs/api-reference/)** - **[Cohere](https://docs.cohere.com/reference/generate)** - **[Anthropic](https://docs.anthropic.com/claude/reference/)** - **[Google PaLM](https://ai.google/discover/palm2/)** - **[Microsoft Azure AI](https://azure.microsoft.com/en-us/solutions/ai)** - Supports open-source LLMs hosted on Hugging Face 🤗: - **[Falcon](https://huggingface.co/tiiuae)** - **[Dolly](https://huggingface.co/databricks)** - **[Llama 2](https://huggingface.co/meta-llama)** - **[OpenLLaMA](https://huggingface.co/openlm-research)** - **[StableLM](https://huggingface.co/stabilityai)** - **[Mistral](https://huggingface.co/mistralai)** - Integration with [LangChain](https://github.com/hwchase17/langchain) 🦜️🔗 - all `langchain` models and features can be used in `spacy-llm` - Tasks available out of the box: - [Named Entity Recognition](https://spacy.io/api/large-language-models#ner) - [Text classification](https://spacy.io/api/large-language-models#textcat) - [Lemmatization](https://spacy.io/api/large-language-models#lemma) - [Relationship extraction](https://spacy.io/api/large-language-models#rel) - [Sentiment analysis](https://spacy.io/api/large-language-models#sentiment) - [Span categorization](https://spacy.io/api/large-language-models#spancat) - [Summarization](https://spacy.io/api/large-language-models#summarization) - [Entity linking](https://spacy.io/api/large-language-models#nel) - [Translation](https://spacy.io/api/large-language-models#translation) - [Raw prompt execution for maximum flexibility](https://spacy.io/api/large-language-models#raw) - Soon: - Semantic role labeling - Easy implementation of **your own functions** via [spaCy's registry](https://spacy.io/api/top-level#registry) for custom prompting, parsing and model integrations. For an example, see [here](https://spacy.io/usage/large-language-models#example-4). - [Map-reduce approach](https://spacy.io/api/large-language-models#task-sharding) for splitting prompts too long for LLM's context window and fusing the results back together ## 🧠 Motivation Large Language Models (LLMs) feature powerful natural language understanding capabilities. With only a few (and sometimes no) examples, an LLM can be prompted to perform custom NLP tasks such as text categorization, named entity recognition, coreference resolution, information extraction and more. [spaCy](https://spacy.io) is a well-established library for building systems that need to work with language in various ways. spaCy's built-in components are generally powered by supervised learning or rule-based approaches. Supervised learning is much worse than LLM prompting for prototyping, but for many tasks it's much better for production. A transformer model that runs comfortably on a single GPU is extremely powerful, and it's likely to be a better choice for any task for which you have a well-defined output. You train the model with anything from a few hundred to a few thousand labelled examples, and it will learn to do exactly that. Efficiency, reliability and control are all better with supervised learning, and accuracy will generally be higher than LLM prompting as well. `spacy-llm` lets you have **the best of both worlds**. You can quickly initialize a pipeline with components powered by LLM prompts, and freely mix in components powered by other approaches. As your project progresses, you can look at replacing some or all of the LLM-powered components as you require. Of course, there can be components in your system for which the power of an LLM is fully justified. If you want a system that can synthesize information from multiple documents in subtle ways and generate a nuanced summary for you, bigger is better. However, even if your production system needs an LLM for some of the task, that doesn't mean you need an LLM for all of it. Maybe you want to use a cheap text classification model to help you find the texts to summarize, or maybe you want to add a rule-based system to sanity check the output of the summary. These before-and-after tasks are much easier with a mature and well-thought-out library, which is exactly what spaCy provides. ## ⏳ Install `spacy-llm` will be installed automatically in future spaCy versions. For now, you can run the following in the same virtual environment where you already have `spacy` [installed](https://spacy.io/usage). ```bash python -m pip install spacy-llm ``` > ⚠️ This package is still experimental and it is possible that changes made to the interface will be breaking in minor version updates. ## 🐍 Quickstart Let's run some text classification using a GPT model from OpenAI. Create a new API key from openai.com or fetch an existing one, and ensure the keys are set as environmental variables. For more background information, see the documentation around setting [API keys](https://spacy.io/api/large-language-models#api-keys). ### In Python code To do some quick experiments, from 0.5.0 onwards you can run: ```python import spacy nlp = spacy.blank("en") llm = nlp.add_pipe("llm_textcat") llm.add_label("INSULT") llm.add_label("COMPLIMENT") doc = nlp("You look gorgeous!") print(doc.cats) # {"COMPLIMENT": 1.0, "INSULT": 0.0} ``` By using the `llm_textcat` factory, the latest version of the built-in textcat task is used, as well as the default GPT-3-5 model from OpenAI. ### Using a config file To control the various parameters of the `llm` pipeline, we can use [spaCy's config system](https://spacy.io/api/data-formats#config). To start, create a config file `config.cfg` containing at least the following (or see the full example [here](https://github.com/explosion/spacy-llm/tree/main/usage_examples/textcat_openai)): ```ini [nlp] lang = "en" pipeline = ["llm"] [components] [components.llm] factory = "llm" [components.llm.task] @llm_tasks = "spacy.TextCat.v3" labels = ["COMPLIMENT", "INSULT"] [components.llm.model] @llm_models = "spacy.GPT-4.v2" ``` Now run: ```python from spacy_llm.util import assemble nlp = assemble("config.cfg") doc = nlp("You look gorgeous!") print(doc.cats) # {"COMPLIMENT": 1.0, "INSULT": 0.0} ``` That's it! There's a lot of other features - prompt templating, more tasks, logging etc. For more information on how to use those, check out https://spacy.io/api/large-language-models. ## 🚀 Ongoing work In the near future, we will - Add more example tasks - Support a broader range of models - Provide more example use-cases and tutorials PRs are always welcome! ## 📝️ Reporting issues If you have questions regarding the usage of `spacy-llm`, or want to give us feedback after giving it a spin, please use the [discussion board](https://github.com/explosion/spacy-llm/discussions). Bug reports can be filed on the [spaCy issue tracker](https://github.com/explosion/spacy-llm/issues). Thank you! ## Migration guides Please refer to our [migration guide](migration_guide.md).

AI & Machine Learning ML Frameworks
1.4K Github Stars
spacy-layout
Open Source

spacy-layout

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # spaCy Layout: Process PDFs, Word documents and more with spaCy This plugin integrates with [Docling](https://ds4sd.github.io/docling/) to bring structured processing of **PDFs**, **Word documents** and other input formats to your [spaCy](https://spacy.io) pipeline. It outputs clean, **structured data** in a text-based format and creates spaCy's familiar [`Doc`](https://spacy.io/api/doc) objects that let you access labelled text spans like sections or headings, and tables with their data converted to a `pandas.DataFrame`. This workflow makes it easy to apply powerful **NLP techniques** to your documents, including linguistic analysis, named entity recognition, text classification and more. It's also great for implementing **chunking for RAG** pipelines. > 📖 **Blog post:** ["From PDFs to AI-ready structured data: a deep dive" ](https://explosion.ai/blog/pdfs-nlp-structured-data) – A new modular workflow for converting PDFs and similar documents to structured data, featuring `spacy-layout` and Docling. [![Test](https://github.com/explosion/spacy-layout/actions/workflows/test.yml/badge.svg)](https://github.com/explosion/spacy-layout/actions/workflows/test.yml) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy-layout.svg?style=flat-square&logo=github&include_prereleases)](https://github.com/explosion/spacy-layout/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy-layout.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy-layout/) [![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg?style=flat-square)](https://spacy.io) ## 📝 Usage > ⚠️ This package requires **Python 3.10** or above. ```bash pip install spacy-layout ``` After initializing the `spaCyLayout` preprocessor with an `nlp` object for tokenization, you can call it on a document path to convert it to structured data. The resulting `Doc` object includes layout spans that map into the original raw text and expose various attributes, including the content type and layout features. ```python import spacy from spacy_layout import spaCyLayout nlp = spacy.blank("en") layout = spaCyLayout(nlp) # Process a document and create a spaCy Doc object doc = layout("./starcraft.pdf") # The text-based contents of the document print(doc.text) # Document layout including pages and page sizes print(doc._.layout) # Tables in the document and their extracted data print(doc._.tables) # Markdown representation of the document print(doc._.markdown) # Layout spans for different sections for span in doc.spans["layout"]: # Document section and token and character offsets into the text print(span.text, span.start, span.end, span.start_char, span.end_char) # Section type, e.g. "text", "title", "section_header" etc. print(span.label_) # Layout features of the section, including bounding box print(span._.layout) # Closest heading to the span (accuracy depends on document structure) print(span._.heading) ``` If you need to process larger volumes of documents at scale, you can use the `spaCyLayout.pipe` method, which takes an iterable of paths or bytes instead and yields `Doc` objects: ```python paths = ["one.pdf", "two.pdf", "three.pdf", ...] for doc in layout.pipe(paths): print(doc._.layout) ``` spaCy also allows you to call the `nlp` object on an already created `Doc`, so you can easily apply a pipeline of components for [linguistic analysis](https://spacy.io/usage/linguistic-features) or [named entity recognition](https://spacy.io/usage/linguistic-features#named-entities), use [rule-based matching](https://spacy.io/usage/rule-based-matching) or anything else you can do with spaCy. ```python # Load the transformer-based English pipeline # Installation: python -m spacy download en_core_web_trf nlp = spacy.load("en_core_web_trf") layout = spaCyLayout(nlp) doc = layout("./starcraft.pdf") # Apply the pipeline to access POS tags, dependencies, entities etc. doc = nlp(doc) ``` ### Tables and tabular data Tables are included in the layout spans with the label `"table"` and under the shortcut `Doc._.tables`. They expose a `layout` extension attribute, as well as an attribute `data`, which includes the tabular data converted to a `pandas.DataFrame`. ```python for table in doc._.tables: # Token position and bounding box print(table.start, table.end, table._.layout) # pandas.DataFrame of contents print(table._.data) ``` By default, the span text is a placeholder `TABLE`, but you can customize how a table is rendered by providing a `display_table` callback to `spaCyLayout`, which receives the `pandas.DataFrame` of the data. This allows you to include the table figures in the document text and use them later on, e.g. during information extraction with a trained named entity recognizer or text classifier. ```python def display_table(df: pd.DataFrame) -> str: return f"Table with columns: {', '.join(df.columns.tolist())}" layout = spaCyLayout(nlp, display_table=display_table) ``` ### Serialization After you've processed the documents, you can [serialize](https://spacy.io/usage/saving-loading#docs) the structured `Doc` objects in spaCy's efficient binary format, so you don't have to re-run the resource-intensive conversion. ```python from spacy.tokens import DocBin docs = layout.pipe(["one.pdf", "two.pdf", "three.pdf"]) doc_bin = DocBin(docs=docs, store_user_data=True) doc_bin.to_disk("./file.spacy") ``` > ⚠️ **Note on deserializing with extension attributes:** The custom extension attributes like `Doc._.layout` are currently registered when `spaCyLayout` is initialized. So if you're loading back `Doc` objects with layout information from a binary file, you'll need to initialize it so the custom attributes can be repopulated. We're planning on making this more elegant in an upcoming version. > > ```diff > + layout = spacyLayout(nlp) > doc_bin = DocBin(store_user_data=True).from_disk("./file.spacy") > docs = list(doc_bin.get_docs(nlp.vocab)) > ``` ## 🎛️ API ### Data and extension attributes ```python layout = spaCyLayout(nlp) doc = layout("./starcraft.pdf") print(doc._.layout) for span in doc.spans["layout"]: print(span.label_, span._.layout) ``` | Attribute | Type | Description | | --- | --- | --- | | `Doc._.layout` | `DocLayout` | Layout features of the document. | | `Doc._.pages` | `list[tuple[PageLayout, list[Span]]]` | Pages in the document and the spans they contain. | | `Doc._.tables` | `list[Span]` | All tables in the document. | | `Doc._.markdown` | `str` | Markdown representation of the document. | | `Doc.spans["layout"]` | `spacy.tokens.SpanGroup` | The layout spans in the document. | | `Span.label_` | `str` | The type of the extracted layout span, e.g. `"text"` or `"section_header"`. [See here](https://github.com/DS4SD/docling-core/blob/14cad33ae7f8dc011a79dd364361d2647c635466/docling_core/types/doc/labels.py) for options. | | `Span.label` | `int` | The integer ID of the span label. | | `Span.id` | `int` | Running index of layout span. | | `Span._.layout` | `SpanLayout \| None` | Layout features of a layout span. | | `Span._.heading` | `Span \| None` | Closest heading to a span, if available. | | `Span._.data` | `pandas.DataFrame \| None` | The extracted data for table spans. ### <kbd>dataclass</kbd> PageLayout | Attribute | Type | Description | | --- | --- | --- | | `page_no` | `int` | The page number (1-indexed). | | `width` | `float` | Page width in pixels. | | `height` | `float` | Page height in pixels. | ### <kbd>dataclass</kbd> DocLayout | Attribute | Type | Description | | --- | --- | --- | | `pages` | `list[PageLayout]` | The pages in the document. | ### <kbd>dataclass</kbd> SpanLayout | Attribute | Type | Description | | --- | --- | --- | | `x` | `float` | Horizontal offset of the bounding box in pixels. | | `y` | `float` | Vertical offset of the bounding box in pixels. | | `width` | `float` | Width of the bounding box in pixels. | | `height` | `float` | Height of the bounding box in pixels. | | `page_no` | `int` | Number of page the span is on. | ### <kbd>class</kbd> `spaCyLayout` #### <kbd>method</kbd> `spaCyLayout.__init__` Initialize the document processor. ```python nlp = spacy.blank("en") layout = spaCyLayout(nlp) ``` | Argument | Type | Description | | --- | --- | --- | | `nlp` | `spacy.language.Language` | The initialized `nlp` object to use for tokenization. | | `separator` | `str` | Token used to separate sections in the created `Doc` object. The separator won't be part of the layout span. If `None`, no separator will be added. Defaults to `"\n\n"`. | | `attrs` | `dict[str, str]` | Override the custom spaCy attributes. Can include `"doc_layout"`, `"doc_pages"`, `"doc_tables"`, `"doc_markdown"`, `"span_layout"`, `"span_data"`, `"span_heading"` and `"span_group"`. | | `headings` | `list[str]` | Labels of headings to consider for `Span._.heading` detection. Defaults to `["section_header", "page_header", "title"]`. | | `display_table` | `Callable[[pandas.DataFrame], str] \| str` | Function to generate the text-based representation of the table in the `Doc.text` or placeholder text. Defaults to `"TABLE"`. | | `docling_options` | `dict[InputFormat, FormatOption]` | [Format options](https://ds4sd.github.io/docling/usage/#advanced-options) passed to Docling's `DocumentConverter`. | | **RETURNS** | `spaCyLayout` | The initialized object. | #### <kbd>method</kbd> `spaCyLayout.__call__` Process a document and create a spaCy [`Doc`](https://spacy.io/api/doc) object containing the text content and layout spans, available via `Doc.spans["layout"]` by default. ```python layout = spaCyLayout(nlp) doc = layout("./starcraft.pdf") ``` | Argument | Type | Description | | --- | --- | --- | | `source` | `str \| Path \| bytes \| DoclingDocument` | Path of document to process, bytes or already created `DoclingDocument`. | | **RETURNS** | `Doc` | The processed spaCy `Doc` object. | #### <kbd>method</kbd> `spaCyLayout.pipe` Process multiple documents and create spaCy [`Doc`](https://spacy.io/api/doc) objects. You should use this method if you're processing larger volumes of documents at scale. The behavior of `as_tuples` works like it does in spaCy's [`Language.pipe`](https://spacy.io/api/language#pipe). ```python layout = spaCyLayout(nlp) paths = ["one.pdf", "two.pdf", "three.pdf", ...] docs = layout.pipe(paths) ``` ```python sources = [("one.pdf", {"id": 1}), ("two.pdf", {"id": 2})] for doc, context in layout.pipe(sources, as_tuples=True): ... ``` | Argument | Type | Description | | --- | --- | --- | | `sources` | `Iterable[str \| Path \| bytes] \| Iterable[tuple[str \| Path \| bytes, Any]]` | Paths of documents to process or bytes, or `(source, context)` tuples if `as_tuples` is set to `True`. | | `as_tuples` | `bool` | If set to `True`, inputs should be an iterable of `(source, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. | | **YIELDS** | `Doc \| tuple[Doc, Any]` | The processed spaCy `Doc` objects or `(doc, context)` tuples if `as_tuples` is set to `True`. | ## 💡 Examples and code snippets This section includes further examples of what you can do with `spacy-layout`. If you have an example that could be a good fit, feel free to submit a [pull request](https://github.com/explosion/spacy-layout/pulls)! ### Visualize a page and bounding boxes with matplotlib ```python import pypdfium2 as pdfium import matplotlib.pyplot as plt from matplotlib.patches import Rectangle import spacy from spacy_layout import spaCyLayout DOCUMENT_PATH = "./document.pdf" # Load and convert the PDF page to an image pdf = pdfium.PdfDocument(DOCUMENT_PATH) page_image = pdf[2].render(scale=1) # get page 3 (index 2) numpy_array = page_image.to_numpy() # Process document with spaCy nlp = spacy.blank("en") layout = spaCyLayout(nlp) doc = layout(DOCUMENT_PATH) # Get page 3 layout and sections page = doc._.pages[2] page_layout = doc._.layout.pages[2] # Create figure and axis with page dimensions fig, ax = plt.subplots(figsize=(12, 16)) # Display the PDF image ax.imshow(numpy_array) # Add rectangles for each section's bounding box for section in page[1]: # Create rectangle patch rect = Rectangle( (section._.layout.x, section._.layout.y), section._.layout.width, section._.layout.height, fill=False, color="blue", linewidth=1, alpha=0.5 ) ax.add_patch(rect) # Add text label at top of box ax.text( section._.layout.x, section._.layout.y, section.label_, fontsize=8, color="red", verticalalignment="bottom" ) ax.axis("off") # hide axes plt.show() ```

ML Frameworks Document Management
903 Github Stars
curated-transformers
Open Source

curated-transformers

🤖 A PyTorch library of curated Transformer models and their composable components

ML Frameworks
896 Github Stars
spacy-streamlit
Open Source

spacy-streamlit

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing [spaCy](https://spacy.io) models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more. [![Current Release Version](https://img.shields.io/github/release/explosion/spacy-streamlit.svg?style=flat-square&logo=github&include_prereleases)](https://github.com/explosion/spacy-streamlit/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy-streamlit.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy-streamlit/) <img width="50%" align="right" src="https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png"> ## 🚀 Quickstart You can install `spacy-streamlit` from pip: ```bash pip install spacy-streamlit ``` The package includes **building blocks** that call into Streamlit and set up all the required elements for you. You can either use the individual components directly and combine them with other elements in your app, or call the `visualize` function to embed the whole visualizer. Download the English model from spaCy to get started. ```bash python -m spacy download en_core_web_sm ``` Then put the following example code in a file. ```python # streamlit_app.py import spacy_streamlit models = ["en_core_web_sm", "en_core_web_md"] default_text = "Sundar Pichai is the CEO of Google." spacy_streamlit.visualize(models, default_text) ``` You can then run your app with `streamlit run streamlit_app.py`. The app should pop up in your web browser. 😀 #### 📦 Example: [`01_out-of-the-box.py`](examples/01_out-of-the-box.py) Use the embedded visualizer with custom settings out-of-the-box. ```bash streamlit run https://raw.githubusercontent.com/explosion/spacy-streamlit/master/examples/01_out-of-the-box.py ``` #### 👑 Example: [`02_custom.py`](examples/02_custom.py) Use individual components in your existing app. ```bash streamlit run https://raw.githubusercontent.com/explosion/spacy-streamlit/master/examples/02_custom.py ``` ## 🎛 API ### Visualizer components These functions can be used in your Streamlit app. They call into `streamlit` under the hood and set up the required elements. #### <kbd>function</kbd> `visualize` Embed the full visualizer with selected components. ```python import spacy_streamlit models = ["en_core_web_sm", "/path/to/model"] default_text = "Sundar Pichai is the CEO of Google." visualizers = ["ner", "textcat"] spacy_streamlit.visualize(models, default_text, visualizers) ``` | Argument | Type | Description | | ------------------------ | -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `models` | List[str] / Dict[str, str] | Names of loadable spaCy models (paths or package names). The models become selectable via a dropdown. Can either be a list of names or the names mapped to descriptions to display in the dropdown. | | `default_text` | str | Default text to analyze on load. Defaults to `""`. | | `default_model` | Optional[str] | Optional name of default model. If not set, the first model in the list of `models` is used. | | `visualizers` | List[str] | Names of visualizers to show. Defaults to `["parser", "ner", "textcat", "similarity", "tokens"]`. | | `ner_labels` | Optional[List[str]] | NER labels to include. If not set, all labels present in the `"ner"` pipeline component will be used. | | `ner_attrs` | List[str] | Span attributes shown in table of named entities. See [`visualizer.py`](spacy_streamlit/visualizer.py) for defaults. | | `token_attrs` | List[str] | Token attributes to show in token visualizer. See [`visualizer.py`](spacy_streamlit/visualizer.py) for defaults. | | `similarity_texts` | Tuple[str, str] | The default texts to compare in the similarity visualizer. Defaults to `("apple", "orange")`. | | `show_json_doc` | bool | Show button to toggle JSON representation of the `Doc`. Defaults to `True`. | | `show_meta` | bool | Show button to toggle `meta.json` of the current pipeline. Defaults to `True`. | | `show_config` | bool | Show button to toggle `config.cfg` of the current pipeline. Defaults to `True`. | | `show_visualizer_select` | bool | Show sidebar dropdown to select visualizers to display (based on enabled visualizers). Defaults to `False`. | | `sidebar_title` | Optional[str] | Title shown in the sidebar. Defaults to `None`. | | `sidebar_description` | Optional[str] | Description shown in the sidebar. Accepts Markdown-formatted text. | | `show_logo` | bool | Show the spaCy logo in the sidebar. Defaults to `True`. | | `color` | Optional[str] | Experimental: Primary color to use for some of the main UI elements (`None` to disable hack). Defaults to `"#09A3D5"`. | | `get_default_text` | Callable[[Language], str] | Optional callable that takes the currently loaded `nlp` object and returns the default text. Can be used to provide language-specific default texts. If the function returns `None`, the value of `default_text` is used, if available. Defaults to `None`. | #### <kbd>function</kbd> `visualize_parser` Visualize the dependency parse and part-of-speech tags using spaCy's [`displacy` visualizer](https://spacy.io/usage/visualizers). ```python import spacy from spacy_streamlit import visualize_parser nlp = spacy.load("en_core_web_sm") doc = nlp("This is a text") visualize_parser(doc) ``` | Argument | Type | Description | | ------------------ | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | | `doc` | `Doc` | The spaCy `Doc` object to visualize. | | _keyword-only_ | | | | `title` | Optional[str] | Title of the visualizer block. | | `key` | Optional[str] | Key used for the streamlit component for selecting labels. | | `manual` | bool | Flag signifying whether the doc argument is a Doc object or a List of Dicts containing parse information. | | `displacy_options` | Optional[Dict] | Dictionary of options to be passed to the displacy render method for generating the HTML to be rendered. See: https://spacy.io/api/top-level#options-dep | #### <kbd>function</kbd> `visualize_ner` Visualize the named entities in a `Doc` using spaCy's [`displacy` visualizer](https://spacy.io/usage/visualizers). ```python import spacy from spacy_streamlit import visualize_ner nlp = spacy.load("en_core_web_sm") doc = nlp("Sundar Pichai is the CEO of Google.") visualize_ner(doc, labels=nlp.get_pipe("ner").labels) ``` | Argument | Type | Description | | ------------------ | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `doc` | `Doc` | The spaCy `Doc` object to visualize. | | _keyword-only_ | | | | `labels` | Sequence[str] | The labels to show in the labels dropdown. | | `attrs` | List[str] | The span attributes to show in entity table. | | `show_table` | bool | Whether to show a table of entities and their attributes. Defaults to `True`. | | `title` | Optional[str] | Title of the visualizer block. | | `colors` | Dict[str,str] | Dictionary of colors for the entity spans to visualize, with keys as labels and corresponding colors as the values. This argument will be deprecated soon. In future the colors arg need to be passed in the `displacy_options` arg with the key "colors".) | | `key` | Optional[str] | Key used for the streamlit component for selecting labels. | | `manual` | bool | Flag signifying whether the doc argument is a Doc object or a List of Dicts containing entity span | | information. | | `displacy_options` | Optional[Dict] | Dictionary of options to be passed to the displacy render method for generating the HTML to be rendered. See https://spacy.io/api/top-level#displacy_options-ent. | #### <kbd>function</kbd> `visualize_spans` Visualize spans in a `Doc` using spaCy's [`displacy` visualizer](https://spacy.io/usage/visualizers). ```python import spacy from spacy_streamlit import visualize_spans nlp = spacy.load("en_core_web_sm") doc = nlp("Sundar Pichai is the CEO of Google.") span = doc[4:7] # CEO of Google span.label_ = "CEO" doc.spans["job_role"] = [span] visualize_spans(doc, spans_key="job_role", displacy_options={"colors": {"CEO": "#09a3d5"}}) ``` | Argument | Type | Description | | ------------------ | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `doc` | `Doc` | The spaCy `Doc` object to visualize. | | _keyword-only_ | | | | `spans_key` | Sequence[str] | Which spans key to render spans from. Default is "sc". | | `attrs` | List[str] | The attributes on the entity Span to be labeled. Attributes are displayed only when the `show_table` argument is True. | | `show_table` | bool | Whether to show a table of spans and their attributes. Defaults to `True`. | | `title` | Optional[str] | Title of the visualizer block. | | `manual` | bool | Flag signifying whether the doc argument is a Doc object or a List of Dicts containing entity span information. | | `displacy_options` | Optional[Dict] | Dictionary of options to be passed to the displacy render method for generating the HTML to be rendered. See https://spacy.io/api/top-level#displacy_options-span. | #### <kbd>function</kbd> `visualize_textcat` Visualize text categories predicted by a trained text classifier. ```python import spacy from spacy_streamlit import visualize_textcat nlp = spacy.load("./my_textcat_model") doc = nlp("This is a text about a topic") visualize_textcat(doc) ``` | Argument | Type | Description | | -------------- | ------------- | ------------------------------------ | | `doc` | `Doc` | The spaCy `Doc` object to visualize. | | _keyword-only_ | | | | `title` | Optional[str] | Title of the visualizer block. | #### `visualize_similarity` Visualize semantic similarity using the model's word vectors. Will show a warning if no vectors are present in the model. ```python import spacy from spacy_streamlit import visualize_similarity nlp = spacy.load("en_core_web_lg") visualize_similarity(nlp, ("pizza", "fries")) ``` | Argument | Type | Description | | --------------- | --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | | `nlp` | `Language` | The loaded `nlp` object with vectors. | | `default_texts` | Tuple[str, str] | The default texts to compare on load. Defaults to `("apple", "orange")`. | | _keyword-only_ | | | | `threshold` | float | Threshold for what's considered "similar". If the similarity score is greater than the threshold, the result is shown as similar. Defaults to `0.5`. | | `title` | Optional[str] | Title of the visualizer block. | #### <kbd>function</kbd> `visualize_tokens` Visualize the tokens in a `Doc` and their attributes. ```python import spacy from spacy_streamlit import visualize_tokens nlp = spacy.load("en_core_web_sm") doc = nlp("This is a text") visualize_tokens(doc, attrs=["text", "pos_", "dep_", "ent_type_"]) ``` | Argument | Type | Description | | -------------- | ------------- | -------------------------------------------------------------------------------------------------------- | | `doc` | `Doc` | The spaCy `Doc` object to visualize. | | _keyword-only_ | | | | `attrs` | List[str] | The names of token attributes to use. See [`visualizer.py`](spacy_streamlit/visualizer.py) for defaults. | | `title` | Optional[str] | Title of the visualizer block. | ### Cached helpers These helpers attempt to cache loaded models and created `Doc` objects. #### <kbd>function</kbd> `process_text` Process a text with a model of a given name and create a `Doc` object. Calls into the `load_model` helper to load the model. ```python import streamlit as st from spacy_streamlit import process_text spacy_model = st.sidebar.selectbox("Model name", ["en_core_web_sm", "en_core_web_md"]) text = st.text_area("Text to analyze", "This is a text") doc = process_text(spacy_model, text) ``` | Argument | Type | Description | | ------------ | ----- | ------------------------------------------------------- | | `model_name` | str | Loadable spaCy model name. Can be path or package name. | | `text` | str | The text to process. | | **RETURNS** | `Doc` | The processed document. | #### <kbd>function</kbd> `load_model` Load a spaCy model from a path or installed package and return a loaded `nlp` object. ```python import streamlit as st from spacy_streamlit import load_model spacy_model = st.sidebar.selectbox("Model name", ["en_core_web_sm", "en_core_web_md"]) nlp = load_model(spacy_model) ``` | Argument | Type | Description | | ----------- | ---------- | ------------------------------------------------------- | | `name` | str | Loadable spaCy model name. Can be path or package name. | | **RETURNS** | `Language` | The loaded `nlp` object. |

LLM Tools & Chat UIs ML Frameworks
857 Github Stars
spacy-stanza
Open Source

spacy-stanza

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a> # spaCy + Stanza (formerly StanfordNLP) This package wraps the [Stanza](https://github.com/stanfordnlp/stanza) (formerly StanfordNLP) library, so you can use Stanford's models in a [spaCy](https://spacy.io) pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labeled dependency parsing in 68 languages. As of v1.0, Stanza also supports named entity recognition for selected languages. > ⚠️ Previous version of this package were available as > [`spacy-stanfordnlp`](https://pypi.python.org/pypi/spacy-stanfordnlp). [![tests](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacy-stanza/actions/workflows/tests.yml) [![PyPi](https://img.shields.io/pypi/v/spacy-stanza.svg?style=flat-square)](https://pypi.python.org/pypi/spacy-stanza) [![GitHub](https://img.shields.io/github/release/explosion/spacy-stanza/all.svg?style=flat-square)](https://github.com/explosion/spacy-stanza) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square)](https://github.com/ambv/black) Using this wrapper, you'll be able to use the following annotations, computed by your pretrained `stanza` model: - Statistical tokenization (reflected in the `Doc` and its tokens) - Lemmatization (`token.lemma` and `token.lemma_`) - Part-of-speech tagging (`token.tag`, `token.tag_`, `token.pos`, `token.pos_`) - Morphological analysis (`token.morph`) - Dependency parsing (`token.dep`, `token.dep_`, `token.head`) - Named entity recognition (`doc.ents`, `token.ent_type`, `token.ent_type_`, `token.ent_iob`, `token.ent_iob_`) - Sentence segmentation (`doc.sents`) ## ️️️⌛️ Installation As of v1.0.0 `spacy-stanza` is only compatible with **spaCy v3.x**. To install the most recent version: ```bash pip install spacy-stanza ``` For spaCy v2, install v0.2.x and refer to the [v0.2.x usage documentation](https://github.com/explosion/spacy-stanza/tree/v0.2.x#-usage--examples): ```bash pip install "spacy-stanza<0.3.0" ``` Make sure to also [download](https://stanfordnlp.github.io/stanza/download_models.html) one of the [pre-trained Stanza models](https://stanfordnlp.github.io/stanza/models.html). ## 📖 Usage & Examples > ⚠️ **Important note:** This package has been refactored to take advantage of > [spaCy v3.0](https://spacy.io). Previous versions that were built for > [spaCy v2.x](https://v2.spacy.io) worked considerably differently. Please see > previous tagged versions of this README for documentation on prior versions. Use `spacy_stanza.load_pipeline()` to create an `nlp` object that you can use to process a text with a Stanza pipeline and create a spaCy [`Doc` object](https://spacy.io/api/doc). By default, both the spaCy pipeline and the Stanza pipeline will be initialized with the same `lang`, e.g. "en": ```python import stanza import spacy_stanza # Download the stanza model if necessary stanza.download("en") # Initialize the pipeline nlp = spacy_stanza.load_pipeline("en") doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.") for token in doc: print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_) print(doc.ents) ``` If language data for the given language is available in spaCy, the respective language class can be used as the base for the `nlp` object – for example, `English()`. This lets you use spaCy's lexical attributes like `is_stop` or `like_num`. The `nlp` object follows the same API as any other spaCy `Language` class – so you can visualize the `Doc` objects with displaCy, add custom components to the pipeline, use the rule-based matcher and do pretty much anything else you'd normally do in spaCy. ```python # Access spaCy's lexical attributes print([token.is_stop for token in doc]) print([token.like_num for token in doc]) # Visualize dependencies from spacy import displacy displacy.serve(doc) # or displacy.render if you're in a Jupyter notebook # Process texts with nlp.pipe for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]): print(doc.text) # Combine with your own custom pipeline components from spacy import Language @Language.component("custom_component") def custom_component(doc): # Do something to the doc here print(f"Custom component called: {doc.text}") return doc nlp.add_pipe("custom_component") doc = nlp("Some text") # Serialize attributes to a numpy array np_array = doc.to_array(['ORTH', 'LEMMA', 'POS']) ``` ### Stanza Pipeline options Additional options for the Stanza [`Pipeline`](https://stanfordnlp.github.io/stanza/pipeline.html#pipeline) can be provided as keyword arguments following the `Pipeline` API: - Provide the Stanza language as `lang`. For Stanza languages without spaCy support, use "xx" for the spaCy language setting: ```python # Initialize a pipeline for Coptic nlp = spacy_stanza.load_pipeline("xx", lang="cop") ``` - Provide Stanza pipeline settings following the `Pipeline` API: ```python # Initialize a German pipeline with the `hdt` package nlp = spacy_stanza.load_pipeline("de", package="hdt") ``` - Tokenize with spaCy rather than the statistical tokenizer (only for English): ```python nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"}) ``` - Provide any additional processor settings as additional keyword arguments: ```python # Provide pretokenized texts (whitespace tokenization) nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True) ``` The spaCy config specifies all `Pipeline` options in the `[nlp.tokenizer]` block. For example, the config for the last example above, a German pipeline with pretokenized texts: ```ini [nlp.tokenizer] @tokenizers = "spacy_stanza.PipelineAsTokenizer.v1" lang = "de" dir = null package = "default" logging_level = null verbose = null use_gpu = true [nlp.tokenizer.kwargs] tokenize_pretokenized = true [nlp.tokenizer.processors] ``` ### Serialization The full Stanza pipeline configuration is stored in the spaCy pipeline [config](https://spacy.io/usage/training#config), so you can save and load the pipeline just like any other `nlp` pipeline: ```python # Save to a local directory nlp.to_disk("./stanza-spacy-model") # Reload the pipeline nlp = spacy.load("./stanza-spacy-model") ``` Note that this **does not save any Stanza model data by default**. The Stanza models are very large, so for now, this package expects you to download the models separately with `stanza.download()` and have them available either in the default model directory or in the path specified under `[nlp.tokenizer.dir]` in the config. ### Adding additional spaCy pipeline components By default, the spaCy pipeline in the `nlp` object returned by `spacy_stanza.load_pipeline()` will be empty, because all `stanza` attributes are computed and set within the custom tokenizer, [`StanzaTokenizer`](spacy_stanza/tokenizer.py). But since it's a regular `nlp` object, you can add your own components to the pipeline. For example, you could add [your own custom text classification component](https://spacy.io/usage/training) with `nlp.add_pipe("textcat", source=source_nlp)`, or augment the named entities with your own rule-based patterns using the [`EntityRuler` component](https://spacy.io/usage/rule-based-matching#entityruler).

ML Frameworks
748 Github Stars
vscode-prodigy
Open Source

vscode-prodigy

# VS Code extension for the Prodigy annotation tool ✨ [![CI](https://github.com/explosion/vscode-prodigy/actions/workflows/ci.yml/badge.svg)](https://github.com/explosion/vscode-prodigy/actions/workflows/ci.yml) This repo contains a [VS Code](https://code.visualstudio.com/) extension for [Prodigy](https://prodi.gy), our scriptable annotation tool for creating training data for machine learning models. It lets you run Prodigy within a VS Code Panel, and annotate as you develop your models and applications. ![](assets/preview.gif) In order to use this extension, you'll need a license for Prodigy&mdash;[see this page](https://prodi.gy/buy) for more details. For issues and questions, please use the [Prodigy Support Forum](https://support.prodi.gy/). If you've found a bug, feel free to submit a [pull request](https://github.com/explosion/vscode-prodigy/pulls) ## ⌛️ Installation Get from the [VS Code marketplace](https://marketplace.visualstudio.com/items?itemName=Explosion.vscode-prodigy) ## 📋 Usage Start a Prodigy session in a terminal, e.g.: ```bash prodigy ner.manual my_set blank:en example/news_headlines.jsonl --label PERSON,ORG,PRODUCT ``` In VS Code, open the Command Palette (<kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>P</kbd>) and select `Open Prodigy`. Execute it, and you will have a new Prodigy panel on the side. ## ⚙ Configuration This extension contributes the following settings: | Configuration | Description | |--------------------|--------------------------------------------------------------------------------------| | vscode-prodigy.url | Set the URL where Prodigy is being served at. Useful when you're in a reverse proxy. | ## 👩‍💻 Develop You need [Visual Studio Code](https://code.visualstudio.com/) and [Typescript](https://www.typescriptlang.org/) for development. First, clone this repository: ```bash git clone [email protected]:explosion/vscode-prodigy.git ``` and open the `vscode-prodigy` folder in Code: ```bash cd vscode-prodigy code . ``` ### Get up and running straight away * Press <kbd>F5</kbd> to open a new window with your extension loaded. * Run your command from the command palette by pressing (<kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>P</kbd> or <kbd>Cmd</kbd>+<kbd>Shift</kbd>+<kbd>P</kbd> on Mac) and typing `Open Prodigy`. * Set breakpoints in your code inside `src/extension.ts` to debug your extension. * Find output from your extension in the debug console. ### Make changes * You can relaunch the extension from the debug toolbar after changing code in `src/extension.ts`. * You can also reload (<kbd>Ctrl</kbd>+<kbd>R</kbd> or <kbd>Cmd</kbd>+<kbd>R</kbd> on Mac) the VS Code window with your extension to load your changes. ### Run tests * Open the debug viewlet (<kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>D</kbd> or <kbd>Cmd</kbd>+<kbd>Shift</kbd>+<kbd>D</kbd> on Mac) and from the launch configuration dropdown pick `Extension Tests`. * Press <kbd>F5</kbd> to run the tests in a new window with your extension loaded. * See the output of the test result in the debug console. * Make changes to `src/test/suite/extension.test.ts` or create new test files inside the `test/suite` folder. * The provided test runner will only consider files matching the name pattern `**.test.ts`.

CMS Plugins & Extensions Code Editors & IDEs
30 Github Stars