ivan-bilan

Open Source

The-NLP-Pandect

![The-NLP-Pandect](./Resources/Images/pandect.png) <p align="center"> This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online. </p> > __Note__ > Quick legend on available resource types: > > ⭐ - open source project, usually a GitHub repository with its number of stars > > 📙 - resource you can read, usually a blog post or a paper > > 🗂️ - a collection of additional resources > > 🔱 - non-open source tool, framework or paid service > > 🎥️ - a resource you can watch > > 🎙️ - a resource you can listen to ### <p align="center"><b>Table of Contents</b></p> | 📇 Main Section | 🗃️ Sub-sections Sample | | ------------- | ------------- | | [NLP Resources](https://github.com/ivan-bilan/The-NLP-Pandect#) | [Paper Summaries](https://github.com/ivan-bilan/The-NLP-Pandect#papers-and-paper-summaries), [Conference Summaries](https://github.com/ivan-bilan/The-NLP-Pandect#conference-summaries), [NLP Datasets](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-datasets) | | [NLP Podcasts](https://github.com/ivan-bilan/The-NLP-Pandect#-1) | [NLP-only Podcasts](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-only-podcasts), [Podcasts with many NLP Episodes](https://github.com/ivan-bilan/The-NLP-Pandect#many-nlp-episodes) | | [NLP Newsletters](https://github.com/ivan-bilan/The-NLP-Pandect#-2) | - | | [NLP Meetups](https://github.com/ivan-bilan/The-NLP-Pandect#-3) | - | | [NLP YouTube Channels](https://github.com/ivan-bilan/The-NLP-Pandect#-4) | - | | [NLP Benchmarks](https://github.com/ivan-bilan/The-NLP-Pandect#-5) | [General NLU](https://github.com/ivan-bilan/The-NLP-Pandect#general-nlu), [Question Answering](https://github.com/ivan-bilan/The-NLP-Pandect#question-answering), [Multilingual](https://github.com/ivan-bilan/The-NLP-Pandect#multilingual-and-non-english-benchmarks) | | [Research Resources](https://github.com/ivan-bilan/The-NLP-Pandect#-6) | [Resource on Transformer Models](https://github.com/ivan-bilan/The-NLP-Pandect#transformer-based-architectures), [Distillation and Pruning](https://github.com/ivan-bilan/The-NLP-Pandect#distillation-pruning-and-quantization), [Automated Summarization](https://github.com/ivan-bilan/The-NLP-Pandect#automated-summarization) | | [Industry Resources](https://github.com/ivan-bilan/The-NLP-Pandect#-7) | [Best Practices for NLP Systems](https://github.com/ivan-bilan/The-NLP-Pandect#best-practices-for-nlp), [MLOps for NLP](https://github.com/ivan-bilan/The-NLP-Pandect#mlops-for-nlp) | | [Speech Recognition](https://github.com/ivan-bilan/The-NLP-Pandect#-8) | [General Resources](https://github.com/ivan-bilan/The-NLP-Pandect#general-speech-recognition), [Text to Speech](https://github.com/ivan-bilan/The-NLP-Pandect#text-to-speech), [Speech to Text](https://github.com/ivan-bilan/The-NLP-Pandect#speech-to-text), [Datasets](https://github.com/ivan-bilan/The-NLP-Pandect#datasets) | | [Topic Modeling](https://github.com/ivan-bilan/The-NLP-Pandect#-9) | [Blogs](https://github.com/ivan-bilan/The-NLP-Pandect#blogs-1), [Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#frameworks-for-topic-modeling), [Repositories and Projects](https://github.com/ivan-bilan/The-NLP-Pandect#repositories-1) | | [Keyword Extraction](https://github.com/ivan-bilan/The-NLP-Pandect#-10) | [Text Rank](https://github.com/ivan-bilan/The-NLP-Pandect#text-rank), [Rake](https://github.com/ivan-bilan/The-NLP-Pandect#rake---rapid-automatic-keyword-extraction), [Other Approaches](https://github.com/ivan-bilan/The-NLP-Pandect#other-approaches) | | [Responsible NLP](https://github.com/ivan-bilan/The-NLP-Pandect#-11) | [NLP and ML Interpretability](https://github.com/ivan-bilan/The-NLP-Pandect#nlp-and-ml-interpretability), [Ethics, Bias, and Equality in NLP](https://github.com/ivan-bilan/The-NLP-Pandect#ethics-bias-and-equality-in-nlp), [Adversarial Attacks for NLP](https://github.com/ivan-bilan/The-NLP-Pandect#adversarial-attacks-for-nlp) | | [NLP Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#-12) | [General Purpose](https://github.com/ivan-bilan/The-NLP-Pandect#general-purpose), [Data Augmentation](https://github.com/ivan-bilan/The-NLP-Pandect#data-augmentation), [Machine Translation](https://github.com/ivan-bilan/The-NLP-Pandect#machine-translation), [Adversarial Attacks](https://github.com/ivan-bilan/The-NLP-Pandect#adversarial-nlp-attacks--behavioral-testing), [Dialog Systems & Speech](https://github.com/ivan-bilan/The-NLP-Pandect#dialog-systems-and-speech), [Entity and String Matching](https://github.com/ivan-bilan/The-NLP-Pandect#entity-and-string-matching), [Non-English Frameworks](https://github.com/ivan-bilan/The-NLP-Pandect#non-english-oriented), [Text Annotation](https://github.com/ivan-bilan/The-NLP-Pandect#text-data-labelling) | | [Learning NLP](https://github.com/ivan-bilan/The-NLP-Pandect#-13) | [Courses](https://github.com/ivan-bilan/The-NLP-Pandect#courses), [Books](https://github.com/ivan-bilan/The-NLP-Pandect#books), [Tutorials](https://github.com/ivan-bilan/The-NLP-Pandect#tutorials) | | [NLP Communities](https://github.com/ivan-bilan/The-NLP-Pandect#-14) | - | | [Other NLP Topics](https://github.com/ivan-bilan/The-NLP-Pandect#-15) | [Tokenization](https://github.com/ivan-bilan/The-NLP-Pandect#tokenization), [Data Augmentation](https://github.com/ivan-bilan/The-NLP-Pandect#data-augmentation-and-weak-supervision), [Named Entity Recognition](https://github.com/ivan-bilan/The-NLP-Pandect#named-entity-recognition-ner), [Error Correction](https://github.com/ivan-bilan/The-NLP-Pandect#spell-correction--error-correction), [AutoML/AutoNLP](https://github.com/ivan-bilan/The-NLP-Pandect#automl--autonlp), [Text Generation](https://github.com/ivan-bilan/The-NLP-Pandect#text-generation) | ![The-NLP-Resources](./Resources/Images/pandect_resources.png) ----- > __Note__ > Section keywords: paper summaries, compendium, awesome list #### Compendiums and awesome lists on the topic of NLP: * 🗂️ [The NLP Index](https://index.quantumstat.com) - Searchable Index of NLP Papers by Quantum Stat / NLP Cypher * ⭐ [Awesome NLP](https://github.com/keon/awesome-nlp) by [keon](https://github.com/keon) [GitHub, 18674 stars] * ⭐ [Speech and Natural Language Processing Awesome List](https://github.com/edobashira/speech-language-processing#readme) by [elaboshira](https://github.com/edobashira) [GitHub, 2224 stars] * ⭐ [Awesome Deep Learning for Natural Language Processing (NLP)](https://github.com/brianspiering/awesome-dl4nlp) [GitHub, 1307 stars] * ⭐ [Text Mining and Natural Language Processing Resources](https://github.com/stepthom/text_mining_resources) by [stepthom](https://github.com/stepthom) [GitHub, 598 stars] * 🗂️ [Brainsources for #NLP enthusiasts](https://www.notion.so/634eba1a37d34e2baec1bb574a8a5482) by [Philip Vollet](https://www.linkedin.com/in/philipvollet/) * ⭐ [Awesome AI/ML/DL - NLP Section](https://github.com/neomatrix369/awesome-ai-ml-dl/tree/master/natural-language-processing#natural-language-processing-nlp) [GitHub, 1668 stars] * 🗂️ [NLP articles](https://devopedia.org/site-map/browse-articles/natural+language+processing) by [Devopedia](https://devopedia.org) * ⭐ [Awesome LLM Apps](https://github.com/Shubhamsaboo/awesome-llm-apps) [GitHub, 112502 stars] #### NLP Conferences, Paper Summaries and Paper Compendiums: ##### Papers and Paper Summaries * ⭐ [100 Must-Read NLP Papers](https://github.com/mhagiwara/100-nlp-papers) 100 Must-Read NLP Papers [GitHub, 3846 stars] * ⭐ [NLP Paper Summaries](https://github.com/dair-ai/nlp_paper_summaries) by [dair-ai](https://github.com/dair-ai) [GitHub, 1477 stars] * ⭐ [Curated collection of papers for the NLP practitioner](https://github.com/mihail911/nlp-library) [GitHub, 1072 stars] * ⭐ [Papers on Textual Adversarial Attack and Defense](https://github.com/thunlp/TAADpapers) [GitHub, 1574 stars] * ⭐ [Recent Deep Learning papers in NLU and RL](https://github.com/madrugado/deep-learning-nlp-rl-papers) by Valentin Malykh [GitHub, 297 stars] * ⭐ [A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers](https://github.com/NiuTrans/ABigSurvey) [GitHub, 2031 stars] * ⭐ [A Paper List for Style Transfer in Text](https://github.com/fuzhenxin/Style-Transfer-in-Text) [GitHub, 1623 stars] * 🎥 [Video recordings index for papers](https://papertalk.org/index) ##### Conference Summaries * ⭐ [NLP top 10 conferences Compendium](https://github.com/soulbliss/NLP-conference-compendium) by [soulbliss](https://github.com/soulbliss) [GitHub, 459 stars] * 📙 [ICLR 2020 Trends](https://gsarti.com/post/iclr2020-transformers/) * 📙 [SpacyIRL 2019 Conference in Overview](https://www.linkedin.com/pulse/spacyirl-2019-conference-overview-ivan-bilan/) * 📙 [Paper Digest](https://www.paperdigest.org/category/nlp/) - Conferences and Papers in Overview #### NLP Progress and NLP Tasks: * ⭐ [NLP Progress](https://github.com/sebastianruder/NLP-progress) by [sebastianruder](https://github.com/sebastianruder) [GitHub, 22957 stars] * ⭐ [NLP Tasks](https://github.com/Kyubyong/nlp_tasks) by [Kyubyong](https://github.com/Kyubyong) [GitHub, 3013 stars] #### NLP Datasets: * ⭐ [NLP Datasets](https://github.com/niderhoff/nlp-datasets) by [niderhoff](https://github.com/niderhoff) [GitHub, 5982 stars] * ⭐ [Datasets](https://github.com/huggingface/datasets) by Huggingface [GitHub, 21559 stars] * 🗂️ [Big Bad NLP Database](https://datasets.quantumstat.com) * ⭐ [UWA Unambiguous Word Annotations](http://danlou.github.io/uwa/) - Word Sense Disambiguation Dataset * ⭐ [MLDoc](https://github.com/facebookresearch/MLDoc) - Corpus for Multilingual Document Classification in Eight Language [GitHub, 153 stars] #### Word and Sentence embeddings: * ⭐ [Awesome Embedding Models](https://github.com/Hironsan/awesome-embedding-models) by [Hironsan](https://github.com/Hironsan) [GitHub, 1840 stars] * ⭐ [Awesome list of Sentence Embeddings](https://github.com/Separius/awesome-sentence-embedding) by [Separius](https://github.com/Separius) [GitHub, 2289 stars] * ⭐ [Awesome BERT](https://github.com/Jiakui/awesome-bert) by [Jiakui](https://github.com/Jiakui) [GitHub, 1842 stars] * ⭐ [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/tree/master) - Retrieval and Retrieval-augmented LLMs [GitHub, 11753 stars] #### Notebooks, Scripts and Repositories * ⭐ [The Super Duper NLP Repo](https://notebooks.quantumstat.com) [Website, 2020] #### Non-English resources and Compendiums * ⭐ [NLP Resources for Bahasa Indonesian](https://github.com/louisowen6/NLP_bahasa_resources) [GitHub, 572 stars] * ⭐ [Indic NLP Catalog](https://github.com/AI4Bharat/indicnlp_catalog) [GitHub, 632 stars] * ⭐ [Pre-trained language models for Vietnamese](https://github.com/VinAIResearch/PhoBERT) [GitHub, 788 stars] * ⭐ [Natural Language Toolkit for Indic Languages (iNLTK)](https://github.com/goru001/inltk) [GitHub, 840 stars] * ⭐ [Indic NLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library) [GitHub, 638 stars] * ⭐ [AI4Bharat-IndicNLP Portal](https://indicnlp.ai4bharat.org) * ⭐ [ARBML](https://github.com/ARBML/ARBML) - Implementation of many Arabic NLP and ML projects [GitHub, 421 stars] * ⭐ [zemberek-nlp](https://github.com/ahmetaa/zemberek-nlp) - NLP tools for Turkish [GitHub, 1331 stars] * ⭐ [TDD AI](https://tdd.ai) - An open-source platform for all Turkish datasets, language models, and NLP tools. * ⭐ [KLUE](https://github.com/KLUE-benchmark/KLUE) - Korean Language Understanding Evaluation [GitHub, 595 stars] * ⭐ [Persian NLP Benchmark](https://github.com/Mofid-AI/persian-nlp-benchmark) - benchmark for evaluation and comparison of various NLP tasks in Persian language [GitHub, 76 stars] * ⭐ [nlp-greek](https://github.com/Yuliya-HV/nlp-greek) - Greek language sources [GitHub, 5 stars] * ⭐ [Awesome NLP Resources for Hungarian](https://github.com/oroszgy/awesome-hungarian-nlp) [GitHub, 278 stars] #### Pre-trained NLP models * ⭐ [List of pre-trained NLP models](https://github.com/balavenkatesh3322/NLP-pretrained-model) [GitHub, 170 stars] * ⭐ [Pretrained language models developed by Huawei Noah's Ark Lab](https://github.com/huawei-noah/Pretrained-Language-Model) [GitHub, 3160 stars] * ⭐ [Spanish Language Models and resources](https://github.com/PlanTL-GOB-ES/lm-spanish) [GitHub, 263 stars] #### NLP History ##### General * ⭐ [Modern Deep Learning Techniques Applied to Natural Language Processing](https://github.com/omarsar/nlp_overview) [GitHub, 1322 stars] ##### 2020 Year in Review * 📙 [Natural Language Processing in 2020: The Year In Review](https://www.linkedin.com/pulse/natural-language-processing-2020-year-review-ivan-bilan/) [Blog, December 2020] * 📙 [ML and NLP Research Highlights of 2020](https://www.ruder.io/research-highlights-2020/) [Blog, January 2021] ![The-NLP-Podcasts](./Resources/Images/pandect_lyra.png) ----- [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) #### NLP-only podcasts * 🎙️ [NLP Highlights](https://soundcloud.com/nlp-highlights) [Years: 2017 - now, Status: active] * 🎙️ [The NLP Zone](https://open.spotify.com/show/5Q4ONkGcBPHWm2gzNTYt0P) [Years: 2021 - now, Status: active] #### Many NLP episodes * 🎙️ [TWIML AI](https://twimlai.com) [Years: 2016 - now, Status: active] * 🎙️ [Practical AI](https://changelog.com/practicalai) [Years: 2018 - now, Status: active] * 🎙️ [The Data Exchange](https://thedataexchange.media) [Years: 2019 - now, Status: active] * 🎙️ [Gradient Dissent](https://www.wandb.com/podcast) [Years: 2020 - now, Status: active] * 🎙️ [Machine Learning Street Talk](https://open.spotify.com/show/02e6PZeIOdpmBGT9THuzwR) [Years: 2020 - now, Status: active] * 🎙️ [DataFramed](https://www.datacamp.com/podcast) - latest trends and insights on how to scale the impact of data science in organizations [Years: 2019 - now, Status: active] #### Some NLP episodes * 🎙️ [The Super Data Science Podcast](https://www.superdatascience.com/podcast) [Years: 2016 - now, Status: active] * 🎙️ [Data Hack Radio](https://soundcloud.com/datahack-radio) [Years: 2018 - now, Status: active] * 🎙️ [AI Game Changers](https://open.spotify.com/show/7I2fEsGxLa4TdN8zN0T6XN) [Years: 2020, Status: active] * 🎙️ [The Analytics Show](https://anchor.fm/analyticsshow) [Years: 2019 - now, Status: active] ![The-NLP-Newsletter](./Resources/Images/pandect_scroll.png) ----- * 📙 [NLP News](https://www.ruder.io/) by [Sebastian Ruder](https://www.ruder.io/) * 📙 [This Week in NLP by Robert Dale](https://www.language-technology.com/twin) * 📙 [Papers with Code](https://paperswithcode.com) * 📙 [The Batch](https://www.deeplearning.ai/thebatch/) by [deeplearning.ai](https://www.deeplearning.ai/thebatch/) * 📙 [Paper Digest](https://www.paperdigest.org/2020/04/recent-papers-on-question-answering/) by [PaperDigest](https://www.paperdigest.org/daily-paper-digest/) * 📙 [NLP Cypher](https://medium.com/@quantumstat) by [QuantumStat](https://quantumstat.com) ![The-NLP-Meetups](./Resources/Images/pandect_meetups.png) ----- * 🎥 NLP Zurich [[YouTube Recordings](https://www.youtube.com/channel/UCLLX-5j9UNYassOwS0nveDQ)] * 🎥 [Hacking-Machine-Learning](https://www.meetup.com/Hacking-Machine-Learning) [[YouTube Recordings](https://www.youtube.com/channel/UCt5RvrC-_3X7FNAWhORVn7Q)] * 🎥 [NY-NLP (New York)](https://www.meetup.com/NY-NLP/) ![The-NLP-Youtube](./Resources/Images/pandect_youtube.png) ----- * 🎥 [Yannic Kilcher](https://www.youtube.com/channel/UCZHmQk67mSJgfCCTn7xBfew) * 🎥 [HuggingFace](https://www.youtube.com/channel/UCHlNU7kIZhRgSbhHvFoy72w) * 🎥 [Kaggle Reading Group](https://www.youtube.com/watch?v=PhTF7yJNR70&list=PLqFaTIg4myu8t5ycqvp7I07jTjol3RCl9) * 🎥 [Rasa Paper Reading](https://www.youtube.com/channel/UCJ0V6493mLvqdiVwOKWBODQ/playlists) * 🎥 [Stanford CS224N: NLP with Deep Learning](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z) * 🎥 [NLPxing](https://www.youtube.com/channel/UCuGC1JusVvbOGa__qMtH3QA/videos) * 🎥 [ML Explained - A.I. Socratic Circles - AISC](https://www.youtube.com/channel/UCfk3pS8cCPxOgoleriIufyg) * 🎥 [Deeplearning.ai](https://www.youtube.com/channel/UCcIXc5mJsHVYTZR1maL5l9w/featured) * 🎥 [Machine Learning Street Talk](https://www.youtube.com/channel/UCMLtBahI5DMrt0NPvDSoIRQ/featured) ![The-NLP-Benchmarks](./Resources/Images/pandect_benchmark.png) ----- [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### General NLU * ⭐ [GLUE](https://gluebenchmark.com) - General Language Understanding Evaluation (GLUE) benchmark * ⭐ [SuperGLUE](https://super.gluebenchmark.com) - benchmark styled after GLUE with a new set of more difficult language understanding tasks * ⭐ [decaNLP](https://decanlp.com) - The Natural Language Decathlon (decaNLP) for studying general NLP models * ⭐ [dialoglue](https://github.com/alexa/dialoglue) - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue [GitHub, 287 stars] * ⭐ [DynaBench](https://dynabench.org/) - Dynabench is a research platform for dynamic data collection and benchmarking * ⭐ [Big-Bench](https://github.com/google/BIG-bench) - collaborative benchmark for measuring and extrapolating the capabilities of language models [GitHub, 3244 stars] ### Summarization * ⭐ [WikiAsp](https://github.com/neulab/wikiasp) - WikiAsp: Multi-document aspect-based summarization Dataset * ⭐ [WikiLingua](https://github.com/esdurmus/Wikilingua) - A Multilingual Abstractive Summarization Dataset ### Question Answering * ⭐ [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) - Stanford Question Answering Dataset (SQuAD) * ⭐ [XQuad](https://github.com/deepmind/xquad) - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering * ⭐ [GrailQA](https://dki-lab.github.io/GrailQA/) - Strongly Generalizable Question Answering (GrailQA) * ⭐ [CSQA](https://amritasaha1812.github.io/CSQA/) - Complex Sequential Question Answering ### Multilingual and Non-English Benchmarks * 📙 [XTREME](https://arxiv.org/abs/2003.11080) - Massively Multilingual Multi-task Benchmark * ⭐ [GLUECoS](https://github.com/microsoft/GLUECoS) - A benchmark for code-switched NLP * ⭐ [IndicGLUE](https://indicnlp.ai4bharat.org/indic-glue/) - Natural Language Understanding Benchmark for Indic Languages * ⭐ [LinCE](https://ritual.uh.edu/lince/) - Linguistic Code-Switching Evaluation Benchmark * ⭐ [Russian SuperGlue](https://github.com/RussianNLP/RussianSuperGLUE) - Russian SuperGlue Benchmark ### Bio, Law, and other scientific domains * ⭐ [BLURB](https://microsoft.github.io/BLURB/) - Biomedical Language Understanding and Reasoning Benchmark * ⭐ [BLUE](https://github.com/ncbi-nlp/BLUE_Benchmark) - Biomedical Language Understanding Evaluation benchmark * ⭐ [LexGLUE](https://github.com/coastalcph/lex-glue) - A Benchmark Dataset for Legal Language Understanding in English ### Transformer Efficiency * ⭐ [Long-Range Arena](https://github.com/google-research/long-range-arena) - Long Range Arena for Benchmarking Efficient Transformers ([Pre-print](https://arxiv.org/abs/2011.04006)) [GitHub, 788 stars] ### Other * ⭐ [CodeXGLUE](https://www.microsoft.com/en-us/research/blog/codexglue-a-benchmark-dataset-and-open-challenge-for-code-intelligence/) - A benchmark dataset for code intelligence * ⭐ [CrossNER](https://github.com/zliucr/CrossNER) - CrossNER: Evaluating Cross-Domain Named Entity Recognition * ⭐ [MultiNLI](https://cims.nyu.edu/~sbowman/multinli/) - Multi-Genre Natural Language Inference corpus * ⭐ [iSarcasm: A Dataset of Intended Sarcasm](https://github.com/silviu-oprea/iSarcasm) - iSarcasm is a dataset of tweets, each labelled as either sarcastic or non_sarcastic * ⭐ [SLTev](https://github.com/ELITR/SLTev) - tool for comprehensive evaluation of (simultaneous) spoken language translation [GitHub, 12 stars] ![The-NLP-Research](./Resources/Images/pandect_quill.png) ----- [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### General * 📙 [A Recipe for Training Neural Networks](https://karpathy.github.io/2019/04/25/recipe/) by Andrej Karpathy [Keywords: research, training, 2019] * 📙 [Recent Advances in NLP via Large Pre-Trained Language Models: A Survey](https://arxiv.org/abs/2111.01243) [Paper, November 2021] ### Embeddings #### Repositories * ⭐ [Pre-trained ELMo Representations for Many Languages](https://github.com/HIT-SCIR/ELMoForManyLangs) [GitHub, 1461 stars] * ⭐ [sense2vec](https://github.com/explosion/sense2vec) - Contextually-keyed word vectors [GitHub, 1673 stars] * ⭐ [wikipedia2vec](https://github.com/wikipedia2vec/wikipedia2vec) [GitHub, 966 stars] * ⭐ [StarSpace](https://github.com/facebookresearch/StarSpace) [GitHub, 3955 stars] * ⭐ [fastText](https://github.com/facebookresearch/fastText) [GitHub, 26531 stars] #### Blogs * 📙 [Language Models and Contextualised Word Embeddings](http://www.davidsbatista.net/blog/2018/12/06/Word_Embeddings/) by David S. Batista [Blog, 2018] * 📙 [An Essential Guide to Pretrained Word Embeddings for NLP Practitioners](https://www.analyticsvidhya.com/blog/2020/03/pretrained-word-embeddings-nlp/?utm_source=AVLinkedin&utm_medium=post&utm_campaign=22_may_new_article) by AnalyticsVidhya [Blog, 2020] * 📙 [Polyglot Word Embeddings Discover Language Clusters](http://blog.shriphani.com/2020/02/03/polyglot-word-embeddings-discover-language-clusters/) [Blog, 2020] * 📙 [The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/) by Jay Alammar [Blog, 2019] #### Cross-lingual Word and Sentence Embeddings * ⭐ [vecmap](https://github.com/artetxem/vecmap) - VecMap (cross-lingual word embedding mappings) [GitHub, 654 stars] * ⭐ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) - Multilingual Sentence & Image Embeddings with BERT [GitHub, 18765 stars] #### Byte Pair Encoding * ⭐ [bpemb](https://github.com/bheinzerling/bpemb) - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 1220 stars] * ⭐ [subword-nmt](https://github.com/rsennrich/subword-nmt) - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 2273 stars] * ⭐ [python-bpe](https://github.com/soaxelbrooke/python-bpe) - Byte Pair Encoding for Python [GitHub, 232 stars] ### Transformer-based Architectures #### General * 📙 [The Transformer Family](https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html) by Lilian Weng [Blog, 2020] * 📙 [Playing the lottery with rewards and multiple languages](https://arxiv.org/abs/1906.02768) - about the effect of random initialization [ICLR 2020 Paper] * 📙 [Attention? Attention!](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) by Lilian Weng [Blog, 2018] * 📙 [the transformer … “explained”?](https://nostalgebraist.tumblr.com/post/185326092369/the-transformer-explained) [Blog, 2019] * 🎥️ [Attention is all you need; Attentional Neural Network Models](https://www.youtube.com/watch?v=rBCqOTEfxvg) by Łukasz Kaiser [Talk, 2017] * 📙 [Attention Is Off By One](https://www.evanmiller.org/attention-is-off-by-one.html?s=03) [July, 2023] * 🎥️ [Understanding and Applying Self-Attention for NLP](https://www.youtube.com/watch?v=OYygPG4d9H0) [Talk, 2018] * 📙 [The NLP Cookbook: Modern Recipes for Transformer based Deep Learning Architectures](https://arxiv.org/abs/2104.10640) [Paper, April 2021] * 📙 [Pre-Trained Models: Past, Present and Future](https://arxiv.org/abs/2106.07139) [Paper, June 2021] * 📙 [A Survey of Transformers](https://arxiv.org/abs/2106.04554) [Paper, June 2021] #### Transformer * 📙 [The Annotated Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) by Harvard NLP [Blog, 2018] * 📙 [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/) by Jay Alammar [Blog, 2018] * 📙 [Illustrated Guide to Transformers](https://jinglescode.github.io/2020/05/27/illustrated-guide-transformer/) by Hong Jing [Blog, 2020] * 📙 [Evolution of Representations in the Transformer](https://lena-voita.github.io/posts/emnlp19_evolution.html) by Lena Voita [Blog, 2019] * 📙 [Reformer: The Efficient Transformer](https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html) [Blog, 2020] * 📙 [Longformer — The Long-Document Transformer](https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9) by Viktor Karlsson [Blog, 2020] * 📙 [TRANSFORMERS FROM SCRATCH](http://www.peterbloem.nl/blog/transformers) [Blog, 2019] * 📙 [Transformers in Natural Language Processing — A Brief Survey](https://eigenfoo.xyz/transformers-in-nlp/) by George Ho [Blog, May 2020] * ⭐ [Lite Transformer](https://github.com/mit-han-lab/lite-transformer) - Lite Transformer with Long-Short Range Attention [GitHub, 611 stars] * 📙 [Transformers from Scratch](https://e2eml.school/transformers.html) [Blog, Oct 2021] #### BERT * 📙 [A Visual Guide to Using BERT for the First Time](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/) by Jay Alammar [Blog, 2019] * 📙 [The Dark Secrets of BERT](https://text-machine-lab.github.io/blog/2020/bert-secrets/) by Anna Rogers [Blog, 2020] * 📙 [Understanding searches better than ever before](https://www.blog.google/products/search/search-language-understanding-bert/) [Blog, 2019] * 📙 [Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework](https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/) [Blog, 2019] * ⭐ [SemBERT](https://github.com/cooelf/SemBERT) - Semantics-aware BERT for Language Understanding [GitHub, 288 stars] * ⭐ [BERTweet](https://github.com/VinAIResearch/BERTweet) - BERTweet: A pre-trained language model for English Tweets [GitHub, 609 stars] * ⭐ [Optimal Subarchitecture Extraction for BERT](https://github.com/alexa/bort) [GitHub, 470 stars] * ⭐ [CharacterBERT: Reconciling ELMo and BERT](https://github.com/helboukkouri/character-bert) [GitHub, 199 stars] * 📙 [When BERT Plays The Lottery, All Tickets Are Winning](https://thegradient.pub/when-bert-plays-the-lottery-all-tickets-are-winning/) [Blog, Dec 2020] * ⭐ [BERT-related Papers](https://github.com/tomohideshibata/BERT-related-papers) a list of BERT-related papers [GitHub, 2036 stars] #### Other Transformer Variants ##### T5 * 📙 [T5 Understanding Transformer-Based Self-Supervised Architectures](https://medium.com/@rojagtap/t5-text-to-text-transfer-transformer-643f89e8905e) [Blog, August 2020] * 📙 [T5: the Text-To-Text Transfer Transformer](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) [Blog, 2020] * ⭐ [multilingual-t5](https://github.com/google-research/multilingual-t5) - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 1294 stars] ##### BigBird * 📙 [Big Bird: Transformers for Longer Sequences](https://arxiv.org/abs/2007.14062) original paper by Google Research [Paper, July 2020] ##### Reformer / Linformer / Longformer / Performers * 🎥️ [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451) - [Paper, February 2020] [[Video](https://www.youtube.com/watch?v=xJrKIPwVwGM), October 2020] * 🎥️ [Longformer: The Long-Document Transformer](https://arxiv.org/abs/2004.05150) - [Paper, April 2020] [[Video](https://www.youtube.com/watch?v=_8KNb5iqblE), April 2020] * 🎥️ [Linformer: Self-Attention with Linear Complexity](https://arxiv.org/abs/2006.04768) - [Paper, June 2020] [[Video](https://www.youtube.com/watch?v=-_2AF9Lhweo), June 2020] * 🎥️ [Rethinking Attention with Performers](https://arxiv.org/abs/2009.14794) - [Paper, September 2020] [[Video](https://www.youtube.com/watch?v=0eTULzrOztQ), September 2020] * ⭐ [performer-pytorch](https://github.com/lucidrains/performer-pytorch) - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1176 stars] ##### Switch Transformer * 📙 [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) original paper by Google Research [Paper, January 2021] #### GPT-family ##### General * 📙 [The Illustrated GPT-2](http://jalammar.github.io/illustrated-gpt2/) by Jay Alammar [Blog, 2019] * 📙 [The Annotated GPT-2](https://amaarora.github.io/posts/2020-02-18-annotatedGPT2.html) by Aman Arora * 📙 [OpenAI’s GPT-2: the model, the hype, and the controversy](https://medium.com/data-science/openais-gpt-2-the-model-the-hype-and-the-controversy-1109f4bfd5e8) by Ryan Lowe [Blog, 2019] * 📙 [How to generate text](https://huggingface.co/blog/how-to-generate) by Patrick von Platen [Blog, 2020] ##### GPT-3 ###### Learning Resources * 📙 [Zero Shot Learning for Text Classification](https://amitness.com/2020/05/zero-shot-text-classification/) by Amit Chaudhary [Blog, 2020] * 📙 [GPT-3 A Brief Summary](https://leogao.dev/2020/05/29/GPT-3-A-Brief-Summary/) by Leo Gao [Blog, 2020] * 📙 [GPT-3, a Giant Step for Deep Learning And NLP](https://anotherdatum.com/gpt-3.html) by Yoel Zeldes [Blog, June 2020] * 📙 [GPT-3 Language Model: A Technical Overview](https://lambdalabs.com/blog/demystifying-gpt-3/) by Chuan Li [Blog, June 2020] * 📙 [Is it possible for language models to achieve language understanding?](https://medium.com/@ChrisGPotts/is-it-possible-for-language-models-to-achieve-language-understanding-81df45082ee2) by Christopher Potts ###### Applications * ⭐ [Awesome GPT-3](https://github.com/elyase/awesome-gpt3) - list of all resources related to GPT-3 [GitHub, 4536 stars] * 🗂️ [GPT-3 Projects](https://airtable.com/shrndwzEx01al2jHM/tblYMAiGeDLXe35jC) - a map of all GPT-3 start-ups and commercial projects * 🗂️ [GPT-3 Demo Showcase](https://gpt3demo.com/) - GPT-3 Demo Showcase, 180+ Apps, Examples, & Resources * 🔱 [OpenAI API](https://platform.openai.com/docs/overview) - API Demo to use OpenAI GPT for commercial applications ###### Open-source Efforts * 📙 [GPT-Neo](https://www.eleuther.ai/artifacts/gpt-neo) - in-progress GPT-3 open source replication [HuggingFace Hub](https://huggingface.co/EleutherAI) * ⭐ [GPT-J](https://github.com/kingoflolz/mesh-transformer-jax/#gpt-j-6b) - A 6 billion parameter, autoregressive text generation model trained on The Pile * 📙 [Effectively using GPT-J with few-shot learning](https://nlpcloud.com/effectively-using-gpt-j-gpt-neo-gpt-3-alternatives-few-shot-learning.html) [Blog, July 2021] #### Other * 📙 [What is Two-Stream Self-Attention in XLNet](https://medium.com/data-science/what-is-two-stream-self-attention-in-xlnet-ebfe013a0cf3) by Xu LIANG [Blog, 2019] * 📙 [Visual Paper Summary: ALBERT (A Lite BERT)](https://amitness.com/2020/02/albert-visual-summary/) by Amit Chaudhary [Blog, 2020] * 📙 [Turing NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) by Microsoft * 📙 [Multi-Label Text Classification with XLNet](https://www.kaggle.com/code/mnavaidd/xlnet-multi-class-text-classification-xlnet) by Josh Xin Jie Lee [Blog, 2019] * ⭐ [ELECTRA](https://github.com/google-research/electra) [GitHub, 2370 stars] * ⭐ [Performer](https://github.com/lucidrains/performer-pytorch) implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 1176 stars] #### Distillation, Pruning and Quantization ##### Reading Material * 📙 [Compression of Deep Learning Models for Text: A Survey](https://arxiv.org/abs/2008.05221) [Paper, April 2021] ##### Tools * ⭐ [Bert-squeeze](https://github.com/JulesBelveze/bert-squeeze) - code to reduce the size of Transformer-based models or decrease their latency at inference time [GitHub, 85 stars] * ⭐ [XtremeDistil ](https://github.com/microsoft/xtreme-distil-transformers) - XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks [GitHub, 157 stars] ### Automated Summarization * 📙 [PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization](https://ai.googleblog.com/2020/06/pegasus-state-of-art-model-for.html) by Google AI [Blog, June 2020] * ⭐ [CTRLsum](https://github.com/salesforce/ctrl-sum) - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 150 stars] * ⭐ [XL-Sum](https://github.com/csebuetnlp/xl-sum) - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages [GitHub, 277 stars] * ⭐ [SummerTime](https://github.com/Yale-LILY/SummerTime) - an open-source text summarization toolkit for non-experts [GitHub, 280 stars] * ⭐ [PRIMER](https://github.com/allenai/PRIMER) - PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization [GitHub, 157 stars] * ⭐ [summarus](https://github.com/IlyaGusev/summarus) - Models for automatic abstractive summarization [GitHub, 172 stars] ### Knowledge Graphs and NLP * 📙 [Fusing Knowledge into Language Model](https://drive.google.com/file/d/1Zgijg9RPxF-tIGWU9nt9rBcryOIB4lOk/view) [Presentation, Oct 2021] ### Model Generation * ⭐ [smolmodels](https://pypi.org/project/smolmodels) - agentic framework for building ML models from natural language ### Small LLMs * [smollm](https://github.com/huggingface/smollm/blob/main/text/README.md) - 3B parameter language model designed to push the boundaries of small models [GitHub, 3797 stars] ![The-NLP-Industry](./Resources/Images/pandect_industry.png) ----- > __Note__ > Section keywords: best practices, MLOps [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### Best Practices for building NLP Projects * 🎥 [In Search of Best Practices for NLP Projects](https://www.youtube.com/watch?v=0S9iai4Ld4I) [[Slides](https://www.dropbox.com/s/4fymdzz4yh3mlyz/NLP_Best_Practices_Bilan.pdf?dl=0), Dec. 2020] * 🎥 [EMNLP 2020: High Performance Natural Language Processing](https://slideslive.com/38940826) by Google Research, [Recording](https://slideslive.com/38940826), Nov. 2020] * 📙 [Practical Natural Language Processing](https://www.amazon.com/Practical-Natural-Language-Processing-Pragmatic/dp/1492054054) - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020] * 📙 [How to Structure and Manage NLP Projects](https://neptune.ai/blog/how-to-structure-and-manage-nlp-projects-templates) [Blog, May 2021] * 📙 [Applied NLP Thinking](https://explosion.ai/blog/applied-nlp-thinking) - Applied NLP Thinking: How to Translate Problems into Solutions [Blog, June 2021] * 🎥 [Introduction to NLP for Industry Use](https://www.youtube.com/watch?v=VRur3xey31s) - DataTalksClub presentation on Introduction to NLP for Industry Use [Recording, December 2021] * 📙 [Measuring Embedding Drift](https://arize.com/blog/embedding-drift/) - Best practices for monitoring drift of NLP models [Blog, December 2022] * 📙 [Drift in Machine Learning](https://www.fiddler.ai/blog/drift-in-machine-learning-how-to-identify-issues-before-you-have-a-problem) - How to Identify Issues Before You Have a Problem [Blog, January 2022] ### MLOps for NLP MLOps, especially when applied to NLP, is a set of best practices around automating various parts of the workflow when building and deploying NLP pipelines. In general, MLOps for NLP includes having the following processes in place: - **Data Versioning** - make sure your training, annotation and other types of data are versioned and tracked - **Experiment Tracking** - make sure that all of your experiments are automatically tracked and saved where they can be easily replicated or retraced - **Model Registry** - make sure any neural models you train are versioned and tracked and it is easy to roll back to any of them - **Automated Testing and Behavioral Testing** - besides regular unit and integration tests, you want to have behavioral tests that check for bias or potential adversarial attacks - **Model Deployment and Serving** - automate model deployment, ideally also with zero-downtime deploys like Blue/Green, Canary deploys etc. - **Data and Model Observability** - track data drift, model accuracy drift etc. Additionally, there are two more components that are not as prevalent for NLP and are mostly used for Computer Vision and other sub-fields of AI: - **Feature Store** - centralized storage of all features developed for ML models than can be easily reused by any other ML project - **Metadata Management** - storage for all information related to the usage of ML models, mainly for reproducing behavior of deployed ML models, artifact tracking etc. #### MLOps Compilations & Awesome Lists * ⭐ [awesome-mlops](https://github.com/visenger/awesome-mlops) [GitHub, 13923 stars] * ⭐ [best-of-ml-python](https://github.com/ml-tooling/best-of-ml-python) [GitHub, 23609 stars] #### Running LLMs locally or self-hosted * ⭐ [vLLM](https://github.com/vllm-project/vllm) [GitHub, 81616 stars] * ⭐ [llama.cpp](https://github.com/ggml-org/llama.cpp) [GitHub, 114160 stars] * 🔱 [ollama](https://ollama.com/) [Free Local & Paid Cloud Service] #### Reading Material * 📙 [Machine Learning Operations (MLOps): Overview, Definition, and Architecture](https://arxiv.org/abs/2205.02302) [Paper, May 2022] * 📙 [Requirements and Reference Architecture for MLOps:Insights from Industry](https://www.techrxiv.org/doi/full/10.36227/techrxiv.21397413.v1) [Paper, Oct 2022] * 📙 [MLOps: What It Is, Why it Matters, and How To Implement It](https://neptune.ai/blog/mlops-what-it-is-why-it-matters-and-how-to-implement-it-from-a-data-scientist-perspective) by Neptune AI [Blog, July 2021] * 📙 [Best MLOps Tools You Need to Know as a Data Scientist](https://neptune.ai/blog/best-mlops-tools) by Neptune AI [Blog, July 2021] * 📙 [State of MLOps 2021](https://valohai.com/state-of-mlops/#introduction) by Valohai [Blog, August 2021] * 📙 [The MLOps Stack](https://valohai.com/blog/the-mlops-stack/) by Valohai [Blog, October 2020] * 📙 [The Rapid Evolution of the Canonical Stack for Machine Learning](https://medium.com/@ODSC/the-rapid-evolution-of-the-canonical-stack-for-machine-learning-21b37af9c3b5) [Blog, July 2021] * 📙 [MLOps: Comprehensive Beginner’s Guide](https://medium.com/sciforce/mlops-comprehensive-beginners-guide-c235c77f407f) [Blog, March 2021] * 📙 [What I’ve learned about MLOps from speaking with 100+ ML practitioners](https://veselinastaneva.medium.com/what-ive-learned-about-mlops-from-speaking-with-100-ml-practitioners-3025e33458ad) [Blog, May 2021] * 📙 [DataRobot Challenger Models](https://www.datarobot.com/blog/introducing-mlops-champion-challenger-models) - MLOps Champion/Challenger Models * 📙 [State of MLOps Blog](https://www.stateofmlops.com/) by Dr. Ori Cohen * 📙 [MLOps Ecosystem Overview](https://arize.com/wp-content/uploads/2021/04/Arize-AI-Ecosystem-White-Paper.pdf) [Blog, 2021] * 📙 [Metrics vs. Inferences](https://www.fiddler.ai/blog/should-enterprises-observe-metrics-or-inferences) - Which should you observe? [Blog, February 2024] #### Learning Material * 🗂 [MLOps cource](https://madewithml.com/#mlops) by Made With ML * 🗂 [GitHub MLOps](https://mlops.githubapp.com) - collection of resources on how to facilitate Machine Learning Ops with GitHub * 🗂 [ML Observability Fundamentals Course](https://arize.com/ml-observability-fundamentals/) Learn how to monitor and root-cause issues with production NLP models #### MLOps Communities * [The MLOps Community](https://mlops.community/) - blogs, slack group, newsletter and more all about MLOps #### Data Versioning * ⭐ [DVC](https://dvc.org/) - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] [Link to GitHub](https://github.com/iterative/dvc) * 🔱 [Weights & Biases](https://wandb.ai/site) - tools for experiment tracking and dataset versioning [Paid Service] #### Experiment Tracking * ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/) * 🔱 [Weights & Biases](https://wandb.ai/site) - tools for experiment tracking and dataset versioning [Paid Service] * 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service] * 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service] * 🔱 [SigOpt](https://sigopt.com/) - automate training & tuning, visualize & compare runs [Paid Service] * ⭐ [Optuna](https://github.com/optuna/optuna) - hyperparameter optimization framework [GitHub, 14280 stars] * ⭐ [Clear ML](https://clear.ml/) - experiment, orchestrate, deploy, and build data stores, all in one place [Free and Open Source] [Link to GitHub](https://github.com/allegroai/clearml/) * ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 10111 stars] ##### Model Registry * ⭐ [DVC](https://dvc.org/) - Data Version Control (DVC) tracks ML models and data sets [Free and Open Source] [Link to GitHub](https://github.com/iterative/dvc) * ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/) * ⭐ [ModelDB](https://github.com/VertaAI/modeldb) - open-source system for Machine Learning model versioning, metadata, and experiment management [GitHub, 1747 stars] * 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service] * 🔱 [Valohai](https://valohai.com/) - End-to-end ML pipelines [Paid Service] * 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier] * 🔱 [polyaxon](https://polyaxon.com/) - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service] * 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service] #### Automated Testing and Behavioral Testing * ⭐ [CheckList](https://github.com/marcotcr/checklist) - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2050 stars] * ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 3427 stars] * ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) - Corrupt an input text to test NLP models' robustness [GitHub, 76 stars] * ⭐ [Great Expectations](https://github.com/great-expectations/great_expectations) - Write tests for your data [GitHub, 11532 stars] * ⭐ [Deepchecks](https://github.com/deepchecks/deepchecks) - Python package for comprehensively validating your machine learning models and data [GitHub, 4017 stars] #### Model Deployability and Serving * ⭐ [mlflow](https://mlflow.org/) - open source platform for the machine learning lifecycle [Free and Open Source] [Link to GitHub](https://github.com/mlflow/mlflow/) * 🔱 [Amazon SageMaker](https://aws.amazon.com/de/sagemaker/) [Paid Service] * 🔱 [Valohai](https://valohai.com/) - End-to-end ML pipelines [Paid Service] * 🔱 [NLP Cloud](https://nlpcloud.com/) - Production-ready NLP API [Paid Service] * 🔱 [Saturn Cloud](https://saturncloud.io/) [Paid Service] * 🔱 [Comet ML](https://www.comet.ml/site/) - enables data scientists and teams to track, compare, explain and optimize experiments and models [Paid Service] * 🔱 [polyaxon](https://polyaxon.com/) - reproduce, automate, and scale your data science workflows with production-grade MLOps tools [Paid Service] * ⭐ [TorchServe](https://github.com/pytorch/serve) - flexible and easy to use tool for serving PyTorch models [GitHub, 4359 stars] * 🔱 [Kubeflow](https://www.kubeflow.org/) - The Machine Learning Toolkit for Kubernetes [GitHub, 10600 stars] * ⭐ [KFServing](https://github.com/kubeflow/kfserving) - Serverless Inferencing on Kubernetes [GitHub, 5534 stars] * 🔱 [TFX](https://www.tensorflow.org/tfx) - TensorFlow Extended - end-to-end platform for deploying production ML pipelines [Paid Service] * 🔱 [Pachyderm](https://www.pachyderm.com/) - version control for data with the tools to build scalable end-to-end ML/AI pipelines [Paid Service with Free Tier] * 🔱 [Cortex](https://www.cortex.dev/) - containers as a service on AWS [Paid Service] * 🔱 [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/#features) - end-to-end machine learning lifecycle [Paid Service] * ⭐ [End2End Serverless Transformers On AWS Lambda](https://github.com/bhavsarpratik/serverless-transformers-on-aws-lambda) [GitHub, 122 stars] * ⭐ [NLP-Service](https://github.com/karndeb/NLP-Service) - sample demo of NLP as a service platform built using FastAPI and Hugging Face [GitHub, 13 stars] * 🔱 [Dagster](https://dagster.io/) - data orchestrator for machine learning [Free and Open Source] * 🔱 [Verta](https://www.verta.ai/) - AI and machine learning deployment and operations [Paid Service] * ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 10111 stars] * ⭐ [flyte](https://github.com/flyteorg/flyte) - workflow automation platform for complex, mission-critical data and ML processes at scale [GitHub, 7056 stars] * ⭐ [MLRun](https://github.com/mlrun/mlrun) - Machine Learning automation and tracking [GitHub, 1670 stars] * 🔱 [DataRobot MLOps](https://www.datarobot.com/platform/mlops/) - DataRobot MLOps provides a center of excellence for your production AI #### Model Debugging * ⭐ [imodels](https://github.com/csinva/imodels) - package for concise, transparent, and accurate predictive modeling [GitHub, 1592 stars] * ⭐ [Cockpit](https://github.com/f-dangel/cockpit) - A Practical Debugging Tool for Training Deep Neural Networks [GitHub, 488 stars] #### Model Accuracy Prediction * ⭐ [WeightWatcher](https://github.com/CalculatedContent/WeightWatcher) - WeightWatcher tool for predicting the accuracy of Deep Neural Networks [GitHub, 1751 stars] #### Data and Model Observability ##### General * ⭐ [Arize AI](https://arize.com/) - embedding drift monitoring for NLP models * ⭐ [Arize-Phoenix](https://phoenix.arize.com/) - ML observability for LLMs, vision, language, and tabular models * ⭐ [whylogs](https://github.com/whylabs/whylogs) - open source standard for data and ML logging [GitHub, 2819 stars] * ⭐ [Rubrix](https://github.com/recognai/rubrix) - open-source tool for exploring and iterating on data for artificial intelligence projects [GitHub, 4992 stars] * ⭐ [MLRun](https://github.com/mlrun/mlrun) - Machine Learning automation and tracking [GitHub, 1670 stars] * 🔱 [DataRobot MLOps](https://www.datarobot.com/platform/mlops/) - DataRobot MLOps provides a center of excellence for your production AI * 🔱 [Cortex](https://www.cortex.dev/) - containers as a service on AWS [Paid Service] ##### Model Centric * 🔱 [Algorithmia](https://algorithmia.com/) - minimize risk with advanced reporting and enterprise-grade security and governance across all data, models, and infrastructure [Paid Service] * 🔱 [Dataiku](https://www.dataiku.com/) - dataiku is for teams who want to deliver advanced analytics using the latest techniques at big data scale [Paid Service] * ⭐ [Evidently AI](https://evidentlyai.com/) - tools to analyze and monitor machine learning models [Free and Open Source] [Link to GitHub](https://github.com/evidentlyai/evidently) * 🔱 [Fiddler](https://www.fiddler.ai/) - All-in-one ML and LLM observability. Fastest LLM Guardrails. [Paid Service] * 🔱 [Hydrosphere](https://hydrosphere.io/) - open-source platform for managing ML models [Paid Service] * 🔱 [Verta](https://www.verta.ai/) - AI and machine learning deployment and operations [Paid Service] * 🔱 [Domino Model Ops](https://www.dominodatalab.com/product/model-ops/) - Deploy and Manage Models to Drive Business Impact [Paid Service] ##### Data Centric * 🔱 [Datafold](https://www.datafold.com/) - data quality through diffs, profiling, and anomaly detection [Paid Service] * 🔱 [acceldata](https://www.acceldata.io/) - improve reliability, accelerate scale, and reduce costs across all data pipelines [Paid Service] * 🔱 [Bigeye](https://www.bigeye.com/) - monitoring and alerting to your datasets in minutes [Paid Service] * 🔱 [datakin](https://datakin.com/product/) - end-to-end, real-time data lineage solution [Paid Service] * 🔱 [Monte Carlo](https://www.montecarlodata.com/) - data integrity, drifts, schema, lineage [Paid Service] * 🔱 [SODA](https://www.soda.io/) - data monitoring, testing and validation [Paid Service] #### Feature Stores * 🔱 [Tecton](https://www.tecton.ai//) - enterprise feature store for machine learning [Paid Service] * ⭐ [FEAST](https://github.com/feast-dev/feast) - open source feature store for machine learning [Website](https://feast.dev/) [GitHub, 7063 stars] * 🔱 [Hopsworks Feature Store](https://www.hopsworks.ai/feature-store) - data management system for managing machine learning features [Paid Service] #### Metadata Management * ⭐ [ML Metadata](https://github.com/google/ml-metadata) - a library for recording and retrieving metadata associated with ML developer and data scientist workflows [GitHub, 678 stars] * 🔱 [Neptune AI](https://neptune.ai/) - experiment tracking and model registry built for research and production teams [Paid Service] #### MLOps Frameworks * ⭐ [Metaflow](https://github.com/Netflix/metaflow) - human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects [GitHub, 10111 stars] * ⭐ [kedro](https://github.com/quantumblacklabs/kedro) - Python framework for creating reproducible, maintainable and modular data science code [GitHub, 10867 stars] * ⭐ [Seldon Core](https://github.com/SeldonIO/seldon-core) - MLOps framework to package, deploy, monitor and manage thousands of production machine learning models [GitHub, 4752 stars] * ⭐ [ZenML](https://github.com/maiot-io/zenml) - MLOps framework to create reproducible ML pipelines for production machine learning [GitHub, 5429 stars] * 🔱 [Google Vertex AI](https://cloud.google.com/vertex-ai) - build, deploy, and scale ML models faster, with pre-trained and custom tooling within a unified AI platform [Paid Service] * ⭐ [Diffgram](https://github.com/diffgram/diffgram) - Complete training data platform for machine learning delivered as a single application [GitHub, 1904 stars] * 🔱 [Continual.ai](https://www.continual.ai/) - build, deploy, and operationalize ML models easier and faster with a declarative interface on cloud data warehouses like Snowflake, BigQuery, RedShift, and Databricks. [Paid Service] ### Transformer-based Architectures [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) #### General * 📙 [Why BERT Fails in Commercial Environments](https://www.intel.com/content/www/us/en/artificial-intelligence/posts/bert-commercial-environments.html) by Intel AI [Blog, 2020] * 📙 [Fine Tuning BERT for Text Classification with FARM](https://medium.com/data-science/fine-tuning-bert-for-text-classification-with-farm-2880665065e2) by Sebastian Guggisberg [Blog, 2020] * ⭐ [Pretrain Transformers Models in PyTorch using Hugging Face Transformers](https://github.com/gmihaila/ml_things/blob/master/notebooks/pytorch/pretrain_transformers_pytorch.ipynb) [GitHub, 265 stars] * 🎥️ [Practical NLP for the Real World](https://www.infoq.com/presentations/practical-nlp/) [Presentation, 2019] * 🎥️ [From Paper to Product – How we implemented BERT](https://www.youtube.com/watch?v=VnmKDPBQjJk) by Christoph Henkelmann [Talk, 2020] ##### Multi-GPU Transformers * ⭐ [Parallelformers: An Efficient Model Parallelization Toolkit for Deployment](https://github.com/tunib-ai/parallelformers) [GitHub, 788 stars] ##### Training Transformers Effectively * ⭐ [Training BERT with Compute/Time (Academic) Budget](https://github.com/IntelLabs/academic-budget-bert) [GitHub, 315 stars] ### Embeddings as a Service * ⭐ [embedding-as-service](https://github.com/amansrivastava17/embedding-as-service) [GitHub, 210 stars] * ⭐ [Bert-as-service](https://github.com/hanxiao/bert-as-service) [GitHub, 12830 stars] ### NLP Recipes Industrial Applications: * ⭐ [NLP Recipes](https://github.com/microsoft/nlp-recipes) by [microsoft](https://github.com/microsoft) [GitHub, 6438 stars] * ⭐ [NLP with Python](https://github.com/susanli2016/NLP-with-Python) by [susanli2016](https://github.com/susanli2016) [GitHub, 2790 stars] * ⭐ [Basic Utilities for PyTorch NLP](https://github.com/PetrochukM/PyTorch-NLP) by [PetrochukM](https://github.com/PetrochukM) [GitHub, 2229 stars] ### NLP Applications in Bio, Finance, Legal and other industries * ⭐ [Blackstone](https://github.com/ICLRandD/Blackstone) - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 689 stars] * ⭐ [Sci spaCy](https://github.com/allenai/scispacy) - spaCy pipeline and models for scientific/biomedical documents [GitHub, 1960 stars] * ⭐ [FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks](https://github.com/psnonis/FinBERT) [GitHub, 204 stars] * ⭐ [LexNLP](https://github.com/LexPredict/lexpredict-lexnlp) - Information retrieval and extraction for real, unstructured legal text [GitHub, 781 stars] * ⭐ [NerDL and NerCRF](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/data_prep.ipynb) - Tutorial on Named Entity Recognition for Healthcare with SparkNLP * ⭐ [Legal Text Analytics](https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics) - A list of selected resources dedicated to Legal Text Analytics [GitHub, 720 stars] * ⭐ [BioIE](https://github.com/caufieldjh/awesome-bioie) - A curated list of resources relevant to doing Biomedical Information Extraction [GitHub, 441 stars] ![The-NLP-Speech](./Resources/Images/pandect_speech.png) ----- > __Note__ > Section keywords: speech recognition [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### General Speech Recognition * ⭐ [wav2letter](https://github.com/facebookresearch/wav2letter) - Automatic Speech Recognition Toolkit [GitHub, 6444 stars] * ⭐ [DeepSpeech](https://github.com/mozilla/DeepSpeech) - Baidu's DeepSpeech architecture [GitHub, 26750 stars] * 📙 [Acoustic Word Embeddings](https://medium.com/@maobedkova/acoustic-word-embeddings-fc3f1a8f0519) by Maria Obedkova [Blog, 2020] * ⭐ [kaldi](https://github.com/kaldi-asr/kaldi) - Kaldi is a toolkit for speech recognition [GitHub, 15401 stars] * ⭐ [awesome-kaldi](https://github.com/YoavRamon/awesome-kaldi) - resources for using Kaldi [GitHub, 538 stars] * ⭐ [ESPnet](https://github.com/espnet/espnet) - End-to-End Speech Processing Toolkit [GitHub, 9850 stars] * 📙 [HuBERT](https://ai.facebook.com/blog/hubert-self-supervised-representation-learning-for-speech-recognition-generation-and-compression) - Self-supervised representation learning for speech recognition, generation, and compression [Blog, June 2021] ### Text to Speech / Speech Generation * ⭐ [FastSpeech](https://github.com/xcmyz/FastSpeech) - The Implementation of FastSpeech based on pytorch [GitHub, 880 stars] * ⭐ [TTS](https://github.com/coqui-ai/TTS) - a deep learning toolkit for Text-to-Speech [GitHub, 45454 stars] * 🔱 [NotebookLM](https://notebooklm.google/) - Google Gemini powered personal assistant / podcast generator ### Speech to Text * ⭐ [whisper](https://github.com/openai/whisper) - Robust Speech Recognition via Large-Scale Weak Supervision, by OpenAI [GitHub, 101153 stars] * ⭐ [vibe](https://github.com/thewh1teagle/vibe) - GUI tool to work with whisper, multilingual and cuda support included [GitHub, 6317 stars] ### Datasets * ⭐ [VoxPopuli](https://github.com/facebookresearch/voxpopuli) - large-scale multilingual speech corpus for representation learning [GitHub, 573 stars] ![The-NLP-Topics](./Resources/Images/pandect_topics.png) ----- > __Note__ > Section keywords: topic modeling [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### Blogs * 📙 [Topic Modelling with PySpark and Spark NLP](https://medium.com/trustyou-engineering/topic-modelling-with-pyspark-and-spark-nlp-a99d063f1a6e) by Maria Obedkova [Spark, Blog, 2020] * 📙 [A Unique Approach to Short Text Clustering (Algorithmic Theory)](https://medium.com/data-science/a-unique-approach-to-short-text-clustering-part-1-algorithmic-theory-4d4fad0882e1) by Brittany Bowers [Blog, 2020] ### Frameworks for Topic Modeling * ⭐ [gensim](https://github.com/RaRe-Technologies/gensim) - framework for topic modeling [GitHub, 16421 stars] * ⭐ [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) [GitHub, 4131 stars] ### Repositories * ⭐ [Top2Vec](https://github.com/ddangelov/Top2Vec) [GitHub, 3107 stars] * ⭐ [Anchored Correlation Explanation Topic Modeling](https://github.com/gregversteeg/CorEx) [GitHub, 307 stars] * ⭐ [Topic Modeling in Embedding Spaces](https://github.com/adjidieng/ETM) [GitHub, 561 stars] [Paper](https://arxiv.org/abs/1907.04907) * ⭐ [TopicNet](https://github.com/machine-intelligence-laboratory/TopicNet) - A high-level interface for BigARTM library [GitHub, 143 stars] * ⭐ [BERTopic](https://github.com/MaartenGr/BERTopic) - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 7655 stars] * ⭐ [OCTIS](https://github.com/MIND-Lab/OCTIS) - A python package to optimize and evaluate topic models [GitHub, 802 stars] * ⭐ [Contextualized Topic Models](https://github.com/MilaNLProc/contextualized-topic-models) [GitHub, 1269 stars] * ⭐ [GSDMM](https://github.com/rwalk/gsdmm) - GSDMM: Short text clustering [GitHub, 359 stars] ![Keyword-Extraction](./Resources/Images/pandect_papyrus2.png) ----- > __Note__ > Section keywords: keyword extraction [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### Text Rank * ⭐ [PyTextRank](https://github.com/DerwenAI/pytextrank) - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2213 stars] * ⭐ [textrank](https://github.com/summanlp/textrank) - TextRank implementation for Python 3 [GitHub, 1269 stars] ### RAKE - Rapid Automatic Keyword Extraction * ⭐ [rake-nltk](https://github.com/csurfer/rake-nltk) - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1084 stars] * ⭐ [yake](https://github.com/LIAAD/yake) - Single-document unsupervised keyword extraction [GitHub, 1861 stars] * ⭐ [RAKE-tutorial](https://github.com/zelandiya/RAKE-tutorial) - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 372 stars] * ⭐ [rake-nltk](https://github.com/csurfer/rake-nltk) - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1084 stars] ### Other Approaches * ⭐ [flashtext](https://github.com/vi3k6i5/flashtext) - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5714 stars] * ⭐ [BERT-Keyword-Extractor](https://github.com/ibatra/BERT-Keyword-Extractor) - Deep Keyphrase Extraction using BERT [GitHub, 261 stars] * ⭐ [keyBERT](https://github.com/MaartenGr/KeyBERT) - Minimal keyword extraction with BERT [GitHub, 4182 stars] * ⭐ [KeyphraseVectorizers](https://github.com/TimSchopf/KeyphraseVectorizers) - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 267 stars] ### Further Reading * 📙 [Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts](https://howard-haowen.github.io/blog.ai/keyword-extraction/spacy/textacy/ckip-transformers/jieba/textrank/rake/2021/02/16/Adding-a-custom-tokenizer-to-spaCy-and-extracting-keywords.html) by Haowen Jiang [Blog, Feb 2021] * 📙 [How to Extract Relevant Keywords with KeyBERT](https://medium.com/data-science/how-to-extract-relevant-keywords-with-keybert-6e7b3cf889ae) [Blog, June 2021] ![Responsible-NLP](./Resources/Images/pandect_pegasus.png) ----- > __Note__ > Section keywords: ethics, responsible NLP [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### NLP and ML Interpretability #### NLP-centric * [Explainability for Natural Language Processing - KDD'2021 Tutorial](https://www.youtube.com/watch?v=PvKOSYGclPk&t=2s) [Slides](https://www.slideshare.net/YunyaoLi/explainability-for-natural-language-processing-249992241) [Presentation, August 2021] * ⭐ [ecco](https://github.com/jalammar/ecco) - Tools to visuals and explore NLP language models [GitHub, 2102 stars] * ⭐ [NLP Profiler](https://github.com/neomatrix369/nlp_profiler) - A simple NLP library allows profiling datasets with text columns [GitHub, 244 stars] * ⭐ [transformers-interpret](https://github.com/cdpierse/transformers-interpret) - Model explainability that works seamlessly with transformers [GitHub, 1412 stars] * ⭐ [Awesome-explainable-AI](https://github.com/wangyongjie-ntu/Awesome-explainable-AI) - collection of research materials on explainable AI/ML [GitHub, 1643 stars] * ⭐ [LAMA](https://github.com/facebookresearch/LAMA) - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1387 stars] #### General * ⭐ [Language Interpretability Tool (LIT)](https://github.com/PAIR-code/lit) [GitHub, 3654 stars] * ⭐ [WhatLies](https://github.com/RasaHQ/whatlies) - Toolkit to help visualise - what lies in word embeddings [GitHub, 480 stars] * ⭐ [Interpret-Text](https://github.com/interpretml/interpret-text) - Interpretability techniques and visualization dashboards for NLP models [GitHub, 432 stars] * ⭐ [InterpretML](https://github.com/interpretml/interpret) - Fit interpretable models. Explain blackbox machine learning [GitHub, 6866 stars] * ⭐ [thermostat](https://github.com/DFKI-NLP/thermostat) - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars] * ⭐ [Dodrio](https://github.com/poloclub/dodrio) - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 372 stars] * ⭐ [imodels](https://github.com/csinva/imodels) - package for concise, transparent, and accurate predictive modeling [GitHub, 1592 stars] ### Ethics, Bias, and Equality in NLP * 📙 [Bias in Natural Language Processing @EMNLP 2020](https://gaurav-maheshwari.medium.com/bias-in-natural-language-processing-emnlp-2020-8f1cb2806fcc#cc1a) [Blog, Nov 2020] * 🎥️ [Machine Learning as a Software Engineering Enterprise](https://nips.cc/virtual/2020/public/invited_16166.html) - NeurIPS 2020 Keynote [Presentation, Dec 2020] * 🗂️ [Ethics in NLP](https://aclweb.org/aclwiki/Ethics_in_NLP) - resources from ACLs Ethics in NLP track * 🗂️ [The Institute for Ethical AI & Machine Learning](https://ethical.institute) * 📙 [Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models](https://arxiv.org/abs/2102.02503) [Paper, Feb 2021] * ⭐ [Fairness-in-AI](https://github.com/dreji18/Fairness-in-AI) - this package is used to detect and mitigate biases in NLP tasks [GitHub, 101 stars] * ⭐ [nlg-bias](https://github.com/ewsheng/nlg-bias) - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 72 stars] * 🗂️ [bias-in-nlp](https://github.com/cisnlp/bias-in-nlp) - list of papers related to bias in NLP [GitHub, 11 stars] ### Adversarial Attacks for NLP * 📙 [Privacy Considerations in Large Language Models](https://ai.googleblog.com/2020/12/privacy-considerations-in-large.html?m=1) [Blog, Dec 2020] * ⭐ [DeepWordBug](https://github.com/QData/deepWordBug) - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 85 stars] * ⭐ [Adversarial-Misspellings](https://github.com/danishpruthi/Adversarial-Misspellings) - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 64 stars] ### Hate Speech Analysis * ⭐ [HateXplain](https://github.com/hate-alert/HateXplain) - BERT for detecting abusive language [GitHub, 240 stars] ### NLP & Security * ⭐ [vuln2vc](https://github.com/aissa302/vuln2vec) - domain-specific Word2Vec model for cybersecurity text mining and NLP research [GitHub, 2 stars] ![The-NLP-Frameworks](./Resources/Images/pandect_frameworks.png) ----- > __Note__ > Section keywords: frameworks [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) ### General Purpose * ⭐ [spaCy](https://github.com/explosion/spaCy) by Explosion AI [GitHub, 33624 stars] * ⭐ [flair](https://github.com/flairNLP/flair) by Zalando [GitHub, 14377 stars] * ⭐ [AllenNLP](https://github.com/allenai/allennlp) by AI2 [GitHub, 11897 stars] * ⭐ [stanza](https://github.com/stanfordnlp/stanza) (former Stanford NLP) [GitHub, 7806 stars] * ⭐ [spaCy stanza](https://github.com/explosion/spacy-stanza) [GitHub, 748 stars] * ⭐ [nltk](https://github.com/nltk/nltk) [GitHub, 14634 stars] * ⭐ [gensim](https://github.com/RaRe-Technologies/gensim) - framework for topic modeling [GitHub, 16421 stars] * ⭐ [pororo](https://github.com/kakaobrain/pororo) - Platform of neural models for natural language processing [GitHub, 1307 stars] * ⭐ [NLP Architect](https://github.com/NervanaSystems/nlp-architect) - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2933 stars] * ⭐ [FARM](https://github.com/deepset-ai/FARM) [GitHub, 1754 stars] * ⭐ [gobbli](https://github.com/RTIInternational/gobbli) by RTI International [GitHub, 274 stars] * ⭐ [headliner](https://github.com/as-ideas/headliner) - training and deployment of seq2seq models [GitHub, 228 stars] * ⭐ [SyferText](https://github.com/OpenMined/SyferText) - A privacy preserving NLP framework [GitHub, 198 stars] * ⭐ [DeText](https://github.com/linkedin/detext) - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1265 stars] * ⭐ [TextHero](https://github.com/jbesomi/texthero) - Text preprocessing, representation and visualization [GitHub, 2911 stars] * ⭐ [textblob](https://github.com/sloria/textblob) - TextBlob: Simplified Text Processing [GitHub, 9536 stars] * ⭐ [AdaptNLP](https://github.com/Novetta/adaptnlp) - A high level framework and library for NLP [GitHub, 407 stars] * ⭐ [textacy](https://github.com/chartbeat-labs/textacy) - NLP, before and after spaCy [GitHub, 2241 stars] * ⭐ [texar](https://github.com/asyml/texar) - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2391 stars] * ⭐ [jiant](https://github.com/nyu-mll/jiant) - jiant is an NLP toolkit [GitHub, 1675 stars] ### Data Augmentation * ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) Text manipulation library to test NLP models [GitHub, 76 stars] * ⭐ [snorkel](https://github.com/snorkel-team/snorkel) Framework to generate training data [GitHub, 5970 stars] * ⭐ [NLPAug](https://github.com/makcedward/nlpaug) Data augmentation for NLP [GitHub, 4657 stars] * ⭐ [SentAugment](https://github.com/facebookresearch/SentAugment) Data augmentation by retrieving similar sentences from larger datasets [GitHub, 358 stars] * ⭐ [faker](https://github.com/joke2k/faker) - Python package that generates fake data for you [GitHub, 19253 stars] * ⭐ [textflint](https://github.com/textflint/textflint) - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 651 stars] * ⭐ [Parrot](https://github.com/PrithivirajDamodaran/Parrot_Paraphraser) - Practical and feature-rich paraphrasing framework [GitHub, 919 stars] * ⭐ [AugLy](https://github.com/facebookresearch/AugLy) - data augmentations library for audio, image, text, and video [GitHub, 5084 stars] * ⭐ [TextAugment](https://github.com/dsfsi/textaugment) - Python 3 library for augmenting text for natural language processing applications [GitHub, 438 stars] ### Adversarial NLP Attacks & Behavioral Testing * ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 3427 stars] * ⭐ [CleverHans](https://github.com/tensorflow/cleverhans) - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6438 stars] * ⭐ [CheckList](https://github.com/marcotcr/checklist) - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2050 stars] ### Transformer-oriented * ⭐ [transformers](https://github.com/huggingface/transformers) by HuggingFace [GitHub, 161166 stars] * ⭐ [Adapter Hub](https://github.com/Adapter-Hub/adapter-transformers) and its [documentation](https://docs.adapterhub.ml/index.html) - Adapter modules for Transformers [GitHub, 2812 stars] * ⭐ [haystack](https://github.com/deepset-ai/haystack) - Transformers at scale for question answering & neural search. [GitHub, 25432 stars] ### Dialogue Systems and Speech, Voice Agents * ⭐ [DeepPavlov](https://github.com/deepmipt/DeepPavlov) by MIPT [GitHub, 6986 stars] * ⭐ [ParlAI](https://github.com/facebookresearch/ParlAI) by FAIR [GitHub, 10627 stars] * ⭐ [rasa](https://github.com/RasaHQ/rasa) - Framework for Conversational Agents [GitHub, 21190 stars] * ⭐ [wav2letter](https://github.com/facebookresearch/wav2letter) - Automatic Speech Recognition Toolkit [GitHub, 6444 stars] * ⭐ [ChatterBot](https://github.com/gunthercox/ChatterBot) - conversational dialog engine for creating chatbots [GitHub, 14488 stars] * ⭐ [SpeechBrain](https://github.com/speechbrain/speechbrain) - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 11581 stars] * ⭐ [dialoguefactory](https://github.com/smartinovski/dialoguefactory/tree/main) Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars] * ⭐ [gabber](https://github.com/gabber-dev/gabber) AI applications that can see, hear, and speak using your screens, microphones [GitHub, 1103 stars] ### Word/Sentence-embeddings oriented * ⭐ [MUSE](https://github.com/facebookresearch/MUSE) A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3246 stars] * ⭐ [vecmap](https://github.com/artetxem/vecmap) A framework to learn cross-lingual word embedding mappings [GitHub, 654 stars] * ⭐ [sentence-transformers](https://github.com/UKPLab/sentence-transformers) - Multilingual Sentence & Image Embeddings with BERT [GitHub, 18765 stars] ### Social Media Oriented * ⭐ [Ekphrasis](https://github.com/cbaziotis/ekphrasis) - text processing tool, geared towards text from social networks [GitHub, 676 stars] ### Phonetics * ⭐ [DeepPhonemizer](https://github.com/as-ideas/DeepPhonemizer) - grapheme to phoneme conversion with deep learning [GitHub, 426 stars] ### Morphology * ⭐ [LemmInflect](https://github.com/bjascob/LemmInflect) - python module for English lemmatization and inflection [GitHub, 280 stars] * ⭐ [Inflect](https://github.com/jaraco/inflect) - generate plurals, ordinals, indefinite articles [GitHub, 1076 stars] * ⭐ [simplemma](https://github.com/jaraco/inflect) - simple multilingual lemmatizer for Python [GitHub, 1076 stars] ### Multi-lingual tools * ⭐ [polyglot](https://github.com/aboSamoor/polyglot) - Multi-lingual NLP Framework [GitHub, 2367 stars] * ⭐ [trankit](https://github.com/nlp-uoregon/trankit) - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 795 stars] ### Distributed NLP / Multi-GPU NLP * ⭐ [Spark NLP](https://github.com/JohnSnowLabs/spark-nlp) [GitHub, 4131 stars] * ⭐ [Parallelformers: An Efficient Model Parallelization Toolkit for Deployment](https://github.com/tunib-ai/parallelformers) [GitHub, 788 stars] ### Machine Translation * ⭐ [COMET](https://github.com/Unbabel/COMET) -A Neural Framework for MT Evaluation [GitHub, 756 stars] * ⭐ [marian-nmt](https://github.com/marian-nmt/marian) - Fast Neural Machine Translation in C++ [GitHub, 1449 stars] * ⭐ [argos-translate](https://github.com/argosopentech/argos-translate) - Open source neural machine translation in Python [GitHub, 6092 stars] * ⭐ [Opus-MT](https://github.com/Helsinki-NLP/Opus-MT) - Open neural machine translation models and web services [GitHub, 819 stars] * ⭐ [dl-translate](https://github.com/xhlulu/dl-translate) - A deep learning-based translation library built on Huggingface transformers [GitHub, 497 stars] * ⭐ [CTranslate2](https://github.com/OpenNMT/CTranslate2) - CTranslate2 end-to-end machine translation [GitHub, 4507 stars] ### Entity and String Matching * ⭐ [PolyFuzz](https://github.com/MaartenGr/PolyFuzz) - Fuzzy string matching, grouping, and evaluation [GitHub, 797 stars] * ⭐ [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 1104 stars] * ⭐ [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy String Matching in Python [GitHub, 9259 stars] * ⭐ [jellyfish](https://github.com/jamesturk/jellyfish) - approximate and phonetic matching of strings [GitHub, 2215 stars] * ⭐ [textdistance](https://github.com/life4/textdistance) - Compute distance between sequences [GitHub, 3534 stars] * ⭐ [DeepMatcher](https://github.com/anhaidgroup/deepmatcher) - Compute distance between sequences [GitHub, 617 stars] * ⭐ [RE2](https://github.com/alibaba-edu/simple-effective-text-matching) - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 341 stars] * ⭐ [Machamp](https://github.com/megagonlabs/machamp) - Machamp: A Generalized Entity Matching Benchmark [GitHub, 21 stars] * ⭐ [bge-m3](https://huggingface.co/BAAI/bge-m3) - BGE-M3 hybrid retrieval + re-ranking [GitHub](https://github.com/FlagOpen/FlagEmbedding/tree/master/FlagEmbedding) [GitHub, 11753 stars] * 📙 [Cosine_Similarity_Explainer](https://huggingface.co/spaces/uumerrr684/Cosine_Similarity_Explainer) - Semantic Similarity Explainer with AI ### Discourse Analysis * ⭐ [ConvoKit](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit) - Cornell Conversational Analysis Toolkit [GitHub, 635 stars] ### PII scrubbing * ⭐ [scrubadub](https://github.com/LeapBeyond/scrubadub) - Clean personally identifiable information from dirty dirty text [GitHub, 425 stars] ### Hastag Segmentation * ⭐ [hashformers](https://github.com/ruanchaves/hashformers) - automatically inserting the missing spaces between the words in a hashtag [GitHub, 77 stars] ### Books Analysis / Literary Analysis / Semantic Search * ⭐ [booknlp](https://github.com/booknlp/booknlp) - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 918 stars] * ⭐ [bookworm](https://github.com/harrisonpim/bookworm) - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars] * ⭐ [SemanticFinder](https://github.com/do-me/SemanticFinder) - frontend-only live semantic search with transformers.js [GitHub, 327 stars] ### Non-English oriented #### Japanese * ⭐ [fugashi](https://github.com/polm/fugashi) - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 521 stars] * ⭐ [SudachiPy](https://github.com/WorksApplications/SudachiPy) - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 437 stars] * ⭐ [Konoha](https://github.com/himkt/konoha) - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 261 stars] * ⭐ [jProcessing](https://github.com/kevincobain2000/jProcessing) - Japanese Natural Langauge Processing Libraries [GitHub, 147 stars] * ⭐ [Ginza](https://github.com/megagonlabs/ginza) - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 854 stars] * ⭐ [kuromoji](https://github.com/atilika/kuromoji) - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 1043 stars] * ⭐ [nagisa](https://github.com/taishi-i/nagisa) - Japanese tokenizer based on recurrent neural networks [GitHub, 417 stars] * ⭐ [KyTea](https://github.com/neubig/kytea) - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 213 stars] * ⭐ [Jigg](https://github.com/mynlp/jigg) - Pipeline framework for easy natural language processing [GitHub, 77 stars] * ⭐ [Juman++](https://github.com/ku-nlp/jumanpp) - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 414 stars] * ⭐ [RakutenMA](https://github.com/rakuten-nlp/rakutenma) - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 472 stars] * ⭐ [toiro](https://github.com/taishi-i/toiro) - a comparison tool of Japanese tokenizers [GitHub, 123 stars] #### Thai * ⭐ [AttaCut](https://github.com/PyThaiNLP/attacut) - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 96 stars] * ⭐ [ThaiLMCut](https://github.com/meanna/ThaiLMCUT) - Word Tokenizer for Thai Language [GitHub, 17 stars] #### Chinese * ⭐ [Spacy-pkuseg](https://github.com/explosion/spacy-pkuseg) - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 70 stars] #### Ukrainian * ⭐ [recruitment-dataset](https://github.com/Stereotypes-in-LLMs/recruitment-dataset) - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English) #### Other * ⭐ [textblob-de](https://github.com/markuskiller/textblob-de) - TextBlob: Simplified Text Processing for German [GitHub, 103 stars] * ⭐ [Kashgari](https://github.com/BrikerMan/Kashgari) Transfer Learning with focus on Chinese [GitHub, 2384 stars] * ⭐ [Underthesea](https://github.com/undertheseanlp/underthesea) - Vietnamese NLP Toolkit [GitHub, 1737 stars] * ⭐ [PTT5](https://github.com/unicamp-dl/PTT5) - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 91 stars] ### Text Data Labelling & Classification * ⭐ [Small-Text](https://github.com/webis-de/small-text) - Active Learning for Text Classifcation in Python [GitHub, 643 stars] * ⭐ [Doccano](https://github.com/doccano/doccano) - open source annotation tool for machine learning practitioners [GitHub, 10659 stars] * ⭐ [Adala](https://github.com/HumanSignal/Adala) - Autonomous DAta (Labeling) Agent framework [GitHub, 1593 stars] * ⭐ [EDA](https://github.com/jasonwei20/eda_nlp) - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1652 stars] * 🔱 [Prodigy](https://prodi.gy/) - annotation tool powered by active learning [Paid Service] ![The-NLP-Learning](./Resources/Images/pandect_learning.png) ----- > __Note__ > Section keywords: learn NLP [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) #### General * 📙 [Learn NLP the practical way](https://towardsdatascience.com/learn-nlp-the-practical-way-b854ce1035c4) [Blog, Nov. 2019] * 📙 [Learn NLP the Stanford way](https://medium.com/data-science/learn-nlp-the-stanford-way-lesson-1-3f1844265760) ([+Part 2](https://medium.com/data-science/learn-nlp-the-stanford-way-lesson-2-7447f2c12b36)) [Blog, Nov 2020] * 📙 [Choosing the right course for a Practical NLP Engineer](https://airev.us/ultimate-guide-to-natural-language-processing-courses/) * 📙 [12 Best Natural Language Processing Courses & Tutorials to Learn Online](https://blog.coursesity.com/best-natural-language-processing-courses/) * ⭐ [Treasure of Transformers](https://github.com/ashishpatel26/Treasure-of-Transformers) - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 1143 stars] * 🎥️ [Rasa Algorithm Whiteboard](https://www.youtube.com/playlist?list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb) - YouTube series by Rasa explaining various Data Science and NLP Algorithms * 🎥️ [ExplosionAI Videos](https://www.youtube.com/c/ExplosionAI/videos) - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP #### Courses * 🎥️ [CS25: Transformers United Stanford - Fall 2021](https://web.stanford.edu/class/cs25/) [Course, Fall 2021] * 📙 [NLP Course | For You](https://lena-voita.github.io/nlp_course.html) - Great and interactive course on NLP * 📙 [Advanced NLP with spaCy](https://course.spacy.io/en/) - how to use spaCy to build advanced natural language understanding systems * 📙 [Transformer models for NLP](https://huggingface.co/course/chapter1) by HuggingFace * 🎥️ [Stanford NLP Seminar](https://nlp.stanford.edu/seminar/) - slides from the Stanford NLP course #### Books * 📙 [Applied Natural Language Processing in the Enterprise](https://www.oreilly.com/library/view/applied-natural-language/9781492062561/) - [Book, May 2021] * 📙 [Practical Natural Language Processing](https://www.oreilly.com/library/view/practical-natural-language/9781492054047/) - [Book, June 2020] * 📙 [Dive into Deep Learning](https://d2l.ai/index.html) - An interactive deep learning book with code, math, and discussions * 📙 [Natural Language Processing and Computational Linguistics](https://www.amazon.de/Natural-Language-Processing-Computational-Linguistics/dp/1848218486) - Speech, Morphology and Syntax (Cognitive Science) #### Tutorials * ⭐ [nlp-tutorial](https://github.com/lyeoni/nlp-tutorial) - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1374 stars] * ⭐ [nlp-tutorial](https://github.com/graykode/nlp-tutorial) - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14901 stars] * ⭐ [Hands-On NLTK Tutorial](https://github.com/hb20007/hands-on-nltk-tutorial) [GitHub, 572 stars] * ⭐ [Modern Practical Natural Language Processing](https://github.com/jmugan/modern_practical_nlp) [GitHub, 266 stars] * ⭐ [Transformers-Tutorials](https://github.com/NielsRogge/Transformers-Tutorials) - demos with the Transformers library by HuggingFace [GitHub, 11633 stars] * 🗂️ [CalmCode Tutorials](https://calmcode.io/#science) - Set of Python Data Science Tutorials ![The-NLP-Communities](./Resources/Images/pandect_communities.png) ----- * [r/LanguageTechnology](https://www.reddit.com/r/LanguageTechnology/) - NLP Reddit forum ![Other-NLP-Topics](Resources/Images/pandect_papyrus_other.png) ----- [🔙 Back to the Table of Contents](https://github.com/ivan-bilan/The-NLP-Pandect#table-of-contents) #### Tokenization * ⭐ [tokenizers](https://github.com/huggingface/tokenizers) - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 10782 stars] * ⭐ [SentencePiece](https://github.com/google/sentencepiece) - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 11870 stars] * ⭐ [SoMaJo](https://github.com/tsproisl/SoMaJo) - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 152 stars] #### Data Augmentation and Weak Supervision ##### Libraries and Frameworks * ⭐ [WildNLP](https://github.com/MI2DataLab/WildNLP) Text manipulation library to test NLP models [GitHub, 76 stars] * ⭐ [NLPAug](https://github.com/makcedward/nlpaug) Data augmentation for NLP [GitHub, 4657 stars] * ⭐ [SentAugment](https://github.com/facebookresearch/SentAugment) Data augmentation by retrieving similar sentences from larger datasets [GitHub, 358 stars] * ⭐ [TextAttack](https://github.com/QData/TextAttack) - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 3427 stars] * ⭐ [skweak](https://github.com/NorskRegnesentral/skweak) - software toolkit for weak supervision applied to NLP tasks [GitHub, 927 stars] * ⭐ [NL-Augmenter](https://github.com/GEM-benchmark/NL-Augmenter) - Collaborative Repository of Natural Language Transformations [GitHub, 787 stars] * ⭐ [EDA](https://github.com/jasonwei20/eda_nlp) - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1652 stars] * ⭐ [snorkel](https://github.com/snorkel-team/snorkel) Framework to generate training data [GitHub, 5970 stars] * ⭐ [dialoguefactory](https://github.com/smartinovski/dialoguefactory/tree/main) Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars] ##### Reading Material and Tutorials * ⭐ [A Survey of Data Augmentation Approaches for NLP](https://arxiv.org/abs/2105.03075) [Paper, May 2021] [GitHub Link](https://github.com/styfeng/DataAug4NLP) * 📙 [A Visual Survey of Data Augmentation in NLP](https://amitness.com/2020/05/data-augmentation-for-nlp/) [Blog, 2020] * 📙 [Weak Supervision: A New Programming Paradigm for Machine Learning](http://ai.stanford.edu/blog/weak-supervision/) [Blog, March 2019] #### Named Entity Recognition (NER) * ⭐ [Datasets for Entity Recognition](https://github.com/juand-r/entity-recognition-datasets) [GitHub, 1573 stars] * ⭐ [Datasets to train supervised classifiers for Named-Entity Recognition](https://github.com/davidsbatista/NER-datasets) [GitHub, 344 stars] * ⭐ [Bootleg](https://github.com/HazyResearch/bootleg) - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 218 stars] * ⭐ [Few-NERD](https://github.com/thunlp/Few-NERD) - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 400 stars] #### Relation Extraction * ⭐ [tacred-relation](https://github.com/yuhaozhang/tacred-relation) TACRED: position-aware attention model for relation extraction [GitHub, 359 stars] * ⭐ [tacrev](https://github.com/DFKI-NLP/tacrev) TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 70 stars] * ⭐ [tac-self-attention](https://github.com/ivan-bilan/tac-self-attention) Relation extraction with position-aware self-attention [GitHub, 66 stars] * ⭐ [Re-TACRED](https://github.com/gstoica27/Re-TACRED) Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 55 stars] #### Coreference Resolution * ⭐ [NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks](https://github.com/huggingface/neuralcoref) by HuggingFace [GitHub, 2889 stars] * ⭐ [coref](https://github.com/mandarjoshi90/coref) - BERT and SpanBERT for Coreference Resolution [GitHub, 453 stars] #### Sentiment Analysis * ⭐ [Reading list for Awesome Sentiment Analysis papers](https://github.com/declare-lab/awesome-sentiment-analysis) by [declare-lab](https://github.com/declare-lab) [GitHub, 538 stars] * ⭐ [Awesome Sentiment Analysis](https://github.com/xiamx/awesome-sentiment-analysis) by [xiamx](https://github.com/xiamx) [GitHub, 931 stars] #### Domain Adaptation * ⭐ [Neural Adaptation in Natural Language Processing - curated list](https://github.com/bplank/awesome-neural-adaptation-in-NLP) [GitHub, 266 stars] #### Low Resource NLP * ⭐ [CMU LTI Low Resource NLP Bootcamp 2020](https://github.com/neubig/lowresource-nlp-bootcamp-2020) - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 606 stars] #### Spell Correction / Error Correction * ⭐ [Gramformer](https://github.com/PrithivirajDamodaran/Gramformer) - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1581 stars] * ⭐ [NeuSpell](https://github.com/neuspell/neuspell) - A Neural Spelling Correction Toolkit [GitHub, 712 stars] * ⭐ [SymSpellPy](https://github.com/mammothb/symspellpy) - Python port of SymSpell [GitHub, 871 stars] * 📙 [Speller100](https://www.microsoft.com/en-us/research/blog/speller100-zero-shot-spelling-correction-at-scale-for-100-plus-languages/) by Microsoft [Blog, Feb 2021] * ⭐ [JamSpell](https://github.com/bakwc/JamSpell) - spell checking library - accurate, fast, multi-language [GitHub, 662 stars] * ⭐ [pycorrector](https://github.com/shibing624/pycorrector) - spell correction for Chinese [GitHub, 6452 stars] * ⭐ [contractions](https://github.com/kootenpv/contractions) - Fixes contractions such as `you're` to you `are` [GitHub, 318 stars] * 📙 [Fine Tuning T5 for Grammar Correction](https://sachinruk.github.io/blog/2022-11-07-t5-for-grammar-correction.html) by Sachin Abeywardana [Blog, Nov 2022] #### PDF Parsing * ⭐ [spacy-layout](https://github.com/explosion/spacy-layout) - Process PDFs, Word documents and more with spaCy [GitHub, 902 stars] * ⭐ [bentopdf](https://github.com/alam00000/bentopdf) - A Privacy First PDF Toolkit [GitHub, 13552 stars] #### Style Transfer for NLP * ⭐ [Styleformer](https://github.com/PrithivirajDamodaran/Styleformer) - Neural Language Style Transfer framework [GitHub, 494 stars] * ⭐ [StylePTB](https://github.com/lvyiwei1/StylePTB) - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 64 stars] #### Automata Theory for NLP * ⭐ [pyahocorasick](https://github.com/WojciechMula/pyahocorasick) - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 1104 stars] #### Obscene words detection * ⭐ [LDNOOBW](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 3376 stars] #### Reddit Analysis * ⭐ [Subreddit Analyzer](https://github.com/PhantomInsights/subreddit-analyzer) - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 499 stars] #### Skill Detection * ⭐ [SkillNER](https://github.com/AnasAito/SkillNER) - rule based NLP module to extract job skills from text [GitHub, 209 stars] #### Reinforcement Learning for NLP * ⭐ [nlp-gym](https://github.com/rajcscw/nlp-gym) - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 201 stars] #### AutoML / AutoNLP * ⭐ [AutoNLP](https://github.com/huggingface/autonlp) - Faster and easier training and deployments of SOTA NLP models [GitHub, 4574 stars] * ⭐ [TPOT](https://github.com/EpistasisLab/tpot) - Python Automated Machine Learning tool [GitHub, 10047 stars] * ⭐ [Auto-PyTorch](https://github.com/automl/Auto-PyTorch) - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2534 stars] * ⭐ [HungaBunga](https://github.com/ypeleg/HungaBunga) - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 0 stars] * ⭐ [Optuna](https://github.com/optuna/optuna) - hyperparameter optimization framework [GitHub, 14280 stars] * ⭐ [FLAML](https://github.com/microsoft/FLAML) - fast and lightweight AutoML library [GitHub, 4360 stars] * ⭐ [Gradsflow](https://github.com/gradsflow/gradsflow) - open-source AutoML & PyTorch Model Training Library [GitHub, 307 stars] #### OCR - Optical Character Recognition * 🎥️ [A framework for designing document processing solutions](https://ljvmiranda921.github.io/notebook/2022/06/19/document-processing-framework/) [Blog, June 2022] #### Document AI * 📙 [Table Transformer](https://huggingface.co/docs/transformers/main/model_doc/table-transformer) + [HuggingFace Models](https://huggingface.co/models?other=table-transformer) #### Text Generation * ⭐ [keytotext](https://github.com/gagan3012/keytotext) - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 451 stars] * 📙 [Controllable Neural Text Generation](https://lilianweng.github.io/lil-log/2021/01/02/controllable-neural-text-generation.html) [Blog, Jan 2021] * ⭐ [BARTScore](https://github.com/neulab/BARTScore) Evaluating Generated Text as Text Generation [GitHub, 369 stars] #### Title / Headlines Generation * ⭐ [TitleStylist](https://github.com/jind11/TitleStylist) Learning to Generate Headlines with Controlled Styles [GitHub, 78 stars] #### NLP research reproducibility * 📙 [A Systematic Review of Reproducibility Research in Natural Language Processing](https://arxiv.org/abs/2103.07929) [Paper, March 2021] ## License [CC0](./LICENSE) ## Attributions #### Resources * All linked resources belong to original authors #### Icons * [Akropolis](https://thenounproject.com/search/?q=ancient%20greek&i=403786) by parkjisun from the [Noun Project](https://thenounproject.com) * [Book](https://thenounproject.com/icon/304884/) of Ester by Gilad Sotil from the [Noun Project](https://thenounproject.com) * [quill](https://thenounproject.com/term/quill/17013/) by Juan Pablo Bravo from the [Noun Project](https://thenounproject.com) * [acting](https://thenounproject.com/term/acting/2369397/) by Flatart from the [Noun Project](https://thenounproject.com) * [olympic](https://thenounproject.com/term/olympic/1870751/) by supalerk laipawat from the [Noun Project](https://thenounproject.com) * [aristocracy](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156156) by Eucalyp from the [Noun Project](https://thenounproject.com) * [Horn](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156640) by Eucalyp from the [Noun Project](https://thenounproject.com) * [temple](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3156638) by Eucalyp from the [Noun Project](https://thenounproject.com) * [constellation](https://thenounproject.com/eucalyp/collection/ancient-greece-glyph/?i=3156142) by Eucalyp from the [Noun Project](https://thenounproject.com) * [ancient greek round pattern](https://thenounproject.com/term/ancient-greek-round-pattern/2048889/) by Olena Panasovska from the [Noun Project](https://thenounproject.com) * Harp by Vectors Point from the [Noun Project](https://thenounproject.com) * [Atlas](https://thenounproject.com/naripuru/collection/ancient-gods/?i=2225785) by parkjisun from the [Noun Project](https://thenounproject.com) * [Parthenon](https://thenounproject.com/eucalyp/collection/ancient-greece-line/?i=3158942) by Eucalyp from the [Noun Project](https://thenounproject.com) * [papyrus](https://thenounproject.com/iconmark/collection/greek-mythology/?i=3515982) by IconMark from the [Noun Project](https://thenounproject.com) * [papyrus](https://thenounproject.com/search/?q=papyrus&i=2239368) by Smalllike from the [Noun Project](https://thenounproject.com) * [pegasus](https://thenounproject.com/search/?q=pegasus&i=2266449) by Saeful Muslim from the [Noun Project](https://thenounproject.com) #### Fonts * [Dalek Font](https://www.dafont.com/dalek.font) ----- <h3 align="center">The Pandect Series also includes</h3> <p align="middle"> <a href="https://github.com/ivan-bilan/The-Microservices-Pandect"> <img src="https://raw.githubusercontent.com/ivan-bilan/The-Engineering-Manager-Pandect/main/Resources/Images/microservices_pandect_promo.png" width="390" /> </a>       <a href="https://github.com/ivan-bilan/The-Engineering-Manager-Pandect"> <img src="https://raw.githubusercontent.com/ivan-bilan/The-Engineering-Manager-Pandect/main/Resources/Images/em_pandect_promo.png" width="370" /> </a> </p>

ML Frameworks Data Visualisation

2K Github Stars

Software by ivan-bilan

The-NLP-Pandect