NLP-OpenSource-List
NLP OpenSource List
-
SnowNLP : SnowNLP 是一个 Python 写的类库,可以方便的处理中文文本内容,是受到了 TextBlob 的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和 TextBlob 不同的是,这里没有用 NLTK,所有的算法都是自己实现的,并且自带了一些训练好的字典。
-
nlp_compromise : a cool way to use natural language in javascript
-
flair : A very simple framework for state-of-the-art Natural Language Processing (NLP)
-
Chinese NLP : Shared tasks, datasets and state-of-the-art results for Chinese Natural Language Processing (NLP).
-
2019-Transformers : 🤗 Transformers: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
-
2020-MiNLP : 小米自然语言处理平台(MiNLP)具备词法、句法、语义分析等数十个功能模块,已经在公司业务中得到了广泛应用。
-
2020-fastNLP : fastNLP 是一款轻量级的自然语言处理(NLP)工具包。你既可以用它来快速地完成一个 NLP 任务, 也可以用它在研究中快速构建更复杂的模型。
-
2022-Haystack : Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want to perform Question Answering or semantic document search, you can use the State-of-the-Art NLP models in Haystack to provide unique search experiences and allow your users to query in natural language. Haystack is built in a modular fashion so that you can combine the best technology from other open-source projects like Huggingface’s Transformers, Elasticsearch, or Milvus.
-
2022-PaddleNLP : Easy-to-use and powerful NLP library with Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including Neural Search, Question Answering, Information Extraction and Sentiment Analysis end-to-end system.
-
2020-ParlAI : A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
-
WantWords : Opposite to a regular (forward) dictionary that provides definitions for query words, a reverse dictionary returns words semantically matching the query descriptions.
-
funNLP : 中英文敏感词、语言检测、中外手机/电话归属地/运营商查询、名字推断性别、手机号抽取、身份证抽取、邮箱抽取、中日文人名库、中文缩写库、拆字词典、词汇情感值、停用词、反动词表、暴恐词表、繁简体转换、英文模拟中文发音、汪峰歌词生成器、职业名称词库、同义词库、反义词库、否定词库、汽车品牌词库、汽车零件词库、连续英文切割、各种中文词向量、公司名字大全、古诗词库、IT 词库、财经词库、成语词库、地名词库、历史名人词库、诗词词库、医学词库、饮食词库、法律词库、汽车词库、动物词库、中文聊天语料、中文谣言数据、百度中文问答数据集、句子相似度匹配算法集合、bert 资源、文本生成&摘要相关工具、cocoNLP 信息抽取工具、国内电话号码正则匹配、清华大学 XLORE:中英文跨语言百科知识图谱、清华大学人工智能技术…
Dialogue
- 2022-Sketch : Sketch is an AI code-writing assistant for pandas users that understands the context of your data, greatly improving the relevance of suggestions. Sketch is usable in seconds and doesn’t require adding a plugin to your IDE.
Language Representation
-
2018-BERT : BERT is method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). 海量中文预训练 ALBERT 模型。
-
2019-GPT2 : Code and models from the paper “Language Models are Unsupervised Multitask Learners”.
- 2019-GPT2 Chinese : Chinese version of GPT2 training code, using BERT or BPE tokenizer.
- 2021-gpt neo : An implementation of model parallel GPT2& GPT3-like models, with the ability to scale up to full GPT3 sizes (and possibly more!), using the mesh-tensorflow.
Classification
- 2016-FastText : FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.
Syntax & Semantic Analysis
-
Snips NLU : Snips NLU (Natural Language Understanding) is a Python library that allows to parse sentences written in natural language and extracts structured information.
-
Word2Bits : Word2Bits extends the Word2Vec algorithm to output high quality quantized word vectors that take 8x-16x less storage/memory than regular word vectors.
-
ansj_seg : ansj 分词.ict 的真正 java 实现.分词效果速度都超过开源版的 ict. 中文分词,人名识别,词性标注,用户自定义词典。
-
gensim : topic modelling for humans
-
2019-pkuseg : pkuseg 简单易用,支持细分领域分词,有效提升了分词准确度。
-
2019-Synonyms : 最好的中文近义词工具包。Synonyms 可以用于自然语言理解的很多任务:文本对齐,推荐算法,相似度计算,语义偏移,关键字提取,概念提取,自动摘要,搜索引擎等。
Knowledge Graph | 知识图谱
-
2018-OpenKE : An Open-Source Package for Knowledge Embedding (KE).
-
基于医药知识图谱的智能问答系统 : 这是一个基于 Python 模块 REfO 实现的知识库问答初级系统. 该问答系统可以解析输入的自然语言问句生成 SPARQL 查询,进一步请求后台基于 TDB 知识库的 Apache Jena Fuseki 服务, 进而得到问题的结果。
-
2019-KnowledgeGraphData : 知识就是力量,知识图谱是人工智能新时代的产物,简单地说知识图谱就是通过关联关系将知识组成网状的结构,然后我们的人工智能可以通过这个图谱来认识其代表的这一个现实事件,这个事件可以是现实,也可以是虚构的。
Speech
-
2019-Project DeepSpeech : A TensorFlow implementation of Baidu’s DeepSpeech architecture.
-
2020-TTS : TTS is a library for advanced Text-to-Speech generation. It’s built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.
-
2021-MockingBird : 🚀AI 拟声: 5 秒内克隆您的声音并生成任意语音内容 Clone a voice in 5 seconds to generate arbitrary speech in real-time
-
2022-Whisper : Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
- whisper.cpp : High-performance inference of OpenAI’s Whisper automatic speech recognition (ASR) model.
-
2023-faster-whisper : faster-whisper is a reimplementation of OpenAI’s Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
Dialogue System & Bot
-
2018-DeepPavlov : An open source library for building end-to-end dialog systems and training chatbots.
-
Home Assistant : Home Assistant is an open-source home automation platform running on Python 3. Track and control all devices at home and automate control. Perfect to run on a Raspberry Pi.
-
ChatterBot : ChatterBot is a machine learning, conversational dialog engine for creating chat bots
-
2016-Hubot : Hubot is a framework to build chat bots, modeled after GitHub’s Campfire bot of the same name, hubot. He’s pretty cool. He’s extendable with scripts and can work on many different chat services.
-
2019-Botpress : The ultimate open-source conversational platform with built-in natural language processing (NLU), easy-to-use graphical interface and dialog manager.
-
Olivia : Your new best friend built with an artificial neural networ.
-
Leon : Leon is your open-source personal assistant.
-
Dexter : Dexter is a voice controlled assistant, akin to Google Home and Alexa. Dexter’s your right hand person (in theory).
ASR
-
Common Voice : The Common Voice project is Mozilla’s initiative to help teach machines how real people speak.
-
DeepSpeech : Project DeepSpeech is an open source Speech-To-Text engine. It uses a model trained by machine learning techniques, based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow project to make the implementation easier.
-
wav2letter : wav2letter is a simple and efficient end-to-end Automatic Speech Recognition (ASR) system from Facebook AI Research.
-
WeNet : Production First and Production Ready End-to-End Speech Recognition Toolkit
-
ASRT_SpeechRecognition : A Deep-Learning-Based Chinese Speech Recognition System.
-
2019-Real-Time Voice Cloning : SV2TTS is a three-stage deep learning framework that allows to create a numerical representation of a voice from a few seconds of audio, and to use it to condition a text-to-speech model trained to generalize to new voices.
-
2020-TensorFlowTTS : 😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, Korean, Chinese)
-
2020-Silero Models : Silero Models: pre-trained STT models and benchmarks made embarrassingly simple
TTS
- 2022-TorToiSe : A multi-voice TTS system trained with an emphasis on quality