Awesome-DataSets

Awesome DataSets Collections

Collection | 数据集合

Search Index | 索引

Google Dataset Search: A new search service to find data from sciences, government, some news organizations.
Re3Data: 2,000 Data Repositories and Science Europe’s Framework for Discipline-specific Research Data Management
Open Data Inception: 2600+ Open Data Portals Around the World
天池数据集: 多领域的用于科学研究与实验的数据集合

Repositories | 资源

Reddit Datasets: A place to share, find, and discuss Datasets.
AWS Public Datasets: AWS hosts a variety of public datasets that anyone can access for free.
awesome-public-datasets : An awesome list of high-quality open datasets in public domains (on-going).
Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
UCI Machine Learning Repository
Data For Everyone: Here are some of our favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download.

NLP & Text DataSets | 文本数据

DataSets : Datasets and evaluation metrics for natural language processing and more. Compatible with NumPy, Pandas, PyTorch and TensorFlow.

News

20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.

Wiki

Wikimedia Dumps: The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps.

Tweets

2011-tweets2011

Comments/Reviews

Amazon Reviews: Over 142 million product reviews for sentiment analysis, recommender systems, and more.

Chinese | 中文文本

2016-THUCTC: 清华大学新闻数据集
chinese-xinhua: 中华新华字典数据库和 API。收录包括 14032 条歇后语，16142 个汉字，264434 个词语，31648 个成语。
chinese-poetry: 最全的中华古典文集数据库, 包含 5.5 万首唐诗、26 万首宋诗和 2.1 万首宋词. 唐宋两朝近 1.4 万古诗人, 和两宋时期 1.5K 词人. 数据来源于互联网。
2019-ChineseGLUE : Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard.

Image DataSets | 图片数据

fashion-mnist : Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.
facets : The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive.
Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共 173MB
Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images
2023-MultimodalC4 : MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.

Adults

NSFW Data Scrapper : Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier.

OCR

im2latex-100k : A prebuilt dataset for OpenAI’s task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets.

Voice & Media & Video

领域数据 | Domain

Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.
SNAP: Stanford Large Network Dataset Collection
MLVIS: This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web.
Network Repository: Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations.

Driving Data | 驾驶数据

LBS | 地理位置

中国 5 级行政区域 mysql 库 : 爬取国家统计局官网的行政区域数据,包括省市县镇村 5 个层级;
china_regions : 最全最新中国省，市，地区 json 及 sql 数据
qqzeng-ip : 最新 IP 地址数据库-多语言解析以及导入数据库脚本。

Time Series

Time Series Data Library: The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.

Business DataSets

Financial | 金融证券

Tushare: 交易类数据提供股票的交易行情数据，通过简单的接口调用可获取相应的 DataFrame 格式数据。

Sports | 体育

Football Strategy:Thousands of scenarios to make the best coaching decisions. 共 876KB
Horses for Course:Horse-racing data for predicting race results. 共 19MB
NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.

Medicines | 医药

National Survey on Drug Use and Health:Predict drug use based on health survey questions. 共 2GB
Prostate Cancer:Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB
Record of Heart Sound:Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共 47.7MB

Foods | 饮食

Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3 个文件，共 343KB。
malicious-urls: 数十万条级别的 URL 以及其是否 Malicious 标签.
MovieLens:海量的关于电影影评数据

Governments | 政务

Frequent Itemset Mining Dataset Repository
CUHK Multimedia Laboratory
Social Computing Research at the University ofMinnesota
Hyperlink Graphs
The home of the U.S. Government’s open data: Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
Enterprise Registration Data of Chinese Mainland: 中国大陆 31 个省份 1978 年至 2019 年一千多万工商企业注册信息，包含企业名称、注册地址、统一社会信用代码、地区、注册日期、经营范围、法人代表、注册资金、企业类型等详细资料。

Others

Fake Mail Generator

Links

最近更新于 2023-04-16

Awesome-DataSets

Awesome DataSets Collections

Collection | 数据集合

Search Index | 索引

Repositories | 资源

NLP & Text DataSets | 文本数据

News

Wiki

Tweets

Comments/Reviews

Chinese | 中文文本

Image DataSets | 图片数据

Adults

OCR

Voice & Media & Video

领域数据 | Domain

Social Networks | 社交网络

Driving Data | 驾驶数据

LBS | 地理位置

Time Series

Business DataSets

Financial | 金融证券

Sports | 体育

Medicines | 医药

Foods | 饮食

Governments | 政务

Others

Links