Awesome-DataSets
Awesome DataSets Collections
Collection | 数据集合
Search Index | 索引
- 
Google Dataset Search: A new search service to find data from sciences, government, some news organizations. 
- 
Re3Data: 2,000 Data Repositories and Science Europe’s Framework for Discipline-specific Research Data Management 
- 
Open Data Inception: 2600+ Open Data Portals Around the World 
- 
天池数据集: 多领域的用于科学研究与实验的数据集合 
Repositories | 资源
- 
Reddit Datasets: A place to share, find, and discuss Datasets. 
- 
AWS Public Datasets: AWS hosts a variety of public datasets that anyone can access for free. 
- 
awesome-public-datasets : An awesome list of high-quality open datasets in public domains (on-going). 
- 
Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research. 
- 
Data For Everyone: Here are some of our favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download. 
NLP & Text DataSets | 文本数据
- DataSets 
  : Datasets and evaluation metrics for natural language processing and more. Compatible with NumPy, Pandas, PyTorch and TensorFlow. 
News
- 20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.
Wiki
- Wikimedia Dumps: The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps.
Tweets
Comments/Reviews
- Amazon Reviews: Over 142 million product reviews for sentiment analysis, recommender systems, and more.
Chinese | 中文文本
- 
2016-THUCTC: 清华大学新闻数据集 
- 
chinese-xinhua: 中华新华字典数据库和 API。收录包括 14032 条歇后语,16142 个汉字,264434 个词语,31648 个成语。 
- 
chinese-poetry: 最全的中华古典文集数据库, 包含 5.5 万首唐诗、26 万首宋诗和 2.1 万首宋词. 唐宋两朝近 1.4 万古诗人, 和两宋时期 1.5K 词人. 数据来源于互联网。 
- 
2019-ChineseGLUE : Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard. 
Image DataSets | 图片数据
- 
fashion-mnist : Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. 
- 
facets : The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive. 
- 
Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共 173MB 
- 
Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB 
- 
NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total 
- 
One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB 
- 
Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB 
- 
Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images 
- 
2023-MultimodalC4 : MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text. 
Adults
- NSFW Data Scrapper 
  : Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier. 
OCR
- im2latex-100k 
  : A prebuilt dataset for OpenAI’s task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets. 
Voice & Media & Video
领域数据 | Domain
Social Networks | 社交网络
- 
Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access. 
- 
SNAP: Stanford Large Network Dataset Collection 
- 
MLVIS: This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web. 
- 
Network Repository: Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations. 
Driving Data | 驾驶数据
LBS | 地理位置
- 
中国 5 级行政区域 mysql 库 : 爬取国家统计局官网的行政区域数据,包括省市县镇村 5 个层级; 
- 
china_regions : 最全最新中国省,市,地区 json 及 sql 数据 
- 
qqzeng-ip : 最新 IP 地址数据库-多语言解析以及导入数据库脚本。 
Time Series
- Time Series Data Library: The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.
Business DataSets
Financial | 金融证券
- Tushare: 交易类数据提供股票的交易行情数据,通过简单的接口调用可获取相应的 DataFrame 格式数据。
Sports | 体育
- 
Football Strategy:Thousands of scenarios to make the best coaching decisions. 共 876KB 
- 
Horses for Course:Horse-racing data for predicting race results. 共 19MB 
- 
NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions. 
Medicines | 医药
- 
National Survey on Drug Use and Health:Predict drug use based on health survey questions. 共 2GB 
- 
Prostate Cancer:Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB 
- 
Record of Heart Sound:Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共 47.7MB 
Foods | 饮食
- 
Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3 个文件,共 343KB。 
- 
malicious-urls: 数十万条级别的 URL 以及其是否 Malicious 标签. 
- 
MovieLens:海量的关于电影影评数据 
Governments | 政务
- 
The home of the U.S. Government’s open data: Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. 
- 
Enterprise Registration Data of Chinese Mainland: 中国大陆 31 个省份 1978 年至 2019 年一千多万工商企业注册信息,包含企业名称、注册地址、统一社会信用代码、地区、注册日期、经营范围、法人代表、注册资金、企业类型等详细资料。 
