Awesome-DataSets
Awesome DataSets Collections
Collection | 数据集合
Search Index | 索引
-
Google Dataset Search: A new search service to find data from sciences, government, some news organizations.
-
Re3Data: 2,000 Data Repositories and Science Europe’s Framework for Discipline-specific Research Data Management
-
Open Data Inception: 2600+ Open Data Portals Around the World
-
天池数据集: 多领域的用于科学研究与实验的数据集合
Repositories | 资源
-
Reddit Datasets: A place to share, find, and discuss Datasets.
-
AWS Public Datasets: AWS hosts a variety of public datasets that anyone can access for free.
-
awesome-public-datasets : An awesome list of high-quality open datasets in public domains (on-going).
-
Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.
-
Data For Everyone: Here are some of our favorite open datasets created on the Figure Eight platform. They’re free for any and everyone to download.
NLP & Text DataSets | 文本数据
- DataSets : Datasets and evaluation metrics for natural language processing and more. Compatible with NumPy, Pandas, PyTorch and TensorFlow.
News
- 20 Newsgroups:The text from 20000 messages taken from 20 Usenet newsgroups for text analysis, classification, etc.
Wiki
- Wikimedia Dumps: The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps.
Tweets
Comments/Reviews
- Amazon Reviews: Over 142 million product reviews for sentiment analysis, recommender systems, and more.
Chinese | 中文文本
-
2016-THUCTC: 清华大学新闻数据集
-
chinese-xinhua: 中华新华字典数据库和 API。收录包括 14032 条歇后语,16142 个汉字,264434 个词语,31648 个成语。
-
chinese-poetry: 最全的中华古典文集数据库, 包含 5.5 万首唐诗、26 万首宋诗和 2.1 万首宋词. 唐宋两朝近 1.4 万古诗人, 和两宋时期 1.5K 词人. 数据来源于互联网。
-
2019-ChineseGLUE : Language Understanding Evaluation benchmark for Chinese: datasets, baselines, pre-trained models,corpus and leaderboard.
Image DataSets | 图片数据
-
fashion-mnist : Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples.
-
facets : The facets project contains two visualizations for understanding and analyzing machine learning datasets: Facets Overview and Facets Dive.
-
Labeled Faces in the Wild:13,000 named faces for facial recognition. Multiple training and test sets. 共 173MB
-
Mushroom Identification:For hypothetically classifying mushrooms as edible or poisonous based on its characteristics.3 files, 480KB
-
NORB 3D Object Recognition:Binocular images of 50 toy figurines for 3D object recognition from image.Multiple files, over 5GB total
-
One Million Songs: Audio features and metadata for a subset (10,000) of the one million popular songs dataset for recognition/classification.1.8GB
-
Hate Speech Identification:A sampling of Twitter posts that have been judged based on whether they are offensive or contain hate speech, as a training set for text analysis.2.66MB
-
Hidden Beauty of Flickr Pictures: 15,000 Flickr photo IDs that have received ratings based on aesthetics, for image analysis.138KB, use Flickr API to get images
-
2023-MultimodalC4 : MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
Adults
- NSFW Data Scrapper : Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier.
OCR
- im2latex-100k : A prebuilt dataset for OpenAI’s task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets.
Voice & Media & Video
领域数据 | Domain
Social Networks | 社交网络
-
Yahoo Instant Messenger Friends Connectivity Graph:Connections between Yahoo users who communicate with each other using Yahoo messenger, can be used to identify key social contacts/influencers. Add dataset to cart to access.
-
SNAP: Stanford Large Network Dataset Collection
-
MLVIS: This project is the first to combine the notion of a data repository with real-time visual analytics for interactive data mining and exploratory analysis on the web.
-
Network Repository: Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations.
Driving Data | 驾驶数据
LBS | 地理位置
-
中国 5 级行政区域 mysql 库 : 爬取国家统计局官网的行政区域数据,包括省市县镇村 5 个层级;
-
china_regions : 最全最新中国省,市,地区 json 及 sql 数据
-
qqzeng-ip : 最新 IP 地址数据库-多语言解析以及导入数据库脚本。
Time Series
- Time Series Data Library: The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia.
Business DataSets
Financial | 金融证券
- Tushare: 交易类数据提供股票的交易行情数据,通过简单的接口调用可获取相应的 DataFrame 格式数据。
Sports | 体育
-
Football Strategy:Thousands of scenarios to make the best coaching decisions. 共 876KB
-
Horses for Course:Horse-racing data for predicting race results. 共 19MB
-
NBA & MLB Stats:Current and past season stats for teams and players for fantasy sports predictions.
Medicines | 医药
-
National Survey on Drug Use and Health:Predict drug use based on health survey questions. 共 2GB
-
Prostate Cancer:Tumor and nontumor samples, used to recognize prostate cancer. 共 4.8MB
-
Record of Heart Sound:Recordings of normal and abnormal heartbeats, used to recognize heart murmur, etc. 共 47.7MB
Foods | 饮食
-
Wine Quality:Chemical properties of red and white wines (separately) and quality, for classification. 3 个文件,共 343KB。
-
malicious-urls: 数十万条级别的 URL 以及其是否 Malicious 标签.
-
MovieLens:海量的关于电影影评数据
Governments | 政务
-
The home of the U.S. Government’s open data: Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.
-
Enterprise Registration Data of Chinese Mainland: 中国大陆 31 个省份 1978 年至 2019 年一千多万工商企业注册信息,包含企业名称、注册地址、统一社会信用代码、地区、注册日期、经营范围、法人代表、注册资金、企业类型等详细资料。