您的位置：首页 > 大数据 > 人工智能

【数据集】人工智能领域比较常见的数据集汇总

2017-02-15 14:02 417 查看

链接：
https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2#.4ogf5l3xu
原文链接：
http://weibo.com/1657470871/EvlMEm0EH?ref=home&rid=7_0_202_2669536424773680536&type=comment
It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS,
Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.

Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow
grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI.

However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility.

With that said, it can be hard to piece through what public datasets are useful to look at, which are viable for a proof of concept, and what datasets can be useful
as a potential product or feature validation step before you collect your own proprietary data.

It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Most people in
AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling. Standard datasets can
be used as validation or a good starting point for building a more tailored solution.

This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to
know about in the AI world.

Computer Vision

MNIST

CIFAR 10 & CIFAR 100

ImageNet

LSUN

PASCAL VOC

SVHN

MS COCO

Visual Genome

Labeled Faces in the Wild

Natural Language

Text Classification Datasets

WikiTex

Question Pairs

SQuAD

CMU Q/A Dataset

Maluuba Datasets

Billion Words

Common Crawl

bAbi

The Children’s Book Test

Stanford Sentiment Treebank

20 Newsgroups

Reuters

IMDB

UCI’s Spambase

Speech

Most speech recognition datasets are proprietary — the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.

2000 HUB5 English

LibriSpeech

VoxForge

TIMIT

CHIME

TED-LIUM

Recommendation and ranking systems

Netflix Challenge

MovieLens

Million Song Dataset

Last.fm

Networks and Graphs

Amazon Co-Purchasing and Amazon Reviews

Friendster Social Network Dataset

Geospatial data

OpenStreetMap

Landsat8

NEXRAD

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航