10 sites to get the large data set or data corpus for free
2013-05-21 21:10
537 查看
You may require GBs of data to do performance or load testing. How your app behaves when there is loads of data. You need to know the capacity of your application. This is the frequently asked question from the sales team "The customer is having 100GB of data
and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.
How to generate that data? The easiest way would be to have some samples of data, multiply it using some scripts. Another option would be to create data using random values. The main disadvantage of this approach is the data will have very less unique content
and it may not give desired results. Below are some links to get large data set.
Wikipedia:Database, Wikipedia offers free copies of all available content to interested users. data is available in multiple languages. Content along with images could be downloaded.
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Common crawl builds and maintains an open crawl of the web accessible to everyone. The data is stored in amazon s3bucket and the requester may have spend some money to access it.
https://www.commoncrawl.org/
EDRM File Formats Data Set, consists of 381 files covering 200 file formats.
http://www.edrm.net/resources/data-sets/edrm-file-format-data-set
Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data.
https://cwiki.apache.org/confluence/display/MAHOUT/Collections
EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.
http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used
by several tracks of the TREC conference. http://lemurproject.org/clueweb09/
DMOZ - Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is one main source for internet search engines.
http://www.dmoz.org/rdf.html
theinfo.org - This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange
tips and tricks, develop and share tools together, and begin to integrate their particular projects.
http://theinfo.org/
Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, iOS or other portable device.
http://www.gutenberg.org/
Million song data set, has data related to tracks and artist. http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
Mailing list archive, Subscribe to any mailing list, you will get dozens of emails. Many groups or communities provide option to download the mail archive.
Sometimes we may not get the type or format of data we want. In this situitation, we could get these data, write some script to convert it to our desired format.
This large data set helps to do load test your app and understand its capacity and bottleneck. Using these data set you cannot validate the test results. If you build a search engine, you cannot verify that these many number of hits should be returned for a
given keyword.
Now every company is moving towards cloud. People talk about big data but there is some way to generate these data, so that the application could be well tested.
and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.
How to generate that data? The easiest way would be to have some samples of data, multiply it using some scripts. Another option would be to create data using random values. The main disadvantage of this approach is the data will have very less unique content
and it may not give desired results. Below are some links to get large data set.
Wikipedia:Database, Wikipedia offers free copies of all available content to interested users. data is available in multiple languages. Content along with images could be downloaded.
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Common crawl builds and maintains an open crawl of the web accessible to everyone. The data is stored in amazon s3bucket and the requester may have spend some money to access it.
https://www.commoncrawl.org/
EDRM File Formats Data Set, consists of 381 files covering 200 file formats.
http://www.edrm.net/resources/data-sets/edrm-file-format-data-set
Apache Mahout TLP project to create scalable, machine learning algorithms. Mahout has many links to get free and paid corpus data.
https://cwiki.apache.org/confluence/display/MAHOUT/Collections
EDRM Enron Email Data Set v2 consist of Enron e-mail messages and attachments in two sets of downloadable compressed files: XML and PST.
http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2
ClueWeb09 dataset was created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009. The dataset is used
by several tracks of the TREC conference. http://lemurproject.org/clueweb09/
DMOZ - Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It has collections of URLs in different category. Dmoz is one main source for internet search engines.
http://www.dmoz.org/rdf.html
theinfo.org - This is a site for large data sets and the people who love them: the scrapers and crawlers who collect them, the academics and geeks who process them, the designers and artists who visualize them. It's a place where they can exchange
tips and tricks, develop and share tools together, and begin to integrate their particular projects.
http://theinfo.org/
Project Gutenberg offers over 36,000 free ebooks to download to your PC, Kindle, Android, iOS or other portable device.
http://www.gutenberg.org/
Million song data set, has data related to tracks and artist. http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
Mailing list archive, Subscribe to any mailing list, you will get dozens of emails. Many groups or communities provide option to download the mail archive.
Sometimes we may not get the type or format of data we want. In this situitation, we could get these data, write some script to convert it to our desired format.
This large data set helps to do load test your app and understand its capacity and bottleneck. Using these data set you cannot validate the test results. If you build a search engine, you cannot verify that these many number of hits should be returned for a
given keyword.
Now every company is moving towards cloud. People talk about big data but there is some way to generate these data, so that the application could be well tested.
相关文章推荐
- How to fix the issue that GEM_HOME and/or GEM_PATH not set issue for rvm in mac version 10.12
- how to get the default maximum heap size for Sun's JVM from Java SE 6 or 7
- MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error
- #Redis Error #MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details a
- Sql Script To set the show sort of data ( up or down )
- How To Print a Form That Is Too Large for the Screen or Page
- 6 ways to download free intraday and tick data for the U.S. stock market
- 6 ways to download free intraday and tick data for the U.S. stock market
- How-to: Set up the delta upload for Vendor Master Data from CRM to ERP
- MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error
- 10 ways to download historical stock quotes data for free
- 10 ways to download historical stock quotes data for free
- Failed to write image data for the launch image set from "LaunchImage.launchimage/1242_2208.png" to
- the application is not licensed to create or modify schema for this type of data
- ArcGIS Engine中初始化许可常见问题归纳,the application is not licensed to create or modify schema for this type of data
- how to get the Authorization of adobe acrobat 8.0 for free
- The application is not licensed to modify or create schema for this type of data 解决办法
- Unable to connect to data source (DSN: shangjihuiclient; Network Address: ; Port Number: 53397). Cannot connect to TimesTen Server. Verify that the TimesTen Server is running or verify that your TCP PORT is set correctly.
- How to set up OpenERP for various timezone kindly follow the following steps to select timezone in OpenERP
- how to set proxy for "darcs get"