Zoom on Dataset sources
by snonov
Some dataset sources list
URL, pointers to dataset :
- Awesome public dataset https://github.com/awesomedata/awesome-public-datasets
- http://hadoopilluminated.com/hadoop_illuminated/Public_Bigdata_Sets.html
- https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016
External providers :
- Data portals (list of dataset worldwide) : http://dataportals.org/
- Another data portals (list of dataset worldwide) : https://datahub.io/
- Internet archive https://archive.org/about/ example HackerNews dump https://archive.org/download/HackerNewsStoriesAndCommentsDump
- Kaggle public dataset https://www.kaggle.com/datasets
- AWS OpenData https://registry.opendata.aws/
- Google BigQuery public dataset https://cloud.google.com/bigquery/public-data/
- Pushshift that hold some common feed datasets like Reddit, Twitter, … http://files.pushshift.io/
- Laboratory for web algo http://law.di.unimi.it/datasets.php
Owner and providers :
- Wikipedia dataset : https://en.wikipedia.org/wiki/Wikipedia:Database_download
- RATP Dataset : https://data.ratp.fr/explore/?sort=modified
- SNCF Dataset : https://data.sncf.com/explore/?sort=modified
Official gov :
- Etalab https://www.etalab.gouv.fr/ and datagouv https://www.data.gouv.fr/fr/
- Paris city dataset https://opendata.paris.fr/explore/?sort=modified