Open Source Datasets


A large-scale, high-quality dataset of URL links to approximately 300,000 video clips that covers 400 human action classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400 video clips. Each clip is human annotated with a single action class and lasts around 10s. 

View paper • Read more & access the dataset

AQuA-RAT (Algebra Question Answering with Rationales) 

A large-scale dataset consisting of approximately 100,000 algebraic word problems. The solution to each question is explained step-by-step using natural language. This data is used to train a program generation model that learns to generate the explanation, while generating the program that solves the question.

View paper •  View source on GitHub

dSprites - Disentanglement testing Sprites dataset

This dataset consists of 737,280 images of 2D shapes, procedurally generated from 5 ground truth independent latent factors, controlling the shape, scale, rotation and position of a sprite. This data can be used to assess the disentanglement properties of unsupervised learning methods.

View source on GitHub

DeepMind CNN/Daily Mail Reading Comprehension Corpus

This dataset contains over 1.5 million question and answer pairs for a reading comprehension task based on articles from the CNN and Daily Mail. Questions, answers and context are anonymised with random entity markers, thereby forcing systems to answer questions purely based on the context provided. This dataset accompanies the 'Teaching Machines to Read and Comprehend' paper.

View paper •  View source on GitHub 

Metacontrol for Adaptive Imagination-Based Optimization task

An artificially generated dataset for the spaceship task from 'Metacontrol for Adaptive Imagination-Based Optimization'. We generated five datasets, each containing scenes with a different number of planets (ranging from a single planet to five planets). Each dataset consisted of 100,000 training scenes and 1,000 testing scenes.

View paper •  View source on GitHub 

Collectible Card Game to Code 

This dataset contains the language to code datasets described in our paper 'Latent Predictor Networks for Code Generation'.

View paper •  View source on GitHub 

Unsupervised Data Generated for GeoQuery and SAIL 

This dataset contains the generated unsupervised data for GeoQuery and SAIL semantic parsing tasks in our paper 'Semantic Parsing with Semi-Supervised Sequential Autoencoders'.

View paper •  View source on GitHub