Open Source Datasets


A large-scale extendable dataset which generates question and answer pairs from a range of question types at roughly school-level difficulty. It is designed to test the mathematical learning and algebraic reasoning skills of learning models.

View paper • View source on GitHub


StreetLearn dataset for academic research, based on Google Street View images of two cities.

View paper • Request Dataset


This repository contains levels for boxoban, a box-pushing puzzle game inspired by Sokoban.

View paper • View source on GitHub

Abstract reasoning matrices

Progressive matrices dataset, as described in: Measuring abstract reasoning in neural networks. 

View paper • View source on GitHub

Spatial language Integrating Model (SLIM) 

This dataset consists of virtual scenes rendered in MuJoCo with multiple views each presented in multiple modalities: image, and synthetic or natural language descriptions. Each scene consists of two or three objects placed on a square walled room, and for each of the 10 camera viewpoint we render a 3D view of the scene as seen from that viewpoint as well as a synthetically generated description of the scene.

View paper • View source on GitHub

Logical entailment

This repository contains an entailment dataset for propositional logic, and code for generating that dataset. It also contains code for parsing the dataset in Python.

View paper • View source on GitHub


A large-scale, high-quality dataset of URL links to approximately 300,000 video clips that covers 400 human action classes, including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging. Each action class has at least 400 video clips. Each clip is human annotated with a single action class and lasts around 10s.

View paper • View source on GitHub


This repository contains the NarrativeQA dataset. It includes the list of documents with Wikipedia summaries, links to full stories, and questions and answers.

View paper • View source on GitHub

AQuA-RAT (Algebra Question Answering with Rationales) 

A large-scale dataset consisting of approximately 100,000 algebraic word problems. The solution to each question is explained step-by-step using natural language. This data is used to train a program generation model that learns to generate the explanation, while generating the program that solves the question.

View paper •  View source on GitHub

dSprites - Disentanglement testing Sprites dataset

This dataset consists of 737,280 images of 2D shapes, procedurally generated from 5 ground truth independent latent factors, controlling the shape, scale, rotation and position of a sprite. This data can be used to assess the disentanglement properties of unsupervised learning methods.

View source on GitHub

Metacontrol for Adaptive Imagination-Based Optimization task

An artificially generated dataset for the spaceship task from 'Metacontrol for Adaptive Imagination-Based Optimization'. We generated five datasets, each containing scenes with a different number of planets (ranging from a single planet to five planets). Each dataset consisted of 100,000 training scenes and 1,000 testing scenes.

View paper •  View source on GitHub 

Collectible Card Game to Code 

This dataset contains the language to code datasets described in our paper 'Latent Predictor Networks for Code Generation'.

View paper •  View source on GitHub 

Unsupervised Data Generated for GeoQuery and SAIL 

This dataset contains the generated unsupervised data for GeoQuery and SAIL semantic parsing tasks in our paper 'Semantic Parsing with Semi-Supervised Sequential Autoencoders'.

View paper •  View source on GitHub