Learning to navigate in cities without a map

How did you learn to navigate the neighborhood of your childhood, to go to a friend’s house, to your school or to the grocery store? Probably without a map and simply by remembering the visual appearance of streets and turns along the way. As you gradually explored your neighborhood, you grew more confident, mastered your whereabouts and learned new and increasingly complex paths. You may have gotten briefly lost, but found your way again thanks to landmarks, or perhaps even by looking to the sun for an impromptu compass.

Navigation is an important cognitive task that enables humans and animals to traverse, without maps, over long distances in a complex world. Such long-range navigation can simultaneously support self-localisation (“I am here”) and a representation of the goal (“I am going there”).

In Learning to Navigate in Cities Without a Map, we present an interactive navigation environment that uses first-person perspective photographs from Google Street View and gamify that environment to train an AI. As standard with Street View images, faces and license plates have been blurred and are unrecognisable. We build a neural network-based artificial agent that learns to navigate multiple cities using visual information (pixels from a Street View image). Note that this research is about navigation in general rather than driving; we did not use traffic information nor try to model vehicle control.

fullscreen fullscreen_mobile
Our agent navigates in visually diverse environments, without having access to the map of the environment.

The agent is rewarded when it reaches a target destination (specified, for instance, as pair of latitude and longitude coordinates), like a courier tasked with an endless set of deliveries but without a map. Over time, the AI agent learns to cross entire cities in this way. We also demonstrate that our agent can learn the task in multiple cities, and then robustly adapt to a new city.

fullscreen fullscreen_mobile
Stop-motion films of agent trained in Paris. The images are superposed with a map of the city, showing the goal location (in red) and the agent location and field of view (in green). Note that the agent does not see the map, only the lat/lon coordinates of the goal location.

Learning navigation without building maps

We depart from the traditional approaches which rely on explicit mapping and exploration (like a cartographer who tries to localise themselves and draw a map at the same time). Our approach, in contrast, is to learn to navigate as humans used to do, without maps, GPS localisation, or other aids, using only visual observations. We build a neural network agent that inputs images observed from the environment and predicts the next action it should take in that environment. We train it end-to-end using deep reinforcement learning, similarly to some recent work on learning to navigate in complex 3D mazes and reinforcement learning with unsupervised auxiliary tasks for playing games. Unlike those studies, which were conducted on small-scale simulated maze environments, we utilise city-scale real-world data, including complex intersections, footpaths, tunnels, and diverse topology across London, Paris, and New York City. Moreover, the approach we use support city-specific learning and optimisation as well as general, transferable navigation behaviours.

Modular neural network architecture that can transfer to new cities

The neural network inside our agent consists of three parts: 1) a convolutional network that can process images and extract visual features, 2) a locale-specific recurrent neural network that is implicitly tasked with memorising the environment as well as learning a representation of “here” (current position of the agent) and of “there” (location of the goal) and 3) a locale-invariant recurrent network that produces the navigation policy over the agent’s actions. The locale-specific module is designed to be interchangeable and, as its name indicates, unique to each city where the agent navigates, whereas the vision module and the policy module can be locale-invariant.

fullscreen fullscreen_mobile
Comparison of the CityNav architecture (a), MultiCityNav architecture with a locale-specific pathway for each city (b) and illustration of the training and transfer procedure when adapting the agent to a new city (c).

Just as in the Google Street View interface, the agent can rotate in place or move forward to the next panorama, when possible. Unlike the Google Maps and Street View environment, the agent does not see the little arrows, the local or global map, or the famous Pegman: it needs to learn to differentiate open roads from sidewalks. The target destinations may be kilometres away in the real world and require the agent to step through hundreds of panoramas to reach them.

We demonstrate that our proposed method can provide a mechanism for transferring knowledge to new cities. As with humans, when our agent visits a new city, we would expect it to have to learn a new set of landmarks, but not to have to re-learn its visual representations or its behaviours (e.g., zooming forward along streets or turning at intersections). Therefore, using the MultiCity architecture, we train first on a number of cities, then we freeze both the policy network and the visual convolutional network and only a new locale-specific pathway on a new city. This approach enables the agent to acquire new knowledge without forgetting what it has already learned, similarly to the progressive neural networks architecture.

fullscreen fullscreen_mobile
Five areas of Manhattan used in this study.

Studying navigation is fundamental in the study and development of artificial intelligence, and trying to replicate navigation in artificial agents can also help scientists understand its biological underpinnings.