Lang2LTL-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models

IROS 2024

Brown University

We deployed our language grounding system on a quadruped mobile manipulator Spot to ground spatiotemporal navigational commands in both indoor and outdoor environments.

Abstract

Grounding spatiotemporal navigation commands to structured task specifications enables autonomous robots to understand a broad range of natural language commands and solve long-horizon tasks with safety guarantees. Prior works mostly focus on grounding spatial or temporally extended language for robots.

We propose a modular system that leverages pretrained large language and vision-language models and multimodal semantic information to ground spatiotemporal navigation commands in novel city-scale environments without retraining.

Our language grounding system achieves 93.53% accuracy on a dataset of 21,780 semantically diverse natural language commands from unseen environments. We run an ablation study to validate the need for different modalities. We also show that a physical robot equipped with the same system without modification can execute 50 semantically diverse natural language commands in both indoor and outdoor environments.

Robot Demonstrations

Please see the videos of robot demonstrations and their corresponding natural language spatiotemporal navigation commands. [videos, commands]

BibTeX

@inproceedings{liu2024lang2ltl2,
  title     = {{Lang2LTL}-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models},
  author    = {Liu, Jason Xinyu and Shah, Ankit and Konidaris, George and Tellex, Stefanie and Paulius, David},
  booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  year      = {2024}
}