Grounding spatiotemporal navigation commands to structured task specifications enables autonomous robots to understand a broad range of natural language commands and solve long-horizon tasks with safety guarantees. Prior works mostly focus on grounding spatial or temporally extended language for robots.
We propose a modular system that leverages pretrained large language and vision-language models and multimodal semantic information to ground spatiotemporal navigation commands in novel city-scale environments without retraining.
Our language grounding system achieves 93.53% accuracy on a dataset of 21,780 semantically diverse natural language commands from unseen environments. We run an ablation study to validate the need for different modalities. We also show that a physical robot equipped with the same system without modification can execute 50 semantically diverse natural language commands in both indoor and outdoor environments.
@inproceedings{liu2024lang2ltl2,
title = {{Lang2LTL}-2: Grounding Spatiotemporal Navigation Commands Using Large Language and Vision-Language Models},
author = {Liu, Jason Xinyu and Shah, Ankit and Konidaris, George and Tellex, Stefanie and Paulius, David},
booktitle = {IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
year = {2024}
}