VADER: Visual Affordance Detection and Error Recovery with Multi Robot Human Collaboration

VADER
Visual Affordance Detection and Error Recovery with Multi Robot Human Collaboration

Michael Ahn
Montserrat Gonzalez~Arenas
Matthew Bennice
Noah Brown
Christine Chan
Byron David
Anthony Francis
Gavin Gonzalez
Rainer Hessmer
Tomas Jackson
Nikhil J Joshi
Daniel Lam
Tsang-Wei Edward Lee
Alex Luong
Sharath Maddineni
Harsh Patel
Jodilyn Peralta
Jornell Quiambao
Diego Reyes
Rosario M Jauregui Ruano
Dorsa Sadigh
Pannag Sanketi
Leila Takayama
Pavel Vodenski
Fei Xia

Authors listed in alphabetical order (see paper appendix for contribution statement).

Paper

Abstract

Robots today can exploit the rich world knowledge of large language models to chain simple behavioral skills into long-horizon tasks. However, robots often get interrupted during long-horizon tasks due to primitive skill failures and dynamic environments. We propose VADER, a plan, execute, detect framework with seeking help as a new skill that enables robots to recover and complete long-horizon tasks with the help of humans or other robots. VADER leverages visual question answering (VQA) modules to detect visual affordances and recognize execution errors. It then generates prompts for a language model planner (LMP) which decides when to seek help from another robot or human to recover from errors in long-horizon task execution. We show the effectiveness of VADER through an experiment with a mobile manipulator asking for help from another mobile manipulator or another human for completing two long-horizon robotic tasks. Our user study with 19 participants suggests VADER is perceived to complete tasks more successfully than a control which does not ask for help, yet VADER is perceived as equally capable even though it receives human help.

Overview

We have seen powerful machine learning models enhancing capabilities in the world of Robotics. An interesting case is that of language model planners or LMPs that use large language models extending their generalization potential to Task and Motion Planning. LMPs enable execution of complex, long horizon tasks that can be expressed into a chain of simpler, low level skills that a robot has acquired (by way of machine learning techniques that can absorb data from past experiences). While LMPs readily exploit the rich knowledge of the world embedded in these large language models, they lack grounding with the robot's environment which is often necessary to accommodate dynamic changes to its surroundings requiring replanning. Inability to replan often yields undesired execution results. For example, the robot in the following video fails to pick up the debris from the table, but continues approaching the trash bin - a behavior that would adversely impact usefulness of such a robotic assistant in real life.

**Robot being insensitive to the intemediate picking failure, follows through the plan successfully, but fails to complete task.**

Inner Monologue like extentions improve reliability of execution by incorporating a variety of environment feedback into the LMP planning loop. However, these systems typically focus on failures that can be resolved by a single robot without any external help. In situations like that depicted in the video above, the robot is likely to receover from the failure even after multiple retries or without an external human intervention, since the coke can is outside its reach that the picking skill training assumed.

Our key insight is that by grounding with their environments, robots can detect their failures and collaborate with other robots, and humans to course correct through planning. Today's visual question answering (VQA) systems can provide the required grounding mechanism, where natural language summary on a visual observation can be generated within the context of a query. But a multi robot human collaboration is hard due to lack of a mechanism for distributed communication that enables agents to post or claim tasks and provide assistance to each other.

We introduce a general-purpose technique called VADER that offers grounding to the agent's environment by plugging in VQA in the LMP loop. VADER uses feedback from (visual) affordance and invokes error detection to generate requests for help from other agents. VADER is effectively a plan, execute, detect loop.

This allows the system to dynamically detect failures and employ recovery measures, thus enabling it to complete long horizon tasks. We also introduce a generic cloud-based communication framework, Human Robot Fleet Orchestration System or HRFS to facilitate this assistive collaboration, instantiated with robot agents working alongside humans.

With VADER, an agent is able to recover from failures in situations like above as shown in the video below.

With VADER, agent follows an LMP generated plan to execute an intemediate low-level skill, but invokes a VQA detect phase and adapts to any failures or deviations that would require replanning - asks a nearby human to help with coke can picking.

VADER: Algorithm

In a nutshell, VADER is a plan, execute, detect framework that employs a Vision Question Answering or VQA module to close the planning and execution loop via a state or context detection. For a long horizon task, say "I am out of ink in my pen and need something to write", the large language model based planner or LMP is invoked to express it in terms of low-level skils that a robot is capable of executing, as shown in the cartoon on the right. Given the current environmental state, the skill with the highest affordence, say navigate to the desk is selected, and executed. Simultaneously, a natural language description for the expected post-execution state is generated, say in front of the desk. Upon completion of the current skill execution the expectation is checked against the fresh state estimation using the VQA and the (natural language) output is fed back in order to potential re-planning in response to any deviations or errors. Deviations could include failures in skill execution, losing the capacity for executing the skill (for example, breaking a gripper), or even incorrectly picking up tasks it can’t complete given its skillset. In our example on the right, the robot failed to navigate to the desk due to a closed door blocking its path.

Expressed more formaly, VADER algorithm is

VADER adds three key components to an LMP to enable recovery from failures: (a) detection of skill affordances and execution errors with visual question answering, (b) replanning based on the detected categories of failures, and (c) recovery based on seeking help from other agents.

An interesting aspect of failure recovery in VADER is to seek help from other agents (human or robot) with skillsets beyond those of the seeker. For example, requesting help with opening the path-blocking door in our example above. Another agent with capabilities to serve door opening may or may not be available in the vicinity. In order to facilitate collaborations without assuming coexistence of a helper in immediate proximity, we developed a cloud based communication protocol, we call the Human Robot Fleet (orchestration) Service or HRFS.

Human Robot Fleet Orchestration System (HRFS)

Human Robot Fleet Orchestration System (HRFS) is a scalable cloud based service that offers a real-time communication platform to many active simultaneous for its participants. It is inherently elastic and scalable with respect to number of active simultaneous participants.

At its heart, HRFS is a database with transactional guarantees which maintains the most current state of the system, along with a plug-and-play client interface to facilitate interaction with the database and provide communication scheduling between various clients via the database state. The database maintains information about tasks pushed to HRFS and metadata associated with them such as the requester, task input specifications like a text-instruction, task execution context like preferences of the requesting entity, any constraints to be followed during the execution, task execution logging events pushed by the executor and various timestamps associated with these events. For example, a robot may push a task like ``open the door" with an executor preference of ``human" with a context providing the building or position address or identifier where the door is located.

HRFS maintains an elastic list of rooms or constraints that identify a logical boundary. For example, a building, an area like lobby in a building, a type of morphology, type of tasks, or simply a robot identifier. A newly participating entity - human or robot alike, subscribes to HRFS providing one or more types of constraints or tasks it would be interested in serving. For example, upon coming alive a robot may declare it is available for executing tasks in a build bldg123, tasks that involve a wiping tool. Note, a participant can change its subscription pattern anytime while alive and connected to HRFS. For example, a robot delivering an item from kitchen to lobby was subscribed to the former at the beginning of task execution (and hence received the delivery task request) but may update subscription to lobby at the end of successful delivery. HRFS maintains a dynamic mapping of current participant to its subscriptions.

As new tasks become available, HRFS pushes them to all eligible participants and any next available participant can claim it. Once claimed by a participant, HRFS locks it unavailable for any other entity to claim. Each participant while alive on the HRFS, is required to report its heart-beat periodically to HRFS, failure to doing so for an extended period, HRFS may mark this entity as absent and recover or reassign any tasks currently claimed by this entity. On the other hand, during execution of the claimed task the executor can push execution logging events in the spirit of reporting back to the requester on the latest execution update. Once the task is completed, the executor is required to update HRFS, which in turn is responsible for instantaneously pushing all these updates back to the requester. The two or more entities bound during a task execution this way are termed in-contract. Note, none of the participants interact with each other directly and hence may or may not be aware of each others' identities unless in-contract.

The HRFS implementation used in this work involved use of a custom Firebase Realtime Database instance that readily provides a lot of functionality for such a fleet orchestration and was inspired from this example video.