State classification of objects and relations is essential for a plethora of tasks, from scene understanding to robot planning and
manipulation. Many such long-horizon tasks require accurate and varied state predictions for entities in scenes. For example, planning
for “setting up the table” requires classifying whether the cup is NextTo the plate, whether the utensils are
OnTop of the table, and whether the microwave is Open.
The goal of state classification is to precisely answer such queries about specific entities in an image, and determine whether they
satisfy particular conditions across a range of attributes and relations.
However, the combinatorial space of objects (e.g., cup, plate, microwave) and predicates (e.g., NextTo, OnTop,
Open) gives rise to an explosion of possible object-predicate combinations that is intractable to obtain corresponding
training data for. In addition, real-world systems operating in dynamic environments must generalize to queries with novel predicates,
often with only a few examples. Hence, an essential but difficult consideration for state classification models is to quickly learn to
adapt to out-of-distribution queries.