TextPSG: Panoptic Scene Graph Generation from Textual Descriptions

Abstract

Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions.

Results

Quantitative Results

Our quantitative results on Caption-to-PSG are shown below. To make a fair comparison with bbox-based scene graphs generated by baselines, we evaluate our generated PSGs in both mask and bbox mode. For the latter, all masks in both prediction and ground truth are converted into bboxes (i.e., the mask area's enclosing rectangle) for evaluation, resulting in an easier setting than the former. The results show that our framework (Ours) significantly outperforms all the baselines under the same constraints on both PhrDet (14.37 vs. 3.71 N5R100) and SGDet (5.48 vs. 2.7 N5R100). Our method also shows better results compared with SGGNLS-o on all metrics and all tasks (on PhrDet, 14.37 vs. 7.93 N5R100; on SGDet, 5.48 vs. 5.02 N5R100) although SGGNLS-o utilizes location priors by leveraging a pre-trained detector. The results demonstrate that our framework is more effective for learning a good panoptic structured scene understanding.

Qualitative Results

We provide typical qualitative results below to further show our framework's effectiveness. Compared with SGGNLS-o, our framework has the following advantages. First, our framework is able to provide fine-grained semantic labels to each pixel in the image to reach a panoptic understanding, while SGGNLS-o can only provide sparse bboxes produced by the pre-trained detector. Note that categories with irregular shapes (e.g., trees) are hard to be labeled precisely by bboxes. Second, compared with SGGNLS-o, our framework can generate more comprehensive object semantics and relation predicates, such as "dry grass field" and "land at", showing the open-vocabulary potential of our framework. More qualitative results can be found in the supplementary material.

Out-of-distribution (OOD) Robustness

We further analyze another key advantage of our framework, i.e., the robustness in OOD cases. Since SGGNLS-c and SGGNLS-o both rely on a pre-trained detector to locate objects, their performance highly depends on whether object semantics in the scene are covered by the detector. Based on the object semantics covered by the detector, we split the ground truth triplets into an in-distribution (ID) set and an OOD set. For triplets within the ID set, both the subject and object semantics are covered, while for triplets in the OOD set, at least one of the semantics is not covered. As shown below, both SGGNLS-c and SGGNLS-o suffer a significant performance drop from the ID set to the OOD set. On the OOD set, the triplets can hardly be retrieved. However, our framework, with the ability of location learned from purely text descriptions, can reach similar performance on both sets, which demonstrates the OOD robustness of our framework for PSG generation.

Application on Text-supervised Semantic Segmentation

As a side product, we observe that our entity grounder and segment merger can also enhance TSSS. Based on the original GroupViT, we replace the multi-label contrastive loss with our entity grounder and segment merger. Then we finetune the model on the COCO Caption dataset. As shown below, compared with GroupViT directly finetuned on the COCO Caption dataset, the explicit learning of merging in our modules can boost the model with an absolute 2.15% improvement of mean Intersection over Union (mIoU, %) on COCO, which demonstrates the effectiveness of our proposed modules on better object location.

Analysis

Failure Case Analysis

We provide more qualitative results below and analyze the failure cases. We find that our framework has the following limitations which could be further improved in the future.
• The strategy we use to convert semantic segmentation into instance segmentation is not entirely effective. For simplicity, in TextPSG, we identify each connected component in the semantic segmentation to be an individual object instance. However, this strategy may fail when instances overlap or are occluded, resulting in either an underestimation or an overestimation of instances. As shown below, our strategy can successfully separate the two cows in (ii), but mistakenly divides the car behind the tree into three parts in (i).

• Our framework faces difficulty in locating small objects in the scene due to limitations in resolution and the grouping strategy for location. As shown below, in (ii) and (iv), our method can identify large objects such as large cows, birds, grass, and sea, but struggles to locate relatively small objects such as small cows in (ii) and people in (iv).

• The relation prediction of our framework requires enhancement, as it is not adequately conditioned on the image. While the label generator uses both image features and predicted object semantics to determine the relation, it sometimes seems to lean heavily on the object semantics, potentially neglecting the actual image content. As shown below, in (i), the relations between the blue mask of the car and the green mask of the car are predicted as both being "in front of", which is not reasonable. In this case, "beside" may be a more appropriate prediction (in this case, the first limitation about the segmentation conversion also exists).

Model Diagnosis

For a clearer understanding of the efficacy of our framework, we conduct a further model diagnosis to answer the following question.
• Why does our framework only achieve semantic segmentation through learning, rather than panoptic segmentation (and thus requires further segmentation conversion to obtain instance segmentation)?
Here, we use two captions in different granularity to execute region-entity alignment, with (a) one describing the two sheep individually while (b) the other merges them in plural form. It shows that our framework has the capability to assign distinct masks to individual instances. However, the nature of caption data, where captions often merge objects of the same semantics in plural form, limits our framework from differentiating instances. It is the weak supervision provided by the caption data that constrains our framework. We argue that a superior image-caption-pair dataset with more detailed granularity in captions may achieve panoptic segmentation through learning, which could be a valuable future direction to explore.