End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

Chao Wang; Wei Luo; Jia-Rui Zhu; Ying-Chun Xia; Jin He; Li-Chuan Gu

熱門：朱丽彬黃光男王美玲王善边曾瓊瑤崔雪娟

首頁

臺灣期刊 學校系所學協會民間出版

大陸/海外期刊 政府機關學校系所學協會民間出版

DOI註冊服務


篇名	End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning
並列篇名	End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning
作者	Chao Wang、Wei Luo、Jia-Rui Zhu、Ying-Chun Xia、Jin He、Li-Chuan Gu
英文摘要	Visual grounding locates target objects or areas in the image based on natural language expression. Most current methods extract visual features and text embeddings independently, and then carry out complex fusion reasoning to locate target objects mentioned in the query text. However, such independently extracted visual features often contain many features that are irrelevant to the query text or misleading, thus affecting the subsequent multimodal fusion module, and deteriorating target localization. This study introduces a combined network model based on the transformer architecture, which realizes more accurate visual grounding by using query text to guide visual feature generation and multi-stage fusion reasoning. Specifically, the visual feature generation module reduces the interferences of irrelevant features and generates visual features related to query text through the guidance of query text features. The multi-stage fused reasoning module uses the relevant visual features obtained by the visual feature generation module and the query text embeddings for multi-stage interactive reasoning, further infers the correlation between the target image and the query text, so as to achieve the accurate localization of the object described by the query text. The effectiveness of the proposed model is experimentally verified on five public datasets and the model outperforms state-of-the-art methods. It achieves an improvement of 1.04%, 2.23%, 1.00% and +2.51% over the previous state-of-the-art methods in terms of the top-1 accuracy on TestA and TestB of the RefCOCO and RefCOCO+ datasets, respectively.
起訖頁	083-095
關鍵詞	visual grounding、query text guidance、Swin-transformer、attention module、multi-stage reasoning
刊名	電腦學刊
期數	202402 (35:1期)
DOI	10.53106/199115992024023501006 複製DOI
QR Code
該期刊上一篇	Efficient First-price Sealed E-auction Protocol Under Secure Multi-party Computational Malicious Model
該期刊下一篇	A Novel Deep Neural Network for Facial Beauty Improvement

教師服務合作出版期刊徵稿聯絡高教高教FB	讀者服務圖書目錄教育期刊訂購服務活動訊息	數位服務高等教育知識庫國際資料庫收錄投審稿系統 DOI註冊	線上購買高點網路書店元照網路書店博客來網路書店	教育資源教育網站國際教育網站	關於高教高教簡介出版授權合作單位
知識達	知識達	知識達	知識達	知識達	知識達