End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning,ERICDATA高等教育知識庫
高等教育出版
熱門: 王善边  黃光男  崔雪娟  王美玲  朱丽彬  黃乃熒  
高等教育出版
首頁 臺灣期刊   學校系所   學協會   民間出版   大陸/海外期刊   政府機關   學校系所   學協會   民間出版   DOI註冊服務
篇名
End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning
並列篇名
End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning
作者 Chao WangWei LuoJia-Rui ZhuYing-Chun XiaJin HeLi-Chuan Gu
英文摘要

Visual grounding locates target objects or areas in the image based on natural language expression. Most current methods extract visual features and text embeddings independently, and then carry out complex fusion reasoning to locate target objects mentioned in the query text. However, such independently extracted visual features often contain many features that are irrelevant to the query text or misleading, thus affecting the subsequent multimodal fusion module, and deteriorating target localization. This study introduces a combined network model based on the transformer architecture, which realizes more accurate visual grounding by using query text to guide visual feature generation and multi-stage fusion reasoning. Specifically, the visual feature generation module reduces the interferences of irrelevant features and generates visual features related to query text through the guidance of query text features. The multi-stage fused reasoning module uses the relevant visual features obtained by the visual feature generation module and the query text embeddings for multi-stage interactive reasoning, further infers the correlation between the target image and the query text, so as to achieve the accurate localization of the object described by the query text. The effectiveness of the proposed model is experimentally verified on five public datasets and the model outperforms state-of-the-art methods. It achieves an improvement of 1.04%, 2.23%, 1.00% and +2.51% over the previous state-of-the-art methods in terms of the top-1 accuracy on TestA and TestB of the RefCOCO and RefCOCO+ datasets, respectively.

 

起訖頁 083-095
關鍵詞 visual groundingquery text guidanceSwin-transformerattention modulemulti-stage reasoning
刊名 電腦學刊  
期數 202402 (35:1期)
DOI 10.53106/199115992024023501006   複製DOI
QR Code
該期刊
上一篇
Efficient First-price Sealed E-auction Protocol Under Secure Multi-party Computational Malicious Model
該期刊
下一篇
A Novel Deep Neural Network for Facial Beauty Improvement

高等教育知識庫  閱讀計畫  教育研究月刊  新書優惠  

教師服務
合作出版
期刊徵稿
聯絡高教
高教FB
讀者服務
圖書目錄
教育期刊
訂購服務
活動訊息
數位服務
高等教育知識庫
國際資料庫收錄
投審稿系統
DOI註冊
線上購買
高點網路書店 
元照網路書店
博客來網路書店
教育資源
教育網站
國際教育網站
關於高教
高教簡介
出版授權
合作單位
知識達 知識達 知識達 知識達 知識達 知識達
版權所有‧轉載必究 Copyright2011 高等教育文化事業股份有限公司  All Rights Reserved
服務信箱:edubook@edubook.com.tw 台北市館前路 26 號 6 樓 Tel:+886-2-23885899 Fax:+886-2-23892500