Abstract
Text-to-image person re-identification (ReID) is a common subproblem in the field of person re-identification and image-text retrieval. Recent approaches generally follow the structure of a dual-stream network, extracting image and text features. There is no deep interaction between images and text in this approach, making it difficult for the network to learn a highly semantic feature representation. In addition, for both image data and text data, the feature extraction process is modeled in a regular way, such as using Transformer to extract sequence embeddings. However, this type of modeling disregards the inherent relationships among multimodal input embeddings. A more flexible approach to mining multimodal data, which uniformly treats the data as graphs, is proposed. In this way, the extraction and interaction of multimodal information are accomplished by means of messages passing between graph nodes. First, a unified multimodal feature extraction and fusion network is proposed based on the graph convolutional network, which enables the progression of multimodal information from ‘local’ to ‘global’. Second, an asymmetric multilevel alignment module, which focuses on more accurate ‘local’ information from a ‘global’ perspective, is proposed to progressively divide the multimodal information at each level. Last, a cross-modal representation matching strategy based on similarity distribution and mutual information is proposed to achieve cross-modal alignment. The proposed algorithm in this paper is simple and efficient, and the testing results on three public datasets (CUHK-PEDES, ICFG-PEDES and RSTPReID) show that it can achieve SOTA-level performance.
Original language | English |
---|---|
Pages (from-to) | 1-12 |
Number of pages | 12 |
Journal | IEEE Transactions on Multimedia |
Early online date | 19 Dec 2023 |
DOIs | |
Publication status | E-pub ahead of print - 19 Dec 2023 |
Bibliographical note
Publisher Copyright:IEEE
Keywords
- Convolutional neural networks
- Cross-modal retrieval
- Data mining
- Feature extraction
- graph convolutional network
- Graph neural networks
- image-text retrieval
- person re-identification
- person search
- Semantics
- Task analysis
- Visualization