Vision Transformer and DNN based human gaze map prediction in cross-view match task

Poster No:

1124 

Submission Type:

Abstract Submission 

Authors:

Yidong Hu1, Li Tong2, Ying Zeng2, Bin Yan2

Institutions:

1Information Engineering University, Zhengzhou, Henan, 2Information Engineering Univer, Zhengzhou, Henan

First Author:

Yidong Hu  
Information Engineering University
Zhengzhou, Henan

Co-Author(s):

Li Tong  
Information Engineering Univer
Zhengzhou, Henan
Ying Zeng  
Information Engineering Univer
Zhengzhou, Henan
Bin Yan  
Information Engineering Univer
Zhengzhou, Henan

Introduction:

Human gaze region prediction aims to simulate human selective attention mechanisms. Traditional models lack high-level semantic information from images, leading to a semantic gap. This study proposes a hybrid model combining Vision Transformer(ViT) and deep neural networks, adopting an encoder-decoder architecture to capture both low-level visual features and high-level semantic features of images. The model is optimized using saliency evaluation metrics and residual connections. Experiments demonstrate that this model can automatically learn human gaze features, outperforming existing techniques and achieving the best performance among four models.

Methods:

Experiment
We conducted professional training for 20 participants to ensure the quality of eye-tracking data. The experiment utilized 4,000 pairs of random images sourced from CVACT (with a 3:1 ratio of matched to unmatched pairs) (Liu et al. 2019), and eye-tracking data was collected using the Tobii Pro Spectrum device. Preprocessing steps included data cleaning, organization, filtering, calibration, and visualization of fixation points.
The proposed model
s shown in Fig. 1., the model architecture consists of three parts: a Hybrid Encoder, a Convertor, and a Transformer Decoder. The Hybrid Encoder combines high-level semantic features extracted by VGG-16 (Huang et al. 2015)with low-level visual features extracted by Tokens-to-Token ViT(Yuan et al. 2021) from the image, where irrelevant Tokens are eliminated through fold transformation. The Convertor uses standard Transformer layers to map features into the decoding space and introduces a Saliency Token to enhance performance. The Transformer Decoder employs Reverse Tokens-to-Token ViT(Liu et al. 2021) to restore the size of the Tokens, enabling dense predictions. To compensate for information loss, residual connections are implemented between components(Ma et al. 2023). Finally, combining predictions at various levels with the ground truth, the loss function is defined using a linear combination of Mean Absolute Error (MAE), Maximum F-measure (maxF), and Maximum enhanced-alignment (E_ξ^max) (Qin et al. 2020; Thomas n.d.).
Supporting Image: Fig1.jpg
   ·The proposed model
 

Results:

Fig. 2. compares the human gaze prediction results of four models (U^2-net(Qin et al. 2020)、RGB-VST(Liu et al. 2021)、Liu et al.、Ours) on the MIT1003 (Fig. 2. (a)-(b)), CAT2000 (Fig. 2. (c)-(d)), and a self-constructed dataset CVACT (Fig. 2. (e)-(h)). On CVACT, the models are able to perform human gaze region prediction in cross-view matching tasks. Although the predicted regions are slightly larger than the actual gaze regions, they almost completely cover the core part of the actual gaze. On MIT1003 and CAT2000, the model demonstrated optimal performance, accurately capturing salient regions.
Supporting Image: Fig2.jpg
 

Conclusions:

This paper proposes a novel human gaze prediction model designed to predict gaze regions in cross-view image matching. The model integrates high-level semantic features extracted by VGG-16 with low-level visual features extracted by Tokens-to-Token Transformer, effectively combining various visual features in the process of human visual cognition. Experiment results on the self-constructed cross-view matching dataset and public datasets demonstrate that the model can automatically learn human gaze features and exhibits excellent performance in gaze map prediction.

Modeling and Analysis Methods:

Classification and Predictive Modeling 1
Methods Development

Perception, Attention and Motor Behavior:

Perception: Visual 2
Perception and Attention Other

Keywords:

Immitation
Vision
Other - Eye movement

1|2Indicates the priority used for review

Abstract Information

By submitting your proposal, you grant permission for the Organization for Human Brain Mapping (OHBM) to distribute your work in any format, including video, audio print and electronic text through OHBM OnDemand, social media channels, the OHBM website, or other electronic publications and media.

I accept

The Open Science Special Interest Group (OSSIG) is introducing a reproducibility challenge for OHBM 2025. This new initiative aims to enhance the reproducibility of scientific results and foster collaborations between labs. Teams will consist of a “source” party and a “reproducing” party, and will be evaluated on the success of their replication, the openness of the source work, and additional deliverables. Click here for more information. Propose your OHBM abstract(s) as source work for future OHBM meetings by selecting one of the following options:

I am submitting this abstract as the outcome of the OHBM-OSSIG reproducibility challenge, having reproduced previously submitted work with the original author(s)’ agreement. I have cited the original work and acknowledged the origin team in the abstract.

Please indicate below if your study was a "resting state" or "task-activation” study.

Task-activation

Healthy subjects only or patients (note that patient studies may also involve healthy subjects):

Healthy subjects

Was this research conducted in the United States?

No

Were any human subjects research approved by the relevant Institutional Review Board or ethics panel? NOTE: Any human subjects studies without IRB approval will be automatically rejected.

Yes

Were any animal research approved by the relevant IACUC or other animal research panel? NOTE: Any animal studies without IACUC approval will be automatically rejected.

Not applicable

Please indicate which methods were used in your research:

Computational modeling
Other, Please specify  -   Deep learning

Provide references using APA citation style.

Huang X. et al. (2015). SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In: 2015 IEEE International Conference on Computer Vision (ICCV). Presented at the 2015 IEEE International Conference on Computer Vision (ICCV). Santiago, Chile: IEEE. p. 262–270.
Liu L. et al. (2019). Lending Orientation to Neural Networks for Cross-View Geo-Localization. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Presented at the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE. p. 5617–5626.
Liu N. et al. (2021). Visual Saliency Transformer. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Presented at the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE. p. 4702–4712.
Ma C. et al. (2023). Eye-Gaze-Guided Vision Transformer for Rectifying Shortcut Learning. IEEE Trans Med Imaging. 42:3384–3394.
Qin X, Zhang Z, Huang C, Dehghan M, Zaiane OR, Jagersand M. 2020. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognition. 106:107404.
Thomas C. n.d. Technical Report OpenSalicon: An Open Source Implementation of the Salicon Saliency Model.
Yuan L. et al. (2021). Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Presented at the 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, QC, Canada: IEEE. p. 538–547.

UNESCO Institute of Statistics and World Bank Waiver Form

I attest that I currently live, work, or study in a country on the UNESCO Institute of Statistics and World Bank List of Low and Middle Income Countries list provided.

No