Electronics Science Technology and Application

  • Home
  • About
    • About the Journal
    • Contact
  • Article
    • Current
    • Archives
  • Submissions
  • Editorial Team
  • Announcements
Register Login

ISSN

2424-8460(Online)

2251-2608(Print)

Article Processing Charges (APCs)

US$800

Publication Frequency

Quarterly

PDF

Published

2026-01-30

Issue

Vol 12 No 4 (2025): Published

Section

Articles

A review of multimodal representation learning

Xiangyu Chen

College of Big Data and Software Engineering, Zhejiang Wanli University

Qinglin Wang

School of Statistics, University of International Business and Economic

Zhen Wang

College of Big Data and Software Engineering, Zhejiang Wanli University


DOI: https://doi.org/10.59429/esta.v12i4.12661


Keywords: multimodal; representational learning; deep learning; modal combination


Abstract

This paper provides an overview of multimodal representation learning, using modal combination types as the core classiffcation framework. It systematically reviews representation learning methods, association mechanisms, and fusion strategies under different modal pairs. Unlike previous reviews based on model architectures or task categories, this paper adopts a bottom-up perspective on modal interactions, revealing how the collaborative characteristics of different modal pairs inffuence representation modelling and summarizing the technical evolution of cross-modal learning. Additionally, this paper summarizes the general paradigms and practical challenges in multi- modal architecture design, offering new perspectives for model optimization in complex scenarios.


References

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.

[2] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, Zheng, X. Xia, et al., Glm-130b: An open bilingual pre-trained model, arXiv preprint arXiv:2210.02414 (2022).

[3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efffcient foundation language models, arXiv preprint arXiv:2302.13971 (2023).

[4] M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: Data, models and evaluation metrics, Journal of Artiffcial Intelligence Research 47 (2013) 853–899.

[5] N. Srivastava, R. R. Salakhutdinov, Multimodal learning with deep boltzmann machines, Advances in neural information processing sys- tems 25 (2012).

[6] K. Simonyan, A. Zisserman, Very deep convolutional networks for large- scale image recognition, arXiv preprint arXiv:1409.1556 (2014).

[7] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- han, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[9] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, Advances in neural information processing systems 32 (2019).

[10] X. Zhan, Y. Wu, X. Dong, Y. Wei, M. Lu, Y. Zhang, H. Xu, X. Liang, Product1m: Towards weakly supervised instance-level product retrieval via cross-modal pretraining, in: Proceedings of the IEEE/CVF interna- tional conference on computer vision, 2021, pp. 11782–11791.

[11] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A simple and performant baseline for vision and language, arXiv preprint arXiv:1908.03557 (2019).

[12] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-training of generic visual-linguistic representations, arXiv preprint arXiv:1908.08530 (2019).

[13] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng,J. Liu, Uniter: Universal image-text representation learning, in: Euro- pean conference on computer vision, Springer, 2020, pp. 104–120.

[14] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training, in: Pro- ceedings of the AAAI conference on artiffcial intelligence, Vol. 34, 2020, pp. 11336–11344.

[15] D. Ma, M. Wang, A. Xiang, Z. Qi, Q. Yang, Transformer-based classi- ffcation outcome prediction for multimodal stroke treatment, in: 2024 IEEE 2nd International Conference on Sensors, Electronics and Com- puter Engineering (ICSECE), IEEE, 2024, pp. 383–386.

[16] F. Alzamzami, A. El Saddik, Transformer-based feature fusion approach for multimodal visual sentiment recognition using tweets in the wild, IEEE Access 11 (2023) 47070–47079.

[17] A. Xiang, Z. Qi, H. Wang, Q. Yang, D. Ma, A multimodal fusion net- work for student emotion recognition based on transformer and tensor product, in: 2024 IEEE 2nd International Conference on Sensors, Elec- tronics and Computer Engineering (ICSECE), IEEE, 2024, pp. 1–4.

[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transfer- able visual models from natural language supervision, in: International conference on machine learning, PmLR, 2021, pp. 8748–8763.

[19] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre- training for unified vision-language understanding and generation, in: International conference on machine learning, PMLR, 2022, pp. 12888– 12900.

[20] H. Ding, Z. Du, Z. Wang, J. Xue, Z. Wei, K. Yang, S. Jin, Z. Zhang, J. Wang, Intervoxnet: a novel dual-modal audio-text fusion network for automatic and efffcient depression detection from interviews, Frontiers in Physics 12 (2024) 1430035.

[21] M. Luong, K. Nguyen, N. Ho, R. Haf, D. Phung, L. Qu, Revisiting deep audio-text retrieval through the lens of transportation, arXiv preprint arXiv:2405.10084 (2024).

[22] B. Elizalde, S. Deshmukh, M. Al Ismail, H. Wang, Clap learning audio concepts from natural language supervision, in: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), IEEE, 2023, pp. 1–5.

[23] K. Sung-Bin, A. Senocak, H. Ha, A. Owens, T.-H. Oh, Sound to visual scene generation by audio-to-visual latent alignment, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2023, pp. 6430–6440.

[24] T. Feng, Y. Xie, X. Guan, J. Song, Z. Liu, F. Ma, F. Yu, Unisync: A uniffed framework for audio-visual synchronization, arXiv preprint arXiv:2503.16357 (2025).



ISSN: 2424-8460
21 Woodlands Close #02-10 Primz Bizhub Singapore 737854

Email:editorial_office@as-pub.com