I am currently a senior researcher at Tencent Hunyuan, working on large multimodal models and physical AI foundation models.
I obtained my Ph.D. degree from the Intelligent Vision Group (IVG), Department of Automation, Tsinghua University in 2025, advised by Prof. Jiwen Lu and Jie Zhou. Before that, I received my B.Eng. degree from the Department of Electronic Engineering, Tsinghua University in 2020.
I am broadly interested in large language model and computer vision. My current research focuses on multi-modal large language models and large vision models.
2021-07: 2 papers (including 1 oral) on 3D vision and video understanding are accepted to ICCV 2021.
Publications
* indicates equal contribution
ViQ: Text-Aligned Visual Quantized Representations at Any Resolution Xumin Yu*,
Zuyan Liu*,
Zhenyu Yang*,
Yuhao Dong,
Shengsheng Qian,
Jiwen Lu,
Han Hu,
Yongming Rao#
European Conference on Computer Vision (ECCV), 2026
[arXiv][Code][Models]
ViQ is a framework for discrete visual representations that balances semantics and details while supporting native-resolution inputs, serving as a unified, general discrete representation for arbitrary visual inputs for both understanding and high-fidelity reconstruction.
HY-Embodied-0.5 is the first version of our embodied foundation models for real-world agents, attaining best performance on 16 out of 22 widely used benchmarks of visual perception, spatial intelligence and embodied reasoning.
We propose a new contrastive regression (CoRe) framework to learn the relative scores by pair-wise comparison, which highlights the differences between videos and guides the models to learn the key hints for assessment.
Graph Interaction Networks for Relation Transfer in Human Activity Videos Yansong Tang, Yi Wei , Xumin Yu, Jiwen Lu , Jie Zhou IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2020
[Paper]
We propose a graph interaction networks (GINs) model for transferring relation knowledge across two graphs two different scenarios for video
analysis, including a new proposed setting for unsupervised skeleton-based action recognition across different datasets, and supervised group activity recognition with multi-modal inputs.
Learning fine-grained estimation of physiological states from coarse-grained labels by distribution restoration Zengyi Qin , Jiansheng Chen , Zhenyu Jiang , Xumin Yu, Chunhua Hu, Yu Ma, Suhua Miao and Rongsong Zhou
Scientific Reports, 2020
[Paper][Code]
Our method allows machine learning algorithms to perform fine-grained estimation of physiological states (e.g., sleep depth) even if the training labels are coarse-grained.
Experiences
Tencent Hunyuan
Multi-Modal Model Group, Researcher
Topic: Multi-Modal
ByteDance
Intelligent Creation Group, Research Intern
Topic: Human AIGC
Honors and Awards
China National Scholarship (PhD Student), 2024
Excellent Undergraduate in Tsinghua University, 2020
The First Prize of Microsoft Imagine Cup, China Finals, 2018