Research and Medical Application of Vision Transformer and Capsule Network Models

Speaker:  Junying Chen – Guangzhou, China
Topic(s):  Artificial Intelligence, Machine Learning, Computer Vision, Natural language processing


Vision Transformer and capsule network models are deep learning models that have attracted much attention in recent years. Vision Transformer uses a self-attention mechanism to assign different weights to each part of the input data, and combines the encoder-decoder structure to achieve good experimental results. However, due to the large-scale training data and extremely long training schedule, the pre-training and optimization of the vision Transformer is a very challenging task. The capsule network can explicitly learn the relatively fixed relationship between part-whole features, which provides the capsule network with an equivariance mechanism that the convolutional neural network lacks, thereby reducing the need for data augmentation during model training, but the capsule network training is memory-intensive, making it challenging to train deeper capsule network models to their full potential. When processing medical image and video data, the above problems are more prominent due to the large input sample size. This lecture will focus on solving the above problems, introduce the research results and progress on vision Transformer and capsule network models, discuss methods to effectively speed up model training while improving task accuracy, and demonstrate the effect of the model applied to intelligent medical tasks.

About this Lecture

Number of Slides:  50
Duration:  70 minutes
Languages Available:  Chinese (Simplified), English
Last Updated: 

Request this Lecture

To request this particular lecture, please complete this online form.

Request a Tour

To request a tour with this speaker, please complete this online form.

All requests will be sent to ACM headquarters for review.