• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    公众号

pliang279/awesome-multimodal-ml: Reading list for research topics in multimodal ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称(OpenSource Name):

pliang279/awesome-multimodal-ml

开源软件地址(OpenSource Url):

https://github.com/pliang279/awesome-multimodal-ml

开源编程语言(OpenSource Language):


开源软件介绍(OpenSource Introduction):

Reading List for Topics in Multimodal Machine Learning

By Paul Liang ([email protected]), Machine Learning Department and Language Technologies Institute, CMU, with help from members of the MultiComp Lab at LTI, CMU. If there are any areas, papers, and datasets I missed, please let me know!

Course content + workshops

Tutorials on Multimodal Machine Learning at CVPR 2022 and NAACL 2022

New course 11-877 Advanced Topics in Multimodal Machine Learning Spring 2022 @ CMU. It will primarily be reading and discussion-based. We plan to post discussion probes, relevant papers, and summarized discussion highlights every week on the website.

Public course content and lecture videos from 11-777 Multimodal Machine Learning, Fall 2020 @ CMU.

Table of Contents

Research Papers

Survey Papers

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, JAIR 2021

Experience Grounds Language, EMNLP 2020

A Survey of Reinforcement Learning Informed by Natural Language, IJCAI 2019

Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2019

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, arXiv 2019

Deep Multimodal Representation Learning: A Survey, arXiv 2019

Guest Editorial: Image and Language Understanding, IJCV 2017

Representation Learning: A Review and New Perspectives, TPAMI 2013

A Survey of Socially Interactive Robots, 2003

Core Areas

Multimodal Representations

Balanced Multimodal Learning via On-the-fly Gradient Modulation, CVPR 2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast, IJCAI 2021 [code]

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, arXiv 2021

FLAVA: A Foundational Language And Vision Alignment Model, arXiv 2021

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning, NeurIPS 2021 [code]

Perceiver: General Perception with Iterative Attention, ICML 2021 [code]

Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 [blog] [code]

VinVL: Revisiting Visual Representations in Vision-Language Models, arXiv 2021 [blog] [code]

Learning Transferable Visual Models From Natural Language Supervision, arXiv 2020 [blog] [code]

12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020 [code]

Watching the World Go By: Representation Learning from Unlabeled Videos, arXiv 2020

Learning Video Representations using Contrastive Bidirectional Transformer, arXiv 2019

Visual Concept-Metaconcept Learning, NeurIPS 2019 [code]

OmniNet: A Unified Architecture for Multi-modal Multi-task Learning, arXiv 2019 [code]

Learning Representations by Maximizing Mutual Information Across Views, arXiv 2019 [code]

ViCo: Word Embeddings from Visual Co-occurrences, ICCV 2019 [code]

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations, CVPR 2019

Multi-Task Learning of Hierarchical Vision-Language Representation, CVPR 2019

Learning Factorized Multimodal Representations, ICLR 2019 [code]

A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks, ICML 2018

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?, ACL 2018

Learning Robust Visual-Semantic Embeddings, ICCV 2017

Deep Multimodal Representation Learning from Temporal Data, CVPR 2017

Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations, COLING 2016

Combining Language and Vision with a Multimodal Skip-gram Model, NAACL 2015

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS 2014

Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

Learning Grounded Meaning Representations with Autoencoders, ACL 2014

DeViSE: A Deep Visual-Semantic Embedding Model, NeurIPS 2013

Multimodal Deep Learning, ICML 2011

Multimodal Fusion

Robust Contrastive Learning against Noisy Views, arXiv 2022

Cooperative Learning for Multi-view Analysis, arXiv 2022

What Makes Multi-modal Learning Better than Single (Provably), NeurIPS 2021

Efficient Multi-Modal Fusion with Diversity Analysis, ACMMM 2021

Attention Bottlenecks for Multimodal Fusion, NeurIPS 2021

Trusted Multi-View Classification, ICLR 2021 [code]

Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis, ICDM 2020

Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies, NeurIPS 2020 [code]

Deep Multimodal Fusion by Channel Exchanging, NeurIPS 2020 [code]

What Makes Training Multi-Modal Classification Networks Hard?, CVPR 2020

Dynamic Fusion for Multimodal Data, arXiv 2019

DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis, IJCAI 2019 [code]

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, NeurIPS 2019

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification, IEEE TNNLS 2019 [code]

MFAS: Multimodal Fusion Architecture Search, CVPR 2019

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, ICLR 2019 [code]

Unifying and merging well-trained deep neural networks for inference stage, IJCAI 2018 [code]

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018 [code]

Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018 [code]

Tensor Fusion Network for Multimodal Sentiment Analysis, EMNLP 2017 [code]

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework, AAAI 2015

A co-regularized approach to semi-supervised learning with multiple views, ICML 2005

Multimodal Alignment

Reconsidering Representation Alignment for Multi-view Clustering, CVPR 2021 [code]

CoMIR: Contrastive Multimodal Image Representation for Registration, NeurIPS 2020 [code]

Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019 [code]

Temporal Cycle-Consistency Learning, CVPR 2019 [code]

See, Hear, and Read: Deep Aligned Representations, arXiv 2017

On Deep Multi-View Representation Learning, ICML 2015

Unsupervised Alignment of Natural Language Instructions with Video Segments, AAAI 2014

Multimodal Alignment of Videos, MM 2014

Deep Canonical Correlation Analysis, ICML 2013 [code]

Multimodal Pretraining

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021 [code]

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021

Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 [code]

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, EMNLP 2020 [code]

Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, arXiv 2019 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 [code]

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arXiv 2019

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Multimodal Translation

Zero-Shot Text-to-Image Generation, ICML 2021 [code]

Translate-to-Recognize Networks for RGB-D Scene Recognition, CVPR 2019 [code]

Language2Pose: Natural Language Grounded Pose Forecasting, 3DV 2019 [code]

Reconstructing Faces from Voices, NeurIPS 2019 [code]

Speech2Face: Learning the Face Behind a Voice, CVPR 2019 [code]

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities, AAAI 2019 [code]

Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions, ICASSP 2018 [code]

Crossmodal Retrieval

Learning with Noisy Correspondence for Cross-modal Matching, NeurIPS 2021 [code]

MURAL: Multimodal, Multitask Retrieval Across Languages, arXiv 2021

Self-Supervised Learning from Web Data for Multimodal Retrieval, arXiv 2019

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018

Multimodal Co-learning

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, ICML 2021

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions, arXiv 2021

Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision, EMNLP 2020

Foundations of Multimodal Co-learning, Information Fusion 2020

Missing or Imperfect Modalities

A Variational Information Bottleneck Approach to Multi-Omics Data Integration, AISTATS 2021 [code]

SMIL: Multimodal Learning with Severely Missing Modality, AAAI 2021

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series, arXiv 2019

Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization, ACL 2019

Multimodal Deep Learning for Robust RGB-D Object Recognition, IROS 2015

Analysis of Multimodal Models

M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis, IEEE TVCG 2022

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, TACL 2021

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020

Blindfold Baselines for Embodied QA, NIPS 2018 Visually-Grounded Interaction and Language Workshop

Analyzing the Behavior of Visual Question Answering Models, EMNLP 2016

Knowledge Graphs and Knowledge Bases

MMKG: Multi-Modal Knowledge Graphs, ESWC 2019

Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs, AKBC 2019

Embedding Multimodal Relational Data for Knowledge Base Completion, EMNLP 2018

A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning, SEM 2018 [code]

Order-Embeddings of Images and Language, ICLR 2016 [code]

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries, arXiv 2015

Intepretable Learning

Multimodal Explanations by Predicting Counterfactuality in Videos, CVPR 2019

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence, CVPR 2018 [code]

Do Explanations make VQA Models more Predictable to a Human?, EMNLP 2018

Towards Transparent AI Systems: Interpreting Visual Question Answering Models, ICML Workshop on Visualization for Deep Learning 2016

Generative Learning

Generalized Multimodal ELBO, ICLR 2021 [code]

Variational Mixture-of-Experts Autoencodersfor Multi-Modal Deep Generative Models, NeurIPS 2019 [code]

Few-shot Video-to-Video Synthesis, NeurIPS 2019 [code]

Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]

The Multi-Entity Variational Autoencoder, NeurIPS 2017

Semi-supervised Learning

Semi-supervised Vision-language Mapping via Variational Learning, ICRA 2017

Semi-supervised Multimodal Hashing, arXiv 2017

Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition, IJCAI 2016

Multimodal Semi-supervised Learning for Image Classification, CVPR 2010

Self-supervised Learning

DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, NeurIPS 2021 Datasets & Benchmarks Track [code]

Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020 [code]

Self-Supervised MultiModal Versatile Networks, NeurIPS 2020 [code]

Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision, NeurIPS 2020 [code]

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces, CVPR 2017

Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems, 2016

Language Models

Neural Language Modeling with Visual Features, arXiv 2019

Learning Multi-Modal Word Representation Grounded in Visual Context, AAAI 2018

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes, CVPR 2016

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, ICML 2014 [code]

Adversarial Attacks

Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models, NeurIPS Workshop on Visually Grounded Interaction and Language 2018

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning, ACL 2018 [code]

Fooling Vision and Language Models Despite Localization and Attention Mechanism, CVPR 2018

Few-Shot Learning

Language to Network: Conditional Parameter Adaptation with Natural Language Descriptions, ACL 2020

Shaping Visual Representations with Language for Few-shot Classification, ACL 2020

Zero-Shot Learning - The Good, the Bad and the Ugly, CVPR 2017

Zero-Shot Learning Through Cross-Modal Transfer, NIPS 2013

Bias and Fairness

Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models, arXiv 2021

Towards Debiasing Sentence Representations, ACL 2020 [code]

FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment, ICMI 2020 [code]

Model Cards for Model Reporting, FAccT 2019

Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings, NAACL 2019 [code]

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, FAccT 2018

Datasheets for Datasets, arXiv 2018

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NeurIPS 2016

Human in the Loop Learning

Human in the Loop Dialogue Systems, NeurIPS 2020 workshop

Human And Machine in-the-Loop Evaluation and Learning Strategies, NeurIPS 2020 workshop

Human-centric dialog training via offline reinforcement learning, EMNLP 2020 [code]

Human-In-The-Loop Machine Learning with Intelligent Multimodal Interfaces, ICML 2017 workshop

Architectures

Multimodal Transformers

Pretrained Transformers As Universal Computation Engines, AAAI 2022

Perceiver: General Perception with Iterative Attention, ICML 2021

FLAVA: A Foundational Language And Vision Alignment Model, arXiv 2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio, arXiv 2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, NeurIPS 2021 [code]

Parameter Efficient Multimodal Transformers for Video Representation Learning, ICLR 2021 [code]

Multimodal Memory

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, arXiv 2021

History Aware Multimodal Transformer for Vision-and-Language Navigation, NeurIPS 2021 [code]

Episodic Memory in Lifelong Language Learning, NeurIPS 2019

ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection, EMNLP 2018

Multimodal Memory Modelling for Video Captioning, CVPR 2018

Dynamic Memory Networks for Visual and Textual Question Answering, ICML 2016

Applications and Datasets

Language and Visual QA

Learning to Answer Questions in Dynamic Audio-Visual Scenarios, CVPR 2022

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events, CVPR 2021 [code]

MultiModalQA: complex question answering over text, tables and images, ICLR 2021

ManyModalQA: Modality Disambiguation and QA over Diverse Inputs, AAAI 2020 [code]

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020

Interactive Language Learning by Question Answering, EMNLP 2019 [code]

Fusion of Detected Objects in Text for Visual Question Answering, arXiv 2019

RUBi: Reducing Unimodal Biases in Visual Question Answering, NeurIPS 2019 [code]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, CVPR 2019 [code]

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge, CVPR 2019 [code]

MUREL: Multimodal Relational Reasoning for Visual Question Answering, CVPR 2019 [code]

Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence, CVPR 2019 [code]

Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering, ICML 2019 [code]

Learning to Count Objects in Natural Images for Visual Question Answering, ICLR 2018, [code]

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization, NeurIPS 2018

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, NeurIPS 2018 [code]

RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes, EMNLP 2018 [code]

TVQA: Localized, Compositional Video Question Answering, EMNLP 2018 [code]

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR 2018 [code]

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering, CVPR 2018 [code]

Stacked Latent Attention for Multimodal Reasoning, CVPR 2018

Learning to Reason: End-to-End Module Networks for Visual Question Answering, ICCV 2017 [code]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017 [code] [dataset generation]

Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension, CVPR 2017 [code]

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, EMNLP 2016 [code]

MovieQA: Understanding Stories in Movies through Question-Answering, CVPR 2016 [code]

VQA: Visual Question Answering, ICCV 2015 [code]

Language Grounding in Vision

Core Challenges in Embodied Vision-Language Planning, arXiv 2021

MaRVL: Multicultural Reasoning over Vision and Language, EMNLP 2021 [code]

Grounding 'Grounding' in NLP, ACL 2021

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes, NeurIPS 2020 [code]

What Does BERT with Vision Look At?, ACL 2020

Visual Grounding in Video for Unsupervised Word Translation, CVPR 2020 [code]

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference, CVPR 2020 [code]

Grounded Video Description,


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap