Papers: Difference between revisions

Latest revision as of 23:18, 5 February 2024

	This page needs additional information.
	Key elements of this article are missing. You can help AI Wiki by expanding it.

Important Papers

Name	Date	Source	Type	Organization	Product	Note
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)	2012	AlexNet Paper	Computer Vision		AlexNet
Efficient Estimation of Word Representations in Vector Space (Word2Vec)	2013/01/16	arxiv:1301.3781	Natural Language Processing		Word2Vec
Playing Atari with Deep Reinforcement Learning (DQN)	2013/12/19	arxiv:1312.5602			DQN (Deep Q-Learning)
Generative Adversarial Networks (GAN)	2014/06/10	arxiv:1406.2661			GAN (Generative Adversarial Network)
Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet)	2014/09/04	arxiv:1409.1556	Computer Vision		VGGNet
Sequence to Sequence Learning with Neural Networks (Seq2Seq)	2014/09/10	arxiv:1409.3215	Natural Language Processing		Seq2Seq
Adam: A Method for Stochastic Optimization	2014/12/22	arxiv:1412.6980			Adam
Deep Residual Learning for Image Recognition (ResNet)	2015/12/10	arxiv:1512.03385	Computer Vision		ResNet
Going Deeper with Convolutions (GoogleNet)	2015/12/10	arxiv:1409.4842	Computer Vision	Google	GoogleNet
Asynchronous Methods for Deep Reinforcement Learning (A3C)	2016/02/04	arxiv:1602.01783			A3C
WaveNet: A Generative Model for Raw Audio	2016/09/12	arxiv:1609.03499	Audio		WaveNet
Attention Is All You Need (Transformer)	2017/06/12	arxiv:1706.03762	Natural Language Processing	Google	Transformer	Influential paper
Proximal Policy Optimization Algorithms (PPO)	2017/07/20	arxiv:1707.06347			PPO (Proximal Policy Optimization)
Improving Language Understanding by Generative Pre-Training (GPT)	2018	paper source	Natural Language Processing	OpenAI	GPT (Generative Pre-Training Transformer)
Deep contextualized word representations (ELMo)	2018/02/15	arxiv:1802.05365	Natural Language Processing		ELMo
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding	2018/04/20	arxiv:1804.07461 website	Natural Language Processing		GLUE (General Language Understanding Evaluation)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	2018/10/11	arxiv:1810.04805	Natural Language Processing	Google	BERT (Bidirectional Encoder Representations from Transformers)
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context	2019/01/09	arxiv:1901.02860 github			Transformer-XL
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)	2019/11/19	arxiv:1911.08265			MuZero
Language Models are Few-Shot Learners (GPT-3)	2020/05/28	arxiv:2005.14165	Natural Language Processing	OpenAI	GPT-3
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)	2020/10/22	arxiv:2010.11929 GitHub	Computer Vision	Google	ViT (Vision Transformer)
Learning Transferable Visual Models From Natural Language Supervision (CLIP)	2021/02/26	arxiv:2103.00020 Blog Post	Computer Vision	OpenAI	CLIP (Contrastive Language-Image Pre-Training)
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer	2021/10/05	arxiv:2110.02178 GitHub	Computer Vision	Apple	MobileViT
LaMDA: Language Models for Dialog Applications	2022/01/20	arxiv:2201.08239 Blog Post	Natural Language Processing	Google	LaMDA (Language Models for Dialog Applications)
Block-Recurrent Transformers	2022/03/11	arxiv:2203.07852
Memorizing Transformers	2022/03/16	arxiv:2203.08913
STaR: Bootstrapping Reasoning With Reasoning	2022/03/28	arxiv:2203.14465			STaR (Self-Taught Reasoner)
GPT-4 Technical Report	2023/03/15	arxiv:2303.08774 blog post system card	Natural Language Processing Multimodal	OpenAI	GPT-4

Other Papers

Name	Date	Source	Type	Organization	Product	Note
Self-Rewarding Language Models	2024/01/18	arxiv:2401.10020	Natural Language Processing	Meta
LLM in a flash: Efficient Large Language Model Inference with Limited Memory	2023/12/12	arxiv:2312.11514 HuggingFace	Natural Language Processing	Apple
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation	2023/12/07	arxiv:2311.17117 Website Video GitHub Tutorial	Computer Vision	Alibaba	Animate Anyone
MatterGen: a generative model for inorganic materials design	2023/12/06	arxiv:2312.03687 Tweet	Materials Science	Microsoft	MatterGen
Audiobox: Generating audio from voice and natural language prompts	2023/11/30	Paper Website	Audio	Meta	Audiobox
Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text	2023/11/30	arxiv:2311.18805	Natural Language Processing	University of Tokyo	Scrambled Bench
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers	2023/11/27	arxiv:2311.15475 Website	Computer Vision		MeshGPT
Ferret: Refer and Ground Anything Anywhere at Any Granularity	2023/10/11	arxiv:2310.07704 GitHub	Multimodal Natural Language Processing	Apple	Ferret
SeamlessM4T - Massively Multilingual & Multimodal Machine Translation	2023/08/23	Paper Website Blogpost Demo GitHub	Natural Language Processing	Meta	SeamlessM4T
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	2023/08/01	arxiv:2307.15818 Website Blogpost	Robotics	Google	RT-2
Towards Generalist Biomedical AI	2023/07/26	arxiv:2307.14334	Natural Language Processing	Google	Med-PaLM
Large Language Models Understand and Can be Enhanced by Emotional Stimuli	2023/07/14	arxiv:2307.11760	Natural Language Processing	EmotionPrompt
MusicGen: Simple and Controllable Music Generation	2023/06/08	arxiv:2306.05284 GitHub Example	Audio	Meta	MusicGen
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM	2023/05/31	arxiv:2306.00029 GitHub	Natural Language Processing	Salesforce	CodeTF
Bytes Are All You Need: Transformers Operating Directly On File Bytes	2023/05/31	arxiv:2306.00238	Computer Vision	Apple
Scaling Speech Technology to 1,000+ Languages	2023/05/22	Paper <Blogpost GitHub Languages covered	Natural Language Processing	Meta	Massively Multilingual Speech (MMS)
RWKV: Reinventing RNNs for the Transformer Era	2023/05/22	arxiv:2305.13048	Natural Language Processing		Receptance Weighted Key Value (RWKV)
ImageBind: One Embedding Space To Bind Them All	2023/05/09	arxiv:2305.05665 Website Demo Blog GitHub	Multimodal Computer Vision Natural Language Processing	Meta	ImageBind
Real-Time Neural Appearance Models	2023/05/05	Paper Blog		NVIDIA
Poisoning Language Models During Instruction Tuning	2023/05/01	arxiv:2305.00944	Natural Language Processing
Generative Agents: Interactive Simulacra of Human Behavior	2023/04/07	arxiv:2304.03442	Human-AI Interaction Natural Language Processing	Stanford	Generative agents
Segment Anything	2023/04/05	Paper Website Blog GitHub	Computer Vision	Meta	Segment Anything Model (SAM)
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Microsoft JARVIS)	2023/03/30	arxiv:2303.17580 HuggingFace Space JARVIS GitHub	Natural Language Processing Multimodal	Microsoft Hugging Face	HuggingGPT JARVIS
BloombergGPT: A Large Language Model for Finance	2023/03/30	arxiv:2303.17564 press release twitter thread	Natural Language Processing	Bloomberg	BloombergGPT
Sparks of Artificial General Intelligence: Early experiments with GPT-4	2023/03/22	arxiv:2303.12712	Natural Language Processing Multimodal	Microsoft
Reflexion: an autonomous agent with dynamic memory and self-reflection	2023/03/20	arxiv:2303.11366 blog post GitHub GitHub 2	Natural Language Processing		Reflexion
PaLM-E: An Embodied Multimodal Language Model	2023/03/06	arxiv:2303.03378 blog	Natural Language Processing Multimodal	Google	PaLM-E
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages	2023/03/02	arxiv:2303.01037 blog	Natural Language Processing	Google	Universal Speech Model (USM)
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)	2023/02/27	arxiv:2302.14045	Natural Language Processing	Microsoft	Kosmos-1
LLaMA: Open and Efficient Foundation Language Models	2023/02/25	paper blog post github	Natural Language Processing	Meta	LLaMA
Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)	2023/02/06	arxiv:2302.03011 blog post	Video-to-Video	Runway	Gen-1
Dreamix: Video Diffusion Models are General Video Editors	2023/02/03	arxiv:2302.01329 blog post		Google	Dreamix
FLAME: A small language model for spreadsheet formulas	2023/01/31	arxiv:2301.13779		Microsoft	FLAME
SingSong: Generating musical accompaniments from singing	2023/01/30	arxiv:2301.12662 blog post	Audio		SingSong
MusicLM: Generating Music From Text	2023/01/26	arxiv:2301.11325 blog post	Audio	Google	MusicLM
Mastering Diverse Domains through World Models (DreamerV3)	2023/01/10	arxiv:2301.04104v1 blogpost		DeepMind	DreamerV3
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)	2023/01/05	arxiv:2301.02111 Demo		Micorsoft	VALL-E
Muse: Text-To-Image Generation via Masked Generative Transformers	2023/01/02	arxiv:2301.00704 blog post	Computer Vision	Google	Muse
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model	2022/11/09	arxiv:2211.05100 Blog Post	Natural Language Processing	Hugging Face	BLOOM	Open source LLM that is a competitor to GPT-3
ReAct: Synergizing Reasoning and Acting in Language Models	2022/10/06	arxiv:2210.03629 Blogpost GitHub	Natural Language Processing	Google	ReAct
AudioLM: a Language Modeling Approach to Audio Generation	2022/09/07	arxiv:2209.03143 web page blog post	Audio	Google	AudioML
Emergent Abilities of Large Language Models	2022/06/15	arxiv:2206.07682 Blog Post	Natural Language Processing	Google	Emergent Abilities
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	2022/05/23	arxiv:2205.11487 Blog Post	Computer Vision	Google	Imagen
A Generalist Agent (Gato)	2022/05/23	arxiv:2205.06175 Blog Post	Natural Language Processing Multimodal	DeepMind	Gato
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	2022/04/12	arxiv:2204.05862 GitHub	Natural Language Processing	Anthropic	RLHF (Reinforcement Learning from Human Feedback)
PaLM: Scaling Language Modeling with Pathways	2022/04/05	arxiv:2204.02311 Blog Post	Natural Language Processing	Google	PaLM (Pathways Language Model)
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	2022/01/28	arxiv:2201.11903 Blog Post	Natural Language Processing	Google	Chain of Thought Prompting (CoT prompting)
Constitutional AI: Harmlessness from AI Feedback	2021/12/12	arxiv:2212.08073	Natural Language Processing	Anthropic	Constitutional AI, Claude
Improving language models by retrieving from trillions of tokens (RETRO)	2021/12/08	arxiv:2112.04426 Blog post	Natural Language Processing	OpenAI	RETRO (Retrieval Enhanced Transformer)
InstructPix2Pix: Learning to Follow Image Editing Instructions	2021/11/17	arxiv:2211.09800 Blog Post	Computer Vision	UC Berkley	InstructPix2Pix
REALM: Retrieval-Augmented Language Model Pre-Training	2020/02/10	arxiv:2002.08909 Blog Post	Natural Language Processing	Google	REALM (Retrieval-Augmented Language Model Pre-Training)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)	2019/10/23	arxiv:1910.10683	Natural Language Processing blog post	Google	T5 (Text-To-Text Transfer Transformer)
RoBERTa: A Robustly Optimized BERT Pretraining Approach	2019/07/26	arxiv:1907.11692	Natural Language Processing blog post	Meta	RoBERTa (Robustly Optimized BERT Pretraining Approach)
Probabilistic Face Embeddings	2019/04/21	arxiv:1904.09658	Computer Vision		PFEs (Probabilistic Face Embeddings)
Language Models are Unsupervised Multitask Learners (GPT-2)	2018	paper	Natural Language Processing	OpenAI	GPT-2
Deep reinforcement learning from human preferences	2017/06/12	arxiv:1706.03741 Blog post GitHub		OpenAI	RLHF (Reinforcement Learning from Human Feedback)

@@ Line 1: / Line 1: @@
-===Important===
+{{Needs Expansion}}
-{|
+__TOC__
+==Important Papers==
+{| class="wikitable sortable"
 |-
 !Name
-!Submission<br>Date
+!Date
 !Source
+!Type
+!Organization
+!Product
 !Note
 |-
-|[[ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)]] || 2012 || [https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf AlexNet Paper] ||
+|[[ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)]] || 2012 || [https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf AlexNet Paper] || [[Computer Vision]] ||  || [[AlexNet]] ||
 |-
-|[[Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet)]] || 2014/09/04 || [[arxiv:409.1556]] ||
+|[[Efficient Estimation of Word Representations in Vector Space (Word2Vec)]] || 2013/01/16 || [[arxiv:1301.3781]] || [[Natural Language Processing]] ||  || [[Word2Vec]] ||
 |-
-|[[Deep Residual Learning for Image Recognition (ResNet)]] || 2015/12/10 || [[arxiv:409.1556]] ||
+|[[Playing Atari with Deep Reinforcement Learning (DQN)]] || 2013/12/19 || [[arxiv:1312.5602]] ||  ||  || [[DQN]] ([[Deep Q-Learning]]) ||
 |-
-|[[Going Deeper with Convolutions (GoogleNet)]] || 2015/12/10 || [[arxiv:409.1556]] ||
+|[[Generative Adversarial Networks (GAN)]] || 2014/06/10 || [[arxiv:1406.2661]] ||  ||  || [[GAN]] ([[Generative Adversarial Network]]) ||
 |-
-|[[Attention Is All You Need (Transformer)]] || 2017/06/12 || [[arxiv:1706.03762]] || influential paper that introduced [[Transformer]]
+|[[Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet)]] || 2014/09/04 || [[arxiv:1409.1556]] || [[Computer Vision]] ||  || [[VGGNet]] ||
 |-
-|[[Transformer-XL]] || 2019/01/09 || [[arxiv:1901.02860]] || Attentive Language Models Beyond a Fixed-Length Context
+|[[Sequence to Sequence Learning with Neural Networks (Seq2Seq)]] || 2014/09/10 || [[arxiv:1409.3215]] || [[Natural Language Processing]] ||  || [[Seq2Seq]] ||
 |-
-|[[Language Models are Few-Shot Learners]] || 2020/05/28 || [[arxiv:2005.14165]] || [[GPT]]
+|[[Adam: A Method for Stochastic Optimization]] || 2014/12/22 || [[arxiv:1412.6980]] ||  ||  || [[Adam]] ||
 |-
-|[[An Image is Worth 16x16 Words]] || 2020/10/22 || [[arxiv:2010.11929]] || Transformers for Image Recognition at Scale - [[Vision Transformer]] ([[ViT]])
+|[[Deep Residual Learning for Image Recognition (ResNet)]] || 2015/12/10 || [[arxiv:1512.03385]] || [[Computer Vision]] ||  || [[ResNet]] ||
 |-
-|[[OpenAI CLIP]] || 2021/02/26 || [[arxiv:2103.00020]]<br>[https://openai.com/blog/clip/ OpenAI Blog] || Learning Transferable Visual Models From Natural Language Supervision
+|[[Going Deeper with Convolutions (GoogleNet)]] || 2015/12/10 || [[arxiv:1409.4842]] || [[Computer Vision]] || [[Google]] || [[GoogleNet]] ||
 |-
-|[[MobileViT]] || 2021/10/05 || [[arxiv:2110.02178]] || Light-weight, General-purpose, and Mobile-friendly Vision Transformer
+|[[Asynchronous Methods for Deep Reinforcement Learning (A3C)]] || 2016/02/04 || [[arxiv:1602.01783]] ||  ||  || [[A3C]] ||
 |-
-|[[Block-Recurrent Transformers]] || 2022/03/11 || [[arxiv:2203.07852]] ||
+|[[WaveNet: A Generative Model for Raw Audio]] || 2016/09/12 || [[arxiv:1609.03499]] || [[Audio]] ||  || [[WaveNet]] ||
 |-
-|[[Memorizing Transformers]] || 2022/03/16 ||[[arxiv:2203.08913]] ||
+|[[Attention Is All You Need (Transformer)]] || 2017/06/12 || [[arxiv:1706.03762]] || [[Natural Language Processing]] || [[Google]] || [[Transformer]] || Influential paper
 |-
-|[[STaR]] || 2022/03/28 || [[arxiv:2203.14465]] || Bootstrapping Reasoning With Reasoning
+|[[Proximal Policy Optimization Algorithms (PPO)]] || 2017/07/20 || [[arxiv:1707.06347]] ||  ||  || [[PPO]] ([[Proximal Policy Optimization]]) ||
+|-
+|[[Improving Language Understanding by Generative Pre-Training (GPT)]] || 2018 || [https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf paper source] || [[Natural Language Processing]] || [[OpenAI]] || [[GPT]] ([[Generative Pre-Training Transformer]]) ||
+|-
+|[[Deep contextualized word representations (ELMo)]] || 2018/02/15 || [[arxiv:1802.05365]] || [[Natural Language Processing]] ||  || [[ELMo]] ||
+|-
+|[[GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding]] || 2018/04/20 || [[arxiv:1804.07461]]<br>[https://gluebenchmark.com/ website] || [[Natural Language Processing]] ||  || [[GLUE]] ([[General Language Understanding Evaluation]]) ||
+|-
+|[[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] || 2018/10/11 || [[arxiv:1810.04805]] || [[Natural Language Processing]] || [[Google]] || [[BERT]] ([[Bidirectional Encoder Representations from Transformers]]) ||
+|-
+|[[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context]] || 2019/01/09 || [[arxiv:1901.02860]]<br>[https://github.com/kimiyoung/transformer-xl github] ||  ||  || [[Transformer-XL]] ||
+|-
+|[[Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)]] || 2019/11/19 || [[arxiv:1911.08265]] ||  ||  || [[MuZero]] ||
+|-
+|[[Language Models are Few-Shot Learners (GPT-3)]] || 2020/05/28 || [[arxiv:2005.14165]] || [[Natural Language Processing]] || [[OpenAI]] || [[GPT-3]] ||
+|-
+|[[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)]] || 2020/10/22 || [[arxiv:2010.11929]]<br>[https://github.com/google-research/vision_transformer GitHub] || [[Computer Vision]] || [[Google]] || [[ViT]] ([[Vision Transformer]]) ||
+|-
+|[[Learning Transferable Visual Models From Natural Language Supervision (CLIP)]] || 2021/02/26 || [[arxiv:2103.00020]]<br>[https://openai.com/blog/clip/ Blog Post] || [[Computer Vision]] || [[OpenAI]] || [[CLIP]] ([[Contrastive Language-Image Pre-Training]]) ||
+|-
+|[[MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer]] || 2021/10/05 || [[arxiv:2110.02178]]<br>[https://github.com/apple/ml-cvnets GitHub] || [[Computer Vision]] || [[Apple]] || [[MobileViT]] ||
+|-
+|[[LaMDA: Language Models for Dialog Applications]] || 2022/01/20 || [[arxiv:2201.08239]]<br>[https://blog.google/technology/ai/lamda/ Blog Post] || [[Natural Language Processing]] || [[Google]] || [[LaMDA]] ([[Language Models for Dialog Applications]]) ||
+|-
+|[[Block-Recurrent Transformers]] || 2022/03/11 || [[arxiv:2203.07852]] ||  ||  ||  ||
+|-
+|[[Memorizing Transformers]] || 2022/03/16 ||[[arxiv:2203.08913]] ||  ||  ||  ||
+|-
+|[[STaR: Bootstrapping Reasoning With Reasoning]] || 2022/03/28 || [[arxiv:2203.14465]] ||  ||  || [[STaR]] ([[Self-Taught Reasoner]]) ||
+|-
+|[[GPT-4 Technical Report]] || 2023/03/15 || [[arxiv:2303.08774]]<br>[https://openai.com/research/gpt-4 blog post]<br>[https://cdn.openai.com/papers/gpt-4-system-card.pdf system card] || [[Natural Language Processing]]<br>[[Multimodal]] || [[OpenAI]] || [[GPT-4]] ||
 |-
 |}
-===Others===
+==Other Papers==
-https://arxiv.org/abs/2301.13779 ([[FLAME: A small language model for spreadsheet formulas]]) - Small model specifically for spreadsheets by [[Miscrofot]]
+{| class="wikitable sortable"
+|-
+!Name
+!Date
+!Source
+!Type
+!Organization
+!Product
+!Note
+|-
+|[[Self-Rewarding Language Models]] || 2024/01/18 || [[arxiv:2401.10020]] || [[Natural Language Processing]] || [[Meta]] ||  ||
+|-
+|[[LLM in a flash: Efficient Large Language Model Inference with Limited Memory]] || 2023/12/12 || [[arxiv:2312.11514]]<br>[https://huggingface.co/papers/2312.11514 HuggingFace] || [[Natural Language Processing]] || [[Apple]] ||  ||
+|-
+|[[Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation]] || 2023/12/07 || [[arxiv:2311.17117]]<br>[https://humanaigc.github.io/animate-anyone/ Website]<br>[https://www.youtube.com/watch?v=8PCn5hLKNu4 Video]<br>[https://github.com/HumanAIGC/AnimateAnyone GitHub]<br>[https://www.alibabacloud.com/blog/quickly-set-up-a-virtual-clothes-try-on-services-with-pai_600484 Tutorial] || [[Computer Vision]] || [[Alibaba]] || [[Animate Anyone]] ||
+|-
+|[[MatterGen: a generative model for inorganic materials design]] || 2023/12/06 || [[arxiv:2312.03687]]<br>[https://twitter.com/xie_tian/status/1732798976779595968 Tweet] || [[Materials Science]] || [[Microsoft]] || [[MatterGen]] ||
+|-
+|[[Audiobox: Generating audio from voice and natural language prompts]] || 2023/11/30 || [https://ai.meta.com/research/publications/audiobox-unified-audio-generation-with-natural-language-prompts/ Paper]<br>[https://ai.meta.com/blog/audiobox-generating-audio-voice-natural-language-prompts/ Website] || [[Audio]] || [[Meta]] || [[Audiobox]] ||
+|-
+|[[Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text]] || 2023/11/30 || [[arxiv:2311.18805]] || [[Natural Language Processing]] || [[University of Tokyo]] || [[Scrambled Bench]] ||
+|-
+|[[MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers]] || 2023/11/27 || [[arxiv:2311.15475]]<br>[https://nihalsid.github.io/mesh-gpt/ Website] || [[Computer Vision]] ||  || [[MeshGPT]] ||
+|-
+|[[Ferret: Refer and Ground Anything Anywhere at Any Granularity]] || 2023/10/11 || [[arxiv:2310.07704]]<br>[https://github.com/apple/ml-ferret GitHub] || [[Multimodal]]<br>[[Natural Language Processing]] || [[Apple]] || [[Ferret]] ||
+|-
+|[[SeamlessM4T - Massively Multilingual & Multimodal Machine Translation]] || 2023/08/23 || [https://ai.meta.com/research/publications/seamless-m4t/ Paper]<br>[https://ai.meta.com/resources/models-and-libraries/seamless-communication/ Website]<br>[https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action Blogpost]<br>[https://seamless.metademolab.com/ Demo]<br>[https://github.com/facebookresearch/seamless_communication GitHub] || [[Natural Language Processing]] || [[Meta]] || [[SeamlessM4T]] ||
+|-
+|[[RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control]] || 2023/08/01 || [[arxiv:2307.15818]]<br>[https://robotics-transformer2.github.io/ Website]<br>[https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action Blogpost] || [[Robotics]] || [[Google]] || [[RT-2]] ||
+|-
+|[[Towards Generalist Biomedical AI]] || 2023/07/26 || [[arxiv:2307.14334]] || [[Natural Language Processing]] || [[Google]] || [[Med-PaLM]] ||
+|-
+|[[Large Language Models Understand and Can be Enhanced by Emotional Stimuli]] || 2023/07/14 || [[arxiv:2307.11760]] || [[Natural Language Processing]] || [[EmotionPrompt]] ||
+|-
+|[[MusicGen: Simple and Controllable Music Generation]] || 2023/06/08 || [[arxiv:2306.05284]]<br>[https://github.com/facebookresearch/audiocraft GitHub]<br>[https://ai.honu.io/papers/musicgen/ Example] || [[Audio]] || [[Meta]] || [[MusicGen]] ||
+|-
+|[[CodeTF: One-stop Transformer Library for State-of-the-art Code LLM]] || 2023/05/31 || [[arxiv:2306.00029]]<br>[https://github.com/salesforce/CodeTF GitHub] || [[Natural Language Processing]] || [[Salesforce]] || [[CodeTF]] ||
+|-
+|[[Bytes Are All You Need: Transformers Operating Directly On File Bytes]] || 2023/05/31 || [[arxiv:2306.00238]] || [[Computer Vision]] || [[Apple]] ||  ||
+|-
+|[[Scaling Speech Technology to 1,000+ Languages]] || 2023/05/22 || [https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/ Paper]<br><[https://ai.facebook.com/blog/multilingual-model-speech-recognition/ Blogpost]<br>[https://github.com/facebookresearch/fairseq/tree/main/examples/mms GitHub]<br>[https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html Languages covered] || [[Natural Language Processing]] || [[Meta]] || [[Massively Multilingual Speech]] ([[MMS]]) ||
+|-
+|[[RWKV: Reinventing RNNs for the Transformer Era]] || 2023/05/22 || [[arxiv:2305.13048]] || [[Natural Language Processing]] ||  || [[Receptance Weighted Key Value]] ([[RWKV]]) ||
+|-
+|[[ImageBind: One Embedding Space To Bind Them All]] || 2023/05/09 || [[arxiv:2305.05665]]<br>[https://imagebind.metademolab.com/ Website]<br>[https://imagebind.metademolab.com/demo Demo]<br>[https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/ Blog]<br>[https://github.com/facebookresearch/ImageBind GitHub] || [[Multimodal]]<br>[[Computer Vision]]<br>[[Natural Language Processing]] || [[Meta]] || [[ImageBind]] ||
+|-
+|[[Real-Time Neural Appearance Models]] || 2023/05/05 || [https://research.nvidia.com/labs/rtr/neural_appearance_models/assets/nvidia_neural_materials_paper-2023-05.pdf Paper]<br>[https://research.nvidia.com/labs/rtr/neural_appearance_models/ Blog] ||  || [[NVIDIA]] ||  ||
+|-
+|[[Poisoning Language Models During Instruction Tuning]] || 2023/05/01 || [[arxiv:2305.00944]] || [[Natural Language Processing]] ||  ||  ||
+|-
+|[[Generative Agents: Interactive Simulacra of Human Behavior]] || 2023/04/07 || [[arxiv:2304.03442]] || [[Human-AI Interaction]]<br>[[Natural Language Processing]] || [[Stanford]] || [[Generative agents]] ||
+|-
+|[[Segment Anything]] || 2023/04/05 || [https://ai.facebook.com/research/publications/segment-anything/ Paper]<br>[https://segment-anything.com/ Website]<br>[https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/ Blog]<br>[https://github.com/facebookresearch/segment-anything GitHub] || [[Computer Vision]] || [[Meta]] || [[Segment Anything Model]] ([[SAM]]) ||
+|-
+|[[HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Microsoft JARVIS)]] || 2023/03/30 || [[arxiv:2303.17580]]<br>[https://huggingface.co/spaces/microsoft/HuggingGPT HuggingFace Space]<br>[https://github.com/microsoft/JARVIS JARVIS GitHub] || [[Natural Language Processing]]<br>[[Multimodal]] || [[Microsoft]]<br>[[Hugging Face]] || [[HuggingGPT]]<br>[[JARVIS]] ||
+|-
+|[[BloombergGPT: A Large Language Model for Finance]] || 2023/03/30 || [[arxiv:2303.17564]]<br>[https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/ press release]<br>[https://twitter.com/rasbt/status/1642880757566676992 twitter thread] || [[Natural Language Processing]] || [[Bloomberg]] || [[BloombergGPT]] ||
+|-
+|[[Sparks of Artificial General Intelligence: Early experiments with GPT-4]] || 2023/03/22 || [[arxiv:2303.12712]] || [[Natural Language Processing]]<br>[[Multimodal]] || [[Microsoft]] ||  ||
+|-
+|[[Reflexion: an autonomous agent with dynamic memory and self-reflection]] || 2023/03/20 || [[arxiv:2303.11366]]<br>[https://nanothoughts.substack.com/p/reflecting-on-reflexion blog post]<br>[https://github.com/noahshinn024/reflexion GitHub]<br>[https://github.com/GammaTauAI/reflexion-human-eval GitHub 2] || [[Natural Language Processing]] ||  || [[Reflexion]] ||
+|-
+|[[PaLM-E: An Embodied Multimodal Language Model]] || 2023/03/06 || [[arxiv:2303.03378]]<br>[https://palm-e.github.io/ blog] || [[Natural Language Processing]]<br>[[Multimodal]] || [[Google]] || [[PaLM-E]] ||
+|-
+|[[Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages]] || 2023/03/02 || [[arxiv:2303.01037]]<br>[https://sites.research.google/usm/ blog] || [[Natural Language Processing]] || [[Google]] || [[Universal Speech Model]] ([[USM]]) ||
+|-
+|[[Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)]] || 2023/02/27 ||[[arxiv:2302.14045]] || [[Natural Language Processing]] || [[Microsoft]] || [[Kosmos-1]] ||
+|-
+|[[LLaMA: Open and Efficient Foundation Language Models]] || 2023/02/25 ||[https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/ paper]<br>[https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ blog post]<br>[https://github.com/facebookresearch/llama github] || [[Natural Language Processing]] || [[Meta]] || [[LLaMA]] ||
+|-
+|[[Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)]] || 2023/02/06 ||[[arxiv:2302.03011]]<br>[https://research.runwayml.com/gen1 blog post] || [[Video-to-Video]] || [[Runway]] || [[Gen-1]] ||
+|-
+|[[Dreamix: Video Diffusion Models are General Video Editors]] || 2023/02/03 ||[[arxiv:2302.01329]]<br>[https://dreamix-video-editing.github.io/ blog post] ||  || [[Google]] || [[Dreamix]] ||
+|-
+|[[FLAME: A small language model for spreadsheet formulas]] || 2023/01/31 ||[[arxiv:2301.13779]] ||  || [[Microsoft]] || [[FLAME]] ||
+|-
+|[[SingSong: Generating musical accompaniments from singing]] || 2023/01/30 ||[[arxiv:2301.12662]]<br>[https://storage.googleapis.com/sing-song/index.html blog post] || [[Audio]] ||  || [[SingSong]] ||
+|-
+|[[MusicLM: Generating Music From Text]] || 2023/01/26 ||[[arxiv:2301.11325]]<br>[https://google-research.github.io/seanet/musiclm/examples/ blog post] || [[Audio]] || [[Google]] || [[MusicLM]] ||
+|-
+|[[Mastering Diverse Domains through World Models (DreamerV3)]] || 2023/01/10 ||[[arxiv:2301.04104v1]]<br>[https://danijar.com/project/dreamerv3/ blogpost] ||  || [[DeepMind]] || [[DreamerV3]] ||
+|-
+|[[Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)]] || 2023/01/05 ||[[arxiv:2301.02111]]<br>[https://valle-demo.github.io/ Demo] ||  || [[Micorsoft]] || [[VALL-E]] ||
+|-
+|[[Muse: Text-To-Image Generation via Masked Generative Transformers]] || 2023/01/02 ||[[arxiv:2301.00704]]<br>[https://muse-model.github.io/ blog post] || [[Computer Vision]] || [[Google]] || [[Muse]] ||
+|-
+|[[BLOOM: A 176B-Parameter Open-Access Multilingual Language Model]] || 2022/11/09 || [[arxiv:2211.05100]]<br>[https://bigscience.huggingface.co/blog/bloom Blog Post] || [[Natural Language Processing]] || [[Hugging Face]] || [[BLOOM]] || Open source [[LLM]] that is a competitor to [[GPT-3]]
+|-
+|[[ReAct: Synergizing Reasoning and Acting in Language Models]] || 2022/10/06 || [[arxiv:2210.03629]]<br>[https://ai.googleblog.com/2022/11/react-synergizing-reasoning-and-acting.html Blogpost]<br>[https://github.com/ysymyth/ReAct GitHub] || [[Natural Language Processing]] || [[Google]] || [[ReAct]] ||
+|-
+|[[AudioLM: a Language Modeling Approach to Audio Generation]] || 2022/09/07 || [[arxiv:2209.03143]]<br>[https://google-research.github.io/seanet/audiolm/examples/ web page]<br>[https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html blog post] || [[Audio]] || [[Google]] || [[AudioML]] ||
+|-
+|[[Emergent Abilities of Large Language Models]] || 2022/06/15 || [[arxiv:2206.07682]]<br>[https://ai.googleblog.com/2022/11/characterizing-emergent-phenomena-in.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[Emergent Abilities]] ||
+|-
+|[[Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)]] || 2022/05/23 || [[arxiv:2205.11487]]<br>[https://imagen.research.google/ Blog Post] || [[Computer Vision]] || [[Google]] || [[Imagen]] ||
+|-
+|[[A Generalist Agent (Gato)]] || 2022/05/23 || [[arxiv:2205.06175]]<br>[https://www.deepmind.com/blog/a-generalist-agent Blog Post] || [[Natural Language Processing]]<br>[[Multimodal]] || [[DeepMind]] || [[Gato]] ||
+|-
+|[[Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback]] || 2022/04/12 || [[arxiv:2204.05862]]<br>[https://github.com/anthropics/hh-rlhf GitHub] || [[Natural Language Processing]] || [[Anthropic]] || [[RLHF]] ([[Reinforcement Learning from Human Feedback]]) ||
+|-
+|[[PaLM: Scaling Language Modeling with Pathways]] || 2022/04/05 || [[arxiv:2204.02311]]<br>[https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[PaLM]] ([[Pathways Language Model]]) ||
+|-
+|[[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models]] || 2022/01/28 || [[arxiv:2201.11903]]<br>[https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[Chain of Thought Prompting]] (CoT prompting) ||
+|-
+|[[Constitutional AI: Harmlessness from AI Feedback]] || 2021/12/12 || [[arxiv:2212.08073]] || [[Natural Language Processing]] || [[Anthropic]] || [[Constitutional AI]], [[Claude]] ||
+|-
+|[[Improving language models by retrieving from trillions of tokens (RETRO)]] || 2021/12/08 || [[arxiv:2112.04426]]<br>[https://www.deepmind.com/blog/improving-language-models-by-retrieving-from-trillions-of-tokens Blog post] || [[Natural Language Processing]] || [[OpenAI]] || [[RETRO]] ([[Retrieval Enhanced Transformer]]) ||
+|-
+|[[InstructPix2Pix: Learning to Follow Image Editing Instructions]] || 2021/11/17 || [[arxiv:2211.09800]]<br>[https://www.timothybrooks.com/instruct-pix2pix Blog Post] || [[Computer Vision]] || [[UC Berkley]] || [[InstructPix2Pix]] ||
+|-
+|[[REALM: Retrieval-Augmented Language Model Pre-Training]] || 2020/02/10 || [[arxiv:2002.08909]]<br>[https://ai.googleblog.com/2020/08/realm-integrating-retrieval-into.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[REALM]] ([[Retrieval-Augmented Language Model Pre-Training]]) ||
+|-
+|[[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)]] || 2019/10/23 || [[arxiv:1910.10683]] || [[Natural Language Processing]]<br>[https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html blog post] || [[Google]] || [[T5]] ([[Text-To-Text Transfer Transformer]]) ||
+|-
+|[[RoBERTa: A Robustly Optimized BERT Pretraining Approach]] || 2019/07/26 || [[arxiv:1907.11692]] || [[Natural Language Processing]]<br>[https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ blog post] || [[Meta]] || [[RoBERTa]] ([[Robustly Optimized BERT Pretraining Approach]]) ||
+|-
+|[[Probabilistic Face Embeddings]] || 2019/04/21 || [[arxiv:1904.09658]] || [[Computer Vision]] ||  || [[PFE]]s ([[Probabilistic Face Embeddings]]) ||
+|-
+|[[Language Models are Unsupervised Multitask Learners (GPT-2)]] || 2018 || [https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf paper] || [[Natural Language Processing]] || [[OpenAI]] || [[GPT-2]] ||
+|-
+|[[Deep reinforcement learning from human preferences]] || 2017/06/12 || [[arxiv:1706.03741]]<br>[https://openai.com/research/learning-from-human-preferences Blog post]<br>[https://github.com/mrahtz/learning-from-human-preferences GitHub] ||  || [[OpenAI]] || [[RLHF]] ([[Reinforcement Learning from Human Feedback]]) ||
+|-
+|}