Papers: Difference between revisions

From AI Wiki
No edit summary
 
(120 intermediate revisions by 6 users not shown)
Line 1: Line 1:
===Important Papers===
{{Needs Expansion}}
{|
__TOC__
==Important Papers==
{| class="wikitable sortable"
|-
|-
!Name
!Name
!Submission<br>Date
!Date
!Source
!Source
!Type
!Type
!Organization
!Product
!Note
!Note
|-
|-
|[[ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)]] || 2012 || [https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf AlexNet Paper] ||  ||
|[[ImageNet Classification with Deep Convolutional Neural Networks (AlexNet)]] || 2012 || [https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf AlexNet Paper] || [[Computer Vision]] ||  || [[AlexNet]] ||  
|-
|-
|[[Efficient Estimation of Word Representations in Vector Space (Word2Vec)]] || 2013/01/16 || [[arxiv:1301.3781]] || [[NLP]] ||  
|[[Efficient Estimation of Word Representations in Vector Space (Word2Vec)]] || 2013/01/16 || [[arxiv:1301.3781]] || [[Natural Language Processing]] ||  || [[Word2Vec]] ||  
|-
|-
|[[Playing Atari with Deep Reinforcement Learning (DQN)]] || 2013/12/19 || [[arxiv:1312.5602]] ||  ||  
|[[Playing Atari with Deep Reinforcement Learning (DQN)]] || 2013/12/19 || [[arxiv:1312.5602]] ||  ||  || [[DQN]] ([[Deep Q-Learning]]) ||  
|-
|-
|[[Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet)]] || 2014/09/04 || [[arxiv:409.1556]] ||  ||  
|[[Generative Adversarial Networks (GAN)]] || 2014/06/10 || [[arxiv:1406.2661]] ||  ||  || [[GAN]] ([[Generative Adversarial Network]]) ||  
|-
|-
|[[Deep Residual Learning for Image Recognition (ResNet)]] || 2015/12/10 || [[arxiv:409.1556]] ||  ||  
|[[Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet)]] || 2014/09/04 || [[arxiv:1409.1556]] || [[Computer Vision]] ||  || [[VGGNet]] ||  
|-
|-
|[[Going Deeper with Convolutions (GoogleNet)]] || 2015/12/10 || [[arxiv:409.1556]] ||  ||  
|[[Sequence to Sequence Learning with Neural Networks (Seq2Seq)]] || 2014/09/10 || [[arxiv:1409.3215]] || [[Natural Language Processing]] ||  || [[Seq2Seq]] ||  
|-
|-
|[[Asynchronous Methods for Deep Reinforcement Learning (A3C)]] || 2016/02/04 || [[arxiv:1602.01783]] ||  ||  
|[[Adam: A Method for Stochastic Optimization]] || 2014/12/22 || [[arxiv:1412.6980]] ||  ||  || [[Adam]] ||  
|-
|-
|[[Attention Is All You Need (Transformer)]] || 2017/06/12 || [[arxiv:1706.03762]] ||  || influential paper that introduced [[Transformer]]
|[[Deep Residual Learning for Image Recognition (ResNet)]] || 2015/12/10 || [[arxiv:1512.03385]] || [[Computer Vision]] ||  || [[ResNet]] ||
|-
|-
|[[Proximal Policy Optimization Algorithms (PPO)]] || 2017/07/20 || [[arxiv:1707.06347]] || ||  
|[[Going Deeper with Convolutions (GoogleNet)]] || 2015/12/10 || [[arxiv:1409.4842]] || [[Computer Vision]] || [[Google]] || [[GoogleNet]] ||  
|-
|-
|[[Improving Language Understanding by Generative Pre-Training (GPT)]] || 2018 || [https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf paper source] || [[NLP]] || [[GPT]]
|[[Asynchronous Methods for Deep Reinforcement Learning (A3C)]] || 2016/02/04 || [[arxiv:1602.01783]] ||  ||  || [[A3C]] ||  
|-
|-
|[[Deep contextualized word representations (ELMo)]] || 2018/02/15 || [[arxiv:1802.05365]] || [[NLP]] ||  
|[[WaveNet: A Generative Model for Raw Audio]] || 2016/09/12 || [[arxiv:1609.03499]] || [[Audio]] ||  || [[WaveNet]] ||  
|-
|-
|[[GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding]] || 2018/04/20 || [[arxiv:1804.07461]] || [[NLP]] ||  
|[[Attention Is All You Need (Transformer)]] || 2017/06/12 || [[arxiv:1706.03762]] || [[Natural Language Processing]] || [[Google]] || [[Transformer]] || Influential paper
|-
|-
|[[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] || 2018/10/11 || [[arxiv:1810.04805]] || [[NLP]] || [[BERT]]
|[[Proximal Policy Optimization Algorithms (PPO)]] || 2017/07/20 || [[arxiv:1707.06347]] ||  ||  || [[PPO]] ([[Proximal Policy Optimization]]) ||
|-
|-
|[[Transformer-XL]] || 2019/01/09 || [[arxiv:1901.02860]] || || Attentive Language Models Beyond a Fixed-Length Context
|[[Improving Language Understanding by Generative Pre-Training (GPT)]] || 2018 || [https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf paper source] || [[Natural Language Processing]] || [[OpenAI]] || [[GPT]] ([[Generative Pre-Training Transformer]]) ||
|-
|-
|[[Language Models are Few-Shot Learners (GPT-3)]] || 2020/05/28 || [[arxiv:2005.14165]] || [[NLP]] || [[GPT-3]]
|[[Deep contextualized word representations (ELMo)]] || 2018/02/15 || [[arxiv:1802.05365]] || [[Natural Language Processing]] ||  || [[ELMo]] ||
|-
|-
|[[An Image is Worth 16x16 Words]] || 2020/10/22 || [[arxiv:2010.11929]] ||  || Transformers for Image Recognition at Scale - [[Vision Transformer]] ([[ViT]])
|[[GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding]] || 2018/04/20 || [[arxiv:1804.07461]]<br>[https://gluebenchmark.com/ website] || [[Natural Language Processing]] ||  || [[GLUE]] ([[General Language Understanding Evaluation]]) ||
|-
|-
|[[OpenAI CLIP]] || 2021/02/26 || [[arxiv:2103.00020]]<br>[https://openai.com/blog/clip/ OpenAI Blog] || || Learning Transferable Visual Models From Natural Language Supervision
|[[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]] || 2018/10/11 || [[arxiv:1810.04805]] || [[Natural Language Processing]] || [[Google]] || [[BERT]] ([[Bidirectional Encoder Representations from Transformers]]) ||  
|-
|-
|[[MobileViT]] || 2021/10/05 || [[arxiv:2110.02178]] ||  || Light-weight, General-purpose, and Mobile-friendly Vision Transformer
|[[Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context]] || 2019/01/09 || [[arxiv:1901.02860]]<br>[https://github.com/kimiyoung/transformer-xl github] ||  ||  || [[Transformer-XL]] ||
|-
|-
|[[Block-Recurrent Transformers]] || 2022/03/11 || [[arxiv:2203.07852]] ||  ||  
|[[Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero)]] || 2019/11/19 || [[arxiv:1911.08265]] ||  ||  || [[MuZero]] ||  
|-
|-
|[[Memorizing Transformers]] || 2022/03/16 ||[[arxiv:2203.08913]] || ||  
|[[Language Models are Few-Shot Learners (GPT-3)]] || 2020/05/28 || [[arxiv:2005.14165]] || [[Natural Language Processing]] || [[OpenAI]] || [[GPT-3]] ||  
|-
|-
|[[STaR]] || 2022/03/28 || [[arxiv:2203.14465]] ||  || Bootstrapping Reasoning With Reasoning
|[[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)]] || 2020/10/22 || [[arxiv:2010.11929]]<br>[https://github.com/google-research/vision_transformer GitHub] || [[Computer Vision]] || [[Google]] || [[ViT]] ([[Vision Transformer]]) ||
|-
|[[Learning Transferable Visual Models From Natural Language Supervision (CLIP)]] || 2021/02/26 || [[arxiv:2103.00020]]<br>[https://openai.com/blog/clip/ Blog Post] || [[Computer Vision]] || [[OpenAI]] || [[CLIP]] ([[Contrastive Language-Image Pre-Training]]) ||
|-
|[[MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer]] || 2021/10/05 || [[arxiv:2110.02178]]<br>[https://github.com/apple/ml-cvnets GitHub] || [[Computer Vision]] || [[Apple]] || [[MobileViT]] ||
|-
|[[LaMDA: Language Models for Dialog Applications]] || 2022/01/20 || [[arxiv:2201.08239]]<br>[https://blog.google/technology/ai/lamda/ Blog Post] || [[Natural Language Processing]] || [[Google]] || [[LaMDA]] ([[Language Models for Dialog Applications]]) ||
|-
|[[Block-Recurrent Transformers]] || 2022/03/11 || [[arxiv:2203.07852]] ||  ||  ||  ||
|-
|[[Memorizing Transformers]] || 2022/03/16 ||[[arxiv:2203.08913]] ||  || ||  ||
|-
|[[STaR: Bootstrapping Reasoning With Reasoning]] || 2022/03/28 || [[arxiv:2203.14465]] ||  ||  || [[STaR]] ([[Self-Taught Reasoner]]) ||
|-
|[[GPT-4 Technical Report]] || 2023/03/15 || [[arxiv:2303.08774]]<br>[https://openai.com/research/gpt-4 blog post]<br>[https://cdn.openai.com/papers/gpt-4-system-card.pdf system card] || [[Natural Language Processing]]<br>[[Multimodal]] || [[OpenAI]] || [[GPT-4]] ||
|-
|-
|}
|}


===Other Papers===
==Other Papers==
https://arxiv.org/abs/2301.13779 ([[FLAME: A small language model for spreadsheet formulas]]) - Small model specifically for spreadsheets by [[Miscrofot]]
{| class="wikitable sortable"
|-
!Name
!Date
!Source
!Type
!Organization
!Product
!Note
|-
|[[Self-Rewarding Language Models]] || 2024/01/18 || [[arxiv:2401.10020]] || [[Natural Language Processing]] || [[Meta]] ||  ||
|-
|[[LLM in a flash: Efficient Large Language Model Inference with Limited Memory]] || 2023/12/12 || [[arxiv:2312.11514]]<br>[https://huggingface.co/papers/2312.11514 HuggingFace] || [[Natural Language Processing]] || [[Apple]] ||  ||
|-
|[[Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation]] || 2023/12/07 || [[arxiv:2311.17117]]<br>[https://humanaigc.github.io/animate-anyone/ Website]<br>[https://www.youtube.com/watch?v=8PCn5hLKNu4 Video]<br>[https://github.com/HumanAIGC/AnimateAnyone GitHub]<br>[https://www.alibabacloud.com/blog/quickly-set-up-a-virtual-clothes-try-on-services-with-pai_600484 Tutorial] || [[Computer Vision]] || [[Alibaba]] || [[Animate Anyone]] ||
|-
|[[MatterGen: a generative model for inorganic materials design]] || 2023/12/06 || [[arxiv:2312.03687]]<br>[https://twitter.com/xie_tian/status/1732798976779595968 Tweet] || [[Materials Science]] || [[Microsoft]] || [[MatterGen]] ||
|-
|[[Audiobox: Generating audio from voice and natural language prompts]] || 2023/11/30 || [https://ai.meta.com/research/publications/audiobox-unified-audio-generation-with-natural-language-prompts/ Paper]<br>[https://ai.meta.com/blog/audiobox-generating-audio-voice-natural-language-prompts/ Website] || [[Audio]] || [[Meta]] || [[Audiobox]] ||
|-
|[[Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text]] || 2023/11/30 || [[arxiv:2311.18805]] || [[Natural Language Processing]] || [[University of Tokyo]] || [[Scrambled Bench]] ||
|-
|[[MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers]] || 2023/11/27 || [[arxiv:2311.15475]]<br>[https://nihalsid.github.io/mesh-gpt/ Website] || [[Computer Vision]] ||  || [[MeshGPT]] ||
|-
|[[Ferret: Refer and Ground Anything Anywhere at Any Granularity]] || 2023/10/11 || [[arxiv:2310.07704]]<br>[https://github.com/apple/ml-ferret GitHub] || [[Multimodal]]<br>[[Natural Language Processing]] || [[Apple]] || [[Ferret]] ||
|-
|[[SeamlessM4T - Massively Multilingual & Multimodal Machine Translation]] || 2023/08/23 || [https://ai.meta.com/research/publications/seamless-m4t/ Paper]<br>[https://ai.meta.com/resources/models-and-libraries/seamless-communication/ Website]<br>[https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action Blogpost]<br>[https://seamless.metademolab.com/ Demo]<br>[https://github.com/facebookresearch/seamless_communication GitHub] || [[Natural Language Processing]] || [[Meta]] || [[SeamlessM4T]] ||
|-
|[[RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control]] || 2023/08/01 || [[arxiv:2307.15818]]<br>[https://robotics-transformer2.github.io/ Website]<br>[https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action Blogpost] || [[Robotics]] || [[Google]] || [[RT-2]] ||
|-
|[[Towards Generalist Biomedical AI]] || 2023/07/26 || [[arxiv:2307.14334]] || [[Natural Language Processing]] || [[Google]] || [[Med-PaLM]] ||
|-
|[[Large Language Models Understand and Can be Enhanced by Emotional Stimuli]] || 2023/07/14 || [[arxiv:2307.11760]] || [[Natural Language Processing]] || [[EmotionPrompt]] ||
|-
|[[MusicGen: Simple and Controllable Music Generation]] || 2023/06/08 || [[arxiv:2306.05284]]<br>[https://github.com/facebookresearch/audiocraft GitHub]<br>[https://ai.honu.io/papers/musicgen/ Example] || [[Audio]] || [[Meta]] || [[MusicGen]] ||
|-
|[[CodeTF: One-stop Transformer Library for State-of-the-art Code LLM]] || 2023/05/31 || [[arxiv:2306.00029]]<br>[https://github.com/salesforce/CodeTF GitHub] || [[Natural Language Processing]] || [[Salesforce]] || [[CodeTF]] ||
|-
|[[Bytes Are All You Need: Transformers Operating Directly On File Bytes]] || 2023/05/31 || [[arxiv:2306.00238]] || [[Computer Vision]] || [[Apple]] ||  ||
|-
|[[Scaling Speech Technology to 1,000+ Languages]] || 2023/05/22 || [https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/ Paper]<br><[https://ai.facebook.com/blog/multilingual-model-speech-recognition/ Blogpost]<br>[https://github.com/facebookresearch/fairseq/tree/main/examples/mms GitHub]<br>[https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mms.html Languages covered] || [[Natural Language Processing]] || [[Meta]] || [[Massively Multilingual Speech]] ([[MMS]]) ||
|-
|[[RWKV: Reinventing RNNs for the Transformer Era]] || 2023/05/22 || [[arxiv:2305.13048]] || [[Natural Language Processing]] ||  || [[Receptance Weighted Key Value]] ([[RWKV]]) ||
|-
|[[ImageBind: One Embedding Space To Bind Them All]] || 2023/05/09 || [[arxiv:2305.05665]]<br>[https://imagebind.metademolab.com/ Website]<br>[https://imagebind.metademolab.com/demo Demo]<br>[https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/ Blog]<br>[https://github.com/facebookresearch/ImageBind GitHub] || [[Multimodal]]<br>[[Computer Vision]]<br>[[Natural Language Processing]] || [[Meta]] || [[ImageBind]] ||
|-
|[[Real-Time Neural Appearance Models]] || 2023/05/05 || [https://research.nvidia.com/labs/rtr/neural_appearance_models/assets/nvidia_neural_materials_paper-2023-05.pdf Paper]<br>[https://research.nvidia.com/labs/rtr/neural_appearance_models/ Blog] ||  || [[NVIDIA]] ||  ||
|-
|[[Poisoning Language Models During Instruction Tuning]] || 2023/05/01 || [[arxiv:2305.00944]] || [[Natural Language Processing]] ||  ||  ||
|-
|[[Generative Agents: Interactive Simulacra of Human Behavior]] || 2023/04/07 || [[arxiv:2304.03442]] || [[Human-AI Interaction]]<br>[[Natural Language Processing]] || [[Stanford]] || [[Generative agents]] ||
|-
|[[Segment Anything]] || 2023/04/05 || [https://ai.facebook.com/research/publications/segment-anything/ Paper]<br>[https://segment-anything.com/ Website]<br>[https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/ Blog]<br>[https://github.com/facebookresearch/segment-anything GitHub] || [[Computer Vision]] || [[Meta]] || [[Segment Anything Model]] ([[SAM]]) ||
|-
|[[HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Microsoft JARVIS)]] || 2023/03/30 || [[arxiv:2303.17580]]<br>[https://huggingface.co/spaces/microsoft/HuggingGPT HuggingFace Space]<br>[https://github.com/microsoft/JARVIS JARVIS GitHub] || [[Natural Language Processing]]<br>[[Multimodal]] || [[Microsoft]]<br>[[Hugging Face]] || [[HuggingGPT]]<br>[[JARVIS]] ||
|-
|[[BloombergGPT: A Large Language Model for Finance]] || 2023/03/30 || [[arxiv:2303.17564]]<br>[https://www.bloomberg.com/company/press/bloomberggpt-50-billion-parameter-llm-tuned-finance/ press release]<br>[https://twitter.com/rasbt/status/1642880757566676992 twitter thread] || [[Natural Language Processing]] || [[Bloomberg]] || [[BloombergGPT]] ||
|-
|[[Sparks of Artificial General Intelligence: Early experiments with GPT-4]] || 2023/03/22 || [[arxiv:2303.12712]] || [[Natural Language Processing]]<br>[[Multimodal]] || [[Microsoft]] ||  ||
|-
|[[Reflexion: an autonomous agent with dynamic memory and self-reflection]] || 2023/03/20 || [[arxiv:2303.11366]]<br>[https://nanothoughts.substack.com/p/reflecting-on-reflexion blog post]<br>[https://github.com/noahshinn024/reflexion GitHub]<br>[https://github.com/GammaTauAI/reflexion-human-eval GitHub 2] || [[Natural Language Processing]] ||  || [[Reflexion]] ||
|-
|[[PaLM-E: An Embodied Multimodal Language Model]] || 2023/03/06 || [[arxiv:2303.03378]]<br>[https://palm-e.github.io/ blog] || [[Natural Language Processing]]<br>[[Multimodal]] || [[Google]] || [[PaLM-E]] ||
|-
|[[Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages]] || 2023/03/02 || [[arxiv:2303.01037]]<br>[https://sites.research.google/usm/ blog] || [[Natural Language Processing]] || [[Google]] || [[Universal Speech Model]] ([[USM]]) ||
|-
|[[Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)]] || 2023/02/27 ||[[arxiv:2302.14045]] || [[Natural Language Processing]] || [[Microsoft]] || [[Kosmos-1]] ||
|-
|[[LLaMA: Open and Efficient Foundation Language Models]] || 2023/02/25 ||[https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/ paper]<br>[https://ai.facebook.com/blog/large-language-model-llama-meta-ai/ blog post]<br>[https://github.com/facebookresearch/llama github] || [[Natural Language Processing]] || [[Meta]] || [[LLaMA]] ||
|-
|[[Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1)]] || 2023/02/06 ||[[arxiv:2302.03011]]<br>[https://research.runwayml.com/gen1 blog post] || [[Video-to-Video]] || [[Runway]] || [[Gen-1]] ||
|-
|[[Dreamix: Video Diffusion Models are General Video Editors]] || 2023/02/03 ||[[arxiv:2302.01329]]<br>[https://dreamix-video-editing.github.io/ blog post] ||  || [[Google]] || [[Dreamix]] ||
|-
|[[FLAME: A small language model for spreadsheet formulas]] || 2023/01/31 ||[[arxiv:2301.13779]] ||  || [[Microsoft]] || [[FLAME]] ||
|-
|[[SingSong: Generating musical accompaniments from singing]] || 2023/01/30 ||[[arxiv:2301.12662]]<br>[https://storage.googleapis.com/sing-song/index.html blog post] || [[Audio]] ||  || [[SingSong]] ||
|-
|[[MusicLM: Generating Music From Text]] || 2023/01/26 ||[[arxiv:2301.11325]]<br>[https://google-research.github.io/seanet/musiclm/examples/ blog post] || [[Audio]] || [[Google]] || [[MusicLM]] ||
|-
|[[Mastering Diverse Domains through World Models (DreamerV3)]] || 2023/01/10 ||[[arxiv:2301.04104v1]]<br>[https://danijar.com/project/dreamerv3/ blogpost] ||  || [[DeepMind]] || [[DreamerV3]] ||
|-
|[[Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)]] || 2023/01/05 ||[[arxiv:2301.02111]]<br>[https://valle-demo.github.io/ Demo] ||  || [[Micorsoft]] || [[VALL-E]] ||
|-
|[[Muse: Text-To-Image Generation via Masked Generative Transformers]] || 2023/01/02 ||[[arxiv:2301.00704]]<br>[https://muse-model.github.io/ blog post] || [[Computer Vision]] || [[Google]] || [[Muse]] ||
|-
|[[BLOOM: A 176B-Parameter Open-Access Multilingual Language Model]] || 2022/11/09 || [[arxiv:2211.05100]]<br>[https://bigscience.huggingface.co/blog/bloom Blog Post] || [[Natural Language Processing]] || [[Hugging Face]] || [[BLOOM]] || Open source [[LLM]] that is a competitor to [[GPT-3]]
|-
|[[ReAct: Synergizing Reasoning and Acting in Language Models]] || 2022/10/06 || [[arxiv:2210.03629]]<br>[https://ai.googleblog.com/2022/11/react-synergizing-reasoning-and-acting.html Blogpost]<br>[https://github.com/ysymyth/ReAct GitHub] || [[Natural Language Processing]] || [[Google]] || [[ReAct]] ||
|-
|[[AudioLM: a Language Modeling Approach to Audio Generation]] || 2022/09/07 || [[arxiv:2209.03143]]<br>[https://google-research.github.io/seanet/audiolm/examples/ web page]<br>[https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html blog post] || [[Audio]] || [[Google]] || [[AudioML]] ||
|-
|[[Emergent Abilities of Large Language Models]] || 2022/06/15 || [[arxiv:2206.07682]]<br>[https://ai.googleblog.com/2022/11/characterizing-emergent-phenomena-in.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[Emergent Abilities]] ||
|-
|[[Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)]] || 2022/05/23 || [[arxiv:2205.11487]]<br>[https://imagen.research.google/ Blog Post] || [[Computer Vision]] || [[Google]] || [[Imagen]] ||
|-
|[[A Generalist Agent (Gato)]] || 2022/05/23 || [[arxiv:2205.06175]]<br>[https://www.deepmind.com/blog/a-generalist-agent Blog Post] || [[Natural Language Processing]]<br>[[Multimodal]] || [[DeepMind]] || [[Gato]] ||
|-
|[[Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback]] || 2022/04/12 || [[arxiv:2204.05862]]<br>[https://github.com/anthropics/hh-rlhf GitHub] || [[Natural Language Processing]] || [[Anthropic]] || [[RLHF]] ([[Reinforcement Learning from Human Feedback]]) ||
|-
|[[PaLM: Scaling Language Modeling with Pathways]] || 2022/04/05 || [[arxiv:2204.02311]]<br>[https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[PaLM]] ([[Pathways Language Model]]) ||
|-
|[[Chain-of-Thought Prompting Elicits Reasoning in Large Language Models]] || 2022/01/28 || [[arxiv:2201.11903]]<br>[https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[Chain of Thought Prompting]] (CoT prompting) ||
|-
|[[Constitutional AI: Harmlessness from AI Feedback]] || 2021/12/12 || [[arxiv:2212.08073]] || [[Natural Language Processing]] || [[Anthropic]] || [[Constitutional AI]], [[Claude]] ||
|-
|[[Improving language models by retrieving from trillions of tokens (RETRO)]] || 2021/12/08 || [[arxiv:2112.04426]]<br>[https://www.deepmind.com/blog/improving-language-models-by-retrieving-from-trillions-of-tokens Blog post] || [[Natural Language Processing]] || [[OpenAI]] || [[RETRO]] ([[Retrieval Enhanced Transformer]]) ||
|-
|[[InstructPix2Pix: Learning to Follow Image Editing Instructions]] || 2021/11/17 || [[arxiv:2211.09800]]<br>[https://www.timothybrooks.com/instruct-pix2pix Blog Post] || [[Computer Vision]] || [[UC Berkley]] || [[InstructPix2Pix]] ||
|-
|[[REALM: Retrieval-Augmented Language Model Pre-Training]] || 2020/02/10 || [[arxiv:2002.08909]]<br>[https://ai.googleblog.com/2020/08/realm-integrating-retrieval-into.html Blog Post] || [[Natural Language Processing]] || [[Google]] || [[REALM]] ([[Retrieval-Augmented Language Model Pre-Training]]) ||
|-
|[[Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)]] || 2019/10/23 || [[arxiv:1910.10683]] || [[Natural Language Processing]]<br>[https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html blog post] || [[Google]] || [[T5]] ([[Text-To-Text Transfer Transformer]]) ||
|-
|[[RoBERTa: A Robustly Optimized BERT Pretraining Approach]] || 2019/07/26 || [[arxiv:1907.11692]] || [[Natural Language Processing]]<br>[https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ blog post] || [[Meta]] || [[RoBERTa]] ([[Robustly Optimized BERT Pretraining Approach]]) ||
|-
|[[Probabilistic Face Embeddings]] || 2019/04/21 || [[arxiv:1904.09658]] || [[Computer Vision]] ||  || [[PFE]]s ([[Probabilistic Face Embeddings]]) ||
|-
|[[Language Models are Unsupervised Multitask Learners (GPT-2)]] || 2018 || [https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf paper] || [[Natural Language Processing]] || [[OpenAI]] || [[GPT-2]] ||
|-
|[[Deep reinforcement learning from human preferences]] || 2017/06/12 || [[arxiv:1706.03741]]<br>[https://openai.com/research/learning-from-human-preferences Blog post]<br>[https://github.com/mrahtz/learning-from-human-preferences GitHub] ||  || [[OpenAI]] || [[RLHF]] ([[Reinforcement Learning from Human Feedback]]) ||
|-
|}

Latest revision as of 23:18, 5 February 2024

Important Papers

Name Date Source Type Organization Product Note
ImageNet Classification with Deep Convolutional Neural Networks (AlexNet) 2012 AlexNet Paper Computer Vision AlexNet
Efficient Estimation of Word Representations in Vector Space (Word2Vec) 2013/01/16 arxiv:1301.3781 Natural Language Processing Word2Vec
Playing Atari with Deep Reinforcement Learning (DQN) 2013/12/19 arxiv:1312.5602 DQN (Deep Q-Learning)
Generative Adversarial Networks (GAN) 2014/06/10 arxiv:1406.2661 GAN (Generative Adversarial Network)
Very Deep Convolutional Networks for Large-Scale Image Recognition (VGGNet) 2014/09/04 arxiv:1409.1556 Computer Vision VGGNet
Sequence to Sequence Learning with Neural Networks (Seq2Seq) 2014/09/10 arxiv:1409.3215 Natural Language Processing Seq2Seq
Adam: A Method for Stochastic Optimization 2014/12/22 arxiv:1412.6980 Adam
Deep Residual Learning for Image Recognition (ResNet) 2015/12/10 arxiv:1512.03385 Computer Vision ResNet
Going Deeper with Convolutions (GoogleNet) 2015/12/10 arxiv:1409.4842 Computer Vision Google GoogleNet
Asynchronous Methods for Deep Reinforcement Learning (A3C) 2016/02/04 arxiv:1602.01783 A3C
WaveNet: A Generative Model for Raw Audio 2016/09/12 arxiv:1609.03499 Audio WaveNet
Attention Is All You Need (Transformer) 2017/06/12 arxiv:1706.03762 Natural Language Processing Google Transformer Influential paper
Proximal Policy Optimization Algorithms (PPO) 2017/07/20 arxiv:1707.06347 PPO (Proximal Policy Optimization)
Improving Language Understanding by Generative Pre-Training (GPT) 2018 paper source Natural Language Processing OpenAI GPT (Generative Pre-Training Transformer)
Deep contextualized word representations (ELMo) 2018/02/15 arxiv:1802.05365 Natural Language Processing ELMo
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding 2018/04/20 arxiv:1804.07461
website
Natural Language Processing GLUE (General Language Understanding Evaluation)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018/10/11 arxiv:1810.04805 Natural Language Processing Google BERT (Bidirectional Encoder Representations from Transformers)
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context 2019/01/09 arxiv:1901.02860
github
Transformer-XL
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (MuZero) 2019/11/19 arxiv:1911.08265 MuZero
Language Models are Few-Shot Learners (GPT-3) 2020/05/28 arxiv:2005.14165 Natural Language Processing OpenAI GPT-3
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) 2020/10/22 arxiv:2010.11929
GitHub
Computer Vision Google ViT (Vision Transformer)
Learning Transferable Visual Models From Natural Language Supervision (CLIP) 2021/02/26 arxiv:2103.00020
Blog Post
Computer Vision OpenAI CLIP (Contrastive Language-Image Pre-Training)
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer 2021/10/05 arxiv:2110.02178
GitHub
Computer Vision Apple MobileViT
LaMDA: Language Models for Dialog Applications 2022/01/20 arxiv:2201.08239
Blog Post
Natural Language Processing Google LaMDA (Language Models for Dialog Applications)
Block-Recurrent Transformers 2022/03/11 arxiv:2203.07852
Memorizing Transformers 2022/03/16 arxiv:2203.08913
STaR: Bootstrapping Reasoning With Reasoning 2022/03/28 arxiv:2203.14465 STaR (Self-Taught Reasoner)
GPT-4 Technical Report 2023/03/15 arxiv:2303.08774
blog post
system card
Natural Language Processing
Multimodal
OpenAI GPT-4

Other Papers

Name Date Source Type Organization Product Note
Self-Rewarding Language Models 2024/01/18 arxiv:2401.10020 Natural Language Processing Meta
LLM in a flash: Efficient Large Language Model Inference with Limited Memory 2023/12/12 arxiv:2312.11514
HuggingFace
Natural Language Processing Apple
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation 2023/12/07 arxiv:2311.17117
Website
Video
GitHub
Tutorial
Computer Vision Alibaba Animate Anyone
MatterGen: a generative model for inorganic materials design 2023/12/06 arxiv:2312.03687
Tweet
Materials Science Microsoft MatterGen
Audiobox: Generating audio from voice and natural language prompts 2023/11/30 Paper
Website
Audio Meta Audiobox
Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text 2023/11/30 arxiv:2311.18805 Natural Language Processing University of Tokyo Scrambled Bench
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers 2023/11/27 arxiv:2311.15475
Website
Computer Vision MeshGPT
Ferret: Refer and Ground Anything Anywhere at Any Granularity 2023/10/11 arxiv:2310.07704
GitHub
Multimodal
Natural Language Processing
Apple Ferret
SeamlessM4T - Massively Multilingual & Multimodal Machine Translation 2023/08/23 Paper
Website
Blogpost
Demo
GitHub
Natural Language Processing Meta SeamlessM4T
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control 2023/08/01 arxiv:2307.15818
Website
Blogpost
Robotics Google RT-2
Towards Generalist Biomedical AI 2023/07/26 arxiv:2307.14334 Natural Language Processing Google Med-PaLM
Large Language Models Understand and Can be Enhanced by Emotional Stimuli 2023/07/14 arxiv:2307.11760 Natural Language Processing EmotionPrompt
MusicGen: Simple and Controllable Music Generation 2023/06/08 arxiv:2306.05284
GitHub
Example
Audio Meta MusicGen
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM 2023/05/31 arxiv:2306.00029
GitHub
Natural Language Processing Salesforce CodeTF
Bytes Are All You Need: Transformers Operating Directly On File Bytes 2023/05/31 arxiv:2306.00238 Computer Vision Apple
Scaling Speech Technology to 1,000+ Languages 2023/05/22 Paper
<Blogpost
GitHub
Languages covered
Natural Language Processing Meta Massively Multilingual Speech (MMS)
RWKV: Reinventing RNNs for the Transformer Era 2023/05/22 arxiv:2305.13048 Natural Language Processing Receptance Weighted Key Value (RWKV)
ImageBind: One Embedding Space To Bind Them All 2023/05/09 arxiv:2305.05665
Website
Demo
Blog
GitHub
Multimodal
Computer Vision
Natural Language Processing
Meta ImageBind
Real-Time Neural Appearance Models 2023/05/05 Paper
Blog
NVIDIA
Poisoning Language Models During Instruction Tuning 2023/05/01 arxiv:2305.00944 Natural Language Processing
Generative Agents: Interactive Simulacra of Human Behavior 2023/04/07 arxiv:2304.03442 Human-AI Interaction
Natural Language Processing
Stanford Generative agents
Segment Anything 2023/04/05 Paper
Website
Blog
GitHub
Computer Vision Meta Segment Anything Model (SAM)
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace (Microsoft JARVIS) 2023/03/30 arxiv:2303.17580
HuggingFace Space
JARVIS GitHub
Natural Language Processing
Multimodal
Microsoft
Hugging Face
HuggingGPT
JARVIS
BloombergGPT: A Large Language Model for Finance 2023/03/30 arxiv:2303.17564
press release
twitter thread
Natural Language Processing Bloomberg BloombergGPT
Sparks of Artificial General Intelligence: Early experiments with GPT-4 2023/03/22 arxiv:2303.12712 Natural Language Processing
Multimodal
Microsoft
Reflexion: an autonomous agent with dynamic memory and self-reflection 2023/03/20 arxiv:2303.11366
blog post
GitHub
GitHub 2
Natural Language Processing Reflexion
PaLM-E: An Embodied Multimodal Language Model 2023/03/06 arxiv:2303.03378
blog
Natural Language Processing
Multimodal
Google PaLM-E
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages 2023/03/02 arxiv:2303.01037
blog
Natural Language Processing Google Universal Speech Model (USM)
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) 2023/02/27 arxiv:2302.14045 Natural Language Processing Microsoft Kosmos-1
LLaMA: Open and Efficient Foundation Language Models 2023/02/25 paper
blog post
github
Natural Language Processing Meta LLaMA
Structure and Content-Guided Video Synthesis with Diffusion Models (Gen-1) 2023/02/06 arxiv:2302.03011
blog post
Video-to-Video Runway Gen-1
Dreamix: Video Diffusion Models are General Video Editors 2023/02/03 arxiv:2302.01329
blog post
Google Dreamix
FLAME: A small language model for spreadsheet formulas 2023/01/31 arxiv:2301.13779 Microsoft FLAME
SingSong: Generating musical accompaniments from singing 2023/01/30 arxiv:2301.12662
blog post
Audio SingSong
MusicLM: Generating Music From Text 2023/01/26 arxiv:2301.11325
blog post
Audio Google MusicLM
Mastering Diverse Domains through World Models (DreamerV3) 2023/01/10 arxiv:2301.04104v1
blogpost
DeepMind DreamerV3
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E) 2023/01/05 arxiv:2301.02111
Demo
Micorsoft VALL-E
Muse: Text-To-Image Generation via Masked Generative Transformers 2023/01/02 arxiv:2301.00704
blog post
Computer Vision Google Muse
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model 2022/11/09 arxiv:2211.05100
Blog Post
Natural Language Processing Hugging Face BLOOM Open source LLM that is a competitor to GPT-3
ReAct: Synergizing Reasoning and Acting in Language Models 2022/10/06 arxiv:2210.03629
Blogpost
GitHub
Natural Language Processing Google ReAct
AudioLM: a Language Modeling Approach to Audio Generation 2022/09/07 arxiv:2209.03143
web page
blog post
Audio Google AudioML
Emergent Abilities of Large Language Models 2022/06/15 arxiv:2206.07682
Blog Post
Natural Language Processing Google Emergent Abilities
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen) 2022/05/23 arxiv:2205.11487
Blog Post
Computer Vision Google Imagen
A Generalist Agent (Gato) 2022/05/23 arxiv:2205.06175
Blog Post
Natural Language Processing
Multimodal
DeepMind Gato
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 2022/04/12 arxiv:2204.05862
GitHub
Natural Language Processing Anthropic RLHF (Reinforcement Learning from Human Feedback)
PaLM: Scaling Language Modeling with Pathways 2022/04/05 arxiv:2204.02311
Blog Post
Natural Language Processing Google PaLM (Pathways Language Model)
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models 2022/01/28 arxiv:2201.11903
Blog Post
Natural Language Processing Google Chain of Thought Prompting (CoT prompting)
Constitutional AI: Harmlessness from AI Feedback 2021/12/12 arxiv:2212.08073 Natural Language Processing Anthropic Constitutional AI, Claude
Improving language models by retrieving from trillions of tokens (RETRO) 2021/12/08 arxiv:2112.04426
Blog post
Natural Language Processing OpenAI RETRO (Retrieval Enhanced Transformer)
InstructPix2Pix: Learning to Follow Image Editing Instructions 2021/11/17 arxiv:2211.09800
Blog Post
Computer Vision UC Berkley InstructPix2Pix
REALM: Retrieval-Augmented Language Model Pre-Training 2020/02/10 arxiv:2002.08909
Blog Post
Natural Language Processing Google REALM (Retrieval-Augmented Language Model Pre-Training)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5) 2019/10/23 arxiv:1910.10683 Natural Language Processing
blog post
Google T5 (Text-To-Text Transfer Transformer)
RoBERTa: A Robustly Optimized BERT Pretraining Approach 2019/07/26 arxiv:1907.11692 Natural Language Processing
blog post
Meta RoBERTa (Robustly Optimized BERT Pretraining Approach)
Probabilistic Face Embeddings 2019/04/21 arxiv:1904.09658 Computer Vision PFEs (Probabilistic Face Embeddings)
Language Models are Unsupervised Multitask Learners (GPT-2) 2018 paper Natural Language Processing OpenAI GPT-2
Deep reinforcement learning from human preferences 2017/06/12 arxiv:1706.03741
Blog post
GitHub
OpenAI RLHF (Reinforcement Learning from Human Feedback)