A collection of papers and projects related to LLMs and corresponding data-centric methods.
Other publicly-available materials: [Slides]
If you find our survey useful, please cite the paper:
@article{LLMDATASurvey,
title={A Survey of LLM × DATA},
author={Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu},
year={2025},
journal={arXiv preprint arXiv:2505.18458},
url={https://arxiv.org/abs/2505.18458}
}
@article{tangllmasanalyst,
title={LLM/Agent-as-Data-Analyst: A Survey},
author={Zirui Tang, Weizheng Wang, Zihang Zhou, Yang Jiao, Bangrui Xu, Boyu Niu, Xuanhe Zhou, Guoliang Li, Yeye He, Wei Zhou, Yitong Song, Cheng Tan, Bin Wang, Conghui He, Xiaoyang Wang, Fan Wu},
year={2025},
journal={arXiv preprint arXiv:2509.23988},
url={https://arxiv.org/abs/2509.23988}
}
The IaaS concept for LLM data (phonetically echoing Infrastructure as a Service) defines the characteristics of high-quality datasets along four key dimensions: (1) Inclusiveness ensures broad coverage across domains, tasks, sources, languages, styles, and modalities. (2) Abundance emphasizes sufficient and well-balanced data volume to support scaling, fine-tuning, and continual learning without overfitting. (3) Articulation requires clear, coherent, and instructive content with step-by-step reasoning to enhance model understanding and task performance. (4) Sanitization involves rigorous filtering to remove private, toxic, unethical, and misleading content, ensuring data safety, neutrality, and compliance.
We observe the evolution of LLM/Agent-as-Data-Analyst techniques follows a five-dimension trajectory: (1) Data Modality (homogeneous → heterogeneous); (2) Analysis Functionality (literal → semantic); (3) Knowledge Scope (closed-world →open-world); (4) Tool Integration (tool-coupled → tool-assisted); (5) Development Autonomy (manual → fully autonomous).
-
CommonCrawl: A massive web crawl dataset covering diverse languages and domains; widely used for LLM pretraining. [Source]
-
The Stack: A large-scale dataset of permissively licensed source code in multiple programming languages; used for code LLMs. [HuggingFace]
-
RedPajama: A replication of LLaMA’s training data recipe with open datasets; spans web, books, arXiv, and more. [Github]
-
SlimPajama-627B-DC: A deduplicated and filtered subset of RedPajama (627B tokens); optimized for clean and efficient training. [HuggingFace]
-
Alpaca-CoT: Instruction-following dataset enhanced with Chain-of-Thought (CoT) reasoning prompts; used for dialogue fine-tuning. [Github]
-
LLaVA-Pretrain: A multimodal dataset with image-text pairs for training visual language models like LLaVA. [HuggingFace]
-
Wikipedia: Structured and encyclopedic content; a foundational source for general-purpose language models. [HuggingFace]
-
C4: A cleaned version of CommonCrawl data, widely used in models like T5 for high-quality web text. [HuggingFace]
-
BookCorpus: Contains free fiction books; often used to teach models long-form language understanding. [HuggingFace]
-
Arxiv: Scientific paper corpus from arXiv, covering physics, math, CS, and more; useful for academic language modeling. [HuggingFace]
-
PubMed: Biomedical literature dataset from the PubMed database; key resource for medical domain models. [Source]
-
StackExchange: Community Q&A data covering domains like programming, math, philosophy, etc.; useful for QA and dialogue tasks. [Source]
-
OpenWebText2: A high-quality open-source web text dataset based on URLs commonly cited on Reddit; GPT-style training corpus. [Source]
-
OpenWebMath: A dataset of math questions and answers; designed to improve mathematical reasoning in LLMs. [HuggingFace]
-
Falcon-RefinedWeb: Filtered web data used in training Falcon models; emphasizes data quality through rigorous preprocessing. [HuggingFace]
-
CCI 3.0: A large-scale multi-domain Chinese web corpus, suitable for training high-quality Chinese LLMs. [HuggingFace]
-
OmniCorpus: A unified multimodal dataset (text, image, audio) designed for general-purpose AI training. [Github]
-
WanJuan3.0: A diverse and large-scale Chinese dataset including news, fiction, QA, and more; released by OpenDataLab. [Source]
- OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Hugo Laurençon, Lucile Saulnier, Léo Tronchon, et al. NeurIPS 2023. [Paper]
- Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books Yukun Zhu, Ryan Kiros, Richard Zemel, et al. ICCV 2015. [Paper]
- MedicalGPT: Training Medical GPT Model Ming Xu. 2025. [Github]
- BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark Dakuan Lu, Hengkui Wu, Jiaqing Liang, et al. arXiv 2023. [Paper]
- Free dolly: Introducing the world’s first truly open instruction-tuned llm Mike Conover, Matt Hayes, Ankit Mathur, et al. 2023. [Source]
- MedicalGPT: Training Medical GPT Model [Github]
- DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services Shengbin Yue, Wei Chen, Siyuan Wang, et al. arXiv 2023. [Paper]
- MedicalGPT: Training Medical GPT Model [Github]
- UltraFeedback: Boosting Language Models with Scaled AI Feedback Ganqu Cui, Lifan Yuan, Ning Ding, et al. ICML 2024. [Paper]
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI. arXiv 2025. [Paper]
- Kimi k1.5: Scaling Reinforcement Learning with LLMs Kimi Team. arXiv 2025. [Paper]
- DISC-LawLLM: Fine-tuning Large Language Models for Intelligent Legal Services [Paper]
- DH-RAG: A Dynamic Historical Context-Powered Retrieval-Augmented Generation Method for Multi-Turn Dialogue Feiyuan Zhang, Dezhi Zhu, James Ming, et al. arXiv 2025. [Paper]
- Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation Junde Wu, Jiayuan Zhu, Yunli Qi, et al. arXiv 2024. [Paper]
- ERAGent: Enhancing Retrieval-Augmented Language Models with Improved Accuracy, Efficiency, and Personalization Yunxiao Shi, Xing Zi, Zijing Shi, et al. arXiv 2024. [Paper]
- PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents Saber Zerhoudi, Michael Granitzer. arXiv 2024. [Paper]
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Xiang Yue, Yuansheng Ni, Kai Zhang, et al. CVPR 2024. [Paper]
- LexEval: A Comprehensive Chinese Legal Benchmark for Evaluating Large Language Models Haitao Li, You Chen, Qingyao Ai, et al. NeurIPS 2024. [Paper]
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams Di Jin, Eileen Pan, Nassim Oufattole, et al. AAAI 2021. [Paper]
- Evaluating Large Language Models Trained on Code Mark Chen, Jerry Tworek, Heewoo Jun, et al. arXiv 2021. [Paper]
- STeCa: Step-level Trajectory Calibration for LLM Agent Learning Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li. arXiv 2025. [Paper]
- Large Language Model-Based Agents for Software Engineering: A Survey Junwei Liu, Kaixin Wang, Yixuan Chen, et al. arXiv 2024. [Paper]
- Advancing LLM Reasoning Generalists with Preference Trees Lifan Yuan, Ganqu Cui, Hanbin Wang, et al. arXiv 2024. [Paper]
- Tool Learning in the Wild: Empowering Language Models as Automatic Tool Agents Zhengliang Shi, Shen Gao, Lingyong Yan, et al. arXiv 2024. [Paper]
- Enhancing Chat Language Models by Scaling High-quality Instructional Conversations Ning Ding, Yulin Chen, Bokai Xu, et al. EMNLP 2023. [Paper]
Open-source or copyright-free resources
- Common Crawl: A large-scale publicly accessible web crawl dataset that provides massive raw webpages and metadata. It serves as a crucial raw data source in typical pretraining data pipelines, where it undergoes multiple processing steps such as cleaning, deduplication, and formatting to produce high-quality corpora for downstream model training. [Source]
- CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, et al. LREC-COLING 2024. [Paper]
- Project Gutenberg: A large collection of free eBooks from the public domain; supports training language models on long-form literary text. [Source]
- Open Library: A global catalog of books with metadata and some open-access content; useful for multilingual and knowledge-enhanced language modeling. [Source]
- GitHub: The world’s largest open-source code hosting platform; supports training models for code generation and understanding. [Source]
- GitLab: A DevOps platform for hosting both private and open-source projects; provides high-quality programming and documentation data. [Source]
- Bitbucket: A source code hosting platform by Atlassian; suitable for mining enterprise-level software development data. [Source]
- The Stack: 3 TB of permissively licensed source code Denis Kocetkov, Raymond Li, Loubna Ben Allal, et al. arXiv 2022. [Paper]
- CodeSearchNet Challenge: Evaluating the State of Semantic Code Search Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, et al. arXiv 2019. [Paper]
- An Empirical Comparison of Web Content Extraction Algorithms Janek Bevendorff, Sanket Gupta, Johannes Kiesel, Benno Stein. SIGIR 2023. [Paper]
- Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction Adrien Barbaresi. ACL 2021 Demo. [Paper]
- Fact or Fiction: Content Classification for Digital Libraries Aidan Finn, N. Kushmerick, Barry Smyth. DELOS Workshops / Conferences 2001. [Paper]
- Beautiful Soup: A Python-based library for parsing HTML and XML documents; supports extracting structured information from static web pages. [Source]
- Selenium: A browser automation tool that enables interaction with dynamic web pages; suitable for scraping JavaScript-heavy content. [Github]
- Playwright: A browser automation framework developed by Microsoft; supports multi-browser environments and is ideal for high-quality, concurrent web scraping tasks. [Source]
- Puppeteer: A Node.js library that provides a high-level API to control headless Chrome or Chromium; useful for scraping complex pages, taking screenshots, or generating PDFs. [Source]
Extract text from hand-written or non-textual data (e.g., scanned PDF documents)
Extract text using multiple components as a pipeline, including segmentation and OCR
- PaddleOCR: An open-source Optical Character Recognition (OCR) toolkit based on the PaddlePaddle deep learning framework; supports multilingual text detection and recognition, ideal for extracting text from images and document layout analysis. [Github]
- MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Junbo Niu, Zheng Liu, Zhuangcheng Gu, et al. arXiv 2025. [Paper]
- MinerU: An Open-Source Solution for Precise Document Content Extraction Bin Wang, Chao Xu, Xiaomeng Zhao, et al. arXiv 2024. [Paper]
Extract text using multi-modal LLMs as a whole, from image to text directly
- DeepSeek-OCR: Contexts Optical Compression Haoran Wei, Yaofeng Sun, Yukun Li. arXiv 2025. [Paper]
- dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang. arXiv 2025. [Paper]
- General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model Haoran Wei, Chenglong Liu, Jinyue Chen, et al. arXiv 2024. [Paper]
- Focus Anywhere for Fine-grained Multi-page Document Understanding Chenglong Liu, Haoran Wei, Jinyue Chen, et al. arXiv 2024. [Paper]
Identify and link entities for the extracted text
- UMIE: Unified Multimodal Information Extraction with Instruction Tuning Lin Sun, Kai Zhang, Qingyuan Li, Renze Lou. AAAI 2024. [Paper]
- ChatEL: Entity Linking with Chatbots Yifan Ding, Qingkai Zeng, Tim Weninger. LREC-COLING 2024. [Paper]
- WebIE: Faithful and Robust Information Extraction on the Web Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, et al. ACL 2023. [Paper]
- Alignment-Augmented Consistent Translation for Multilingual Open Information Extraction Keshav Kolluru, Muqeeth Mohammed, Shubham Mittal, et al. ACL 2022. [Paper]
- Analysis of the Reasoning with Redundant Information Provided Ability of Large Language Models Wenbei Xie. arXiv 2023. [Paper]
- Scaling Laws and Interpretability of Learning from Repeated Data Danny Hernandez, Tom Brown, Tom Conerly, et al. arXiv 2022. [Paper]
Identify samples with identical patterns, such as hashes (MD5/SHA256) or substring.
Identify samples with the same MD5/SHA256 hashes.
- MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Anas Awadalla, Le Xue, Oscar Lo, et al. NeurIPS 2024. [Paper]
- BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline Guosheng Dong, Da Pan, Yiding Sun, et al. arXiv 2024. [Paper]
- The llama 3 herd of models Llama Team, AI @ Meta. arXiv 2024. [Paper]
- The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. NeurIPS 2023. [Paper]
- Deduplicating Training Data Makes Language Models Better Katherine Lee, Daphne Ippolito, Andrew Nystrom, et al. ACL 2022. [Paper]
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Luca Soldaini, Rodney Kinney, Akshita Bhagia, et al. ACL 2024. [Paper]
- DataComp: In search of the next generation of multimodal datasets Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, et al. NeurIPS 2023. [Paper]
Identify samples with similar patterns. Algorithms include SimHash, MinHash and DotHash.
- LSHBloom: Memory-efficient, Extreme-scale Document Deduplication Arham Khan, Robert Underwood, Carlo Siebenschuh, et al. arXiv 2025. [Paper]
- SlimPajama-DC: Understanding Data Combinations for LLM Training Zhiqiang Shen, Tianhua Tao, Liqun Ma, et al. arXiv 2023. [Paper]
- Noise-Robust De-Duplication at Scale Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell. arXiv 2022. [Paper]
Sample-Level:
- The llama 3 herd of models [Paper]
- Deduplicating Training Data Makes Language Models Better [Paper]
- The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. NeurIPS 2023. [Paper]
Line-Level:
- BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline [Paper]
Code Data:
- BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline [Paper]
- DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence Daya Guo, Qihao Zhu, Dejian Yang, et al. arXiv 2024. [Paper]
Identify samples with similar semantics.
- FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication Eric Slyman, Stefan Lee, Scott Cohen, Kushal Kafle. CVPR 2024. [Paper]
- SemDeDup: Data-efficient learning at web-scale through semantic deduplication Amro Abbas, Kushal Tirumala, Dániel Simig, et al. ICLR 2023. [Paper]
- D4: Improving LLM Pretraining via Document De-Duplication and Diversification Kushal Tirumala, Daniel Simig, Armen Aghajanyan, Ari Morcos. NeurIPS 2023. [Paper]
Identify and down-sample samples with high frequency:
- SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training Nan He, Weichen Xiong, Hanwen Liu, et al. ACL 2024. [Paper]
Identify and up-sample samples with the removal rate after the entire filtering stage:
- DataComp: In search of the next generation of multimodal datasets Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, et al. NeurIPS 2023. [Paper]
Filter samples based on statistical metrics (e.g., cosine similarity) or model characteristics (e.g., loss, perplexity).
Perplexity Measuring:
- Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Zachary Ankner, Cody Blakeney, Kartik Sreenivasan, et al. ICLR 2025. [Paper]
- Smaller Language Models are capable of selecting Instruction-Tuning Training Data for Larger Language Models Dheeraj Mekala, Alex Nguyen, Jingbo Shang. ACL 2024. [Paper]
- From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning Ming Li, Yong Zhang, Zhitao Li, et al. NAACL 2024. [Paper]
- Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning Ming Li, Yong Zhang, Shwai He, et al. ACL 2024. [Paper]
- Improving Pretraining Data Using Perplexity Correlations Tristan Thrush, Christopher Potts, Tatsunori Hashimoto. arXiv 2024. [Paper]
Influence Assessment: Assess sample impact on LLM performance or learning process (e.g., model parameters)
- Data-efficient Fine-tuning for LLM-based Recommendation Xinyu Lin, Wenjie Wang, Yongqi Li, et al. SIGIR 2024. [Paper]
- SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning Yexiao He, Ziyao Wang, Zheyu Shen, et al. NeurIPS 2024. [Paper]
Clustering: Cluster semantically similar samples - select within the clusters reduces redundancy, while select across the clusters increases diversity
- SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models Yu Yang, Siddhartha Mishra, Jeffrey Chiang, et al. NeurIPS 2024. [Paper]
- Effective Pruning of Web-Scale Datasets Based on Complexity of Concept Clusters Amro Abbas, Evgenia Rusak, Kushal Tirumala, et al. arXiv 2024. [Paper]
Filter samples based on sample quality evaluated by LLM.
- SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection Han Shen, Pin-Yu Chen, Payel Das, Tianyi Chen. ICLR 2025. [Paper]
- SCAR: Data Selection via Style-Consistency-Aware Response Ranking for Efficient Instruction Tuning of Large Language Models Zhuang Li, Yuncheng Hua, Thuy-Trang Vu, et al. ACL 2025. [Paper]
- QuRating: Selecting High-Quality Data for Training Language Models Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen. ICML 2024. [Paper]
- What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning Wei Liu, Weihao Zeng, Keqing He, et al. ICLR 2024. [Paper]
Filter samples using multiple types of methods.
- When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale Max Marion, Ahmet Üstün, Luiza Pozzobon, et al. arXiv 2023. [Paper]
- Instruction Mining: Instruction Data Selection for Tuning Large Language Models Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun. arXiv 2023. [Paper]
- MoDS: Model-oriented Data Selection for Instruction Tuning Qianlong Du, Chengqing Zong, Jiajun Zhang. arXiv 2023. [Paper]
Filter samples with specific information, including
- BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline [Paper]
- The llama 3 herd of models [Paper]
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [Paper]
- When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale Max Marion, Ahmet Üstün, Luiza Pozzobon, et al. arXiv 2023. [Paper]
Language Filtering (using fastText, langdetect, or CLD2)
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [Paper]
- The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. NeurIPS 2023. [Paper]
- Exploring the limits of transfer learning with a unified text-to-text transformer Colin Raffel, Noam Shazeer, Adam Roberts, et al. arXiv 2023. [Paper]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling Leo Gao, Stella Biderman, Sid Black, et al. arXiv 2021. [Paper]
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data Guillaume Wenzek, Marie - Anne Lachaux, Alexis Conneau, et al. LREC 2020. [Paper]
- BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline [Paper]
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [Paper]
- The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. NeurIPS 2023. [Paper]
- Llama 2: Open Foundation and Fine-Tuned Chat Models GenAI, Meta. arXiv 2023. [Paper]
- DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 Zhengliang Liu, Yue Huang, Xiaowei Yu, et al. arXiv 2023. [Paper]
- Analyzing Leakage of Personally Identifiable Information in Language Models Nils Lukas, Ahmed Salem, Robert Sim, et al. IEEE S&P 2023. [Paper]
-
A Survey on Data Selection for Language Models Alon Albalak, Yanai Elazar, Sang Michael Xie, et al. arXiv 2024. [Paper]
-
A Survey on Data Selection for LLM Instruction Tuning Jiahao Wang, Bolin Zhang, Qianlong Du, et al. arXiv 2024. [Paper]
Select subset of data with similar characteristics as the target data
- Efficient Continual Pre-training for Building Domain Specific Large Language Models Yong Xie, Karan Aggarwal, Aitzaz Ahmad. Findings of ACL 2024. [Paper]
- Data Selection for Language Models via Importance Resampling Sang Michael Xie, Shibani Santurkar, Tengyu Ma, Percy Liang. NeurIPS 2023. [Paper]
- Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis Ruiyang Qin, Jun Xia, Zhenge Jia, et al. DAC 2024. [Paper]
- CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training David Brandfonbrener, Hanlin Zhang, Andreas Kirsch, et al. NeurIPS 2024. [Paper]
Select subset of data towards improving model performance on target tasks
- DSDM: model-aware dataset selection with datamodels Logan Engstrom, Axel Feldmann, Aleksander Mądry. ICML 2024. [Paper]
- LESS: Selecting Influential Data for Targeted Instruction Tuning Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, et al. ICML 2024. [Paper]
- TSDS: Data Selection for Task-Specific Model Finetuning Zifan Liu, Amin Karbasi, Theodoros Rekatsinas. arXiv 2024. [Paper]
Select subset of data with the help of LLM
- Autonomous Data Selection with Language Models for Mathematical Texts Yifan Zhang, Yifan Luo, Yang Yuan, et al. ICLR 2024. [Paper]
- Mixtera: A Data Plane for Foundation Model Training Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, Dan Graur, Viktor Gsteiger, Ana Klimovic. arXiv 2025. [Paper]
- Scalable Data Ablation Approximations for Language Models through Modular Training and Merging Clara Na, Ian Magnusson, Ananya Harsh Jha, et al. EMNLP 2024. [Paper]
- Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, et al. COLING 2024. [Paper]
Empirical set mixing ratio (e.g., complexity and diversity)
- Exploring the limits of transfer learning with a unified text-to-text transformer [Paper]
- Maximize Your Data's Potential: Enhancing LLM Accuracy with Two-Phase Pretraining Steven Feng, Shrimai Prabhumoye, Kezhi Kong, et al. arXiv 2024. [Paper]
- SlimPajama-DC: Understanding Data Combinations for LLM Training SlimPajama-DC: Understanding Data Combinations for LLM Training, arXiv 2024. [Paper]
- BiMix: Bivariate Data Mixing Law for Language Model Pretraining Ce Ge, Zhijian Ma, Daoyuan Chen, et al. arXiv 2024. [Paper]
Logically decide the best mixing ratio based on other factors using models
- RegMix: Data Mixture as Regression for Language Model Pre-training Qian Liu, Xiaosen Zheng, Niklas Muennighoff, et al. ICLR 2025. [Paper]
- Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance Jiasheng Ye, Peiju Liu, Tianxiang Sun, et al. ICLR 2025. [Paper]
- CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models Jiawei Gu, Zacc Yang, Chuanghao Ding, et al. EMNLP 2024. [Paper]
- BiMix: Bivariate Data Mixing Law for Language Model Pretraining Ce Ge, Zhijian Ma, Daoyuan Chen, et al. arXiv 2024. [Paper]
- D-CPT Law: Domain-specific Continual Pre-Training Scaling Law for Large Language Models Haoran Que, Jiaheng Liu, Ge Zhang, et al. arXiv 2024. [Paper]
- Data Proportion Detection for Optimized Data Management for Large Language Models Hao Liang, Keshi Zhao, Yajie Yang, et al. arXiv 2024. [Paper]
Find the best mixing ratio that minimizes both training and validation loss
- ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting Rui Pan, Jipeng Zhang, Xingyuan Pan, et al. ACL 2025. [Paper]
- DoGE: Domain Reweighting with Generalization Estimation Simin Fan, Matteo Pagliardini, Martin Jaggi. ICML 2024. [Paper]
Find the best mixing ratio with Distributionally Robust Optimization (DRO)
- Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval Guangyuan Ma, Yongliang Ma, Xing Wu, et al. AAAI 2025. [Paper]
- DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining Sang Michael Xie, Hieu Pham, Xuanyi Dong, et al. NeurIPS 2023. [Paper]
- How to Synthesize Text Data without Model Collapse? Xuekai Zhu, Daixuan Cheng, Hengli Li, et al. ICML 2025. [Paper]
- Differentially Private Synthetic Data via Foundation Model APIs 2: Text Chulin Xie, Zinan Lin, Arturs Backurs, et al. ICML 2024. [Paper]
- WizardLM: Empowering Large Pre-Trained Language Models to Follow Complex Instructions Can Xu, Qingfeng Sun, Kai Zheng, et al. ICLR 2024. [Paper]
- LLM See, LLM Do: Leveraging Active Inheritance to Target Non-Differentiable Objectives Luísa Shimabucoro, Sebastian Ruder, Julia Kreutzer, et al. EMNLP 2024. [Paper]
- Augmenting Math Word Problems via Iterative Question Composing Haoxiong Liu, Yifan Zhang, Yifan Luo, et al. arXiv 2024. [Paper]
Augment raw pre-training text into various formats to scale data
Rephrasing raw corpora into various styles
- Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling Pratyush Maini, Skyler Seto, Richard Bai, et al. ACL 2024. [Paper]
Convert raw corpora into instructions
- Instruction Pre-Training: Language Models are Supervised Multitask Learners Daixuan Cheng, Yuxian Gu, Shaohan Huang, et al. EMNLP 2024. [Paper]
- VeCLIP: Improving CLIP Training via Visual-Enriched Captions Zhengfeng Lai, Haotian Zhang, Bowen Zhang, et al. ECCV 2024. [Paper]
- Improving CLIP Training with Language Rewrites Lijie Fan, Dilip Krishnan, Phillip Isola, et al. NeurIPS 2023. [Paper]
Augment fine-tuning text for specific use cases (e.g., domain augmentation, alignment, agentic LLM, etc.)
- Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation Jiachen Zhao, Wenlong Zhao, Andrew Drozdov, et al. ACL 2024. [Paper]
- Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models Haoran Li, Qingxiu Dong, Zhengyang Tang, et al. arXiv 2024. [Paper]
- AgentInstruct: Toward Generative Teaching with Agentic Flows Arindam Mitra, Luciano Del Corro, Guoqing Zheng, et al. arXiv 2024. [Paper]
- Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Zhangchen Xu, Fengqing Jiang, Luyao Niu, et al. arXiv 2024. [Paper]
- Self-Instruct: Aligning Language Models with Self-Generated Instructions Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, et al. ACL 2023. [Paper]
- Kimi K2: Open Agentic Intelligence Kimi Team. arXiv 2025. [Paper]
- PDSS: A Privacy-Preserving Framework for Step-by-Step Distillation of Large Language Models Tao Fan, Weijing Chen, Yan Kang, et al. arXiv 2025. [Paper]
Synthesize reasoning steps using, e.g., CoT.
- LIMO: Less is More for Reasoning Yixin Ye, Zhen Huang, Yang Xiao, et al. COLM 2025. [Paper]
- Distilling System 2 into System 1 Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov. arXiv 2024 [Paper]
- Distilling Reasoning Capabilities into Smaller Language Models Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan. ACL 2024. [Paper]
- Symbolic chain-of-thought distillation: Small models can also "think" step-by-step Liunian Harold Li, Jack Hessel, Youngjae Yu, et al. arXiv 2024. [Paper]
- Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, et al. ACL 2023. [Paper]
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling Zhenyu Hou, Xin Lv, Rui Lu, et al. arXiv 2025. [Paper]
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations Peiyi Wang, Lei Li, Zhihong Shao, et al. ACL 2024. [Paper]
- Let's Verify Step by Step Hunter Lightman, Vineet Kosaraju, Yura Burda, et al. arXiv 2023. [Paper]
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. NeurIPS 2023. [Paper]
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Yuntao Bai, Andy Jones, Kamal Ndousse, et al. arXiv 2022. [Paper]
- LLMs Can Easily Learn to Reason from Demonstrations: Structure, Not Content, Is What Matters! Dacheng Li, Shiyi Cao, Tyler Griggs, et al. arXiv 2025. [Paper]
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search Maohao Shen, Guangtao Zeng, Zhenting Qi, et al. arXiv 2025. [Paper]
- MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data Yinya Huang, Xiaohan Lin, Zhengying Liu, et al. ICLR 2024. [Paper]
- PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning Xuekai Zhu, Biqing Qi, Kaiyan Zhang, et al. NAACL 2024. [Paper]
- Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data Shenglai Zeng, Jiankun Zhang, Pengfei He, et al. arXiv 2024. [Paper]
- NuminaMath: The largest public dataset in AI4Maths with 860k pairs of competition math problems and solutions Jia Li, Edward Beeching, Lewis Tunstall, et al. 2024. [Paper]
- QwQ: Reflect Deeply on the Boundaries of the Unknown Qwen Team. 2024. [Source]
- Knowledge Distillation Using Frontier Open-source LLMs: Generalizability and the Role of Synthetic Data Anup Shirgaonkar, Nikhil Pandey, Nazmiye Ceren Abay, et al. arXiv 2024. [Paper]
- Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks Minki Kang, Seanie Lee, Jinheon Baek, et al. NeurIPS 2023. [Paper]
- Dialogue chain-of-thought distillation for commonsense-aware conversational agents Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, et al. arXiv 2023. [Paper]
- MCC-KD: Multi-CoT consistent knowledge distillation Hongzhan Chen, Siyue Wu, Xiaojun Quan, et al. arXiv 2023. [Paper]
- Large language models are reasoning teachers Namgyu Ho, Laura Schmid, Se-Young Yun. arXiv 2023. [Paper]
- Leveraging training data in few-shot prompting for numerical reasoning Zhanming Jie, Wei Lu. arXiv 2023. [Paper]
- Distilling reasoning capabilities into smaller language models Kumar Shridhar, Alessandro Stolfo, Mrinmaya Sachan. arXiv 2023. [Paper]
- SCOTT: Self-consistent chain-of-thought distillation Peifeng Wang, Zhengyang Wang, Zheng Li, et al. arXiv 2023. [Paper]
- Democratizing reasoning ability: Tailored learning from large language model Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, et al. arXiv 2023. [Paper]
- Explanations from large language models make small reasoners better Shiyang Li, Jianshu Chen, Yelong Shen, et al. arXiv 2022. [Paper]
- Training Verifiers to Solve Math Word Problems Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. arXiv 2021. [Paper]
- Llama 2: Open Foundation and Fine-Tuned Chat Models [Paper]
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling [Paper]
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [Paper]
- BERT-Tiny-Chinese: A lightweight Chinese BERT pre-trained model released by CKIP Lab, with a small number of parameters; suitable for use as an encoder in pre-training data augmentation tasks to enhance efficiency for compact models. [Source]
- Case2Code: Scalable Synthetic Data for Code Generation Yunfan Shao, Linyang Li, Yichuan Ma, et al. COLING 2025. [Paper]
- Advancing Mathematical Reasoning in Language Models: The Impact of Problem-Solving Data, Data Synthesis Methods, and Training Stages Zui Chen, Tianqiao Liu, Mi Tian, et al. ICLR 2025. [Paper]
- Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Bin Xiao, Haiping Wu, Weijian Xu, et al. CVPR 2024. [Paper]
- DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood, et al. CVPR 2024. [Paper]
- Magicoder: Empowering Code Generation with OSS-Instruct Yuxiang Wei, Zhe Wang, Jiawei Liu, et al. ICML 2024. [Paper]
- JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models Kun Zhou, Beichen Zhang, Jiapeng Wang, et al. arXiv 2024. [Paper]
- Diffusion Models and Representation Learning: A Survey Michael Fuest, Pingchuan Ma, Ming Gui, et al. arXiv 2024. [Paper]
- CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning Qingqing Cao, Mahyar Najibi, Sachin Mehta. arXiv 2024. [Paper]
- Qwen2 Technical Report An Yang, Baosong Yang, Binyuan Hui, et al. arXiv 2024. [Paper]
- TinyLlama: An Open-Source Small Language Model Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu. arXiv 2024. [Paper]
- On the Diversity of Synthetic Data and its Impact on Training Large Language Models Hao Chen, Abdul Waheed, Xiang Li, et al. arXiv 2024. [Paper]
- Towards Effective and Efficient Continual Pre-training of Large Language Models Jie Chen, Zhipeng Chen, Jiapeng Wang, et al. arXiv 2024. [Paper]
- Effective Data Augmentation With Diffusion Models Brandon Trabucco, Kyle Doherty, Max Gurinas, et al. arXiv 2023. [Paper]
- Mistral 7B Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. arXiv 2023. [Paper]
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis Dustin Podell, Zion English, Kyle Lacey, et al. arXiv 2023. [Paper]
- Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Jesse Dodge, Maarten Sap, Ana Marasović, et al. EMNLP 2021. [Paper]
- First Steps of an Approach to the ARC Challenge based on Descriptive Grid Models and the Minimum Description Length Principle Sébastien Ferré (Univ Rennes, CNRS, IRISA). arXiv 2021. [Paper]
- TinyBERT: Distilling BERT for Natural Language Understanding Xiaoqi Jiao, Yichun Yin, Lifeng Shang, et al. Findings of EMNLP 2020. [Paper]
- HellaSwag: Can a Machine Really Finish Your Sentence? Rowan Zellers, Ari Holtzman, Yonatan Bisk, et al. ACL 2019. [Paper]
- Augmenting Math Word Problems via Iterative Question Composing [Paper]
- Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning Yiming Huang, Xiao Liu, Yeyun Gong, et al. arXiv 2024. [Paper]
- Data-Juicer: A One-Stop Data Processing System for Large Language Models Daoyuan Chen, Yilun Huang, Zhijian Ma, et al. SIGMOD 2024. [Paper]
- An Integrated Data Processing Framework for Pretraining Foundation Models Yiding Sun, Feng Wang, Yutao Zhu, et al. SIGIR 2024. [Paper]
- Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, et al. arXiv 2024. [Paper]
- Exploring the limits of transfer learning with a unified text-to-text transformer [Paper]
- The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only Guilherme Penedo, Quentin Malartic, Daniel Hesslow, et al. NeurIPS 2023. [Paper]
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher Jack W. Rae, Sebastian Borgeaud, Trevor Cai, et al. arXiv 2021. [Paper]
- CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data Guillaume Wenzek, Marie - Anne Lachaux, Alexis Conneau, et al. LREC 2020. [Paper]
- Bag of Tricks for Efficient Text Classification Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov. arXiv 2016. [Paper]
- Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development Daoyuan Chen, Haibin Wang, Yilun Huang, et al. ICML 2025 (Spotlight). [Paper]
- TFRecord: A binary data storage format recommended by TensorFlow, suitable for efficient storage and reading of large-scale training data. [Source]
- MindRecord: An efficient data storage format used by MindSpore, supporting multi-platform data management. [Source]
- tf.data.Dataset: An abstract interface in TensorFlow representing collections of training data, enabling flexible data manipulation. [Source]
- COCO JSON: COCO JSON format uses structured JSON to store images and their corresponding labels, widely used in computer vision datasets. [Source]
- PyTorch-specific formats (.pt, .pth): PyTorch’s .pt and .pth formats are used to save model parameters and architecture, supporting model storage and loading. [Source]
- TensorFlow(SavedModel, .ckpt): TensorFlow’s SavedModel and checkpoint formats save complete model information, facilitating model reproduction and deployment. [Source]
- Hugging Face Transformers library: Hugging Face offers a unified model format interface to facilitate saving and usage of various pretrained models. [Source]
- Pickle (.pkl): Pickle format is used for serializing models and data, suitable for quick saving and loading. [Source]
- ONNX: An open cross-platform model format supporting model conversion and deployment across different frameworks. [Source]
- An Empirical Study of Safetensors' Usage Trends and Developers' Perceptions Beatrice Casey, Kaia Damian, Andrew Cotaj, et al. arXiv 2025. [Paper]
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [Paper]
- CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, et al. SIGSPATIAL 2024. [Paper]
- JuiceFS: A high-performance cloud-native distributed file system designed for efficient storage and access of large-scale data. [Github]
- 3FS: A distributed file system designed for deep learning and large-scale data processing, emphasizing high throughput and reliability. [Github]
- S3: A widely used cloud storage service offering secure, scalable, and highly available object storage solutions. [Source]
- Hdfs architecture guide. Hadoop apache project D. Borthakur et al. Hadoop apache project, 53(1-13):2, 2008. [Source]
- ProTrain: Efficient LLM Training via Memory-Aware Techniques Hanmei Yang, Jin Zhou, Yao Fu, et al. arXiv 2024. [Paper]
- ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, et al. SC 2021. [Paper]
- ZeRO-Offload: Democratizing Billion-Scale Model Training Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, et al. USENIX ATC 2021. [Paper]
- ZeRO: memory optimizations toward training trillion parameter models Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, et al. SC 2020. [Paper]
- vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design Minsoo Rhu, Natalia Gimelshein, Jason Clemons, et al. MICRO-49 2016. [Paper]
- Survey of Hallucination in Natural Language Generation Ziwei Ji, Nayeon Lee, Rita Frieske, et al. ACM Computing Surveys (2022). [Paper]
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. NeurIPS 2020. [Paper]
- STELLA: A large-scale Chinese vector database supporting efficient vector search and semantic retrieval applications. [Source]
- Milvus: An open-source vector database focused on large-scale, high-performance similarity search and analysis. [Source]
- Weaviate: Weaviate offers a cloud-native vector search engine supporting intelligent search and knowledge graph construction for multimodal data. [Source]
- LanceDB: An efficient vector database designed for large-scale machine learning and recommendation systems. [Source]
- Mix-of-Granularity: Optimize the Chunking Granularity for Retrieval-Augmented Generation Zijie Zhong, Hanwen Liu, Xiaoya Cui, et al. COLING 2025. [Paper]
- Dense X Retrieval: What Retrieval Granularity Should We Use? Tong Chen, Hongwei Wang, Sihao Chen, et al. EMNLP 2024. [Paper]
- Scalable and Domain-General Abstractive Proposition Segmentation Mohammad Javad Hosseini, Yang Gao, Tim Baumgärtner, et al. Findings of EMNLP 2024. [Paper]
- A Hierarchical Context Augmentation Method to Improve Retrieval-Augmented LLMs on Scientific Papers Tian-Yi Che, Xian-Ling Mao, Tian Lan, et al. KDD 2024. [Paper]
- M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation Jianlyu Chen, Shitao Xiao, Peitian Zhang, et al. Findings of ACL 2024. [Paper]
- Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation Kaikai An, Fangkai Yang, Liqun Li, et al. arXiv 2024. [Paper]
- GleanVec: Accelerating Vector Search with Minimalist Nonlinear Dimensionality Reduction Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, et al. arXiv 2024. [Paper]
- The Faiss Library Matthijs Douze, Alexandr Guzhva, Chengqi Deng, et al. arXiv 2024. [Paper]
- Similarity Search in the Blink of an Eye with Compressed Indices Cecilia Aguerrebere, Ishwar Singh Bhati, Mark Hildebrand, et al. VLDB Endowment 2023. [Paper]
- LeanVec: Searching Vectors Faster by Making Them Fit Mariano Tepper, Ishwar Singh Bhati, Cecilia Aguerrebere, et al. arXiv 2023. [Paper]
- Towards General Text Embeddings with Multi-stage Contrastive Learning Zehan Li, Xin Zhang, Yanzhao Zhang, et al. arXiv 2023. [Paper]
- ArangoDB: A multi-model database that supports graph, document, and key-value data, suitable for handling complex relational queries. [Source]
- MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation Tianyu Fan, Jingyuan Wang, Xubin Ren, et al. arXiv 2025. [Paper]
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization Darren Edge, Ha Trinh, Newman Cheng, et al. arXiv 2024. [Paper]
- LightRAG: Simple and Fast Retrieval-Augmented Generation Zirui Guo, Lianghao Xia, Yanhua Yu, et al. arXiv 2024. [Paper]
- Graph Databases Assessment: JanusGraph, Neo4j, and TigerGraph Jéssica Monteiro, et al. Perspectives and Trends in Education and Technology 2023. [Paper]
- Empirical Evaluation of a Cloud-Based Graph Database: the Case of Neptune Ghislain Auguste Atemezing. KGSWC 2021. [Paper]
- CacheLib: An open-source, high-performance embedded caching library developed by Meta to accelerate data access and increase system throughput. [Source]
- Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training Mark Zhao, Satadru Pan, Niket Agarwal, et al. USENIX ATC 2023. [Paper]
- Fluid: Dataset Abstraction and Elastic Acceleration for Cloud-native Deep Learning Training Jobs Rong Gu, Kai Zhang, Zhihao Xu, et al. ICDE 2022. [Paper]
- Quiver: An Informed Storage Cache for Deep Learning Abhishek Kumar, Muthian Sivathanu. USENIX FAST 2020. [Paper]
- cedar: Optimized and Unified Machine Learning Input Data Pipelines Mark Zhao, et al. Proceedings of the VLDB Endowment, Volume 18, Issue 2, 2025. [Paper]
- Pecan: cost-efficient ML data preprocessing with automatic transformation ordering and hybrid placement Dan Graur, Oto Mraz, Muyu Li, et al. USENIX ATC 2024. [Paper]
- tf.data service: A Case for Disaggregating ML Input Data Processing Andrew Audibert, Yang Chen, Dan Graur, et al. SoCC 2023. [Paper]
- Cachew: Machine Learning Input Data Processing as a Service Dan Graur, Damien Aymon, Dan Kluser, et al. USENIX ATC 2022. [Paper]
- Borg: the next generation Muhammad Tirmazi, Adam Barker, Nan Deng, et al. EuroSys 2020. [Paper]
- Optimizing RLHF Training for Large Language Models with Stage Fusion Yinmin Zhong, Zili Zhang, Bingyang Wu, et al. NSDI 2025. [Paper]
- SiloD: A Co-design of Caching and Scheduling for Deep Learning Clusters Hanyu Zhao, Zhenhua Han, Zhi Yang, et al. EuroSys 2023. [Paper]
- Optimization by Simulated Annealing S. Kirkpatrick, C. D. Gelatt, Jr., M. P. Vecchi. Science, 220(4598):671–680, 1983. [Paper]
- PaddleNLP: PaddleNLP supports checkpoint saving and resuming during training, enabling fault tolerance and recovery for long-running training tasks. [Source]
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs Ziheng Jiang, Haibin Lin, Yinmin Zhong, et al. USENIX NSDI 2024. [Paper]
- ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development Borui Wan, Mingji Han, Yiyao Sheng, et al. arXiv 2024. [Paper]
- GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints Zhuang Wang, Zhen Jia, Shuai Zheng, et al. SOSP 2023. [Paper]
- CheckFreq: Frequent, Fine-Grained DNN Checkpointing Jayashree Mohan, Amar Phanishayee, Vijay Chidambaram. USENIX FAST 2021. [Paper]
- ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, et al. SOSP 2024. [Paper]
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, et al. NSDI 2023 . [Paper]
- Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates Insu Jang, Zhenning Yang, Zhen Zhang, et al. SOSP 2023. [Paper]
- Efficient Memory Management for Large Language Model Serving with PagedAttention Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, et al. SOSP 2023. [Paper]
- VTensor: Using Virtual Tensors to Build a Layout-oblivious AI Programming Framework Feng Yu, Jiacheng Zhao, Huimin Cui, et al. PACT 2020. [Paper]
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention Bin Gao, Zhuomin He, Puru Sharma, et al. USENIX ATC 2024. [Paper]
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation Chao Jin, Zili Zhang, Xuanlin Jiang, et al. arXiv 2024. [Paper]
- Adaptive KV-Cache Compression without Manually Setting Budget Chenxia Tang, Jianchun Liu, Hongli Xu, et al. arXiv 2025. [Paper]
- Fast State Restoration in LLM Serving with HCache Shiwei Gao, Youmin Chen, Jiwu Shu. EuroSys 2025. [Paper]
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Yuhan Liu, Hanchen Li, Yihua Cheng, et al. SIGCOMM 2024. [Paper]
- MiniCache: KV Cache Compression in Depth Dimension for Large Language Models Akide Liu, Jing Liu, Zizheng Pan, et al. NeurIPS 2024. [Paper]
- Animating rotation with quaternion curves Ken Shoemake. ACM SIGGRAPH Computer Graphics, Volume 19, Issue 3. 1985. [Paper]
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Lu Ye, Ze Tao, Yong Huang, et al. ACL 2024. [Paper]
- BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching Zhen Zheng, Xin Ji, Taosong Fang, et al. arXiv 2024. [Paper]
- Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training Zheheng Luo, Xin Zhang, Xiao Liu, et al. ACL 2025. [Paper]
- Mixtera: A Data Plane for Foundation Model Training Maximilian Böther, Xiaozhe Yao, Tolga Kerimoglu, et al. arXiv 2025. [Paper]
- How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition Guanting Dong, Hongyi Yuan, Keming Lu, et al. ACL 2024. [Paper]
- Mixture-of-Skills: Learning to Optimize Data Usage for Fine-Tuning Large Language Models Minghao Wu, Thuy-Trang Vu, Lizhen Qu, et al. EMNLP 2024. [Paper]
- Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning Jisu Kim, Juhwan Lee. arXiv 2024. [Paper]
- Data Pruning via Moving-one-Sample-out Haoru Tan, Sitong Wu, Fei Du, et al. NeurIPS 2023. [Paper]
- NLU on Data Diets: Dynamic Data Subset Selection for NLP Classification Tasks Jean-michel Attendu, Jean-philippe Corbeil. SustaiNLP @ ACL 2023. [Paper]
- Efficient Online Data Mixing For Language Model Pre-Training Alon Albalak, Liangming Pan, Colin Raffel, et al. arXiv 2023. [Paper]
- BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning Mohsen Fayyaz, Ehsan Aghazadeh, Ali Modarressi, et al. ENLSP @ NeurIPS2022. [Paper]
- Scaling Laws for Neural Language Models Jared Kaplan, Sam McCandlish, Tom Henighan, et al. arXiv 2020. [Paper]
- Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory James L. McClelland, Bruce L. McNaughton, Randall C. O’Reilly. Psychological Review 1995. [Paper]
- Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem M. McCloskey, N. J. Cohen. Psychology of Learning and Motivation 1989. [Paper]
- Cohere rerank: Cohere's rerank model reorders initial retrieval results to improve relevance to the query, making it a key component for building high-quality RAG systems. [Source]
- ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval Abdelrahman Abdallah, Jamshid Mozafari, Bhawna Piryani, et al. NAACL 2025. [Paper]
- MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation Chia-Yuan Chang, Zhimeng Jiang, Vineeth Rakesh, et al. arXiv 2025. [Paper]
- ARAGOG: Advanced RAG Output Grading Matouš Eibich, Shivay Nagpal, Alexander Fred-Ojala. arXiv 2024. [Paper]
- Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples! Yubo Ma, Yixin Cao, YongChing Hong, et al. Findings of EMNLP 2023. [Paper]
- Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model Jiaxi Cui, Munan Ning, Zongjian Li, et al. arXiv 2023. [Paper]
- RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin. arXiv 2023. [Paper]
- Context Embeddings for Efficient Answer Generation in RAG David Rau, Shuai Wang, Hervé Déjean, et al. WSDM 2025. [Paper]
- xRAG: Extreme Context Compression for Retrieval-augmented Generation with One Token Xin Cheng, Xun Wang, Xingxing Zhang, et al. NeurIPS 2024. [Paper]
- RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation Fangyuan Xu, Weijia Shi, Eunsol Choi. ICLR 2024. [Paper]
- Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation Kaize Shi, Xueyao Sun, Qing Li, et al. arXiv 2024. [Paper]
- Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation Dongwon Jung, Qin Liu, Tenghao Huang, et al. arXiv 2024. [Paper]
- LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression Huiqiang Jiang, Qianhui Wu, Xufang Luo, et al. ACL 2024. [Paper]
- LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, et al. Findings of ACL 2024. [Paper]
- Learning to Compress Prompts with Gist Tokens Jesse Mu, Xiang Lisa Li, Noah Goodman. NeurIPS 2023. [Paper]
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, et al. EMNLP 2023. [Paper]
- Adapting Language Models to Compress Contexts Alexis Chevalier, Alexander Wettig, Anirudh Ajith, et al. EMNLP 2023. [Paper]
- Fewer Truncations Improve Language Modeling Hantian Ding, Zijian Wang, Giovanni Paolini, et al. ICML 2024. [Paper]
- Bucket Pre-training is All You Need Hongtao Liu, Qiyao Peng, Qing Yang, et al. arXiv 2024. [Paper]
- Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum Hadi Pouransari, Chun-Liang Li, Jen-Hao Rick Chang, et al. NeurIPS 2024. [Paper]
- Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance Mario Michael Krell, Matej Kosec, Sergio P. Perez, et al. arXiv 2021. [Paper]
- Structured Packing in LLM Training Improves Long Context Utilization Konrad Staniszewski, Szymon Tworkowski, Sebastian Jaszczur, et al. AAAI 2025. [Paper]
- In-context Pretraining: Language Modeling Beyond Document Boundaries Weijia Shi, Sewon Min, Maria Lomeli, et al. ICLR 2024. [Paper]
- A comprehensive survey on data provenance: : State-of-the-art approaches and their deployments for IoT security enforcement Md Morshed Alam, Weichao Wang. Journal of Computer Security, Volume 29, Issue 4. 2021. [Paper]
- Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature Tong Zhou, Xuandong Zhao, Xiaolin Xu, et al. NeurIPS 2024. [Paper]
- An Unforgeable Publicly Verifiable Watermark for Large Language Models Aiwei Liu, Leyi Pan, Xuming Hu, et al. ICLR 2024. [Paper]
- Undetectable Watermarks for Language Models Miranda Christ, et al. in Proceedings of the 37th Annual Conference on Learning Theory (COLT 2024). [Paper]
- A Watermark for Large Language Models John Kirchenbauer, Jonas Geiping, Yuxin Wen, et al. ICML 2023. [Paper]
- Publicly-Detectable Watermarking for Language Models Jaiden Fairoze, Sanjam Garg, Somesh Jha, et al. arXiv 2023. [Paper]
- A Watermark for Large Language Models [Paper]
-
Exploring the Feasibility of Automated Data Standardization using Large Language Models for Seamless Positioning Lee, Max JL, et al. 2024 14th International Conference on Indoor Positioning and Indoor Navigation (IPIN). IEEE, 2024. [Paper]
-
CleanAgent: Automating Data Standardization with LLM-based Agents Danrui Qi, Jiannan Wang. arXiv 2024. [Paper]
-
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark Lan Li, Liri Fang, Vetle I. Torvik. arXiv 2024. [Paper]
-
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes Simran Arora, Brandon Yang, Sabri Eyuboglu, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 2, Pages 92 - 105 (2023). [Paper]
-
LLMs with User-defined Prompts as Generic Data Operators for Reliable Data Processing Luyi Ma, et al. 1st IEEE International Workshop on Data Engineering and Modeling for AI (DEMAI), IEEE BigData 2023. [Paper]
-
Large language models as data preprocessors Zhang, Haochen, et al. arXiv 2023. [Paper]
- ZeroED: Hybrid Zero-Shot Error Detection Through Large Language Model Reasoning Wei Ni, Kaihang Zhang, Xiaoye Miao, et al. ICDE 2025. [Paper]
- Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets Tommaso Bendinelli, Artur Dox, Christian Holz. ICLR 2025 Workshop on Foundation Models in the Wild. [Paper]
- Data Cleaning Using Large Language Models Shuo Zhang, Zezhou Huang, Eugene Wu. ICDE Workshops 2025. [Paper]
- GIDCL: A Graph-Enhanced Interpretable Data Cleaning Framework with Large Language Models Mengyi Yan, et al. Proceedings of the ACM on Management of Data, Volume 2, Issue 6, 2024. [Paper]
- Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation Juhwan Choi, Jungmin Yun, Kyohoon Jin, et al. EMNLP 2024. [Paper]
- Anomaly Detection of Tabular Data Using LLMs Aodong Li, Yunhan Zhao, Chen Qiu, et al. Anomaly Detection with Foundation Models Workshop (ICML 2024). [Paper]
- Cleaning Semi-Structured Errors in Open Data Using Large Language Models M. Mondal, J. Audiffren, L. Dolamic, et al, 2024 11th IEEE Swiss Conference on Data Science (SDS), 2024. [Paper]
- IterClean: An Iterative Data Cleaning Framework with Large Language Models Wei Ni, et al. Proceedings of the ACM Turing Award Celebration Conference - China 2024. [Paper]
- LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs Fabian Biester, Mohamed Abdelaal, Daniel Del Gaudio. arXiv 2024. [Paper]
-
On LLM-Enhanced Mixed-Type Data Imputation with High-Order Message Passing Jianwei Wang, Kai Wang, Ying Zhang, et al. Proc. VLDB Endow., Vol. 18, No. 10, pp. 3421-3434 (2025). [Paper]
-
A Context-Aware Approach for Enhancing Data Imputation with Pre-trained Language Models Ahatsham Hayat, Mohammad R. Hasan. COLING 2025, pp. 5668-5685 (2025). [Paper]
-
Does Prompt Design Impact Quality of Data Imputation by LLMs? Srinivasan, Shreenidhi, and Lydia Manikonda. arXiv 2025. [Paper]
-
RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes Zan Ahmad Naeem, et al. VLDB Endowment 2024. [Paper]
-
Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges Bosheng Ding, Chengwei Qin, Ruochen Zhao, et al. ACL Findings 2024, pp. 1679-1705 (2024). [Paper]
- A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models Zhang, Zeyu, et al. International Conference on Extending Database Technology (EDBT) 2025. [Paper]
- Large Language Models for Data Discovery and Integration: Challenges and Opportunities Freire, Juliana, et al. IEEE Data Eng. Bull. 49(1): 3-31 (2025). [Paper]
- Entity matching using large language models Ralph Peeters, Christian Bizer. EDBT 2025. [Paper]
- Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching Tianshu Wang, Hongyu Lin, Xiaoyang Chen, et al. COLING 2025. [Paper]
- Cost-Effective In-Context Learning for Entity Resolution: A Design Space Exploration Meihao Fan, Xiaoyue Han, Ju Fan, et al. ICDE 2024. [Paper]
- Jellyfish: A Large Language Model for Data Preprocessing Haochen Zhang, Yuyang Dong, Chuan Xiao, et al. EMNLP 2024. [Paper]
- KcMF: A Knowledge-compliant Framework for Schema and Entity Matching with Fine-tuning-free LLMs Yongqin Xu, Huan Li, Ke Chen, Lidan Shou. arXiv 2024. [Paper]
- Fine-tuning Large Language Models for Entity Matching Steiner, Aaron, Ralph Peeters, et al. arXiv 2024. [Paper]
- Towards Scalable Schema Mapping using Large Language Models Christopher Buss, Mahdis Safari, Arash Termehchy, et al. MIDAS ’25 Workshop, pp. 12-15 (2025). [Paper]
- Interactive Data Harmonization with LLM Agents: Opportunities and Challenges Aécio Santos, Eduardo H. M. Pena, Roque Lopez, Juliana Freire. NOVAS '25, Berlin, Germany (2025). [Paper]
- SCHEMORA: Schema Matching via Multi-stage Recommendation and Metadata Enrichment using Off-the-Shelf LLMs Osman Erman Gungor, Derak Paulsen, William Kang. arXiv 2025. [Paper]
- Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching Chuangtao Ma, Sriom Chakrabarti, Arijit Khan, et al. arXiv 2025. [Paper]
- Schema Matching with Large Language Models: an Experimental Study Marcel Parciak, Brecht Vandevoort, Frank Neven, et al. TaDA 2024 Workshop, collocated with VLDB 2024. [Paper]
- Magneto: Combining Small and Large Language Models for Schema Matching Yurong Liu, Eduardo Pena, Aecio Santos, et al. VLDB Endowment 2024. [Paper]
- Agent-OM: Leveraging LLM Agents for Ontology Matching Zhangcheng Qiang, et al. Proceedings of the VLDB Endowment, Volume 18, Issue 3, 2024. [Paper]
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching Nabeel Seedat, Mihaela van der Schaar. NeurIPS 2024 (GenAI for Health & Table Representation Learning Workshops). [Paper]
- TableGPT2: A Large Multimodal Model with Tabular Data Integration Aofeng Su, Aowen Wang, Chao Ye, et al. arXiv 2024. [Paper]
- ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models Benjamin Feuer, Yurong Liu, Chinmay Hegde, et al. VLDB 2024. [Paper]
-
Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, et al. SIGMOD 2025. [Paper]
-
Flexible Metadata Harvesting for Ecology Using Large Language Models Zehao Lu, Thijs L van der Plas, Parinaz Rashidi, et al. EcoDL 2025 Workshop. [Paper]
-
AutoDDG: Automated Dataset Description Generation using Large Language Models Haoxiang Zhang, Yurong Liu, Wei-Lun (Allen) Hung, et al. arXiv 2025. [Paper]
-
LEDD: Large Language Model-Empowered Data Discovery in Data Lakes Qi An, Chihua Ying, Yuqing Zhu, et al. arXiv 2025. [Paper]
-
LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts Thorat, Pankaj, et al. arXiv 2025. [Paper]
-
Cocoon: Semantic Table Profiling Using Large Language Models Huang, Zezhou, et al. Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. 2024. [Paper]
-
Birdie: Natural Language-Driven Table Discovery Using Differentiable Search Index Yuxiang Guo, Zhonghao Hu, Yuren Mao, et al. VLDB 2025. [Paper]
-
The Promises and Pitfalls of LLM Annotations in Dataset Labeling: a Case Study on Media Bias Detection Tomáš Horych, Christoph Mandl, Terry Ruas, et al. NAACL 2025 Findings, pp. 1370-1386 (2025). [Paper]
-
Mind the Data Gap: Bridging LLMs to Enterprise Data Integration Moe Kayali, Fabian Wenz, Nesime Tatbul, et al. CIDR 2025. [Paper]
-
Open-Source LLMs for Text Annotation: A Practical Guide for Model Setting and Fine-Tuning Alizadeh, Meysam, et al. Journal of Computational Social Science 8.1 (2025): 1-25. [Paper]
-
Evaluating Knowledge Generation and Self-Refinement Strategies for LLM-based Column Type Annotation Keti Korini, Christian Bizer. arXiv 2025. [Paper]
-
LLMs as Data Annotators: How Close Are We to Human Performance Haq, Muhammad Uzair Ul, Davide Rigoni, et al. arXiv 2025. [Paper]
-
Columbo: Expanding Abbreviated Column Names for Tabular Data Using Large Language Models Ting Cai, Stephen Sheen, AnHai Doan. arXiv 2025. [Paper]
-
An LLM Agent-Based Complex Semantic Table Annotation Approach Yilin Geng, Shujing Wang, Chuan Wang, et al. arXiv 2025. [Paper]
-
Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation Xia, Mingxuan, et al. arXiv 2025. [Paper]
-
Evaluating how LLM annotations represent diverse views on contentious topics Brown, Megan A., et al. arXiv 2025. [Paper]
-
CHORUS: Foundation Models for Unified Data Discovery and Exploration Moe Kayali, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 8, 2024. [Paper]
-
RACOON: An LLM-based Framework for Retrieval-Augmented Column Type Annotation with a Knowledge Graph Lindsey Linxi Wei, Guorui Xiao, Magdalena Balazinska. arXiv 2024. [Paper]
-
AutoLabel: Automated Textual Data Annotation Method Based on Active Learning and Large Language Model Ming, Xuran, et al. International Conference on Knowledge Science, Engineering and Management. 2024. [Paper]
-
Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost Bansal, Parikshit, and Amit Sharma. arXiv 2023. [Paper]
- RubikSQL: Lifelong Learning Agentic Knowledge Base as an Industrial NL2SQL System Zui Chen, Han Li, Xinhao Zhang, et al. to be submitted to VLDB 2026 (PVLDB Vol. 19). (2025). [Paper]
- Cracking SQL Barriers: An LLM-based Dialect Translation System Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li. SIGMOD 2025. [Paper]
- OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment Xiangjin Xie, Guangwei Xu, Lingyan Zhao, Ruijie Guo. Proc. ACM Manag. Data, Vol. 3, No. 3, Article 194, pp. 1-24 (2025). [Paper]
- CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li. arXiv 2025 (extended from SIGMOD 2025 demo). [Paper]
- Data Interpreter: An LLM Agent for Data Science Sirui Hong, Yizhang Lin, Bang Liu, et al. ACL 2025 Findings, pp. 19796-19821 (2025). [Paper]
- Reasoning-SQL: Reinforcement Learning with SQL-Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, et al. COLM 2025. [Paper]
- An advanced AI driven database system M. Tedeschi, S. Rizwan, C. Shringi, et al. EDULEARN25 Conference Proceedings. 2025. [Paper]
- Text to Query Plans for Question Answering on Large Tables Yipeng Zhang, Chen Wang, Yuzhe Zhang, et al. arXiv 2025. [Paper]
- Automatic Metadata Extraction for Text-to-SQL Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson, et al. arXiv 2025. [Paper]
- CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning Lei Sheng, Shuai-Shuai Xu. arXiv 2025. [Paper]
- OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale Haoyang Li, Shang Wu, Xiaokang Zhang, et al. arXiv 2025. [Paper]
- Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning Yusuf Denizay Dönder, Derek Hommel, Andrea W Wen-Yi, et al. arXiv 2025. [Paper]
- A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL Yingqi Gao, Yifu Liu, Xiaoxia Li, et al. arXiv 2025. [Paper]
- FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis Chao Zhang, Yuren Mao, Yijiang Fan, et al. SIGMOD 2024. [Paper]
- CodeS: Towards Building Open-source Language Models for Text-to-SQL Haoyang Li, et al. Proceedings of the ACM on Management of Data, Volume 2, Issue 3, 2024. [Paper]
- The Dawn of Natural Language to SQL: Are We Fully Ready? Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. VLDB 2024. [Paper]
- Contextualized Data-Wrangling Code Generation in Computational Notebooks Junjie Huang, Daya Guo, Chenglong Wang, et al. ASE 2024. [Paper]
- PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency Zhishuai Li, Xiang Wang, Jingjing Zhao, et al. arXiv 2024. [Paper]
- CHESS: Contextual Harnessing for Efficient SQL Synthesis Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, et al. arXiv 2024. [Paper]
- DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction Mohammadreza Pourreza, Davood Rafiei. NeurIPS 2023. [Paper]
- Natural Language to Code Generation in Interactive Data Science Notebooks Pengcheng Yin, Wen-Ding Li, Kefan Xiao, et al. ACL 2023. [Paper]
- PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. JMLR 2023. [Paper]
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Paper]
- TableGPT2: A Large Multimodal Model with Tabular Data Integration [Paper]
- TableMaster: A Recipe to Advance Table Understanding with Language Models Lang Cao. arXiv 2025. [Paper]
- RoT: Enhancing Table Reasoning with Iterative Row-Wise Traversals Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che. arXiv 2025. [Paper]
- PPT: A Process-based Preference Learning Framework for Self Improving Table Question Answering Models Wei Zhou, Mohsen Mesgar, Heike Adel, et al. arXiv 2025. [Paper]
- CABINET: Content Relevance based Noise Reduction for Table Question Answering Sohan Patnaik, Heril Changwal, Milan Aggarwal, et al. ICLR 2024. [Paper]
- ReAcTable: Enhancing ReAct for Table Question Answering Yunjia Zhang, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 8, 2024. [Paper]
- Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks Peng Li, et al. Proceedings of the ACM on Management of Data, Volume 2, Issue 3, 2024. [Paper]
- TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy Weichao Zhao, Hao Feng, Qi Liu, et al. NeurIPS 2024. [Paper]
- Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding Zilong Wang, Hao Zhang, Chun-Liang Li, et al. ICLR 2024. [Paper]
- TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning Yilun Zhao, Lyuhao Chen, Arman Cohan, Chen Zhao. ACL 2024. [Paper]
- Multimodal Table Understanding Mingyu Zheng, Xinwei Feng, Qingyi Si, et al. ACL 2024. [Paper]
- TAT-LLM: A Specialized Language Model for Discrete Reasoning over Financial Tabular and Textual Data Fengbin Zhu, Ziyang Liu, Fuli Feng, et al. ICAIF 2024. [Paper]
- S3HQA: A Three-Stage Approach for Multi-hop Text-Table Hybrid Question Answering Fangyu Lei, Xiang Li, Yifan Wei, et al. ACL 2023. [Paper]
- Blazegraph: A high-performance graph database that supports RDF/SPARQL queries, commonly used in semantic web and knowledge graph analysis. [Source]
- GraphDB: A triplestore with ontology reasoning and SPARQL query support, widely used for enterprise knowledge management and semantic search. [Source]
- Neo4j: Neo4j is one of the most popular graph databases, based on the property graph model, supporting complex relationship queries and visual analytics. [Github]
- A Comparison of Current Graph Database Models Renzo Angles. ICDEW 2012. [Paper]
- R3-NL2GQL: A Model Coordination and Knowledge Graph Alignment Approach for NL2GQL Yuhang Zhou, Yu He, Siyu Tian, et al. Findings of EMNLP 2024. [Paper]
- NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language Yuanyuan Liang, Tingyu Xie, Gan Peng, et al. arXiv 2024. [Paper]
- Graph Learning in the Era of LLMs: A Survey from the Perspective of Data, Models, and Tasks Xunkai Li, Zhengyu Wu, Jiayi Wu, et al. arXiv 2024. [Paper]
- Leveraging Biomolecule and Natural Language through Multi-Modal Learning: A Survey Qizhi Pei, Lijun Wu, Kaiyuan Gao, et al. arXiv 2024. [Paper]
- FlexKBQA: A Flexible LLM-Powered Framework for Few-Shot Knowledge Base Question Answering Zhenyu Li, Sunqi Fan, Yu Gu, et al. AAAI 2024. [Paper]
- GraphGPT: Graph Instruction Tuning for Large Language Models Jiabin Tang, Yuhao Yang, Wei Wei, et al. SIGIR 2024. [Paper]
- Interactive-KBQA: Multi-Turn Interactions for Knowledge Base Question Answering with Large Language Models Guanming Xiong, Junwei Bao, Wen Zhao. ACL 2024. [Paper]
- InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment Jianing Wang, Junda Wu, Yupeng Hou, et al. Findings of ACL 2024. [Paper]
- Call Me When Necessary: LLMs can Efficiently and Faithfully Reason over Structured Environments Sitao Cheng, Ziyuan Zhuang, Yong Xu, et al. Findings of ACL 2024. [Paper]
- Language is All a Graph Needs Ruosong Ye, Caiqi Zhang, Runhui Wang, et al. EACL 2024. [Paper]
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafael Rafailov, Archit Sharma, Eric Mitchell, et al. NeurIPS 2023. [Paper]
- UniKGQA: Unified Retrieval and Reasoning for Solving Multi-hop Question Answering Over Knowledge Graph Jinhao Jiang, Kun Zhou, Wayne Xin Zhao, et al. ICLR 2023. [Paper]
- StructGPT: A General Framework for Large Language Model to Reason over Structured Data Jinhao Jiang, Kun Zhou, Zican Dong, et al. EMNLP 2023. [Paper]
- Subgraph Retrieval Enhanced Model for Multi-hop Knowledge Base Question Answering Jing Zhang, Xiaokang Zhang, Jifan Yu, et al. ACL 2022. [Paper]
- RoBERTa: A Robustly Optimized BERT Pretraining Approach Yinhan Liu, Myle Ott, Naman Goyal, et al. arXiv 2019. [Paper]
- Inductive representation learning on large graphs William L. Hamilton, Rex Ying, Jure Leskovec. NeurIPS 2017. [Paper]
- Semi-Supervised Classification with Graph Convolutional Networks Thomas N. Kipf, Max Welling. ICLR 2017. [Paper]
- ST-Raptor: LLM-Powered Semi-Structured Table Question Answering Zirui Tang, Boyu Niu, Xuanhe Zhou, et al. SIGMOD 2026. (2025). [Paper]
- Querying Semi-Structured Data Serge Abiteboul. ICDT 1997. [Paper]
- MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning Zheng Li, Yang Du, Mao Zheng, et al. COLING 2025. [Paper]
- AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries Jiayi Wang, Guoliang Li. CIDR 2025 [Paper]
- SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation Zeyao Ma, Bohan Zhang, Jing Zhang, et al. NeurIPS 2024. [Paper]
- TempTabQA: Temporal Question Answering for Semi-Structured Tables Vivek Gupta, Pranshu Kandoi, Mahek Bhavesh Vora, et al. EMNLP 2023. [Paper]
- AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries [Paper]
- Focus Anywhere for Fine-grained Multi-page Document Understanding [Paper]
- Exploring the limits of transfer learning with a unified text-to-text transformer [Paper]
- General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model [Paper]
- Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing Chunwei Liu, Matthew Russo, Michael Cafarella, et al. CIDR 2025 [Paper]
- DocFormerv2: Local Features for Document Understanding Srikar Appalaraju, Peng Tang, Qi Dong, et al. AAAI 2024. [Paper]
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Anwen Hu, Haiyang Xu, Jiabo Ye, et al. Findings of EMNLP 2024. [Paper]
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding Hao Feng, Qi Liu, Hao Liu, et al. SCIS 2024. [Paper]
- Towards Accurate and Efficient Document Analytics with Large Language Models Y. Lin, M. Hulsebos, R. Ma, et al. arXiv 2024. [Paper]
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Kenton Lee, Mandar Joshi, Iulia Turc, et al. ICML 2023. [Paper]
- Unifying Vision, Text, and Layout for Universal Document Processing Zineng Tang, Ziyi Yang, Guoxin Wang, et al. CVPR 2023. [Paper]
- DUBLIN: Visual Document Understanding By Language-Image Network Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, et al. EMNLP Industry Track 2023. [Paper]
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. ICLR 2021. [Paper]
- The JPEG Still Picture Compression Standard Gregory K. Wallace. Communications of the ACM 1991. [Paper]
- Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks Zhongxin Liu, Zhijie Tang, Junwei Zhang, et al. ICSE 2024. [Paper]
- Large Language Model for Vulnerability Detection: Emerging Results and Future Directions Xin Zhou, Ting Zhang, David Lo. ICSE-NIER 2024. [Paper]
- Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code Junwei Zhang, Zhongxin Liu, Xing Hu, et al. IEEE TSE 2023. [Paper]
- Software Vulnerability Detection with GPT and In-Context Learning Zhihong Liu, Qing Liao, Wenchao Gu, et al. DSC 2023. [Paper]
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages Zhangyin Feng, Daya Guo, Duyu Tang, et al. Findings of EMNLP 2020. [Paper]
- The Probabilistic Relevance Framework: BM25 and Beyond Stephen Robertson, et al. Foundations and Trends in Information Retrieval, Volume 3, Issue 4, 2009. [Paper]
- Repoformer: Selective Retrieval for Repository-Level Code Completion Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, et al. ICML 2024. [Paper]
- Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning Mingyang Geng, Shangwen Wang, Dezun Dong, et al. ICSE 2024. [Paper]
- Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, Earl Barr. ICSE 2024. [Paper]
- CoCoMIC: Code Completion by Jointly Modeling In-file and Cross-file Context Yangruibo Ding, Zijian Wang, Wasi Ahmad, et al. LREC-COLING 2024. [Paper]
- SCLA: Automated Smart Contract Summarization via LLMs and Semantic Augmentation Yingjie Mao, Xiaoqi Li, Wenkai Li, et al. arXiv 2024. [Paper]
- Code Structure–Guided Transformer for Source Code Summarization Shuzheng Gao, et al. ACM Transactions on Software Engineering and Methodology 2023. [Paper]
- RepoFusion: Training Code Models to Understand Your Repository Disha Shrivastava, Denis Kocetkov, Harm de Vries, et al. arXiv 2023. [Paper]
- ELMo-Tune-V2: LLM-Assisted Full-Cycle Auto-Tuning to Optimize LSM-Based Key-Value Stores Viraj Thakkar, Qi Lin, Kenanya Keandra Adriel Prasetyo, et al. arXiv 2025. [Paper]
- MLETune: Streamlining Database Knob Tuning via Multi-LLMs Experts Guided Deep Reinforcement Learning Wenlong Dong, Wei Liu, Rui Xi, et al. ICPADS 2024. [Paper]
- λ-Tune: Harnessing Large Language Models for Automated Database System Tuning Victor Giannankouris, Immanuel Trummer. SIGMOD 2025. [Paper]
- LLMIdxAdvis: Resource-Efficient Index Advisor Utilizing Large Language Model Xinxin Zhao, Haoyang Li, Jing Zhang, et al. arXiv 2025. [Paper]
- LATuner: An LLM-Enhanced Database Tuning System Based on Adaptive Surrogate Model Chongjiong Fan, Zhicheng Pan, Wenwen Sun, et al. ECML PKDD 2024. [Paper]
- Is Large Language Model Good at Database Knob Tuning? A Comprehensive Experimental Evaluation Yiyan Li, Haoyang Li, Zhao Pu, et al. arXiv 2024. [Paper]
- Automatic Database Configuration Debugging using Retrieval-Augmented Language Models Sibei Chen, Ju Fan, Bin Wu, et al. Proceedings of the ACM on Management of Data, Volume 3, Issue 1, 2025. [Paper]
- GPTuner: A Manual-Reading Database Tuning System via GPT-Guided Bayesian Optimization Jiale Lao, Yibo Wang, Yufei Li, et al. VLDB 2024. [Paper]
- E2ETune: End-to-End Knob Tuning via Fine-tuned Generative Language Model Xinmei Huang, Haoyang Li, Jing Zhang, et al. VLDB 2025. [Paper]
- DB-GPT: Large Language Model Meets Database Xuanhe Zhou, Zhaoyan Sun, Guoliang Li. Data Science and Engineering 2024. [Paper]
- HEBO: Heteroscedastic Evolutionary Bayesian Optimisation Alexander I. Cowen-Rivers, Wenlong Lyu, Zhi Wang, et al. NeurIPS 2020. [Paper]
- DB-GPT: Large Language Model Meets Database [Paper]
- R-Bot: An LLM-Based Query Rewrite System Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, et al. Proc. VLDB Endow., Vol. 18, No. 12, pp. 5031-5044 (2025). [Paper]
- E3-Rewrite: Learning to Rewrite SQL for Executability, Equivalence,and Efficiency Dongjie Xu, Yue Cui, Weijie Shi, et al. arXiv 2025. [Paper]
- LLM4Hint: Leveraging Large Language Models for Hint Recommendation in Offline Query Optimization Suchen Liu, Jun Gao, Yinjun Han, et al. arXiv 2025. [Paper]
- QUITE: A Query Rewrite System Beyond Rules with LLM Agents Yuyang Song, Hanxu Yan, Jiale Lao, et al. arXiv 2025. [Paper]
- Can Large Language Models Be Query Optimizer for Relational Databases? Jie Tan, Kangfei Zhao, Rui Li, et al. arXiv 2025. [Paper]
- A Query Optimization Method Utilizing Large Language Models Zhiming Yao, Haoyang Li, Jing Zhang, et al. arXiv 2025. [Paper]
- Query Rewriting via LLMs Sriram Dharwada, Himanshu Devrani, Jayant Haritsa, et al. arXiv 2025. [Paper]
- LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency Zhaodonghui Li, Haitao Yuan, Huiming Wang, et al. VLDB 2024. [Paper]
- The Unreasonable Effectiveness of LLMs for Query Optimization Peter Akioyamen, Zixuan Yi, Ryan Marcus. ML for Systems Workshop at NeurIPS 2024. [Paper]
- Query Rewriting via Large Language Models Jie Liu, Barzan Mozafari. arXiv 2024. [Paper]
- DBG-PT: A Large Language Model Assisted Query Performance Regression Debugger Victor Giannakouris, Immanuel Trummer. Proceedings of the VLDB Endowment, Volume 17, Issue 12, 2024. [Paper]
- Query Performance Explanation through Large Language Model for HTAP Systems Haibo Xiu, Li Zhang, Tieying Zhang, et al. ICDE 2025. [Paper]
- DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs Wei Zhou, Peng Sun, Xuanhe Zhou, et al. arXiv 2025. [Paper]
- D-Bot: Database Diagnosis System using Large Language Models Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 10. 2024. [Paper]
- LLM As DBA Xuanhe Zhou, Guoliang Li, Zhiyuan Liu. arXiv 2023. [Paper]
- D-Bot: Database Diagnosis System using Large Language Models [Paper]
- LLM As DBA [Paper]
- GaussMaster: An LLM-based Database Copilot System Wei Zhou, Ji Sun, Xuanhe Zhou, et al. arXiv 2025. [Paper]
- Panda: Performance Debugging for Databases using LLM Agents Vikramank Singh, Kapil Eknath Vaidya, Vinayshekhar Bannihatti Kumar, et al. CIDR 2024. [Paper]
- D-Bot: Database Diagnosis System using Large Language Models [Paper]
- LLM for Data Management Guoliang Li, Xuanhe Zhou, Xinyang Zhao. PVLDB 17(12). 2024. [Paper]
- LLM-Enhanced Data Management Xuanhe Zhou, Xinyang Zhao, Guoliang Li. arXiv 2024. [Paper]
- Probabilistic classification and clustering in relational data Ben Taskar, Eran Segal, Daphne Koller. IJCAI'01. 2021. [Paper]
- Multilinear tensor regression for longitudinal relational data Peter D. Hoff. Ann. Appl. Stat. 9 (3) 1169 - 1193, September 2015. [Paper]
- Outlier detection in relational data: A case study in geographical information systems Joris Maervoet, Celine Vens, Greet Vanden Berghe, et al. Expert Syst. Appl. 39, 5 (April, 2012). [Paper]
- A relational model of data for large shared data banks E. F. Codd. Communications of the ACM, Volume 13, Issue 6. 1970. [Paper]
- RubikSQL: Lifelong Learning Agentic Knowledge Base as an Industrial NL2SQL System [Paper]
- A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going? Xinyu Liu, Shuyu Shen, Boyan Li, et al. IEEE Transactions on Knowledge and Data Engineering, 2025. [Paper]
- Natural Language to SQL: State of the Art and Open Problems Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, Nan Tang. Proc. VLDB Endow., Vol. 18, No. 12, pp. 5466-5471 (2025). [Paper]
- A Survey on Employing Large Language Models for Text-to-SQL Tasks Liang Shi, Zhengju Tang, Nan Zhang, et al. ACM Comput. Surv., Vol. 58, No. 2, Article 54, pp. 1-37 (2025). [Paper]
- Bridging the Semantic Gap Between Text and Table: A Case Study on NL2SQL Lin Long, Xijun Gu, Xinjie Sun, et al. ICLR 2025. [Paper]
- Cracking SQL Barriers: An LLM-based Dialect Translation System Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li. SIGMOD 2025. [Paper]
- OpenSearch-SQL: Enhancing Text-to-SQL with Dynamic Few-shot and Consistency Alignment Xiangjin Xie, Guangwei Xu, Lingyan Zhao, Ruijie Guo. Proc. ACM Manag. Data, Vol. 3, No. 3, Article 194, pp. 1-24 (2025). [Paper]
- Data Interpreter: An LLM Agent for Data Science Sirui Hong, Yizhang Lin, Bang Liu, et al. ACL 2025 Findings, pp. 19796-19821 (2025). [Paper]
- Reasoning-SQL: Reinforcement Learning with SQL-Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, et al. COLM 2025. [Paper]
- An advanced AI driven database system M. Tedeschi, S. Rizwan, C. Shringi, et al. EDULEARN25 Conference Proceedings. 2025. [Paper]
- Text to Query Plans for Question Answering on Large Tables Yipeng Zhang, Chen Wang, Yuzhe Zhang, et al. arXiv 2025. [Paper]
- Automatic Metadata Extraction for Text-to-SQL Vladislav Shkapenyuk, Divesh Srivastava, Theodore Johnson, et al. arXiv 2025. [Paper]
- CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning Lei Sheng, Shuai-Shuai Xu. arXiv 2025. [Paper]
- OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale Haoyang Li, Shang Wu, Xiaokang Zhang, et al. arXiv 2025. [Paper]
- Cheaper, Better, Faster, Stronger: Robust Text-to-SQL without Chain-of-Thought or Fine-Tuning Yusuf Denizay Dönder, Derek Hommel, Andrea W Wen-Yi, et al. arXiv 2025. [Paper]
- A Preview of XiYan-SQL: A Multi-Generator Ensemble Framework for Text-to-SQL Yingqi Gao, Yifu Liu, Xiaoxia Li, et al. arXiv 2025. [Paper]
- FinSQL: Model-Agnostic LLMs-based Text-to-SQL Framework for Financial Analysis Chao Zhang, Yuren Mao, Yijiang Fan, et al. SIGMOD 2024. [Paper]
- CodeS: Towards Building Open-source Language Models for Text-to-SQL Haoyang Li, et al. Proceedings of the ACM on Management of Data, Volume 2, Issue 3, 2024. [Paper]
- The Dawn of Natural Language to SQL: Are We Fully Ready? Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, Nan Tang. VLDB 2024. [Paper]
- Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL Ju Fan, Zihui Gu, Songyue Zhang, et al. Proceedings of the VLDB Endowment, Volume 17, Issue 11 (2024). [Paper]
- Contextualized Data-Wrangling Code Generation in Computational Notebooks Junjie Huang, Daya Guo, Chenglong Wang, et al. ASE 2024. [Paper]
- PET-SQL: A Prompt-Enhanced Two-Round Refinement of Text-to-SQL with Cross-consistency Zhishuai Li, Xiang Wang, Jingjing Zhao, et al. arXiv 2024. [Paper]
- CHESS: Contextual Harnessing for Efficient SQL Synthesis Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, et al. arXiv 2024. [Paper]
- DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction Mohammadreza Pourreza, Davood Rafiei. NeurIPS 2023. [Paper]
- Natural Language to Code Generation in Interactive Data Science Notebooks Pengcheng Yin, Wen-Ding Li, Kefan Xiao, et al. ACL 2023. [Paper]
- PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. JMLR 2023. [Paper]
-
Data Interpreter: An LLM Agent for Data Science [Paper]
-
Collaboration between Intelligent Agents and Large Language Models: A Novel Approach for Enhancing Code Generation Capability Xingyuan Bai, Shaobin Huang, Chi Wei, et al. Expert Systems with Applications, 2025. [Paper]
-
Contextualized Data-Wrangling Code Generation in Computational Notebooks Junjie Huang, Daya Guo, Chenglong Wang, et al. ASE '24 (2024). [Paper]
-
Natural Language to Code Generation in Interactive Data Science Notebooks Pengcheng Yin, Wen-Ding Li, Kefan Xiao, et al. ACL 2023. [Paper]
-
PaLM: Scaling Language Modeling with Pathways Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. Journal of Machine Learning Research 24 (2023). [Paper]
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Mike Lewis, Yinhan Liu, Naman Goyal, et al. ACL 2020 (2020). [Paper]
Multi-Step QA
- TAT-LLM: A Specialized Language Model for Discrete Reasoning over Financial Tabular and Textual Data [Paper]
- TaPERA: Enhancing Faithfulness and Interpretability in Long-Form Table QA by Content Planning and Execution-based Reasoning [Paper]
- Reactable: Enhancing react for table question answering [Paper]
- S3HQA: A three-stage approach for multi-hop text-table hybrid question answering [Paper]
- Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance Xixi Wang, Miguel Costa, Jordanka Kovaceva, et al. Findings of EMNLP 2025 (2025). [Paper]
- Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding Zilong Wang, Hao Zhang, Chun-Liang Li, et al. ICLR 2024 (2024). [Paper]
End-to-End QA
- Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks [Paper]
- Tablegpt2: A large multimodal model with tabular data integration [Paper]
- Cabinet: Content relevance based noise reduction for table question answering [Paper]
- Tablemaster: A recipe to advance table understanding with language models [Paper]
- Multimodal table understanding [Paper]
- Tabpedia: Towards comprehensive visual table understanding with concept synergy [Paper]
- Judging llm-as-a-judge with mt-bench and chatbot arena [Paper]
- MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions Jian Wu, Linyi Yang, Dongyuan Li, et al. ICLR 2025 (2025). [Paper]
- Improved Baselines with Visual Instruction Tuning Haotian Liu, Chunyuan Li, Yuheng Li, et al. CVPR 2024 (2024). [Paper]
- Towards Cross-Modality Modeling for Time Series Analytics: A Survey in the LLM Era Chenxi Liu, et al. IJCAI 2025 Survey Track (2025). [Paper]
- Association between forecasting models’ precision and nonlinear patterns of daily river flow time series Farhang Rahmani & Mohammad Hadi Fattahi. Modeling Earth Systems and Environment, 2022. [Paper]
- HMCKRAutoEncoder: An Interpretable Deep Learning Framework for Time Series Analysis Jilong Wang, Rui Li, Renfa Li, et al. IEEE Transactions on Emerging Topics in Computing, 2022. [Paper]
- The Performance of LSTM and BiLSTM in Forecasting Time Series Sima Siami-Namini, Neda Tavakoli, Akbar Siami Namin. IEEE International Conference on Big Data, 2019. [Paper]
- A Comparison of ARIMA and LSTM in Forecasting Time Series Sima Siami-Namini, Neda Tavakoli, Akbar Siami Namin. ICMLA, 2018. [Paper]
- Time Series Databases and InfluxDB Syeda Noor Zehra Naqvi, Sofia Yfantidou, et al. Université libre de Bruxelles, Advanced Databases, 2017. [Paper]
TS2NL
- TimeRAG: Boosting LLM Time Series Forecasting via Retrieval-Augmented Generation Silin Yang, Dong Wang, Haoqi Zheng, Ruochun Jin. ICASSP 2025. [Paper]
- TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents Geon Lee, Wenchao Yu, Kijung Shin, et al. AAAI 2025. [Paper]
- Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop Yushan Jiang, Wenchao Yu, Geon Lee, et al. arXiv 2025. [Paper]
- From News to Forecast: Integrating Event Analysis in LLM-Based Time Series Forecasting with Reflection Xinlei Wang, Maike Feng, Jing Qiu, et al. NeurIPS 2024. [Paper]
- Dynamic Dynamic Time Warping Karl Bringmann, Nick Fischer, Ivor van der Hoog, et al. SODA 2024. [Paper]
- Can Large Language Models be Anomaly Detectors for Time Series? Sarah Alnegheimish, Linh Nguyen, Laure Berti-Equille, et al.DSAA 2024. [Paper]
- Exploring Large Language Models for Climate Forecasting Yang Wang, Hassan A. Karimi. arXiv 2024. [Paper]
- Temporal Data Meets LLM -- Explainable Financial Time Series Forecasting Xinli Yu, Zheng Chen, Yuan Ling, et al. arXiv 2023. [Paper]
Alignment
- TimeCMA: Towards LLM-Empowered Multivariate Time Series Forecasting via Cross-Modality Alignment Chenxi Liu, Qianxiong Xu, Hao Miao, et al. AAAI 2025. [Paper]
- CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning Peiyuan Liu, Hang Guo, Tao Dai, et al. AAAI 2025. [Paper]
- LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters Ching Chang, Wei-Yao Wang, Wen-Chih Peng, Tien-Fu Chen. ACM Transactions on Intelligent Systems and Technology, Volume 16, Issue 3, Article No. 60, Pages 1 - 20 (2025). [Paper]
- Large Language Models are Few-Shot Multivariate Time Series Classifiers Yakun Chen, Zihao Li, Chao Yang, et al. Data Mining and Knowledge Discovery, Volume 39, Issue 5 (2025). [Paper]
- SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs Fengze Li, Yue Wang, Yangle Liu, et al. arXiv 2025. [Paper]
- Time-LLM: Time Series Forecasting by Reprogramming Large Language Models Ming Jin, Shiyu Wang, Lintao Ma, et al. ICLR 2024. [Paper]
- S2IP-LLM: Semantic Space Informed Prompt Learning with LLM for Time Series Forecasting Zijie Pan, Yushan Jiang, Sahil Garg, et al. ICML 2024. [Paper]
- A Comparison of Current Graph Database Models Renzo Angles. IEEE 28th International Conference on Data Engineering Workshops, 2012. [Paper]
Natural Language To Graph Analysis Query
- NAT-NL2GQL: A Novel Multi-Agent Framework for Translating Natural Language to Graph Query Language [Paper]
- r3-NL2GQL: A model coordination and knowledge graph alignment approach for NL2GQL [Paper]
- Graph learning in the era of llms: A survey from the perspective of data, models, and tasks [Paper]
- Leveraging biomolecule and natural language through multi-modal learning: A survey [Paper]
- Aligning Large Language Models to a Domain-specific Graph Database for NL2GQL Yuanyuan Liang, Keren Tan, Tingyu Xie, et al. CIKM '24, 2024. [Paper]
LLM-based Semantic Analysis
- Retrieval-Then-Reasoning
- Subgraph retrieval enhanced model for multi-hop knowledge base question answering [Paper]
- Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph [Paper]
- G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering Xiaoxin He, Yijun Tian, Yifei Sun, et al. NeurIPS 2024. [Paper]
- Execution-Then-Reasoning
- Interactive-kbqa: Multi-turn inter-actions for knowledge base question answering with large language models [Paper]
- Flexkbqa: A flexible llm-powered framework for few-shot knowledge base question answering [Paper]
- MCTS-KBQA: Monte Carlo Tree Search for Knowledge Base Question Answering Guanming Xiong, Haochen Li, Wen Zhao. arXiv 2025. [Paper]
Graph Task Based Fine-tuning Methods
- Language is all a graph needs [Paper]
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model [Paper]
- Graphgpt: Graph instruction tuning for large language models [Paper]
- Inductive representation learning on large graphs [Paper]
- InstructGraph: Boosting Large Language Models via Graph-Centric Instruction Tuning and Preference Alignment Jianing Wang, Junda Wu, Yupeng Hou, et al. Findings of ACL 2024. [Paper]
- GLaM: Fine-Tuning Large Language Models for Domain Knowledge Graph Alignment via Neighborhood Partitioning and Generative Subgraph Encoding Stefan Dernbach, Khushbu Agarwal, Alejandro Zuniga, et al. AAAI Symposium Series, 3(1), 82-89 (2024). [Paper]
- Semi-Supervised Learning With Graph Learning-Convolutional Networks Bo Jiang, Ziyan Zhang, Doudou Lin, Jin Tang, Bin Luo. CVPR 2019, pp. 11313-11320. [Paper]
- Agent Based Methods
- Structgpt: A general framework for large language model to reason over structured data [Paper]
- Call me when necessary: Llms can efficiently and faithfully reason over structured environments [Paper]
- KBQA-o1: Agentic Knowledge Base Question Answering with Monte Carlo Tree Search Haoran Luo, Haihong E, Yikai Guo, et al. ICML 2025. [Paper]
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task Tao Yu, Rui Zhang, Kai Yang, et al. EMNLP 2018. [Paper]
- Compositional Semantic Parsing on Semi-Structured Tables Panupong Pasupat, Percy Liang. ACL 2015, Pages 1470–1480. [Paper]
- Codes: Towards building open-source language models for text-to-sql [Paper]
- ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning Zhe Xie, Zeyan Li, Xiao He, et al. Proceedings of the VLDB Endowment, 2025. [Paper]
- Relational Data Generation with Graph Neural Networks and Latent Diffusion Models Valter Hudovernik. TRL @ NeurIPS 2024 Poster, 2024. [Paper]
- Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space Hengrui Zhang, Jiani Zhang, Balasubramaniam Srinivasan, et al. ICLR 2024. [Paper]
- ITF-GAN: Synthetic Time Series Dataset Generation and Manipulation by Interpretable Features Hendrik Klopries, Andreas Schwung. Knowledge-Based Systems, Volume 283, Issue C (2024). [Paper]
- REaLTabFormer: Generating Realistic Relational and Tabular Data using Transformers Aivin V. Solatorio, Olivier Dupriez. arXiv 2023. [Paper]
- Synthetic Data Generation of Many-to-Many Datasets via Random Graph Generation Kai Xu, Georgi Ganev, Emile Joubert, et al. ICLR 2023. [Paper]
- SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task Tao Yu, Michihiro Yasunaga, Kai Yang, et al. EMNLP 2018. [Paper]
- A Temporal Knowledge Graph Generation Dataset Supervised Distantly by Large Language Models Jun Zhu, Yan Fu, Junlin Zhou, Duanbing Chen. Scientific Data, 12:734 (2025). [Paper]
- A Framework for Large-Scale Synthetic Graph Dataset Generation Sajad Darabi, Piotr Bigaj, Dawid Majchrowski, et al. IEEE Transactions on Neural Networks and Learning Systems, Volume 36, Issue 8, Pages 14258 - 14268 (2025). [Paper]
Markup Extraction
-
Language models enable simple systems for generating structured views of heterogeneous data lakes [Paper]
-
WebFormer: The Web-page Transformer for Structure Information Extraction Qifan Wang, Yi Fang, Anirudh Ravula, et al. WWW '22, 2022. [Paper]
Markup Query
-
XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler Yu Li, Bryce Wang, Xinyu Luan. arXiv 2025. [Paper]
-
Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation Jinwei Lu, Yuanfeng Song, Zhiqian Qin, et al. arXiv 2025. [Paper]
Markup Understanding
- Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding Hongshen Xu, Lu Chen, Zihan Zhao, et al. WSDM '24, 2024. [Paper]
- DOM-LM: Learning Generalizable Representations for HTML Documents Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun. arXiv 2022. [Paper]
- MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. ACL 2022. [Paper]
Table Representation
- ST-Raptor: LLM-Powered Semi-Structured Table Question Answering [Paper]
- Reasoning and Retrieval for Complex Semi-structured Tables via Reinforced Relational Data Transformation Haoyu Dong, Yue Hu, Yanan Cao. SIGIR '25, Pages 1382 - 1391 (2025). [Paper]
- Can an LLM Find Its Way Around a Spreadsheet? Cho-Ting Lee, Andrew Neeser, Shengzhe Xu, et al. ICSE 2025. [Paper]
- Auto-Tables: Relationalize Tables without Using Examples Peng Li, Yeye He, Cong Yan, et al. SIGMOD Record, Volume 53, Issue 1, Pages 76 - 85 (2024). [Paper]
- TUTA: Tree-based Transformers for Generally Structured Table Pre-training Zhiruo Wang, Haoyu Dong, Ran Jia, et al. KDD '21, Pages 1780 - 1790 (2021). [Paper]
Table Prompting
-
SpreadsheetLLM: Encoding Spreadsheets for Large Language Models Haoyu Dong, Jianbo Zhao, Yuzhang Tian, et al. arXiv 2024. [Paper]
-
HySem: A Context Length Optimized LLM Pipeline for Unstructured Tabular Extraction Narayanan PP, Anantharaman Palacode Narayana Iyer. TRL @ NeurIPS 2024 Poster, 2024. [Paper]
Table Querying
- ST-Raptor: LLM-Powered Semi-Structured Table Question Answering [Paper]
- SpreadsheetLLM: encoding spreadsheets for large language models [Paper]
Traditional Approaches
- DVQA: Understanding Data Visualizations via Question Answering Kushal Kafle, Brian Price, Scott Cohen, Christopher Kanan. CVPR 2018, pp. 5648-5656. [Paper]
Chart Captioning
- ChartThinker: A Contextual Chain-of-Thought Approach to Optimized Chart Summarization Mengsha Liu, Daoyuan Chen, Yaliang Li, et al. LREC-COLING 2024, Pages 3057–3074 (2024). [Paper]
- UniChart: A Universal Vision-Language Pretrained Model for Chart Comprehension and Reasoning Ahmed Masry, Parsa Kavehzadeh, Xuan Long Do, Enamul Hoque, Shafiq Joty. EMNLP 2023. [Paper]
- FigCaps-HF: A Figure-to-Caption Generative Framework and Benchmark with Human Feedback Ashish Singh, Ashutosh Singh, Prateek Agarwal, et al. arXiv 2023. [Paper]
- Chart-to-Text: Generating Natural Language Descriptions for Charts by Adapting the Transformer Model Jason Obeid, Enamul Hoque. INLG 2020, Pages 138–147 (2020). [Paper]
- An Architecture for Data-to-Text Systems Ehud Baruch Reiter. ENLG 07, 2007. [Paper]
- Describing Complex Charts in Natural Language: A Caption Generation System Vibhu O. Mittal, Giuseppe Carenini, Johanna D. Moore, Steven Roth. Computational Linguistics, 1998. [Paper]
Chart Question Answering
- EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding Muye Huang, Han Lai, Xinyu Zhang, et al. AAAI 2025. [Paper]
- Charts-of-Thought: Enhancing LLM Visualization Literacy Through Structured Data Extraction Amit Kumar Das, Mohammad Tarun, Klaus Mueller. IEEE VIS 2025. [Paper]
- ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding Zhengzhuo Xu, Bowen Qu, Yiyan Qi, et al. ICLR 2025. [Paper]
- ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild Ahmed Masry, Megh Thakkar, Aayush Bajaj, et al. COLING 2025. [Paper]
- ChartInsights: Evaluating Multimodal Large Language Models for Low-Level Chart Question Answering Yifan Wu, Lutao Yan, Leixian Shen, et al. EMNLP 2024. [Paper]
- VizAbility: Enhancing Chart Accessibility with LLM-based Conversational Interaction Joshua Gorniak, Yoon Kim, Donglai Wei, et al. UIST '24, Article No. 89, Pages 1 - 19 (2024). [Paper]
- ChartLlama: A Multimodal LLM for Chart Understanding and Generation Yucheng Han, Chi Zhang, Xin Chen, et al. arXiv 2023. [Paper]
- ChartBench: A Benchmark for Complex Visual Reasoning in Charts Zhengzhuo Xu, Sinan Du, Yiyan Qi, et al. arXiv 2023. [Paper]
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Qinghao Ye, Haiyang Xu, Guohai Xu, et al. arXiv 2023. [Paper]
Chart-to-Code
- ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Cheng Yang, Chufan Shi, Yaxin Liu, et al. ICLR 2025. [Paper]
- Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation Lei Chen, Xuanle Zhao, Zhixiong Zeng, et al. arXiv 2025. [Paper]
- Text2Chart31: Instruction Tuning for Chart Generation with Automatic Feedback Fatemeh Pesaran Zadeh, Juyeon Kim, Jin-Hwa Kim, Gunhee Kim. EMNLP 2024. [Paper]
- Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding Andong Deng, Zhongpai Gao, Anwesa Choudhuri, et al. CVPR 2025. [Paper]
- TempMe: Video Temporal Token Merging for Efficient Text-Video Retrieval Leqi Shen, Tianxiang Hao, Tao He, et al. ICLR 2025. [Paper]
- Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models Haibo Wang, Zhiyang Xu, Yu Cheng, et al. EMNLP 2025 Findings. [Paper]
- Video Token Merging for Long Video Understanding Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu Li. NeurIPS 2024. [Paper]
- TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma. arXiv 2024. [Paper]
-
From Image to Video, what do we need in multimodal LLMs? Suyuan Huang, Haoxin Zhang, Linqing Zhong, et al. arXiv 2024. [Paper]
-
LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs Yunxin Li, Xinyu Chen, Baotain Hu, Min Zhang. arXiv 2024. [Paper]
-
Predicting Team Well-Being through Face Video Analysis with AI Moritz Müller, Ambre Dupuis, Tobias Zeulner, et al. Applied Sciences, 14(3), 1284 (2024). [Paper]
-
AI Based Multimodal Emotion and Behavior Analysis of Interviewee Aaditya Jadhav, Rushikesh Ghodake, Karthik Muralidharan, G Tarun Varma. IJSREM, May 2023. [Paper]
-
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Yuqian Yuan, Hang Zhang, Wentong Li, et al. CVPR 2025, pp. 18970-18980 (2025). [Paper]
-
Video Summarisation with Incident and Context Information using Generative AI Ulindu De Silva, Leon Fernando, Kalinga Bandara, Rashmika Nawaratne. IECON 2024 - 50th Annual Conference of the IEEE Industrial Electronics Society, 2024. [Paper]
-
Abnormal Event Detection in Videos using LSTM Convolutional Autoencoder Abdelhafid Berroukham, Khalid Housni, Mohammed Lahraichi. ISCV 2024 - International Conference on Intelligent Systems and Computer Vision, 2024. [Paper]
-
Utilizing Multimodal Large Language Models for Video Analysis of Posture in Studying Collaborative Learning Ridwan Whitehead, Andy Nguyen, Sanna Järvelä. Journal of Learning Analytics, Vol. 12 No. 1 (2025). [Paper]
-
Artificial Intelligence–Powered 3D Analysis of Video-Based Caregiver-Child Interactions Zhenzhen Weng, Laura Bravo-Sánchez, Zeyu Wang, et al. Science Advances, Vol. 11, Issue 8 (2025). [Paper]
- VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding Shihao Wang, Guo Chen, De-an Huang, et al. arXiv 2025. [Paper]
- DisCo: Disentangled Control for Realistic Human Dance Generation Tan Wang, Linjie Li, Kevin Lin, et al. CVPR 2024, pp. 9326-9336 (2024). [Paper]
- Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, et al. ICCV 2023, pp. 15954-15964 (2023). [Paper]
- Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models Andreas Blattmann, Robin Rombach, Huan Ling, et al. CVPR 2023, pp. 22563-22575 (2023). [Paper]
- SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation Wenxuan Zhang, Xiaodong Cun, Xuan Wang, et al. CVPR 2023, pp. 8652-8661 (2023). [Paper]
- DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models Yifeng Ma, Shiwei Zhang, Jiayu Wang, et al. arXiv 2023. [Paper]
- Imagen Video: High Definition Video Generation with Diffusion Models Jonathan Ho, William Chan, Chitwan Saharia, et al. arXiv 2022. [Paper]
- Make-A-Video: Text-to-Video Generation without Text-Video Data Uriel Singer, Adam Polyak, Thomas Hayes, et al. arXiv 2022. [Paper]
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis Jian Liang, Chenfei Wu, Xiaowei Hu, et al. NeurIPS 2022. [Paper]
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking [Paper]
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [Paper]
- SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding Jian Chen, Ruiyi Zhang, Yufan Zhou, et al. ICLR 2025. [Paper]
- AesthetiQ: Enhancing Graphic Layout Design via Aesthetic-Aware Preference Alignment of Multi-modal Large Language Models Sohan Patnaik, Rishabh Jain, Balaji Krishnamurthy, Mausoom Sarkar. CVPR 2025, pp. 23701-23711 (2025). [Paper]
- VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation Manan Suri, Puneet Mathur, Franck Dernoncourt, et al. NAACL 2025, pp. 6088-6109 (2025). [Paper]
- LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation Hengyu Shi, Junhao Su, Junfeng Luo, Jialin Gao. arXiv 2025. [Paper]
- OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition Jianqiang Wan, Sibo Song, Wenwen Yu, et al. CVPR 2024, pp. 15641-15653 (2024). [Paper]
- DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding Dongsheng Wang, Natraj Raman, Mathieu Sibue, et al. ACL 2024, pp. 8529-8548 (2024). [Paper]
- VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding Ofir Abramovich, Niv Nayman, Sharon Fogel, et al. ECCV 2024, pp. 241-259 (2024). [Paper]
- Efficient End-to-End Visual Document Understanding with Rationale Distillation Wang Zhu, Alekh Agarwal, Mandar Joshi, et al. NAACL 2024. [Paper]
- PosterLlama: Bridging Design Ability of Language Model to Content-Aware Layout Generation Jaejung Seol, Seojun Kim, Jaejun Yoo. ECCV 2024, pp. 451-468 (2024). [Paper]
- SciPostLayout: A Dataset for Layout Analysis and Layout Generation of Scientific Posters Hao Wang, Shohei Tanaka, Yoshitaka Ushiku. CVPR Workshops 2024, pp. 8136-8141 (2024). [Paper]
- DLAFormer: An End-to-End Transformer For Document Layout Analysis Jiawei Wang, Kai Hu, Qiang Huo. ICDAR 2024, pp. 40-57 (2024). [Paper]
- CREPE: Coordinate-Aware End-to-End Document Parser Yamato Okamoto, Youngmin Baek, Geewook Kim, et al. ICDAR 2024, pp. 3-20 (2024). [Paper]
- Corrective Retrieval Augmented Generation Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, Zhen-Hua Ling. arXiv 2024. [Paper]
- RAFT: Adapting Language Model to Domain Specific RAG Tianjun Zhang, Shishir G. Patil, Naman Jain, et al. arXiv 2024. [Paper]
- VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction Jiahao Zhang, Ryota Yoshihashi, Shunsuke Kitada, et al. arXiv 2024. [Paper]
- MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection Niki Nezakati, Md Kaykobad Reza, Ameya Patil, et al. arXiv 2024. [Paper]
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding Jaemin Cho, Debanjan Mahata, Ozan Irsoy, et al. arXiv 2024. [Paper]
- LTSim: Layout Transportation-based Similarity Measure for Evaluating Layout Generation Mayu Otani, Naoto Inoue, Kotaro Kikuchi, et al. arXiv 2024. [Paper]
- MissModal: Increasing Robustness to Missing Modality in Multimodal Sentiment Analysis Ronghao Lin, Haifeng Hu. TACL 2023, pp. 1686-1702 (2023). [Paper]
- Automatic Generation of Scientific Papers for Data Augmentation in Document Layout Analysis Lorenzo Pisaneschi, Andrea Gemelli, Simone Marinai. Pattern Recognition Letters, 2023. [Paper]
- Unifying Layout Generation With a Decoupled Diffusion Model Mude Hui, Zhizheng Zhang, Xiaoyi Zhang, et al. CVPR 2023, pp. 1942-1951 (2023). [Paper]
- LayoutDM: Discrete Diffusion Model for Controllable Layout Generation Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, et al. CVPR 2023, pp. 10167-10176 (2023). [Paper]
- VLCDoC: Vision-Language Contrastive Pre-training Model for Cross-Modal Document Classification Souhail Bakkali, Zuheng Ming, Mickael Coustaty, et al. Pattern Recognition, Vol. 139, 109419 (2023). [Paper]
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding Yang Xu, Yiheng Xu, Tengchao Lv, et al. ACL-IJCNLP 2021, pp. 2579-2591 (2021). [Paper]
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding Yiheng Xu, Minghao Li, Lei Cui, et al. KDD 2020, pp. 1192-1200 (2020). [Paper]
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision Wonjae Kim, Bokyung Son, Ildoo Kim. ICML 2021. [Paper]
- DocFormer: End-to-End Transformer for Document Understanding Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, et al. ICCV 2021, pp. 993-1003 (2021). [Paper]
- FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran. ICDARW 2019. [Paper]
- Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) [Paper]
- Repoformer: Selective Retrieval for Repository-Level Code Completion [Paper]
- Magicoder: Empowering Code Generation with OSS-Instruct [Paper]
- SCLA: Automated Smart Contract Summarization via LLMs and Semantic Augmentation [Paper]
- Large Language Model for Vulnerability Detection: Emerging Results and Future Directions [Paper]
- Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code [Paper]
- Self-Instruct: Aligning Language Models with Self-Generated Instructions [Paper]
- Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning Haiming Wang, Mert Unsal, Xiaohan Lin, et al. arXiv 2025. [Paper]
- Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving Yong Lin, Shange Tang, Bohan Lyu, et al. arXiv 2025. [Paper]
- Teaching Large Language Models to Self-Debug Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou. ICLR 2024. [Paper]
- Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning Mingyang Geng, Shangwen Wang, Dezun Dong, et al. ICSE 2024, Article 39, pp. 1-13 (2024). [Paper]
- FT2Ra: A Fine-Tuning-Inspired Approach to Retrieval-Augmented Code Completion Qi Guo, Xiaohong Li, Xiaofei Xie, et al. ISSTA 2024, pp. 313-324 (2024). [Paper]
- Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks Zhongxin Liu, Zhijie Tang, Junwei Zhang, et al. ICSE 2024, Article 151, pp. 1-13 (2024). [Paper]
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct Ziyang Luo, Can Xu, Pu Zhao, et al. ICLR 2024. [Paper]
- REPOFUSE: Repository-Level Code Completion with Fused Dual Context Ming Liang, Xiaoheng Xie, Gehao Zhang, et al. arXiv 2024. [Paper]
- DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data Huajian Xin, Daya Guo, Zhihong Shao, et al. arXiv 2024. [Paper]
- Software Vulnerability Detection with GPT and In-Context Learning Zhihong Liu, Qing Liao, Wenchao Gu, Cuiyun Gao. DSC 2023. [Paper]
- RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation Fengji Zhang, Bei Chen, Yue Zhang, et al. EMNLP 2023. [Paper]
- Syntax-Directed Variational Autoencoder for Structured Data Hanjun Dai, Yingtao Tian, Bo Dai, et al. ICLR 2018. [Paper]
- Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference Matthew James Johnson, David Duvenaud, Alexander B. Wiltschko, et al. NeurIPS 2016. [Paper]
- ProtChatGPT: Towards Understanding Proteins with Hybrid Representation and Large Language Models Chao Wang, Hehe Fan, Ruijie Quan, Lina Yao, Yi Yang. SIGIR 2025, pp. 1076-1086 (2025). [Paper]
- 3UR-LLM: An End-to-End Multimodal Large Language Model for 3D Scene Understanding Haomiao Xiong, Yunzhi Zhuge, Jiawen Zhu, et al. IEEE Trans. Multimedia, Vol. 27, pp. 2899-2911 (2025). [Paper]
- Towards 3D Molecule-Text Interpretation in Language Models Sihang Li, Zhiyuan Liu, Yanchen Luo, et al. ICLR 2024. [Paper]
- 3D-LLM: Injecting the 3D World into Large Language Models Yining Hong, Haoyu Zhen, Peihao Chen, et al. NeurIPS 2023, Article 900, pp. 20482-20494 (2023). [Paper]
- ProteinChat: Towards Achieving ChatGPT-Like Functionalities on Protein 3D Structures Han Guo, Mingjia Huo, Ruiyi Zhang, Pengtao Xie. TechRxiv 2023. [Paper]
- Do Large Language Models Truly Understand Geometric Structures? Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang. ICLR 2025. [Paper]
- 3DSMILES-GPT: 3D Molecular Pocket-Based Generation with Token-Only Large Language Model Jike Wang, Hao Luo, Rui Qin, et al. Chemical Science, 2025. [Paper]
- ProtChat: An AI Multi-Agent for Automated Protein Analysis Leveraging GPT-4 and Protein Language Model Huazhen Huang, Xianguo Shi, Hongyang Lei, et al. J. Chem. Inf. Model., 2024. [Paper]
- A Multimodal Protein Representation Framework for Quantifying Transferability Across Biochemical Downstream Tasks Fan Hu, Yishen Hu, Weihong Zhang, et al. Advanced Science, 2023. [Paper]
- SMILES: A Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules David Weininger. J. Chem. Inf. Comput. Sci., Vol. 28, No. 1, pp. 31-36 (1988). [Paper]
-
LLMI3D: MLLM-based 3D Perception from a Single 2D Image Fan Yang, Sicheng Zhao, Yanhao Zhang, et al. arXiv 2024. [Paper]
-
Self-supervised Image-based 3D Model Retrieval Dan Song, Chu-Meng Zhang, Xiao-Qian Zhao, et al. ACM Trans. Multimedia Comput. Commun. Appl., Vol. 19, No. 2, Article 70, pp. 1-18 (2023). [Paper]
- MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers Yiwen Chen, Tong He, Di Huang, et al. ICLR 2025. [Paper]
- CraftsMan3D: High-fidelity Mesh Generation with 3D Native Diffusion and Interactive Geometry Refiner Weiyu Li, Jiarui Liu, Hongyu Yan, et al. CVPR 2025, pp. 5307-5317 (2025). [Paper]
- SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D Weiyu Li, Rui Chen, Xuelin Chen, Ping Tan. ICLR 2024. [Paper]
- RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D Lingteng Qiu, Guanying Chen, Xiaodong Gu, et al. CVPR 2024, pp. 9914-9925 (2024). [Paper]
- Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer Shuang Wu, Youtian Lin, Feihu Zhang, et al. NeurIPS 2024, Article 3873, pp. 121859-121881 (2024). [Paper]
- Hunyuan3D 1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation Xianghui Yang, Huiwen Shi, Bowen Zhang, et al. arXiv 2024. [Paper]
- LLaMA-Mesh: Unifying 3D Mesh Generation with Language Models Zhengyi Wang, Jonathan Lorraine, Yikai Wang, et al. arXiv 2024. [Paper]
- Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation Rui Chen, Yongwei Chen, Ningxin Jiao, Kui Jia. ICCV 2023, pp. 22246-22256 (2023). [Paper]
- Zero-1-to-3: Zero-shot One Image to 3D Object Ruoshi Liu, Rundi Wu, Basile Van Hoorick, et al. ICCV 2023, pp. 9298-9309 (2023). [Paper]
-
Unicorn: A Unified Multi-Tasking Matching Model Ju Fan, Jianhong Tu, Guoliang Li, et al. ACM SIGMOD Record, Vol. 53, No. 1, pp. 44-53 (2024). [Paper]
-
Symphony: Towards Natural Language Query Answering over Multi-modal Data Lakes Zui Chen, Zihui Gu, Lei Cao, Ju Fan, Sam Madden, Nan Tang. CIDR 2023. [Paper]
-
Towards Operationalizing Heterogeneous Data Discovery Jin Wang, Yanlin Feng, Chen Shen, Sajjadur Rahman, Eser Kandogan. arXiv 2025. [Paper]
-
Semantic Operators: A Declarative Model for Rich, AI-based Data Processing Liana Patel, Siddharth Jha, Melissa Pan, et al. arXiv 2024. [Paper]
-
CAESURA: Language Models as Multi-Modal Query Planners Matthias Urban, Carsten Binnig. arXiv 2023. [Paper]
- An Interactive Multi-Modal Query Answering System with Retrieval-Augmented Large Language Models Mengzhao Wang, Haotian Wu, Xiangyu Ke, et al. Proc. VLDB Endow., Vol. 17, No. 12, pp. 4333-4336 (2024). [Paper]
- MUST: An Effective and Scalable Framework for Multimodal Search of Target Modality Mengzhao Wang, Xiangyu Ke, Xiaoliang Xu, et al. ICDE 2024. [Paper]
- Explainable Multi-Modal Data Exploration in Natural Language via LLM Agent Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger. arXiv 2024. [Paper]

