Autodata: an automatic data scientist to create high-quality data

Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Swarnadeep Saha, Eryk Helenowski, Weizhe Yuan, Olga Golovneva, Jack Lanchantin, Yoram Bachrach, Jakob Foerster, Xian Li, Han Fang, Sainbayar Sukhbaatar, Jason Weston

2026

[Blog]


Reasoning Over Mathematical Objects: On-Policy Reward Modeling and Test Time Aggregation

*Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao

*Authors ordered alphabetically
arXiv 2026

[PDF] [arXiv]


Self-Improving Pretraining: Using Post-trained Models to Pretrain Better Models

Ellen Xiaoqing Tan* , Jack Lanchantin* , Shehzaad Dhuliawala, Danwei Li, Thao Nguyen, Jing Xu, Ping Yu, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Xian Li* , Olga Golovneva*

*Equal contribution

arXiv 2026

[PDF] [arXiv]


SPICE: Self-Play In Corpus Environments Improves Reasoning

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston

joint leads

arXiv 2025

[PDF] [arXiv]


LLM Output Homogenization is Task Dependent

Shomik Jain, Jack Lanchantin, Maximilian Nickel, Karen Ullrich, Ashia Wilson, Jamelle Watson-Daniels

arXiv 2025

[PDF] [arXiv]


Jointly Reinforcing Diversity and Quality in Language Model Generations

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang

arXiv 2025

[PDF] [arXiv]


OptimalThinkingBench: Evaluating Over and Underthinking in LLMs

Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha

arXiv 2025, ICLR 2026

[PDF] [arXiv]


CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu

arXiv 2025

[PDF] [arXiv]


NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks

Yang Li*, Youssef Emad*, Karthik Padthe*, Jack Lanchantin*, Weizhe Yuan, Thao Nguyen, Jason Weston, Shang-Wen Li, Dong Wang, Ilia Kulikov, Xian Li

*joint first authors, †joint leads

arXiv 2025

[PDF] [arXiv]

Bridging Offline and Online Reinforcement Learning for LLMs

Jack Lanchantin*, Angelica Chen*, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov

*joint first authors, †joint leads

arXiv 2025

[PDF] [arXiv]

Diverse Preference Optimization

Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala, Ping Yu, Jason Weston, Sainbayar Sukhbaatar, Ilia Kulikov

arXiv 2025

[PDF] [arXiv]

LLM Pretraining with Continuous Concepts

Jihoon Tack, Jack Lanchantin, Jane Yu, Andrew Cohen, Ilia Kulikov, Janice Lan, Shibo Hao, Yuandong Tian, Jason Weston, Xian Li

arXiv 2025, ICLR 2026

[PDF] [arXiv]

Adaptive Decoding via Latent Preference Optimization

Shehzaad Dhuliawala, Ilia Kulikov, Ping Yu, Asli Celikyilmaz, Jason Weston, Sainbayar Sukhbaatar, Jack Lanchantin

arXiv 2024

[PDF] [arXiv]

ToolVerifier: Generalization to New Tools via Self-Verification

Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, Jane Dwivedi-Yu

EMNLP Findings 2024 - Miami, FL
[PDF] [arXiv]

Learning to Reason and Memorize with Self-Notes

Jack Lanchantin*, Shubham Toshniwal*, Jason Weston, Arthur Szlam, Sainbayar Sukhbaatar

*equal contribution

NeurIPS 2023 - New Orleans, LA
[PDF] [arXiv] [slides] 

Robustness of Named-Entity Replacements for In-Context Learning

Saeed Goodarzi, Nikhil Kagita, Dennis Minn, Shufan Wang, Roberto Dessi, Shubham Toshniwal, Adina Williams, Jack Lanchantin*, Koustuv Sinha*

*equal leads

EMNLP Findings 2023 - Singapore

[PDF] [ACL]

Compositional Interfaces for Compositional Generalization

Jelena Luketina, Jack Lanchantin, Sainbayar Sukhbaatar, Arthur Szlam

CoLLAs 2024 - Pisa, Italy

[PDF] [slides]

A Data Source for Reasoning Embodied Agents

Jack Lanchantin, Sainbayar Sukhbaatar, Gabriel Synnaeve, Yuxuan Sun, Kavya Srinet, Arthur Szlam

AAAI 2023  - Washington, DC

[PDF] [arXiv] [slides] [code]

Modeling interactions with Deep Learning

Jack Lanchantin

PhD Dissertation - 2021

Committee: Vicente Ordoñez (chair), Yangfeng Ji, Clint Miller, Casey Greene, Yanjun Qi

General Multi-label Image Classification with Transformers
Jack Lanchantin, Tianlu Wang, Vicente Ordóñez Román, Yanjun Qi
Conference on Computer Vision and Pattern Recognition (CVPR) 2021 - Nashville, TN
[PDF] [arXiv] [poster] [slides] [video] [code]

Transfer Learning for Predicting Virus-Host Protein Interactions for Novel Virus Sequences
Jack Lanchantin, Tom Weingarten, Arshdeep Sekhon, Clint Miller, Yanjun Qi
ACM-BCB 2021, NeurIPS Covid-19 Symposium 2020, Machine Learning in Computational Biology (MLCB) 2020
[PDF] [bioRxiv] [slides] [video] [code]

Time and Space Complexity of Graph Convolutional Networks
Derrick Blakely, Jack Lanchantin, Yanjun Qi
Tech Report 2021
[PDF]

Graph Convolutional Networks for Epigenetic State Prediction Using Both Sequence and 3D Genome Data
Jack Lanchantin, Yanjun Qi
European Conference on Computational Biology (ECCB) 2020, Bioinformatics 2020
[PDF] [bioRxiv] [slides] [poster] [code]

Reevaluating Adversarial Examples in Natural Language
John X. Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, Yanjun Qi
Findings of the Association for Computational Linguistics - EMNLP 2020
[PDF] [arXiv] [slides] [code]

FastSK: Fast Sequence Analysis with Gapped String Kernels
Derrick Blakely, Eamon Collins, Ritambhara Singh, Andrew Norton, Jack Lanchantin, Yanjun Qi
Bioinformatics 2020
[PDF] [Bioinformatics] [code]

Neural Message Passing for Multi-Label Classification
Jack Lanchantin, Arshdeep Sekhon, Yanjun Qi
European Conference on Machine Learning (ECML-PKDD) 2019 - Würzburg, Germany
[PDF] [arXiv] [slides] [poster] [code]

Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers
Ji Gao, Jack Lanchantin, Mary Lou Soffa, Yanjun Qi
Deep Learning and Security Workshop (DLS) 2018 - San Francisco, CA
[PDF] [arXiv] [slides] [video] [code]

Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin
Ritambhara Singh, Jack Lanchantin, Arshdeep Sekhon, Yanjun Qi
Advances in Neural Information Processing Systems (NeurIPS) 2017 - Long Beach, CA
[PDF] [arXiv] [slides] [code] [Kipoi] [poster]

Opportunities and Obstacles for Deep Learning in Biology and Medicine
Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Wei Xie, Gail L Rosen, Benjamin J Lengerich, Johnny Israeli, Jack Lanchantin, Stephen Woloszynek, Anne E Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M Cofer, David J Harris, Dave DeCaprio, Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K Wiley, Marwin HS Segler, Anthony Gitter, Casey S Greene
Journal of the Royal Society Interface 2018
[PDF] [JRSI] [Nature Tech Blog]

Prototype Matching Networks for Large-Scale Multi-label Genomic Sequence Classification
Jack Lanchantin, Arshdeep Sekhon, Ritambhara Singh, Yanjun Qi
arXiv Preprint 2017
[PDF] [arXiv] [slides]

Memory Matching Networks for Genomic Sequence Classification
Jack Lanchantin, Ritambhara Singh, Yanjun Qi
International Conference on Learning Representations (ICLR) Workshop Track 2017 - Toulon, France
[PDF] [arXiv] [poster]

Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks
Jack Lanchantin, Ritambhara Singh, Beilun Wang, Yanjun Qi
Pacific Symposium on Biocomputing (PSB) 2017 - Kohala Coast, HI
[PDF] [arXiv] [slides] [code] [poster]

Deep Motif: Visualizing Genomic Sequence Classifications
Jack Lanchantin, Ritambhara Singh, Zeming Lin, Yanjun Qi
International Conference on Learning Representations (ICLR) Workshop Track 2016 - San Juan, PR
[PDF] [arXiv] [code] [poster]

DeepChrome: Deep Learning for Predicting Gene Expression from Histone Modifications
Ritambhara Singh, Jack Lanchantin, Gabriel Robins, Yanjun Qi
European Conference on Computational Biology (ECCB) 2016 - The Hague, Netherlands
[PDF] [arXiv] [slides] [code]

Transfer String Kernel for Cross-Context DNA-Protein Binding Prediction
Ritambhara Singh, Jack Lanchantin, Gabriel Robins, Yanjun Qi
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2016
[PDF] [arXiv] [slides] [code]

Exploring the Naturalness of Code with Recurrent Neural Nets
Jack Lanchantin, Ji Gao
arXiv Preprint 2016
[PDF] [arXiv] [slides] [code]

MUST-CNN: A Multilayer Shift-and-Stitch Convolutional Architecture for Sequence-based Protein Structure Prediction
Zeming Lin, Jack Lanchantin, Yanjun Qi
The 30th AAAI Conference on Artificial Intelligence (AAAI) 2016 - Phoenix, AZ
[PDF] [arXiv] [slides] [code]

Scene Labeling with Convolutional Neural Nets
Zeming Lin, Jack Lanchantin
Preprint 2015
[PDF] [slides] [code]