Abstract is missing.
- EEGEyeNet: a Simultaneous Electroencephalography and Eye-tracking Dataset and Benchmark for Eye Movement PredictionArd Kastrati, Martyna Plomecka, Damian Pascual, Lukas Wolf, Victor Gillioz, Roger Wattenhofer, Nicolas Langer. [doi]
- Multilingual Spoken Words CorpusMark Mazumder, Sharad Chitlangia, Colby R. Banbury, Yiping Kang, Juan Ciro, Keith Achorn, Daniel Galvez, Mark Sabini, Peter Mattson, David Kanter, Greg Diamos, Pete Warden, Josh Meyer, Vijay Janapa Reddi. [doi]
- An Extensible Benchmark Suite for Learning to Simulate Physical SystemsKarl Otness, Arvi Gjoka, Joan Bruna, Daniele Panozzo, Benjamin Peherstorfer, Teseo Schneider, Denis Zorin. [doi]
- Monash Time Series Forecasting ArchiveRakshitha Godahewa, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, Pablo Montero-Manso. [doi]
- It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness BenchmarksMichelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, Suresh Venkatasubramanian. [doi]
- Whole Brain Vessel Graphs: A Dataset and Benchmark for Graph Learning and NeuroscienceJohannes C. Paetzold, Julian McGinnis, Suprosanna Shit, Ivan Ezhov, Paul Büschl, Chinmay Prabhakar, Anjany Sekuboyina, Mihail I. Todorov, Georgios Kaissis, Ali Ertürk, Stephan Günnemann, Bjoern H. Menze. [doi]
- Trust, but Verify: Cross-Modality Fusion for HD Map Change DetectionJohn Lambert, James Hays. [doi]
- NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language EvaluationDavid Alfonso-Hermelo, Ahmad Rashid, Abbas Ghaddar, Philippe Langlais, Mehdi Rezagholizadeh. [doi]
- Intelligent Sight and Sound: A Chronic Cancer Facial Pain DatasetCatherine Ordun, Alexandra N. Cha, Edward Raff, Byron Gaskin, Alex Hanson 0002, Mason Rule, Sanjay Purushotham, James L. Gulley. [doi]
- HumBugDB: A Large-scale Acoustic Mosquito DatasetIvan Kiskin, Marianne Sinka, Adam D. Cobb, Waqas Rafique, Lawrence Wang, Davide Zilli, Benjamin Gutteridge, Rinita Dam, Theodoros Marinos, Yunpeng Li, Dickson Msaky, Emmanuel Kaindoa, Gerard Killeen, Eva Herreros-Moya, Kathy Willis, Stephen J. Roberts. [doi]
- The Met Dataset: Instance-level Recognition for ArtworksNikolaos-Antonios Ypsilantis, Noa Garcia, Guangxing Han, Sarah Ibrahimi, Nanne van Noord, Giorgos Tolias. [doi]
- URLB: Unsupervised Reinforcement Learning BenchmarkMichael Laskin, Denis Yarats, Hao Liu, Kimin Lee, Albert Zhan, Kevin Lu, Catherine Cang, Lerrel Pinto, Pieter Abbeel. [doi]
- The Medkit-Learn(ing) Environment: Medical Decision Modelling through SimulationAlex J. Chan, Ioana Bica, Alihan Hüyük, Daniel Jarrett, Mihaela van der Schaar. [doi]
- VALUE: A Multi-Task Benchmark for Video-and-Language Understanding EvaluationLinjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen 0001, Rohit Pillai, Yu Cheng 0001, Luowei Zhou, Xin Wang, William Yang Wang, Tamara L. Berg, Mohit Bansal, Jingjing Liu 0001, Lijuan Wang, Zicheng Liu 0001. [doi]
- RobustBench: a standardized adversarial robustness benchmarkFrancesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, Matthias Hein 0001. [doi]
- HPOBench: A Collection of Reproducible Multi-Fidelity Benchmark Problems for HPOKatharina Eggensperger, Philipp Müller, Neeratyoy Mallik, Matthias Feurer, René Sass, Aaron Klein, Noor H. Awad, Marius Lindauer, Frank Hutter. [doi]
- Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative TasksGeorgios Papoudakis, Filippos Christianos, Lukas Schäfer 0001, Stefano V. Albrecht. [doi]
- CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding TasksRuchir Puri, David S. Kung 0001, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen 0007, Mihir R. Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, Shyam Ramji, Ulrich Finkler, Susan Malaika, Frederick Reiss. [doi]
- HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenMLSebastian Pineda-Arango, Hadi S. Jomaa, Martin Wistuba, Josif Grabocka. [doi]
- DENETHOR: The DynamicEarthNET dataset for Harmonized, inter-Operable, analysis-Ready, daily crop monitoring from spaceLukas Kondmann, Aysim Toker, Marc Rußwurm, Andrés Camero, Devis Peressuti, Grega Milcinski, Pierre-Philippe Mathieu, Nicolas Longépé, Timothy Davis, Giovanni Marchisio, Laura Leal-Taixé, Xiaoxiang Zhu. [doi]
- OGB-LSC: A Large-Scale Challenge for Machine Learning on GraphsWeihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, Jure Leskovec. [doi]
- ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D DataAfshin Dehghan, Gilad Baruch, Zhuoyuan Chen, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, Elad Shulman. [doi]
- CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio TranscriptionNikita Pavlichenko, Ivan Stelmakh, Dmitry Ustalov. [doi]
- FFA-IR: Towards an Explainable and Reliable Medical Report Generation BenchmarkMingjie Li, Wenjia Cai, Rui Liu, Yuetian Weng, Xiaoyun Zhao, Cong Wang, Xin Chen, Zhong Liu, Caineng Pan, Mengke Li, Yingfeng Zheng, Yizhi Liu, Flora D. Salim, Karin Verspoor, Xiaodan Liang, Xiaojun Chang. [doi]
- Benchmarking Multimodal AutoML for Tabular Data with Text FieldsXingjian Shi, Jonas Mueller, Nick Erickson, Nick Erickson, Mu Li 0003, Alexander J. Smola. [doi]
- GraphGT: Machine Learning Datasets for Graph Generation and TransformationYuanqi Du, Shiyu Wang, Xiaojie Guo 0002, Hengning Cao, Shujie Hu, Junji Jiang, Aishwarya Varala, Abhinav Angirekula, Liang Zhao 0002. [doi]
- Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy EvaluationYuta Saito, Shunsuke Aihara, Megumi Matsutani, Yusuke Narita. [doi]
- Graph Robustness Benchmark: Benchmarking the Adversarial Robustness of Graph Machine LearningQinkai Zheng, Xu Zou, Yuxiao Dong, Yukuo Cen, Da Yin, Jiarong Xu, Jiarong Xu, Yang Yang 0009, Jie Tang 0001. [doi]
- FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake DatasetHasam Khalid, Shahroz Tariq, Minha Kim, Simon S. Woo. [doi]
- Teach Me to Explain: A Review of Datasets for Explainable Natural Language ProcessingSarah Wiegreffe, Ana Marasovic. [doi]
- SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous DrivingJianhua Han, Xiwen Liang, Hang Xu, Kai Chen, Lanqing Hong, Jiageng Mao, Chaoqiang Ye, Wei Zhang 0196, Zhenguo Li, Xiaodan Liang, Chunjing Xu. [doi]
- MLPerf Tiny BenchmarkColby R. Banbury, Vijay Janapa Reddi, Peter Torelli, Nat Jeffries, Csaba Király 0002, Jeremy Holleman, Pietro Montino, David Kanter, Pete Warden, Danilo Pau, Urmish Thakker, Antonio Torrini, Jay Cordaro, Giuseppe Di Guglielmo, Javier M. Duarte, Honson Tran, Nhan Tran, Wenxu Niu, Xuesong Xu. [doi]
- Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect EstimationAlicia Curth, David Svensson, James Weatherall, Mihaela van der Schaar. [doi]
- WildfireDB: An Open-Source Dataset Connecting Wildfire Occurrence with Relevant DeterminantsSamriddhi Singla, Ayan Mukhopadhyay, Michael Wilbur, Tina Diao, Vinayak Gajjewar, Ahmed Eldawy, Mykel J. Kochenderfer, Ross D. Shachter, Abhishek Dubey. [doi]
- OmniPrint: A Configurable Printed Character SynthesizerHaozhe Sun, Wei-Wei Tu, Isabelle Guyon. [doi]
- EventNarrative: A Large-scale Event-centric Dataset for Knowledge Graph-to-Text GenerationAnthony Colas, Ali Sadeghian, Yue Wang, Daisy Zhe Wang. [doi]
- The CPD Data Set: Personnel, Use of Force, and Complaints in the Chicago Police DepartmentThibaut Horel, Lorenzo Masoero, Raj Agrawal, Daria Roithmayr, Trevor Campbell. [doi]
- The Neural MMO Platform for Massively Multiagent ResearchJoseph Suarez, Yilun Du, Clare Zhu, Igor Mordatch, Phillip Isola. [doi]
- CCNLab: A Benchmarking Framework for Computational Cognitive NeuroscienceNikhil X. Bhattasali, Momchil S. Tomov, Samuel J. Gershman. [doi]
- MIND dataset for diet planning and dietary healthcare with machine learning: Dataset creation using combinatorial optimization and controllable generation with domain expertsChanghun Lee, Soohyeok Kim, Sehwa Jeong, Chiehyeon Lim, Jayun Kim, Yeji Kim, Minyoung Jung. [doi]
- STAR: A Benchmark for Situated Reasoning in Real-World VideosBo Wu, Shoubin Yu, Zhenfang Chen, Josh Tenenbaum 0001, Chuang Gan. [doi]
- Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain ManagementCécile Logé, Emily Ross, David Yaw Amoah Dadey, Saahil Jain, Adriel Saporta, Andrew Y. Ng, Pranav Rajpurkar. [doi]
- ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale DemonstrationsTongzhou Mu, Zhan Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, Hao Su 0001. [doi]
- An Empirical Study of Graph Contrastive LearningYanqiao Zhu 0001, Yichen Xu, Qiang Liu, Shu Wu. [doi]
- A sandbox for prediction and integration of DNA, RNA, and proteins in single cellsMalte Lücken, Daniel Burkhardt, Robrecht Cannoodt, Christopher Lance, Aditi Agrawal, Hananeh Aliee, Ann Chen, Louise Deconinck, Angela Detweiler, Alejandro Granados, Shelly Huynh, Laura Isacco, Yang Kim, Dominik Klein, Bony de Kumar, Sunil Kuppasani, Heiko Lickert, Aaron McGeever, Joaquin Melgarejo, Honey Mekonen, Maurizio Morri, Michaela Müller, Norma Neff, Sheryl Paul, Bastian Rieck, Kaylie Schneider, Scott Steelman, Michael Sterr, Daniel Treacy, Alexander Tong 0001, Alexandra-Chloé Villani, Guilin Wang, Jia Yan, Ce Zhang, Angela Pisco, Smita Krishnaswamy, Fabian J. Theis, Jonathan M. Bloom. [doi]
- Isaac Gym: High Performance GPU Based Physics Simulation For Robot LearningViktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, Gavriel State. [doi]
- Neural Latents Benchmark '21: Evaluating latent variable models of neural population activityFelix Pei, Joel Ye, David M. Zoltowski, Anqi Wu, Raeed H. Chowdhury, Hansem Sohn, Joseph E. O'Doherty, Krishna V. Shenoy, Matthew T. Kaufman, Mark M. Churchland, Mehrdad Jazayeri, Lee E. Miller, Jonathan W. Pillow, Il Memming Park, Eva L. Dyer, Chethan Pandarinath. [doi]
- ReaSCAN: Compositional Reasoning in Language GroundingZhengxuan Wu, Elisa Kreiss, Desmond C. Ong, Christopher Potts. [doi]
- An Empirical Investigation of Representation Learning for ImitationCynthia Chen, Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H. Wang, Ping Luo, Stuart Russell 0001, Pieter Abbeel, Rohin Shah. [doi]
- Dynamic Environments with Deformable ObjectsRika Antonova, Peiyang Shi, Hang Yin, Zehang Weng, Danica Kragic. [doi]
- Chest ImaGenome Dataset for Clinical ReasoningJoy T. Wu, Nkechinyere Agu, Ismini Lourentzou, Ismini Lourentzou, Arjun Sharma, Joseph Alexander Paguio, Jasper Seth Yao, Edward C. Dee, William Mitchell, Satyananda Kashyap, Andrea Giovannini, Leo Anthony Celi, Mehdi Moradi. [doi]
- A Dataset for Answering Time-Sensitive QuestionsWenhu Chen, Xinyi Wang, William Yang Wang, William Yang Wang. [doi]
- WikiChurches: A Fine-Grained Dataset of Architectural Styles with Real-World ChallengesBjörn Barz, Joachim Denzler. [doi]
- An Information Retrieval Approach to Building Datasets for Hate Speech DetectionMd. Mustafizur Rahman, Dinesh Balakrishnan, Dhiraj Murthy, Mücahid Kutlu, Matt Lease. [doi]
- A Channel Coding Benchmark for Meta-LearningRui Li, Ondrej Bohdal, Rajesh K. Mishra, Hyeji Kim, Da Li 0001, Nicholas D. Lane, Timothy M. Hospedales. [doi]
- BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval ModelsNandan Thakur, Nils Reimers 0001, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych. [doi]
- The Tufts fNIRS Mental Workload Dataset & Benchmark for Brain-Computer Interfaces that GeneralizeZhe Huang, Liang Wang, Giles Blaney, Christopher Slaughter, Devon McKeon, Ziyu Zhou, Robert J. K. Jacob, Michael C. Hughes. [doi]
- Seasons in Drift: A Long Term Thermal Imaging Dataset for Studying Concept DriftIvan A. Nikolov, Mark Philip Philipsen, Jinsong Liu, Jacob V. Dueholm, Anders Johansen, Kamal Nasrollahi, Thomas B. Moeslund. [doi]
- Argoverse 2: Next Generation Datasets for Self-Driving Perception and ForecastingBenjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr 0001, James Hays. [doi]
- RELLISUR: A Real Low-Light Image Super-Resolution DatasetAndreas Aakerberg, Kamal Nasrollahi, Thomas B. Moeslund. [doi]
- Chaos as an interpretable benchmark for forecasting and data-driven modellingWilliam Gilpin. [doi]
- Programming PuzzlesTal Schuster, Ashwin Kalyan, Alex Polozov, Adam Kalai. [doi]
- STEP: Segmenting and Tracking Every PixelMark Weber, Jun Xie, Maxwell D. Collins, Yukun Zhu, Paul Voigtlaender, Bo Chen, Bradley Green, Andreas Geiger 0001, Bastian Leibe, Daniel Cremers, Aljosa Osep, Laura Leal-Taixé, Maxwell D. Collins. [doi]
- A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with TransformerDebing Zhang, Yuanqiang Cai, Sibo Wang 0009, Jiahong Li, Zhuang Li, Yejun Tang, Hong Zhou. [doi]
- Modeling Worlds in TextPrithviraj Ammanabrolu, Mark O. Riedl. [doi]
- Pervasive Label Errors in Test Sets Destabilize Machine Learning BenchmarksCurtis G. Northcutt, Anish Athalye, Jonas Mueller. [doi]
- ImageNet-21K Pretraining for the MassesTal Ridnik, Emanuel Ben Baruch, Asaf Noy, Lihi Zelnik. [doi]
- Few-Shot Learning Evaluation in Natural Language UnderstandingSubhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng, Saghar Hosseini, Saghar Hosseini, Hao Cheng 0002, Ge Yang, Christopher Meek, Ahmed Hassan Awadallah, Jianfeng Gao. [doi]
- What Would Jiminy Cricket Do? Towards Agents That Behave MorallyDan Hendrycks, Mantas Mazeika, Andy Zou, Sahil Patel, Christine Zhu, Jesus Navarro, Dawn Song, Bo Li 0026, Jacob Steinhardt. [doi]
- KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight SubdialectsZhiyuan Tang, Dong Wang 0013, Yanguang Xu, Jianwei Sun, Xiaoning Lei, Shuaijiang Zhao, Cheng Wen, Xingjun Tan, Chuandong Xie, Shuran Zhou, Rui Yan, Chenjia Lv, Yang Han, Wei Zou, Xiangang Li. [doi]
- Variance-Aware Machine Translation Test SetsRunzhe Zhan, Xuebo Liu 0002, Derek F. Wong, Lidia S. Chao. [doi]
- The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial UsageDaniel Galvez, Greg Diamos, Juan Torres, Keith Achorn, Juan Felipe Cerón, Anjali Gopi, David Kanter, Max Lam, Mark Mazumder, Vijay Janapa Reddi. [doi]
- WRENCH: A Comprehensive Benchmark for Weak SupervisionJieyu Zhang, Yue Yu, NameError, Yujing Wang, Yaming Yang 0001, Mao Yang, Alexander Ratner. [doi]
- One Million Scenes for Autonomous Driving: ONCE DatasetJiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang 0196, Zhenguo Li, Jie Yu, Chunjing Xu, Hang Xu. [doi]
- The Multi-Agent Behavior Dataset: Mouse Dyadic Social InteractionsJennifer J. Sun, Tomomi Karigo, Dipam Chakraborty, Sharada P. Mohanty, Benjamin Wild, Quan Sun, Chen Chen, David J. Anderson, Pietro Perona, Yisong Yue, Ann Kennedy. [doi]
- RadGraph: Extracting Clinical Entities and Relations from Radiology ReportsSaahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Q. H. Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang 0004, Matthew P. Lungren, Andrew Y. Ng, Curtis P. Langlotz, Pranav Rajpurkar. [doi]
- DEBAGREEMENT: A comment-reply dataset for (dis)agreement detection in online debatesJohn Pougué-Biyong, Valentina Semenova, Alexandre Matton, Rachel Han, Aerin Kim, Renaud Lambiotte, Doyne Farmer. [doi]
- Brax - A Differentiable Physics Engine for Large Scale Rigid Body SimulationC. Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, Olivier Bachem. [doi]
- Timers and Such: A Practical Benchmark for Spoken Language Understanding with NumbersLoren Lugosch, Piyush Papreja, Mirco Ravanelli, Abdelwahab Heba, Titouan Parcollet. [doi]
- VFP290K: A Large-Scale Benchmark Dataset for Vision-based Fallen Person DetectionJaeju An, Jeongho Kim, Hanbeen Lee, Jinbeom Kim, Junhyung Kang, Minha Kim, Saebyeol Shin, Minha Kim, Donghee Hong, Simon S. Woo. [doi]
- WaveFake: A Data Set to Facilitate Audio Deepfake DetectionJoel Frank, Lea Schönherr. [doi]
- Benchmarking the Combinatorial Generalizability of Complex Query Answering on Knowledge GraphsZihao Wang, Hang Yin, Yangqiu Song. [doi]
- HiRID-ICU-Benchmark - A Comprehensive Machine Learning Benchmark on High-resolution ICU DataHugo Yèche, Rita Kuznetsova, Marc Zimmermann, Matthias Hüser, Xinrui Lyu, Martin Faltys, Gunnar Rätsch. [doi]
- Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale TasksAndrey Malinin, Neil Band, Yarin Gal, Mark J. F. Gales, Alexander Ganshin, German Chesnokov, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Denis Roginskiy, Mariya Shmatova, Panagiotis Tigas, Boris Yangel. [doi]
- LiRo: Benchmark and leaderboard for Romanian language tasksStefan Daniel Dumitrescu, Petru Rebeja, Beáta Lorincz, Mihaela Gaman, Andrei-Marius Avram, Mihai Ilie, Andrei Pruteanu, Adriana Stan, Lorena Rosia, Cristina Iacobescu, Luciana Morogan, George Dima, Gabriel Marchidan, Traian Rebedea, Madalina Chitez, Dani Yogatama, Sebastian Ruder, Radu-Tudor Ionescu, Razvan Pascanu, Viorica Patraucean. [doi]
- Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmarkSolène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar, Sina Alisamir, Ziyi Tong, Natalia A. Tomashenko, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier. [doi]
- CSFCube - A Test Collection of Computer Science Research Articles for Faceted Query by ExampleSheshera Mysore, Tim O'Gorman, Andrew McCallum, Hamed Zamani. [doi]
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and GenerationShuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou 0001, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shujie Liu 0001. [doi]
- CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation AlgorithmsMartin Pawelczyk, Sascha Bielawski, Johannes van den Heuvel, Tobias Richter, Gjergji Kasneci. [doi]
- OpenML Benchmarking SuitesBernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael Gomes Mantovani, Jan N. van Rijn, Joaquin Vanschoren. [doi]
- Contemporary Symbolic Regression Methods and their Relative PerformanceWilliam G. La Cava, Patryk Orzechowski, Bogdan Burlacu, Fabrício Olivetti de França, Marco Virgolin, Ying Jin, Michael Kommenda, Jason H. Moore. [doi]
- ThreeDWorld: A Platform for Interactive Multi-Modal Physical SimulationChuang Gan, Jeremy Schwartz, Seth Alter, Damian Mrowca, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, Kuno Kim, Elias Wang, Michael Lingelbach, Aidan Curtis, Kevin T. Feigelis, Daniel Bear, Dan Gutfreund, David D. Cox, Antonio Torralba 0001, James J. DiCarlo, Josh Tenenbaum 0001, Josh H. McDermott, Dan Yamins. [doi]
- SKM-TEA: A Dataset for Accelerated MRI Reconstruction with Dense Image Labels for Quantitative Clinical EvaluationArjun D. Desai, Andrew M. Schmidt, Elka B. Rubin, Christopher M. Sandino, Marianne Black, Valentina Mazzoli, Kathryn J. Stevens, Robert Boutin, Christopher Ré, Garry Gold, Brian A. Hargreaves, Akshay Chaudhari. [doi]
- CropHarvest: A global dataset for crop-type classificationGabriel Tseng, Ivan Zvonkov, Catherine Nakalembe, Hannah Kerner. [doi]
- Relational Pattern Benchmarking on the Knowledge Graph Link Prediction TaskAfshin Sadeghi, Hirra Malik, Diego Collarana, Jens Lehmann 0001. [doi]
- A Toolbox for Construction and Analysis of Speech DatasetsEvelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg. [doi]
- MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning ResearchMikayel Samvelyan, Robert Kirk, Vitaly Kurin, Jack Parker-Holder, Minqi Jiang, Eric Hambro, Fabio Petroni, Heinrich Küttler, Edward Grefenstette, Tim Rocktäschel. [doi]
- Constructing a Visual Dataset to Study the Effects of Spatial Apartheid in South AfricaRaesetje Sefala, Timnit Gebru, Nyalleng Moorosi, Luzango Mfupe, Richard Klein. [doi]
- DUE: End-to-End Document Understanding BenchmarkLukasz Borchmann, Michal Pietruszka, Tomasz Stanislawek, Dawid Jurkiewicz, Michal Turski, Karolina Szyndler, Filip Gralinski. [doi]
- RAFT: A Real-World Few-Shot Text Classification BenchmarkNeel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C. Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, Michael Noetel, Andreas Stuhlmüller. [doi]
- Artsheets for Art DatasetsRamya Srinivasan, Emily Denton, Jordan Famularo, Negar Rostamzadeh, Fernando Diaz, Beth Coleman. [doi]
- Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AISanthosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexander Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X. Chang, Manolis Savva, Yili Zhao, Dhruv Batra. [doi]
- FLIP: Benchmark tasks in fitness landscape inference for proteinsChristian Dallago, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, Kevin Yang. [doi]
- Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement LearningNan Rosemary Ke, Aniket Didolkar, Sarthak Mittal, Anirudh Goyal, Guillaume Lajoie, Stefan Bauer, Danilo Jimenez Rezende, Michael Mozer, Yoshua Bengio, Chris Pal. [doi]
- A realistic approach to generate masked faces applied on two novel masked face recognition data setsTudor Mare, Georgian-Emilian Duta, Mariana-Iuliana Georgescu, Adrian Sandru, Bogdan Alexe, Marius Popescu, Radu-Tudor Ionescu. [doi]
- AI and the Everything in the Whole Wide World BenchmarkInioluwa Deborah Raji, Emily Denton, Emily M. Bender, Alex Hanna, Amandalynne Paullada. [doi]
- ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object SegmentationLaurynas Karazija, Iro Laina, Christian Rupprecht 0001. [doi]
- Mitigating dataset harms requires stewardship: Lessons from 1000 papersKenneth Peng, Arunesh Mathur, Arvind Narayanan. [doi]
- Occluded Video Instance Segmentation: Dataset and ICCV 2021 ChallengeJiyang Qi, Yan Gao, Yao Hu, Xinggang Wang, Xiaoyu Liu, Xiang Bai, Serge J. Belongie, Alan L. Yuille, Philip H. S. Torr, Song Bai. [doi]
- Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language ModelsBoxin Wang, Chejian Xu, Shuohang Wang, Zhe Gan, Yu Cheng 0001, Jianfeng Gao, Ahmed Hassan Awadallah, Bo Li 0026. [doi]
- MQBench: Towards Reproducible and Deployable Model Quantization BenchmarkYuhang Li, Mingzhu Shen, Jian Ma, Yan Ren, Mingxin Zhao, Qi Zhang, Ruihao Gong, Fengwei Yu, Junjie Yan. [doi]
- What Ails One-Shot Image Segmentation: A Data PerspectiveMayur Hemani, Abhinav Patel, Tejas Shimpi, Anirudha Ramesh, Balaji Krishnamurthy. [doi]
- ATOM3D: Tasks on Molecules in Three DimensionsRaphael J. L. Townshend, Martin Vögele, Patricia Suriana, Alexander Derry, Alexander Powers, Yianni Laloudakis, Sidhika Balachandar, Bowen Jing, Brandon M. Anderson, Stephan Eismann, Risi Kondor, Russ B. Altman, Ron O. Dror. [doi]
- Which priors matter? Benchmarking models for learning latent dynamicsAleksandar Botev, Andrew Jaegle, Peter Wirnsberger, Daniel Hennes, Irina Higgins. [doi]
- NaturalProofs: Mathematical Theorem Proving in Natural LanguageSean Welleck, Jiacheng Liu 0010, Ronan Le Bras, Hanna Hajishirzi, Yejin Choi, KyungHyun Cho, KyungHyun Cho. [doi]
- B-Pref: Benchmarking Preference-Based Reinforcement LearningKimin Lee, Laura Smith, Anca D. Dragan, Pieter Abbeel. [doi]
- Empirical Study of Off-Policy Policy Evaluation for Reinforcement LearningCameron Voloshin, Hoang Minh Le 0002, Nan Jiang 0008, Yisong Yue. [doi]
- Personalized Benchmarking with the Ludwig Benchmarking ToolkitAvanika Narayan, Piero Molino, Karan Goel, Willie Neiswanger, Christopher Ré. [doi]
- Revisiting Time Series Outlier Detection: Definitions and BenchmarksKwei-Herng Lai, Daochen Zha, Junjie Xu, Yue Zhao, Guanchu Wang, Xia Hu. [doi]
- Hardware Design and Accurate Simulation of Structured-Light Scanning for Benchmarking of 3D Reconstruction AlgorithmsSebastian Koch, Yurii Piadyk, Markus Worchel, Marc Alexa, Claudio Silva, Denis Zorin, Daniele Panozzo. [doi]
- Benchmarks for Corruption Invariant Person Re-identificationMinghui Chen, Zhiqiang Wang, Feng Zheng. [doi]
- Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection TasksNeil Band, Tim G. J. Rudner, Qixuan Feng, Angelos Filos, Zachary Nado, Mike Dusenberry, Ghassen Jerfel, Dustin Tran, Yarin Gal. [doi]
- FS-Mol: A Few-Shot Learning Dataset of MoleculesMegan Stanley, John Bronskill, Krzysztof Maziarz, Hubert Misztela, Jessica Lanini, Marwin H. S. Segler, Nadine Schneider, Marc Brockschmidt. [doi]
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor CompilersLianmin Zheng, Ruochen Liu, Junru Shao, TianQi Chen, Joseph Gonzalez 0001, Ion Stoica, Ameer Haj Ali. [doi]
- PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure PredictionMateusz Jurewicz, Leon Derczynski. [doi]
- A Large-Scale Database for Graph Representation LearningScott Freitas, Yuxiao Dong, Joshua Neil, Duen Horng Chau. [doi]
- Towards a robust experimental framework and benchmark for lifelong language learningAman Hussain, Nithin Holla, Pushkar Mishra, Helen Yannakoudakis, Ekaterina Shutova. [doi]
- RedCaps: Web-curated image-text data created by the people, for the peopleKaran Desai, Gaurav Kaul, Zubin Aysola, Justin Johnson 0001. [doi]
- SciGen: a Dataset for Reasoning-Aware Text Generation from Scientific TablesNafise Sadat Moosavi, Andreas Rücklé, Dan Roth, Iryna Gurevych. [doi]
- <tt>RP-Mod</tt>&<tt>RP-Crowd: </tt> Moderator- and Crowd-Annotated German News Comment DatasetsDennis Assenmacher, Marco Niemann, Kilian Müller, Moritz Seiler, Dennis M. Riehle, Heike Trautmann. [doi]
- CommonsenseQA 2.0: Exposing the Limits of AI through GamificationAlon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, Jonathan Berant. [doi]
- Datasets for Online Controlled ExperimentsChak Hin Bryan Liu, Ângelo Cardoso, Paul Couturier, Emma J. McCoy. [doi]
- PASS: An ImageNet replacement for self-supervised pretraining without humansYuki M. Asano, Christian Rupprecht 0001, Andrew Zisserman, Andrea Vedaldi. [doi]
- KLUE: Korean Language Understanding EvaluationSungjoon Park, Jihyung Moon, Sungdong Kim, Won-Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, JunSeong Kim, Youngsook Song, Tae-Hwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim 0002, Myeonghwa Lee, Seongbo Jang, Seungwon Do, SunKyoung Kim, Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin Shin, Seonghyun Kim, Eunjeong Lucy Park, Alice Oh, Jung-Woo Ha 0001, KyungHyun Cho. [doi]
- Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine LearningThomas Liao, Rohan Taori, Deborah Raji, Ludwig Schmidt. [doi]
- Benchmarking Bias Mitigation Algorithms in Representation Learning through Fairness MetricsCharan Reddy, Deepak Sharma, Soroush Mehri, Adriana Romero-Soriano, Samira Shabanian, Sina Honari. [doi]
- Benchmarking the Robustness of Spatial-Temporal Models Against CorruptionsChenyu Yi, Siyuan Yang, Haoliang Li, Yap-Peng Tan, Alex C. Kot. [doi]
- Measuring Mathematical Problem Solving With the MATH DatasetDan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. [doi]
- CREAK: A Dataset for Commonsense Reasoning over Entity KnowledgeYasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, Greg Durrett. [doi]
- AP-10K: A Benchmark for Animal Pose Estimation in the WildHang Yu, Yufei Xu, Jing Zhang 0037, Wei Zhao, Ziyu Guan, Dacheng Tao. [doi]
- CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of CancerMoein Sorkhei, Yue Liu, Hossein Azizpour, Edward Azavedo, Karin Dembrower, Dimitra Ntoula, Athanasios Zouzos, Fredrik Strand, Kevin Smith 0001. [doi]
- The CLEAR Benchmark: Continual LEArning on Real-World ImageryZhiqiu Lin, Jia Shi, Deepak Pathak, Deva Ramanan. [doi]
- Reinforcement Learning Benchmarks for Traffic Signal ControlJames Ault, Guni Sharon. [doi]
- A Unified Few-Shot Classification Benchmark to Compare Transfer and Meta Learning ApproachesVincent Dumoulin, Neil Houlsby, Utku Evci, Xiaohua Zhai, Ross Goroshin, Sylvain Gelly, Hugo Larochelle, Hugo Larochelle. [doi]
- Automatic Construction of Evaluation Suites for Natural Language Generation DatasetsSimon Mille, Kaustubh D. Dhole, Saad Mahamood, Laura Perez-Beltrachini, Varun Gangal, Mihir Kale, Emiel van Miltenburg, Sebastian Gehrmann. [doi]
- Measuring Coding Challenge Competence With APPSDan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, Jacob Steinhardt. [doi]
- FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured informationRami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos 0001, Christos Christodoulopoulos 0001, Oana Cocarascu, Arpit Mittal. [doi]
- SegmentMeIfYouCan: A Benchmark for Anomaly SegmentationRobin Chan, Krzysztof Lis, Svenja Uhlemeyer, Hermann Blum, Sina Honari, Roland Siegwart, Pascal Fua, Mathieu Salzmann, Matthias Rottmann. [doi]
- COVID-19 Sounds: A Large-Scale Audio Dataset for Digital Respiratory ScreeningTong Xia, Dimitris Spathis, Chloë Brown, Jagmohan Chauhan, Andreas Grammenos, Jing Han 0010, Apinan Hasthanasombat, Erika Bondareva, Ting Dang, Andres Floto, Pietro Cicuta, Cecilia Mascolo. [doi]
- The PAIR-R24M Dataset for Multi-animal 3D Pose EstimationJesse Marshall, Ugne Klibaite, Amanda Gellis, Diego Aldarondo, Bence Olveczky, Timothy W. Dunn. [doi]
- Synthetic Benchmarks for Scientific Research in Explainable Machine LearningYang Liu, Sujay Khandagale, Colin White, Willie Neiswanger. [doi]
- A Procedural World Generation Framework for Systematic Evaluation of Continual LearningTimm Hess, Martin Mundt, Iuliia Pliushch, Visvanathan Ramesh. [doi]
- LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic SegmentationJunjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, Yanfei Zhong. [doi]
- Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning ResearchBernard Koch, Emily Denton, Alex Hanna, Jacob G. Foster. [doi]
- RB2: Robotic Manipulation Benchmarking with a TwistSudeep Dasari, Jianren Wang, Joyce Hong, Shikhar Bahl, Yixin Lin, Austin S. Wang, Abitha Thankaraj, Karanbir Chahal, Berk Çalli, Saurabh Gupta 0001, David Held, Lerrel Pinto, Deepak Pathak, Vikash Kumar, Abhinav Gupta 0001. [doi]
- Alchemy: A benchmark and analysis toolkit for meta-reinforcement learning agentsJane Wang 0001, Michael King, Nicolas Porcel, Zeb Kurth-Nelson, Tina Zhu, Charles Deck, Peter Choy, Mary Cassin, Malcolm Reynolds, H. Francis Song, Gavin Buttimore, David P. Reichert, Neil C. Rabinowitz, Loic Matthey, Demis Hassabis, Alexander Lerchner, Matt M. Botvinick. [doi]
- SynthBio: A Case Study in Faster Curation of Text DatasetsAnn Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, Sebastian Gehrmann. [doi]
- SustainBench: Benchmarks for Monitoring the Sustainable Development Goals with Machine LearningChristopher Yeh, Chenlin Meng, Sherrie Wang, Anne Driscoll, Erik Rozi, Patrick Liu, Jihyeon Janel Lee, Marshall Burke, David B. Lobell, Stefano Ermon. [doi]
- Evaluating Bayes Error Estimators on Real-World Datasets with FeeBeeCédric Renggli, Luka Rimanic, Nora Hollenstein, Ce Zhang 0001. [doi]
- DABS: a Domain-Agnostic Benchmark for Self-Supervised LearningAlex Tamkin, Vincent Liu, Rongfei Lu, Daniel Fein, Colin Schultz, Noah D. Goodman. [doi]
- Pl@ntNet-300K: a plant image dataset with high label ambiguity and a long-tailed distributionCamille Garcin, Alexis Joly, Pierre Bonnet, Antoine Affouard, Jean-Christophe Lombardo, Mathias Chouet, Maximilien Servajean, Titouan Lorieul, Joseph Salmon. [doi]
- IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language ReasoningPan Lu, Liang Qiu, Jiaqi Chen, Tanglin Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, Song Chun Zhu. [doi]
- Generating Datasets of 3D Garments with Sewing PatternsMaria Korosteleva, Sung Hee Lee. [doi]
- BiToD: A Bilingual Multi-Domain Dataset For Task-Oriented Dialogue ModelingZhaojiang Lin, Andrea Madotto, Genta Indra Winata, Peng Xu 0008, Feijun Jiang, Yuxiang Hu, Chen Shi, Pascale Fung. [doi]
- Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading ComprehensionShusheng Xu, Yichen Liu, Xiaoyu Yi, Siyuan Zhou, Huizi Li, Yi Wu. [doi]
- ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate ModelsSalva Rühling Cachay, Venkatesh Ramesh, Jason N. S. Cole, Howard Barker, David Rolnick. [doi]
- Physion: Evaluating Physical Prediction from Vision in Humans and MachinesDaniel Bear, Elias Wang, Damian Mrowca, Felix J. Binder, Hsiao-Yu Tung, R. T. Pramod, Cameron Holdaway, Sirui Tao, Kevin A. Smith, Fan-Yun Sun, Fei-Fei Li 0001, Nancy Kanwisher, Josh Tenenbaum 0001, Dan Yamins, Judith E. Fan. [doi]
- Addressing "Documentation Debt" in Machine Learning: A Retrospective Datasheet for BookCorpusJack Bandy, Nicholas Vincent. [doi]
- Benchmark for Compositional Text-to-Image SynthesisDong Huk Park, Samaneh Azadi, Xihui Liu, Trevor Darrell, Anna Rohrbach. [doi]
- Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and DevelopmentKexin Huang, Tianfan Fu, Wenhao Gao, Yue Zhao 0016, Yusuf Roohani, Jure Leskovec, Connor W. Coley, Cao Xiao, Jimeng Sun, Marinka Zitnik. [doi]
- Benchmarking Data-driven Surrogate Simulators for Artificial Electromagnetic MaterialsYang Deng, Juncheng Dong, Simiao Ren, Omar Khatib, Mohammadreza Soltani, Vahid Tarokh, Willie Padilla, Jordan M. Malof. [doi]
- CUAD: An Expert-Annotated NLP Dataset for Legal Contract ReviewDan Hendrycks, Collin Burns, Anya Chen, Spencer Ball. [doi]
- A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language LearningGaoussou Youssouf Kebe, Padraig Higgins, Patrick Jenkins, Kasra Darvish, Rishabh Sachdeva, Ryan Barron, John Winder, Don Engel, Edward Raff, Francis Ferraro, Cynthia Matuszek. [doi]
- MultiBench: Multiscale Benchmarks for Multimodal Representation LearningPaul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu, Ruslan Salakhutdinov, Louis-Philippe Morency. [doi]