Eval4NLP - researchr conference series publications

researchr

You are not signed in
Sign in
Sign up

Viewing Publication 1 - 75 from 75

2023

Reference-Free Summarization Evaluation with Large Language ModelsAbbas Akkasi, Kathleen C. Fraser, Majid Komeili. eval4nlp 2023: 193-201 [doi]

LTRC_IIITH's 2023 Submission for Prompting Large Language Models as Explainable Metrics TaskPavan Baswani, Ananya Mukherjee, Manish Shrivastava 0001. eval4nlp 2023: 156-163 [doi]

Large Language Models As Annotators: A Preliminary Evaluation For Annotating Low-Resource Language ContentSavita Bhat, Vasudeva Varma. eval4nlp 2023: 100-107 [doi]

Summary Cycles: Exploring the Impact of Prompt Engineering on Large Language Models' Interaction with Interaction Log InformationJeremy Block, Yu-Peng Chen, Abhilash Budharapu, Lisa Anthony, Bonnie J. Dorr. eval4nlp 2023: 85-99 [doi]

Transformers Go for the LOLs: Generating (Humourous) Titles from Scientific Abstracts End-to-EndYanran Chen, Steffen Eger. eval4nlp 2023: 62-84 [doi]

Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems, Eval4NLP 2023, Bali, Indonesia, November 1, 2023Daniel Deutsch, Rotem Dror, Steffen Eger, Yang Gao 0021, Christoph Leiter, Juri Opitz, Andreas Rücklé, editors, Association for Computational Linguistics, 2023. [doi]

Can a Prediction's Rank Offer a More Accurate Quantification of Bias? A Case Study Measuring Sexism in Debiased Language ModelsJad Doughman, Shady Shehata, Leen Al Qadi, Youssef Nafea, Fakhri Karray. eval4nlp 2023: 108-116 [doi]

Which is better? Exploring Prompting Strategy For LLM-based MetricsJoonghoon Kim, Sangmin Lee, Seung-Hun Han, Saeran Park, Jiyoon Lee, Kiyoon Jeong, Pilsung Kang 0001. eval4nlp 2023: 164-183 [doi]

EduQuick: A Dataset Toward Evaluating Summarization of Informal Educational Content for Social MediaZahra Kolagar, Sebastian Steindl, Alessandra Zarcone. eval4nlp 2023: 32-48 [doi]

Little Giants: Exploring the Potential of Small LLMs as Evaluation Metrics in Summarization in the Eval4NLP 2023 Shared TaskNeema Kotonya, Saran Krishnasamy, Joel R. Tetreault, Alejandro Jaimes. eval4nlp 2023: 202-218 [doi]

Team NLLG submission for Eval4NLP 2023 Shared Task: Retrieval-Augmented In-Context Learning for NLG EvaluationDaniil Larionov, Vasiliy Viskov, George Kokush, Alexander Panchenko, Steffen Eger. eval4nlp 2023: 228-234 [doi]

The Eval4NLP 2023 Shared Task on Prompting Large Language Models as Explainable MetricsChristoph Leiter, Juri Opitz, Daniel Deutsch, Yang Gao 0021, Rotem Dror, Steffen Eger. eval4nlp 2023: 117-138 [doi]

Characterised LLMs Affect its Evaluation of Summary and TranslationYuan Lu, Yu-Ting Lin. eval4nlp 2023: 184-192 [doi]

Exploring Prompting Large Language Models as Explainable MetricsGhazaleh Mahmoudi. eval4nlp 2023: 219-227 [doi]

Understanding Large Language Model Based Metrics for Text SummarizationAbhishek Pradhan, Ketan Kumar Todi. eval4nlp 2023: 149-155 [doi]

Assessing Distractors in Multiple-Choice TestsVatsal Raina, Adian Liusie, Mark J. F. Gales. eval4nlp 2023: 12-22 [doi]

Zero-shot Probing of Pretrained Language Models for Geography KnowledgeNitin Ramrakhiyani, Vasudeva Varma, Girish K. Palshikar, Sachin Pawar. eval4nlp 2023: 49-61 [doi]

Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across LanguagesYixuan Wang, Qingyan Chen, Duygu Ataman. eval4nlp 2023: 23-31 [doi]

WRF: Weighted Rouge-F1 Metric for Entity RecognitionLukas Weber, Krishnan Jothi Ramalingam, Matthias Beyer, Axel Zimmermann 0005. eval4nlp 2023: 1-11 [doi]

HIT-MI&T Lab's Submission to Eval4NLP 2023 Shared TaskRui Zhang, Fuhai Song, Hui Huang, Jinghao Yuan, Muyun Yang, Tiejun Zhao. eval4nlp 2023: 139-148 [doi]

2022

Why is sentence similarity benchmark not predictive of application-oriented task performance?Kaori Abe, Sho Yokoi, Tomoyuki Kajiwara, Kentaro Inui. eval4nlp 2022: 70-87 [doi]

Assessing Neural Referential Form Selectors on a Realistic Multilingual DatasetGuanyi Chen, Fahime Same, Kees van Deemter. eval4nlp 2022: 103-114 [doi]

GLARE: Generative Left-to-right AdversaRial ExamplesRyan Chi, Nathan Kim, Patrick Liu, Zander Lack, Ethan A. Chi. eval4nlp 2022: 44-50 [doi]

Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems, Eval4NLP 2022, Online, November 20, 2022Daniel Deutsch, Can Udomcharoenchaikit, Juri Opitz, Yang Gao 0021, Marina Fomicheva, Steffen Eger, editors, Association for Computational Linguistics, 2022. [doi]

A Comparative Analysis of Stance Detection Approaches and DatasetsParush Gera, Tempestt J. Neal. eval4nlp 2022: 58-69 [doi]

A Japanese Corpus of Many Specialized Domains for Word Segmentation and Part-of-Speech TaggingShohei Higashiyama, Masao Ideuchi, Masao Utiyama, Yoshiaki Oida, Eiichiro Sumita. eval4nlp 2022: 1-10 [doi]

From COMET to COMES - Can Summary Evaluation Benefit from Translation Evaluation?Mateusz Krubi'nski, Pavel Pecina. eval4nlp 2022: 21-31 [doi]

Chat Translation Error Detection for Assisting Cross-lingual CommunicationsYunmeng Li, Jun Suzuki, Makoto Morishita, Kaori Abe, Ryoko Tokuhisa, Ana Brassard, Kentaro Inui. eval4nlp 2022: 88-95 [doi]

Better Smatch = Better Parser? AMR evaluation is not so simple anymoreJuri Opitz, Anette Frank. eval4nlp 2022: 32-43 [doi]

Evaluating the role of non-lexical markers in GPT-2's language modeling behaviorRoberta Rocca, Alejandro de la Vega. eval4nlp 2022: 96-102 [doi]

Random Text Perturbations Work, but not AlwaysZhengxiang Wang. eval4nlp 2022: 51-57 [doi]

Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment AnalysisShohei Zhou, Alisha Zachariah, Devin Conathan, Jeffery Kline. eval4nlp 2022: 11-20 [doi]

2021

ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched EmbeddingsOleg V. Vasilyev 0001, John Bohannon. eval4nlp 2021: 94-103 [doi]

Validating Label Consistency in NER Data AnnotationQingkai Zeng 0001, Mengxia Yu, Wenhao Yu 0002, Tianwen Jiang, Meng Jiang 0001. eval4nlp 2021: 11-15 [doi]

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG EvaluationAyush Garg 0001, Sammed S. Kagi, Vivek Srivastava, Mayank Singh 0001. eval4nlp 2021: 123-132 [doi]

Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Eval4NLP 2021, Punta Cana, Dominican Republic, November 10, 2021Yang Gao 0021, Steffen Eger, Wei Zhao 0033, Piyawat Lertvittayakumjorn, Marina Fomicheva, editors, Association for Computational Linguistics, 2021. [doi]

Statistically Significant Detection of Semantic Shifts using Contextual Word EmbeddingsYang Liu 0254, Alan Medlar, Dorota Glowacka. eval4nlp 2021: 104-113 [doi]

Error-Sensitive Evaluation for Ordinal Target VariablesDavid Chen, Maury Courtland, Adam Faulkner, Aysu Ezen-Can. eval4nlp 2021: 189-199 [doi]

Evaluation of Unsupervised Automatic Readability Assessors Using Rank CorrelationsYo Ehara. eval4nlp 2021: 62-72 [doi]

Explaining Errors in Machine Translation with Absolute Gradient EnsemblesMelda Eksi, Erik Gelbing, Jonathan Stieber, Chi Viet Vu. eval4nlp 2021: 238-249 [doi]

The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and ResultsMarina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao 0033, Steffen Eger, Yang Gao 0021. eval4nlp 2021: 165-178 [doi]

Trainable Ranking Models to Evaluate the Semantic Accuracy of Data-to-Text Neural GeneratorNicolas Garneau, Luc Lamontagne. eval4nlp 2021: 51-61 [doi]

Differential Evaluation: a Qualitative Analysis of Natural Language Processing System Behavior Based Upon Data Resistance to ProcessingLucie Gianola, Hicham El Boukkouri, Cyril Grouin, Thomas Lavergne, Patrick Paroubek, Pierre Zweigenbaum. eval4nlp 2021: 1-10 [doi]

The UMD Submission to the Explainable MT Quality Estimation Shared Task: Combining Explanation Models with Sequence LabelingTasnim Kabir, Marine Carpuat. eval4nlp 2021: 230-237 [doi]

How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis TaskUrja Khurana, Eric T. Nalisnick, Antske Fokkens. eval4nlp 2021: 16-31 [doi]

Reference-Free Word- and Sentence-Level Translation Evaluation with Token-Matching MetricsChristoph Wolfgang Leiter. eval4nlp 2021: 157-164 [doi]

Testing Cross-Database Semantic Parsers With Canonical UtterancesHeather Lent, Semih Yavuz, Tao Yu, Tong Niu, Yingbo Zhou, Dragomir Radev, Xi Victoria Lin. eval4nlp 2021: 73-83 [doi]

Referenceless Parsing-Based Evaluation of AMR-to-English GenerationEmma Manning, Nathan Schneider 0001. eval4nlp 2021: 114-122 [doi]

Developing a Benchmark for Reducing Data Bias in Authorship AttributionBenjamin Murauer, Günther Specht. eval4nlp 2021: 179-188 [doi]

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition EvaluationChester Palen-Michel, Nolan Holley, Constantine Lignos. eval4nlp 2021: 40-50 [doi]

Explainable Quality Estimation: CUNI Eval4NLP SubmissionPeter Polák, Muskaan Singh, Ondrej Bojar. eval4nlp 2021: 250-255 [doi]

Error Identification for Machine Translation with Metric Embedding and AttentionRaphael Rubino, Atsushi Fujita, Benjamin Marie. eval4nlp 2021: 146-156 [doi]

HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish TextVivek Srivastava, Mayank Singh 0001. eval4nlp 2021: 200-208 [doi]

Writing Style Author Embedding EvaluationEnzo Terreau, Antoine Gourru, Julien Velcin. eval4nlp 2021: 84-93 [doi]

StoryDB: Broad Multi-language Narrative DatasetAlexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov. eval4nlp 2021: 32-39 [doi]

IST-Unbabel 2021 Submission for the Explainable Quality Estimation Shared TaskMarcos V. Treviso, Nuno Miguel Guerreiro, Ricardo Rei, André F. T. Martins. eval4nlp 2021: 133-145 [doi]

What is SemEval evaluating? A Systematic Analysis of Evaluation Campaigns in NLPOskar Wysocki, Malina Florea, Dónal Landers, André Freitas. eval4nlp 2021: 209-229 [doi]

2020

Fill in the BLANC: Human-free quality estimation of document summariesOleg V. Vasilyev 0001, Vedant Dharnidharka, John Bohannon. eval4nlp 2020: 11-20 [doi]

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover DistanceXi Chen 0071, Nan Ding 0002, Tomer Levinboim, Radu Soricut. eval4nlp 2020: 51-59 [doi]

One of these words is not like the other: a reproduction of outlier identification using non-contextual word representationsJesper Brink Andersen, Mikkel Bak Bertelsen, Mikkel Hørby Schou, Manuel R. Ciosici, Ira Assent. eval4nlp 2020: 120-130 [doi]

On the Evaluation of Machine Translation n-best ListsJacob Bremerman, Huda Khayrallah, Douglas W. Oard, Matt Post. eval4nlp 2020: 60-68 [doi]

Are Some Words Worth More than Others?Shiran Dudy, Steven Bedrick. eval4nlp 2020: 131-142 [doi]

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Eval4NLP 2020, Online, November 20, 2020Steffen Eger, Yang Gao 0021, Maxime Peyrard, Wei Zhao 0033, Eduard H. Hovy, editors, Association for Computational Linguistics, 2020. [doi]

BLEU Neighbors: A Reference-less Approach to Automatic EvaluationKawin Ethayarajh, Dorsa Sadigh. eval4nlp 2020: 40-50 [doi]

On Aligning OpenIE Extractions with Knowledge Bases: A Case StudyKiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, Sven Hertling, Christian Meilicke. eval4nlp 2020: 143-154 [doi]

Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic EvaluationNeslihan Iskender, Tim Polzehl, Sebastian Möller 0001. eval4nlp 2020: 164-175 [doi]

Artemis: A Novel Annotation Methodology for Indicative Single Document SummarizationRahul Jha, Keping Bi, Yang Li, Mahdi Pakdaman, Asli Celikyilmaz, Ivan Zhiboedov, Kieran McDonald. eval4nlp 2020: 69-78 [doi]

ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERTHwanhee Lee, Seunghyun Yoon 0002, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Kyomin Jung. eval4nlp 2020: 34-39 [doi]

Truth or Error? Towards systematic analysis of factual errors in abstractive summariesKlaus-Michael Lux, Maya Sappelli, Martha A. Larson. eval4nlp 2020: 1-10 [doi]

Grammaticality and Language ModellingJingcheng Niu, Gerald Penn. eval4nlp 2020: 110-119 [doi]

A survey on Recognizing Textual Entailment as an NLP EvaluationAdam Poliak. eval4nlp 2020: 92-109 [doi]

Item Response Theory for Efficient Human Evaluation of ChatbotsJoão Sedoc, Lyle H. Ungar. eval4nlp 2020: 21-33 [doi]

Evaluating Word Embeddings on Low-Resource LanguagesNathan Stringham, Mike Izbicki. eval4nlp 2020: 176-186 [doi]

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance EvaluationHanna Wecker, Annemarie Friedrich, Heike Adel. eval4nlp 2020: 155-163 [doi]

Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification ModelsReda Yacouby, Dustin Axman. eval4nlp 2020: 79-91 [doi]

Sign in or sign up to see more results.

runs on WebDSL