Abstract is missing.
- Truth or Error? Towards systematic analysis of factual errors in abstractive summariesKlaus-Michael Lux, Maya Sappelli, Martha A. Larson. 1-10 [doi]
- Fill in the BLANC: Human-free quality estimation of document summariesOleg V. Vasilyev 0001, Vedant Dharnidharka, John Bohannon. 11-20 [doi]
- Item Response Theory for Efficient Human Evaluation of ChatbotsJoão Sedoc, Lyle H. Ungar. 21-33 [doi]
- ViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERTHwanhee Lee, Seunghyun Yoon 0002, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Kyomin Jung. 34-39 [doi]
- BLEU Neighbors: A Reference-less Approach to Automatic EvaluationKawin Ethayarajh, Dorsa Sadigh. 40-50 [doi]
- Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover DistanceXi Chen 0071, Nan Ding 0002, Tomer Levinboim, Radu Soricut. 51-59 [doi]
- On the Evaluation of Machine Translation n-best ListsJacob Bremerman, Huda Khayrallah, Douglas W. Oard, Matt Post. 60-68 [doi]
- Artemis: A Novel Annotation Methodology for Indicative Single Document SummarizationRahul Jha, Keping Bi, Yang Li, Mahdi Pakdaman, Asli Celikyilmaz, Ivan Zhiboedov, Kieran McDonald. 69-78 [doi]
- Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification ModelsReda Yacouby, Dustin Axman. 79-91 [doi]
- A survey on Recognizing Textual Entailment as an NLP EvaluationAdam Poliak. 92-109 [doi]
- Grammaticality and Language ModellingJingcheng Niu, Gerald Penn. 110-119 [doi]
- One of these words is not like the other: a reproduction of outlier identification using non-contextual word representationsJesper Brink Andersen, Mikkel Bak Bertelsen, Mikkel Hørby Schou, Manuel R. Ciosici, Ira Assent. 120-130 [doi]
- Are Some Words Worth More than Others?Shiran Dudy, Steven Bedrick. 131-142 [doi]
- On Aligning OpenIE Extractions with Knowledge Bases: A Case StudyKiril Gashteovski, Rainer Gemulla, Bhushan Kotnis, Sven Hertling, Christian Meilicke. 143-154 [doi]
- ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance EvaluationHanna Wecker, Annemarie Friedrich, Heike Adel. 155-163 [doi]
- Best Practices for Crowd-based Evaluation of German Summarization: Comparing Crowd, Expert and Automatic EvaluationNeslihan Iskender, Tim Polzehl, Sebastian Möller 0001. 164-175 [doi]
- Evaluating Word Embeddings on Low-Resource LanguagesNathan Stringham, Mike Izbicki. 176-186 [doi]