Judged Set Expanded with Exact Duplicates from the corpus for the Temporal Summarization Track 2013

Associated Publication: Gaurav Baruah, Adam Roegiest, and Mark D. Smucker, "The Effect of Expanding Relevance Judgements with Duplicates", SIGIR 2014, 4 pages [PDF] [Poster]
Test Collection:Temporal Summarization Track 2013
Corpus:KBA Stream Corpus 2013
Data File: expanded-pool.gz (300MB gzipped) [Download]

Abstract

We examine the effects of expanding a judged set of sentences with their duplicates from a corpus. Including new sentences that are exact duplicates of the previously judged sentences may allow for better estimation of performance metrics and enhance the reusability of a test collection. We perform experiments in context of the Temporal Summarization Track at TREC 2013. We find that adding duplicate sentences to the judged set does not significantly affect relative system performance. However, we do find statistically significant changes in the performance of nearly half the systems that participated in the Track. We recommend adding exact duplicate sentences to the set of relevance judgements in order to obtain a more accurate estimate of system performance.


Brief Description

The Temporal summarization track [1] requires that runs submit the sentence-ids of sentences that constitute a temporal summary for a topic. The sentences were pooled from the runs and assessed by NIST resulting in a judged set of sentences. The runs were then evaluated for their relative performance for the Temporal Summarization task. Some runs submitted almost 2 million sentences, however, only 9,113 sentences were pooled and judged. Following from Sakai et al [3], the track's evaluation elides the unjudged sentences.

We observed that there are a large number of duplicate sentences in the KBA Stream Corpus 2013 [2]. In our SIGIR 2014 short paper [4], we addressed the question, "How would evaluation be affected if a run returned an exact duplicate of a judged sentence?". For a judged sentence, an exact duplicate sentence has same content but a different sentence-id. The exact duplicate sentence may be returned earlier in time than the judged sentence which may result in a better evaluation score for the run.

We extracted all duplicate sentences of the judged set (pool) from the corpus and added their sentence-ids to the track's qrels. We found that there are approximately 9 million duplicates of the 9,113 judged sentences. Although the relative performance of the submitted runs is not significantly affected (Kendall' tau = 0.899), the absolute performance (evaluation score) changes significantly (p-value <= 0.05 for a paired t-test) for 13 out of 28 submitted runs.


Usage

The Temporal Summarization track qrels are available from the TREC website. Note that the qrels consist of 3 files, the pooled_updates.tsv, matches.tsv and nuggets.tsv. The evaluation script is tseval.py.

For evaluation using (duplicate-) expanded pool, gunzip the expanded-pool.gz and use it in place of pooled_updates.tsv with tseval.py.

Download expanded-pool.gz (300MB gzipped)

References

  1. J. Aslam, F. Diaz, M. Ekstrand-Abueg, V. Pavlu, and T. Sakai. TREC 2013 Temporal Summarization. In TREC, 2013.
  2. KBA Stream Corpus 2013. http://trec-kba.org/kba-stream-corpus-2013.shtml.
  3. T. Sakai and N. Kando. On information retrieval metrics designed for evaluation with incomplete relevance assessments. Information Retrieval, 2008.
  4. Gaurav Baruah, Adam Roegiest and Mark D. Smucker. The effect of expanding relevance judgements with duplicates, SIGIR 2014.