|Associated Publication:||Gaurav Baruah, Adam Roegiest, and Mark D. Smucker, "The Effect of Expanding Relevance Judgements with Duplicates", SIGIR 2014, 4 pages [PDF] [Poster]|
|Test Collection:||Temporal Summarization Track 2013|
|Corpus:||KBA Stream Corpus 2013|
We examine the effects of expanding a judged set of sentences with their duplicates from a corpus. Including new sentences that are exact duplicates of the previously judged sentences may allow for better estimation of performance metrics and enhance the reusability of a test collection. We perform experiments in context of the Temporal Summarization Track at TREC 2013. We find that adding duplicate sentences to the judged set does not significantly affect relative system performance. However, we do find statistically significant changes in the performance of nearly half the systems that participated in the Track. We recommend adding exact duplicate sentences to the set of relevance judgements in order to obtain a more accurate estimate of system performance.
The Temporal summarization track  requires that runs submit the sentence-ids of sentences that constitute a temporal summary for a topic. The sentences were pooled from the runs and assessed by NIST resulting in a judged set of sentences. The runs were then evaluated for their relative performance for the Temporal Summarization task. Some runs submitted almost 2 million sentences, however, only 9,113 sentences were pooled and judged. Following from Sakai et al , the track's evaluation elides the unjudged sentences.
We observed that there are a large number of duplicate sentences in the KBA Stream Corpus 2013 . In our SIGIR 2014 short paper , we addressed the question, "How would evaluation be affected if a run returned an exact duplicate of a judged sentence?". For a judged sentence, an exact duplicate sentence has same content but a different sentence-id. The exact duplicate sentence may be returned earlier in time than the judged sentence which may result in a better evaluation score for the run.
We extracted all duplicate sentences of the judged set (pool) from the corpus and added their sentence-ids to the track's qrels. We found that there are approximately 9 million duplicates of the 9,113 judged sentences. Although the relative performance of the submitted runs is not significantly affected (Kendall' tau = 0.899), the absolute performance (evaluation score) changes significantly (p-value <= 0.05 for a paired t-test) for 13 out of 28 submitted runs.
The Temporal Summarization track qrels are available from the TREC website.
Note that the qrels consist of 3 files, the
pooled_updates.tsv, matches.tsv and
nuggets.tsv. The evaluation script is
For evaluation using (duplicate-) expanded pool, gunzip the
expanded-pool.gz and use it in place of