Corpus of Word Importance Annotations

About the project

The Switchboard Corpus consists of audio recordings of approximately 260 hours of speech consisting of about 2,400 two-sided telephone conversations among 543 speakers (consisting of 302 male, 241 female) from across the United States. In January 2003, the Institute for Signal and Information Processing (ISIP) released written transcripts for the entire corpus, which consists of nearly 400,000 conversational turns. The ISIP transcripts include a complete lexicon list and automatic word alignment timing corresponding to the original audio files. In our project, a pair of annotators have assigned word-importance scores to these transcripts. As of September 2017, they have annotated over 25,000 tokens, with overlap of approximately 3,100 tokens. We announce the release of these annotations as a set of supplementary files, aligned to the ISIP transcripts. Our annotation work continues, and we aim to annotate all of the Switchboard corpus and with a larger group of annotators.

Corpus Avaiable for Download

Word Importance Annotations (.zip file)

Release History

September, 2017

Below are the files distributed in this release:
- 2005
- 2191
- 2222
- 2348
- 2450
- 2565
- 2636
- 2710
- 2886
- 3044
- 3083
- 3203
- 3301
- 3324
- 3601
- 3817
- 4010
- 4021
- 4320
- 4400
- 4531
- 4721

Linguistic and Assistive Technologies Laboratory