Validating point of cause

Finally, TIMIT includes demographic data about the speakers, permitting fine-grained study of vocal, social, and gender characteristics.

TIMIT illustrates several key features of corpus design.

In general, a text or speech corpus may be annotated at many different linguistic levels, including morphological, syntactic, and discourse levels.

It could also be a phrasal lexicon, where the key field is a phrase rather than a single word.

A thesaurus also consists of record-structured data, where we look up entries via non-key fields that correspond to topics.

Moreover, even at a given level there may be different labeling schemes or even disagreement amongst annotators, such that we want to represent multiple versions.

A second property of TIMIT is its balance across multiple dimensions of variation, for coverage of dialect regions and diphones.

As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.

You must have an account to comment. Please register or login here!