BS ISO 24624:2016 pdf free.Language resource management – Transcription of spoken language
Most transcription conventions do not provide an exact and comprehensive definition of the unit word. Rather, they take the word definition of standard written orthography as a starting point and supplement this with rules for a selected number of special cases (e.g. abbreviations and spelling, words specific to spoken language such as interjections). A more precise definition should not, and need not, be attempted in this document: the decision of what is to be treated (i.e. marked up) as a word can be left to the individual transcription system. The definition of <w> elements in spoken language transcription can thus be viewed as analogous to the definition of a token in the Morpho-Syntactic Annotation Framework (MAF), where “the description of the orthographic, morphological, phonological and lexical structures that may define a token is not covered by [the] standard” (see ISO 24611). Henceforth, we will call the entity marked-up as a <W> element a token in order to avoid confusion with (orthographic) words in a less formal sense.
Most transcription systems distinguish measured pauses and typed pauses, the latter being typically divided into a small number of types based on perceived length; they include“”micro”,“short”, “medium” and “long”. Pauses can occur outside speakers’ utterances (see 5.5) and between or inside tokens attributed to a <u> element. Whether or not, and how, a pause is attributed to a speaker is a decision made by the transcription system.
All pauses should be represented as <pause> elements. For measured pauses, the length should be provided in a @dur attribute. For typed pauses, the type should be provided in a @type attribute. If neither measured length nor a typification is provided, the <pause> element can also be used without attributes. Since notation of pauses in legacy documents varies greatly, it may be advisable to keep the original notation form: a @rend attribute can be used for that purpose. As described above, pauses outside <u> elements need a @start and an @end attribute referring to the timeline. For pauses inside <u> elements, timing information can, but need not, be provided by means of preceding and/or following <anchor> elements.
Since the measured duration of a pause is also temporal information, contradictions may arise between the value of the @dur attribute and information encoded in timeline references, for instance, when a pause is longer than the utterance in which it is contained. Such inconsistencies cannot be detected by
