Ambiguity tag A set of two part-of-speech tags (joined by a hyphen) attached to a single lexical item,
to indicate that the CLAWS tagging program was unable to reliably distinguish between the two possible word classes.
In the BNC World Edition, the ordering of the tags is significant: it is the first of the two tags which is estimated
by the tagger to be the more likely. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between
adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading.
The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.
Collocation The phenomenon of words/lexical items tending to co-occur in
close proximity to one another in spoken/written discourse (i.e. habitual or greater-than-chance
co-selection of words). For example, if you look up the word jubilee, you will tend to find the
following words (the collocates) nearby: silver, diamond, golden, Queen's,
line. The term 'collocation' is very broad, and allows varying degrees of collocability
(or collocational strength), which is measured by several statistical
formulae (e.g. log-likelihood, mutual information). At one extreme of the scale, collocations
which are totally predictable are usually analysed as idioms, cliches, etc.
At the other extreme, items which co-occur significantly in statistical
terms may not be recognised as predictable collocations by native speakers. BNCweb simply
presents the raw statistical results: it is up to the user to do the evaluation and
interpretation.
Lemma (pl. lemmas or lemmata) An abstract lexical category (usually represented by all-capitals, e.g. BLOW)
consisting of a lexeme base plus its inflected forms (regular, irregular & suppletive)
which share the same part of speech. For example, the verbal lemma BLOW contains the word
forms blow, blows, blew, blown and blowing, while the lemma GO
encompasses go, goes, went, gone, going. Lemmas
for nouns (or 'substantives') group together singular and plural forms (e.g. wolf/wolves);
adjectival lemmas group together positive, comparative and superlative forms (e.g.
happy, happier, happiest; good, better, best);
pronominal lemmas group together different cases of the same pronoun (e.g. I, me, my, mine).
<s>-unit This may be thought of as a 'sentence' unit, although it is also used
to delimit some headings in written texts, and in the spoken texts to delimit stretches of discourse
which the transcribers have identified as sentence-like in form (e.g. because they are bounded by pause)
or function (e.g. a one-word utterance such as Yeah). Fragmentary chunks are therefore sometimes
marked up as <s>-units in the BNC. All <s>-units are numbered, and the number is reported
for each BNC query result hit. In the written texts, <s>-units
are contained within 'paragraphs' (<p>-units), while in spoken texts they are contained within
'utterances' (<u>-units). To illustrate, in the excerpt below, there are two 'sentences'
(<s>-units numbered 6 and 7) within one utterance (<u>-unit) by the speaker (who is
referenced by the identity tag 'PS6SH'):
<u who=PS6SH>
<s n="6">Wait.
<s n="7"> Wait Lis, I ain't got the things on.
</u>
BNCweb translates the
SGML Abbreviation for Standard Generalized Markup Language.
SGML is an international standard which governs the way the structural units and fonts of a document are
described so that they can shared among different computer platforms and programs without any loss of
information (compare this with the different internal file formats used by different word processors under
various operating systems such as Windows, Mac OS and Linux). The language used by web browsers, HTML (or,
increasingly, XML), is an example of a text representation language based on SGML. A simple example of how
SGML works is the way paragraphs, section headings or utterances by individual speakers are marked up using
angled brackets to mark the beginning and end of each structural unit (e.g. using <p> and </p>).
Documents marked up in SGML are meant to be read by computer programs in the first instance: they interpret
the markup before displaying the document in a format which humans find easy to read, and users can choose which
parts of the documents to display or search (e.g. only certain headings, or only the utterances spoken by
certain speakers). The raw BNC texts are marked up using SGML, and special tools are required for processing
these texts. SARA is the corpus tool provided with the BNC, and BNCweb is a user-friendly
program ('interface') which mediates between the user and the complexity of the SARA internal commands for
interrogating marked-up texts. A more generic and technical definition of SGML may be found
here.
Spoken context-governed texts (task-oriented speech) The part of the BNC which contains spoken texts transcribed from recordings made in
specially selected spoken contexts (mostly semi-formal/prepared or situated within an institutional context, thus
contrasting with the Spoken demographic material). There are four broadly defined contexts/domains of
spoken discourse: educational (e.g. classroom lessons, lectures, home tutorials), business
(e.g. committee meetings, job interviews), public/institutional (e.g. council meetings, courtroom discourse,
medical consultations, sermons) and leisure (e.g. TV/radio broadcasts, oral history narratives).
The examples given in brackets closely mirror the genre labels available in
the BNC World Edition, which can be used for creating subcorpora.
Text header This refers to the information included at the top of each BNC file which gives
descriptive and classificatory information about the text while not actually being part
of it. It is therefore metatextual information in the sense that it gives information about
the origin/provenance of the text (e.g. bibliographic information for published texts, or information
about the setting & participants for spoken texts), how the text was collected or recorded,
how the words were entered or transcribed, etc. The File and speaker information display
feature can be used to view the metatextual information contained in the header of each text.
The header information in BNC files is very detailed and highly structured. It is set off from
the text proper by the use of the SGML element
<teiHeader></teiHeader> (TEI stands for the Text Encoding Initiative, a body of standards
for electronic texts). As an example, the header of file KSW can be found here
in full.
<w>-unit In most cases, <w>-units correspond to orthographic words, except for
contracted forms, fused forms and multiword units such as "she's",
"gonna", and "in front of". Further information can be found in the
BNC World Edition POS-tagging manual.
The difference between orthographic word and <w>-unit also explains why the "100-million word
corpus" BNC only contains 97,626,093 words: There are in fact slightly more than 100 million words in the BNC (100,467,090 to be
exact), but they are orthographic words rather than <w>-units.