Ambiguity tag     A set of two part-of-speech tags (joined by a hyphen) attached to a single lexical item, 
						to indicate that the CLAWS tagging program was unable to reliably distinguish between the two possible word classes. 
						In the BNC World Edition, the ordering of the tags is significant: it is the first of the two tags which is estimated 
						by the tagger to be the more likely. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between 
						adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading. 
						The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.
						
						
						Collocation     The phenomenon of words/lexical items tending to co-occur in 
						close proximity to one another in spoken/written discourse (i.e. habitual or greater-than-chance 
						co-selection of words). For example, if you look up the word jubilee, you will tend to find the 
						following words (the collocates) nearby: silver, diamond, golden, Queen's, 
						line. The term 'collocation' is very broad, and allows varying degrees of collocability 
						(or collocational strength), which is measured by several statistical 
						formulae (e.g. log-likelihood, mutual information). At one extreme of the scale, collocations 
						which are totally predictable are usually analysed as idioms, cliches, etc. 
						At the other extreme, items which co-occur significantly in statistical
						terms may not be recognised as predictable collocations by native speakers. BNCweb simply 
						presents the raw statistical results: it is up to the user to do the evaluation and 
						interpretation.
						
						Lemma (pl. lemmas or lemmata)     An abstract lexical category (usually represented by all-capitals, e.g. BLOW)
						consisting of a lexeme base plus its inflected forms (regular, irregular & suppletive) 
						which share the same part of speech. For example, the verbal lemma BLOW contains the word 
						forms blow, blows, blew, blown and blowing, while the lemma GO 
						encompasses go, goes, went, gone, going. Lemmas 
						for nouns (or 'substantives') group together singular and plural forms (e.g. wolf/wolves); 
						adjectival lemmas group together positive, comparative and superlative forms (e.g. 
						happy, happier, happiest; good, better, best);
						pronominal lemmas group together different cases of the same pronoun (e.g. I, me, my, mine).
						
						
						<s>-unit     This may be thought of as a 'sentence' unit, although it is also used 
						to delimit some headings in written texts, and in the spoken texts to delimit stretches of discourse 
						which the transcribers have identified as sentence-like in form (e.g. because they are bounded by pause) 
						or function (e.g. a one-word utterance such as Yeah). Fragmentary chunks are therefore sometimes 
						marked up as <s>-units in the BNC. All <s>-units are numbered, and the number is reported 
						for each BNC query result hit. In the written texts, <s>-units 
						are contained within 'paragraphs' (<p>-units), while in spoken texts they are contained within 
						'utterances' (<u>-units). To illustrate, in the excerpt below, there are two 'sentences' 
						(<s>-units numbered 6 and 7) within one utterance (<u>-unit) by the speaker (who is 
						referenced by the identity tag 'PS6SH'):
						
						
						<u who=PS6SH>
						  <s n="6">Wait.
						  <s n="7"> Wait Lis, I ain't got the things on.
						</u>
						
						
						BNCweb translates the 
						SGML     Abbreviation for Standard Generalized Markup Language.
						SGML is an international standard which governs the way the structural units and fonts of a document are 
						described so that they can shared among different computer platforms and programs without any loss of 
						information (compare this with the different internal file formats used by different word processors under 
						various operating systems such as Windows, Mac OS and Linux). The language used by web browsers, HTML (or, 
						increasingly, XML), is an example of a text representation language based on SGML. A simple example of how 
						SGML works is the way paragraphs, section headings or utterances by individual speakers are marked up using 
						angled brackets to mark the beginning and end of each structural unit (e.g. using <p> and </p>). 
						Documents marked up in SGML are meant to be read by computer programs in the first instance: they interpret 
						the markup before displaying the document in a format which humans find easy to read, and users can choose which 
						parts of the documents to display or search (e.g. only certain headings, or only the utterances spoken by 
						certain speakers). The raw BNC texts are marked up using SGML, and special tools are required for processing 
						these texts. SARA is the corpus tool provided with the BNC, and BNCweb is a user-friendly 
						program ('interface') which mediates between the user and the complexity of the SARA internal commands for 
						interrogating marked-up texts. A more generic and technical definition of SGML may be found 
						here.
						
						
						Spoken context-governed texts (task-oriented speech)     The part of the BNC which contains spoken texts transcribed from recordings made in 
						specially selected spoken contexts (mostly semi-formal/prepared or situated within an institutional context, thus 
						contrasting with the Spoken demographic material). There are four broadly defined contexts/domains of 
						spoken discourse: educational (e.g. classroom lessons, lectures, home tutorials), business 
						(e.g. committee meetings, job interviews), public/institutional (e.g. council meetings, courtroom discourse, 
						medical consultations, sermons) and leisure (e.g. TV/radio broadcasts, oral history narratives). 
						The examples given in brackets closely mirror the genre labels available in 
						the BNC World Edition, which can be used for creating subcorpora.
						
						Text header     This refers to the information included at the top of each BNC file which gives 
						descriptive and classificatory information about the text while not actually being part
						of it. It is therefore metatextual information in the sense that it gives information about 
						the origin/provenance of the text (e.g. bibliographic information for published texts, or information 
						about the setting & participants for spoken texts), how the text was collected or recorded, 
						how the words were entered or transcribed, etc. The File and speaker information display
						feature can be used to view the metatextual information contained in the header of each text.
						The header information in BNC files is very detailed and highly structured. It is set off from 
						the text proper by the use of the SGML element 
						<teiHeader></teiHeader> (TEI stands for the Text Encoding Initiative, a body of standards 
						for electronic texts). As an example, the header of file KSW can be found here
						in full. 
						
						 
						
						<w>-unit     In most  cases, <w>-units correspond to orthographic words, except for 
						contracted forms, fused forms and multiword units such as "she's", 
						"gonna", and "in front of". Further information can be found in the 
						BNC World Edition POS-tagging manual.
						The difference between orthographic word and <w>-unit also explains why the "100-million word 
						corpus" BNC only contains 97,626,093 words: There are in fact slightly more than 100 million words in the BNC (100,467,090 to be 
						exact), but they are orthographic words rather than <w>-units.