BNCweb manual: Glossary

Options:

BNCweb Manual

 Introduction

 What is BNCweb

 Feature list

 Limitations of BNCweb

 Manual pages

 Main page options

 Post-query options

 Glossary

 Credits and acknowledgements

 Authors of BNCweb

 Authors of the manual

 Notes and links

 Notes

 Links

 Site map

 BNCweb home

 Last updated: 8.5.2002

Glossary

In the manual pages, clicking on an underlined item in red will take the user to the relevant entry on this glossary page.

Ambiguity tag A set of two part-of-speech tags (joined by a hyphen) attached to a single lexical item, to indicate that the CLAWS tagging program was unable to reliably distinguish between the two possible word classes. In the BNC World Edition, the ordering of the tags is significant: it is the first of the two tags which is estimated by the tagger to be the more likely. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading. The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.

Collocation The phenomenon of words/lexical items tending to co-occur in close proximity to one another in spoken/written discourse (i.e. habitual or greater-than-chance co-selection of words). For example, if you look up the word jubilee, you will tend to find the following words (the collocates) nearby: silver, diamond, golden, Queen's, line. The term 'collocation' is very broad, and allows varying degrees of collocability (or collocational strength), which is measured by several statistical formulae (e.g. log-likelihood, mutual information). At one extreme of the scale, collocations which are totally predictable are usually analysed as idioms, cliches, etc. At the other extreme, items which co-occur significantly in statistical terms may not be recognised as predictable collocations by native speakers. BNCweb simply presents the raw statistical results: it is up to the user to do the evaluation and interpretation.

CQL Abbreviation for Corpus Query Language. The special internal 'syntax' (or 'command language') used by the SARA server to process all queries made to the corpus. You can use CQL to make advanced queries, especially those which specify particular parts of the corpus to restrict your queries to. CQL queries can make direct reference to the SGML elements contained within the text headers of BNC files, and are often not easy to construct or read off the screen. Such complex CQL queries are usually generated for you (behind the scenes) by BNCweb, but you can also type them into a query box directly.

KWIC Abbreviation for Key Word In Context. This describes a common way of displaying concordance results, with the word or query expression you searched for displayed in the centre of the page, surrounded by accompanying text. The alternative way of viewing concordance results is the sentence view, which displays your query expression within the context of the sentence ( or <s>-unit) in which it is used. You can change the default display view in the User settings.

Lemma (pl. lemmas or lemmata) An abstract lexical category (usually represented by all-capitals, e.g. BLOW) consisting of a lexeme base plus its inflected forms (regular, irregular & suppletive) which share the same part of speech. For example, the verbal lemma BLOW contains the word forms blow, blows, blew, blown and blowing, while the lemma GO encompasses go, goes, went, gone, going. Lemmas for nouns (or 'substantives') group together singular and plural forms (e.g. wolf/wolves); adjectival lemmas group together positive, comparative and superlative forms (e.g. happy, happier, happiest; good, better, best); pronominal lemmas group together different cases of the same pronoun (e.g. I, me, my, mine).

Metatextual categories The BNC includes a whole range of information about the individual texts. A few examples are given below:

For written texts:

date of publication

text sample (e.g. whole text vs. only part of a text)

age and sex of author

etc.

For spoken texts:

Interaction type (monologue or dialogue)

age, sex or social class of respondent

age, sex or social class of speaker

etc.

The metatextual information is encoded in the text header of each BNC text. SARA can make use of this information by restricting searches to texts conforming to a certain selection of metatextual information.

Multiword unit A group of two or more orthographic words which the part-of-speech tagger CLAWS treats as a single grammatical unit (a <w>-unit). Multiword units include:

foreign expressions naturalised into English (e.g. a priori, ad hoc, bon mots, film noir, tour de force);

multi-word adverbs (e.g. of course and all of a sudden are each assigned a single POS-tag for 'adverb', AV0);

multi-word prepositions (e.g. rather than and along with are each tagged PRP for 'preposition');

some idiomatic noun constructions (e.g. check outs [NN2], clamp down [NN1], follow up [NN1], grown up [NN1], hotch potch [NN1], know how [NN1], nitty gritty [NN1], per cent [NN0]);

expressions which are sometimes (wrongly) spelt as two orthographic words instead of one (e.g. well being [instead of wellbeing] and some one [instead of someone]).

A full list of multiwords may be found here.

Node The position occupied by the query expression in the concordance window. In a KWIC-format concordance, this will be in the centre of each concordance line. Sorting, collocations, etc. are carried out on positions relative to the node - left or right of it, or even the node position itself.

POS-tag Part-of-speech or grammatical word class tag, as assigned by CLAWS, an automatic tagging program developed at Lancaster University. POS-tags are labels attached to each <w>-unit in the corpus, indicating its grammatical class. There is generally one POS-label for each orthographic word, except for the following cases:

ambiguity (where the CLAWS program was unable to accurately distinguish between two possible word classes)

contracted forms and fused forms (where more than one POS-tag has been applied)

multiwords (where two or more orthographic words share only one POS-tag).

A brief overview of contracted forms, fused forms and multiword units is given in the Standard query page. More detailed explanation may be found in the BNC World Edition POS-tagging manual.

Query expression The expression you look for. It can be a word, a phrase, or string of words.

Query result The output from running a query in BNCweb; this is usually a concordance, but can also simply be a report of the 'number of hits' (if the option 'count hits' is chosen for 'Number of hits per page' on the first screen of any standard/lemma query).

Respondent Respondents are people selected on the basis of their demographic profile who agreed to record all of their conversations over a two to seven day period. A selection of these recordings make up the spoken demographic part of the spoken texts. Query searches can be restricted to texts recorded by specific kinds of respondents (e.g. age between 25 and 34). Please note that this does not have the same effect as restricting a query to utterances produced by certain kinds of speakers.

<s>-unit This may be thought of as a 'sentence' unit, although it is also used to delimit some headings in written texts, and in the spoken texts to delimit stretches of discourse which the transcribers have identified as sentence-like in form (e.g. because they are bounded by pause) or function (e.g. a one-word utterance such as Yeah). Fragmentary chunks are therefore sometimes marked up as <s>-units in the BNC. All <s>-units are numbered, and the number is reported for each BNC query result hit. In the written texts, <s>-units are contained within 'paragraphs' (-units), while in spoken texts they are contained within 'utterances' (-units). To illustrate, in the excerpt below, there are two 'sentences' (<s>-units numbered 6 and 7) within one utterance (-unit) by the speaker (who is referenced by the identity tag 'PS6SH'):

 <s n="6">Wait.
 <s n="7"> Wait Lis, I ain't got the things on.

BNCweb translates the SGML format to a format that can be interpreted and displayed by a normal web browser.

Note: If you quote a sentence or a whole passage from the BNC, you should give the text ID (e.g. KSW) and <s>-unit number(s) as a full reference.

SARA Acronym for SGML Aware Retrieval Application, the corpus tool freely available for all BNC licensees. There are two parts to SARA: the SARA server (a UNIX tool) and the Windows SARA client. BNCweb relies on the SARA server for much of its basic functionality and it replaces the Windows SARA client as a tool to access the BNC.

SGML Abbreviation for Standard Generalized Markup Language. SGML is an international standard which governs the way the structural units and fonts of a document are described so that they can shared among different computer platforms and programs without any loss of information (compare this with the different internal file formats used by different word processors under various operating systems such as Windows, Mac OS and Linux). The language used by web browsers, HTML (or, increasingly, XML), is an example of a text representation language based on SGML. A simple example of how SGML works is the way paragraphs, section headings or utterances by individual speakers are marked up using angled brackets to mark the beginning and end of each structural unit (e.g. using and ). Documents marked up in SGML are meant to be read by computer programs in the first instance: they interpret the markup before displaying the document in a format which humans find easy to read, and users can choose which parts of the documents to display or search (e.g. only certain headings, or only the utterances spoken by certain speakers). The raw BNC texts are marked up using SGML, and special tools are required for processing these texts. SARA is the corpus tool provided with the BNC, and BNCweb is a user-friendly program ('interface') which mediates between the user and the complexity of the SARA internal commands for interrogating marked-up texts. A more generic and technical definition of SGML may be found here.

Spoken context-governed texts (task-oriented speech) The part of the BNC which contains spoken texts transcribed from recordings made in specially selected spoken contexts (mostly semi-formal/prepared or situated within an institutional context, thus contrasting with the Spoken demographic material). There are four broadly defined contexts/domains of spoken discourse: educational (e.g. classroom lessons, lectures, home tutorials), business (e.g. committee meetings, job interviews), public/institutional (e.g. council meetings, courtroom discourse, medical consultations, sermons) and leisure (e.g. TV/radio broadcasts, oral history narratives). The examples given in brackets closely mirror the genre labels available in the BNC World Edition, which can be used for creating subcorpora.

Spoken demographic texts (conversational speech) The part of the BNC containing spoken texts transcribed from recordings made by the 124 respondents who were selected to accurately represent the demographic make-up of the UK population. For the most part, these texts consist of casual, unplanned conversations (this contrasts with the Spoken context-governed material, which is mostly less spontaneous speech).

String Any sequence of characters (letters, numbers or punctuation marks). Note that strings containing accented letters (those with diacritics) are treated specially by BNCweb.

Tag sequence In BNCweb, this refers to a sequence of 'slots' or positions in relation to a query expression (the node, which is what you initially generate). The slots or positions to the left or right of the node may specified as POS-tags (or broad POS classes, such as 'any adverb') or a combination of a lexical item (or pattern) + a POS-tag (or broad POS-class).

Text header This refers to the information included at the top of each BNC file which gives descriptive and classificatory information about the text while not actually being part of it. It is therefore metatextual information in the sense that it gives information about the origin/provenance of the text (e.g. bibliographic information for published texts, or information about the setting & participants for spoken texts), how the text was collected or recorded, how the words were entered or transcribed, etc. The File and speaker information display feature can be used to view the metatextual information contained in the header of each text. The header information in BNC files is very detailed and highly structured. It is set off from the text proper by the use of the SGML element <teiHeader></teiHeader> (TEI stands for the Text Encoding Initiative, a body of standards for electronic texts). As an example, the header of file KSW can be found here in full.

<w>-unit In most cases, <w>-units correspond to orthographic words, except for contracted forms, fused forms and multiword units such as "she's", "gonna", and "in front of". Further information can be found in the BNC World Edition POS-tagging manual. The difference between orthographic word and <w>-unit also explains why the "100-million word corpus" BNC only contains 97,626,093 words: There are in fact slightly more than 100 million words in the BNC (100,467,090 to be exact), but they are orthographic words rather than <w>-units.

Window A range or span of <w>-units to the left (e.g. -5) or right (e.g. +5) of a node. This is usually user-specifiable for particular purposes (e.g. to calculate collocational strength or to specify a tag sequence search).