In the manual pages, clicking on an underlined item in red will take the user to the relevant entry on this
glossary page.
Ambiguity tag A set of two part-of-speech tags (joined by a hyphen) attached to a single lexical item,
to indicate that the CLAWS tagging program was unable to reliably distinguish between the two possible word classes.
In the BNC World Edition, the ordering of the tags is significant: it is the first of the two tags which is estimated
by the tagger to be the more likely. So, for example, the ambiguity tag AJ0-AV0 indicates that the choice between
adjective (AJ0) and adverb (AV0) is left open, although the tagger has a preference for an adjective reading.
The mirror tag, AV0-AJ0, again shows adjective-adverb ambiguity, but this time the more likely reading is the adverb.
Collocation The phenomenon of words/lexical items tending to co-occur in
close proximity to one another in spoken/written discourse (i.e. habitual or greater-than-chance
co-selection of words). For example, if you look up the word jubilee, you will tend to find the
following words (the collocates) nearby: silver, diamond, golden, Queen's,
line. The term 'collocation' is very broad, and allows varying degrees of collocability
(or collocational strength), which is measured by several statistical
formulae (e.g. log-likelihood, mutual information). At one extreme of the scale, collocations
which are totally predictable are usually analysed as idioms, cliches, etc.
At the other extreme, items which co-occur significantly in statistical
terms may not be recognised as predictable collocations by native speakers. BNCweb simply
presents the raw statistical results: it is up to the user to do the evaluation and
interpretation.
CQL Abbreviation for Corpus Query Language. The special internal 'syntax' (or 'command
language') used by the SARA server to process all queries
made to the corpus. You can use CQL to make advanced queries,
especially those which specify particular parts of the corpus to restrict your queries to.
CQL queries can make direct reference to the SGML elements contained within the text
headers of BNC files, and are often not easy to construct or read off the screen.
Such complex CQL queries are usually generated for you (behind the scenes) by BNCweb, but you can also
type them into a query box directly.
KWIC Abbreviation for Key Word In Context. This describes a common way
of displaying concordance results, with the word or query expression
you searched for displayed in the centre of the page, surrounded by accompanying text.
The alternative way of viewing concordance results is the sentence view,
which displays your query expression within the context of the sentence
( or <s>-unit) in which it is used. You can change the default
display view in the User settings.
Lemma (pl. lemmas or lemmata) An abstract lexical category (usually represented by all-capitals, e.g. BLOW)
consisting of a lexeme base plus its inflected forms (regular, irregular & suppletive)
which share the same part of speech. For example, the verbal lemma BLOW contains the word
forms blow, blows, blew, blown and blowing, while the lemma GO
encompasses go, goes, went, gone, going. Lemmas
for nouns (or 'substantives') group together singular and plural forms (e.g. wolf/wolves);
adjectival lemmas group together positive, comparative and superlative forms (e.g.
happy, happier, happiest; good, better, best);
pronominal lemmas group together different cases of the same pronoun (e.g. I, me, my, mine).
Metatextual categories The BNC includes a whole range of information about
the individual texts. A few examples are given below:
- For written texts:
- date of publication
- text sample (e.g. whole text vs. only part of a text)
- age and sex of author
- etc.
- For spoken texts:
- Interaction type (monologue or dialogue)
- age, sex or social class of respondent
- age, sex or social class of speaker
- etc.
The metatextual information is encoded in the text header of each BNC text. SARA can make use of this
information by restricting searches to texts conforming to a certain selection of metatextual information.
Multiword unit A group of two or more orthographic words which the part-of-speech tagger CLAWS treats as a
single grammatical unit (a <w>-unit). Multiword units include:
- foreign expressions naturalised into English (e.g. a priori, ad hoc, bon mots, film noir, tour de force);
- multi-word adverbs (e.g. of course and all of a sudden are each assigned a single POS-tag for 'adverb', AV0);
- multi-word prepositions (e.g. rather than and along with are each tagged PRP for 'preposition');
- some idiomatic noun constructions (e.g. check outs [NN2], clamp down [NN1], follow up [NN1], grown up [NN1],
hotch potch [NN1], know how [NN1], nitty gritty [NN1], per cent [NN0]);
- expressions which are sometimes (wrongly) spelt as two orthographic words instead of one (e.g. well being
[instead of wellbeing] and some one [instead of someone]).
A full list of multiwords may be found here.
Node The position occupied by the query expression in the
concordance window. In a KWIC-format concordance, this will be in the centre
of each concordance line. Sorting, collocations, etc. are carried out on
positions relative to the node - left or right of it, or even the node
position itself.
POS-tag Part-of-speech or grammatical word class tag, as assigned by CLAWS, an automatic tagging program
developed at Lancaster University. POS-tags are labels attached to each <w>-unit in the corpus, indicating
its grammatical class. There is generally one POS-label for each orthographic word, except for the
following cases:
- ambiguity (where the CLAWS program was unable to accurately distinguish between
two possible word classes)
- contracted forms and fused forms (where more than one POS-tag has been applied)
- multiwords (where two or more orthographic words share only one POS-tag).
A brief overview of contracted forms, fused forms and multiword units is given in the
Standard query page.
More detailed explanation may be found in the
BNC World Edition POS-tagging manual.
Query expression The expression you look for.
It can be a word, a phrase, or string of words.
Query result The output from running a query in BNCweb;
this is usually a concordance, but can also simply be a report of the
'number of hits' (if the option 'count hits' is chosen for 'Number of hits per page'
on the first screen of any standard/lemma query).
Respondent Respondents are people selected on the basis of their demographic profile who agreed to
record all of their conversations over a two to seven day period. A selection of these recordings make
up the spoken demographic part of the spoken texts. Query searches can be restricted to texts
recorded by specific kinds of respondents (e.g. age between 25 and 34). Please note that this does not
have the same effect as restricting a query to utterances produced by certain kinds of speakers.
<s>-unit This may be thought of as a 'sentence' unit, although it is also used
to delimit some headings in written texts, and in the spoken texts to delimit stretches of discourse
which the transcribers have identified as sentence-like in form (e.g. because they are bounded by pause)
or function (e.g. a one-word utterance such as Yeah). Fragmentary chunks are therefore sometimes
marked up as <s>-units in the BNC. All <s>-units are numbered, and the number is reported
for each BNC query result hit. In the written texts, <s>-units
are contained within 'paragraphs' (<p>-units), while in spoken texts they are contained within
'utterances' (<u>-units). To illustrate, in the excerpt below, there are two 'sentences'
(<s>-units numbered 6 and 7) within one utterance (<u>-unit) by the speaker (who is
referenced by the identity tag 'PS6SH'):
<u who=PS6SH>
<s n="6">Wait.
<s n="7"> Wait Lis, I ain't got the things on.
</u>
BNCweb translates the SGML format to a format that can be interpreted and displayed by a normal web browser.
Note: If you quote a sentence or a whole passage from the BNC, you should give the text ID (e.g. KSW) and
<s>-unit number(s) as a full reference.
SARA Acronym for SGML Aware Retrieval Application, the corpus tool freely
available for all BNC licensees. There are two parts to SARA: the SARA server (a UNIX tool)
and the Windows SARA client. BNCweb relies on the SARA server for much of its basic
functionality and it replaces the Windows SARA client as a tool to access the BNC.
SGML Abbreviation for Standard Generalized Markup Language.
SGML is an international standard which governs the way the structural units and fonts of a document are
described so that they can shared among different computer platforms and programs without any loss of
information (compare this with the different internal file formats used by different word processors under
various operating systems such as Windows, Mac OS and Linux). The language used by web browsers, HTML (or,
increasingly, XML), is an example of a text representation language based on SGML. A simple example of how
SGML works is the way paragraphs, section headings or utterances by individual speakers are marked up using
angled brackets to mark the beginning and end of each structural unit (e.g. using <p> and </p>).
Documents marked up in SGML are meant to be read by computer programs in the first instance: they interpret
the markup before displaying the document in a format which humans find easy to read, and users can choose which
parts of the documents to display or search (e.g. only certain headings, or only the utterances spoken by
certain speakers). The raw BNC texts are marked up using SGML, and special tools are required for processing
these texts. SARA is the corpus tool provided with the BNC, and BNCweb is a user-friendly
program ('interface') which mediates between the user and the complexity of the SARA internal commands for
interrogating marked-up texts. A more generic and technical definition of SGML may be found
here.
Spoken context-governed texts (task-oriented speech) The part of the BNC which contains spoken texts transcribed from recordings made in
specially selected spoken contexts (mostly semi-formal/prepared or situated within an institutional context, thus
contrasting with the Spoken demographic material). There are four broadly defined contexts/domains of
spoken discourse: educational (e.g. classroom lessons, lectures, home tutorials), business
(e.g. committee meetings, job interviews), public/institutional (e.g. council meetings, courtroom discourse,
medical consultations, sermons) and leisure (e.g. TV/radio broadcasts, oral history narratives).
The examples given in brackets closely mirror the genre labels available in
the BNC World Edition, which can be used for creating subcorpora.
Spoken demographic texts (conversational speech) The part of the BNC containing spoken texts transcribed from recordings
made by the 124 respondents who were selected to accurately represent the
demographic make-up of the UK population. For the most part, these texts consist of casual, unplanned
conversations (this contrasts with the Spoken context-governed material, which is mostly less
spontaneous speech).
String Any sequence of characters (letters, numbers or punctuation marks). Note that strings
containing accented letters (those with diacritics) are
treated specially by BNCweb.
Tag sequence In BNCweb, this refers to a sequence of 'slots' or positions in relation to a query expression
(the node, which is what you initially generate). The slots or positions to the left or right of the node may specified as
POS-tags (or broad POS classes, such as 'any adverb') or a combination of a lexical item (or pattern) + a POS-tag
(or broad POS-class).
Text header This refers to the information included at the top of each BNC file which gives
descriptive and classificatory information about the text while not actually being part
of it. It is therefore metatextual information in the sense that it gives information about
the origin/provenance of the text (e.g. bibliographic information for published texts, or information
about the setting & participants for spoken texts), how the text was collected or recorded,
how the words were entered or transcribed, etc. The File and speaker information display
feature can be used to view the metatextual information contained in the header of each text.
The header information in BNC files is very detailed and highly structured. It is set off from
the text proper by the use of the SGML element
<teiHeader></teiHeader> (TEI stands for the Text Encoding Initiative, a body of standards
for electronic texts). As an example, the header of file KSW can be found here
in full.
<w>-unit In most cases, <w>-units correspond to orthographic words, except for
contracted forms, fused forms and multiword units such as "she's",
"gonna", and "in front of". Further information can be found in the
BNC World Edition POS-tagging manual.
The difference between orthographic word and <w>-unit also explains why the "100-million word
corpus" BNC only contains 97,626,093 words: There are in fact slightly more than 100 million words in the BNC (100,467,090 to be
exact), but they are orthographic words rather than <w>-units.
Window A range or span of <w>-units to the left (e.g. -5) or right (e.g. +5) of a
node. This is usually user-specifiable for particular purposes (e.g. to
calculate collocational strength or to specify a tag sequence search).
|