BNCweb manual: Standard query

Standard query

A Standard Query is used to search for words or phrases in the whole corpus or in the full spoken or written components. Standard queries range in complexity from single word searches to more advanced options concerning what to look for and in which parts of the corpus to do the searching:

Performing a standard query
Restricting a query using metatextual categories (Written Texts/Spoken Texts)
Finding part-of-speech tags
Querying contracted forms and multiword units
Specifying accented and other special characters
Advanced queries:
- CQL searches
- Regular expression queries
Further notes

Performing a standard query

Type your query (the word or sequence of words you wish to find) into the search box. For sequences of words, just use a space between each word as shown in the screenshot below. For more advanced search patterns, see the sections on Regular expressions and CQL queries.

The Number of hits per page box is used to select the number of hits that are displayed on each page of the concordance. Choosing a larger number slows down the display process. Select 'Count hits' if you don't need to see the hits but only want to know how many there are (including relative frequency).

Press 'Start query' to display the query result.

Restricting a query using metatextual categories

You can choose to restrict the query to the whole spoken or written component of the corpus by opening the drop-down menu for Restriction and changing the selection from 'None (Raw Input)' to 'Spoken texts' or 'Written texts' respectively.

If you need more control over text category restrictions, choose either 'Written texts' or 'Spoken texts' in the left-hand menu bar under Standard query. Here you can select such metatextual categories as the date of publication of the text, the age, sex and other characteristics of the author (if you selected 'Written texts'), the age, sex and so on of the speaker (if you are in 'Spoken texts') or the respondent.

Checked boxes within a category are linked with Boolean or, i.e. if you select both "Applied science" and "Social science" in the "Domain" category, you will find instances of your query which occur either in "Applied science" or in "Social science".
Checked boxes across different categories are linked with Boolean and, i.e. if you select "Social science" in the "Text type" category and "35-44" in the "Age of Author" category, you will find instances occurring in texts whose author is in the "35-44" category and which are of the type "Social science"

The following screenshot shows how to search for the word lovely in books belonging to the text domain 'Social science' (not all available categories are shown below).

In the 'Written texts' or 'Spoken texts' pages, you can leave the query string box empty and only select metatextual categories. The query result will then be a list of text files corresponding to the selected criteria. This list can be used to create a subcorpus. See Create/edit subcorpora for further details.

Please note: In the 'Spoken texts' query page, please note the difference between categories which apply to Respondents (i.e. the person who recorded the conversation) and those which apply to Speakers. In the former case, selecting "Age of Respondent: 35-44" will not restrict your query string to utterances produced by respondents within this age range, but rather match all instances of your query string in texts recorded by the corresponding respondents. You may therefore retrieve instances which are spoken by people belonging to a radically different age band.

Note: Due to technicalities to do with the way genre categories are encoded in the BNC headers, it is not possible to restrict searches to specific BNC genres at the outset. Instead, you need to define a subcorpus first, which can then be chosen as a post-query filter (i.e. after an initial search on the whole BNC). You may also find this subcorpus feature useful if you wish to regularly or repeatedly run queries on specific subsets of the BNC (without having to re-specify your metatexual restrictions).

Finding part-of-speech tags

To find a word together with its part-of-speech tag, type the word first, followed by '=', and the tag. E.g.

can=VM0 finds can tagged as a modal auxiliary

can=NN1 finds can tagged as a singular common noun

(Note that the last character is a zero, not the letter O!)

A good way of finding out the different part-of-speech tags for a word is to use the Word lookup function (where you can enter a word and list all POS tags with which it occurs in the corpus).

For a detailed description of the tagset, as well as technical information about the tagging of the BNC World Edition, see Leech and Smith.

N.B. It is not possible to do an initial search for part-of-speech categories on their own - only word-tag combinations can be entered in the search box.

Querying contracted forms and multiword units

Queries containing contracted forms (such as doesn't, you've, he'll, she'd've) or fused forms (e.g. wanna, dunno, gimme) have to be treated with special care. In the BNC contracted forms have been separated into their component parts. To find them in BNCweb, enter them as separate words in the query string box, with the enclitic part enclosed in double quotes, e.g.

does "n't"

he "'ll"

she "'d" "'ve"

Note that each part of a contracted form also has its own part-of-speech tag. This is useful for distinguishing ambiguous forms, like 's and 'd:

she ("'s"=VBZ) (= she is)

she ("'s"=VHZ) (= she has)

Fused forms are broken into component parts, in a way that is rather more arbitrary, as the table below shows. Double quotes are not required when looking for these items. A more complete list of contracted forms and associated tags in BNC2 is provided by Lancaster University.

What you are looking for Query strings to enter

Contracted form Full form First word Second word* Third word

they've they have they "'ve"

he'll he will he "'ll"

she's she has she ("'s"=VHZ)

she's she is she ("'s"=VBZ)

doesn't does not does "n't"

won't will not wo "n't"

ain't ? ai "n't"

gimme give me gim me

gonna going to gon na

innit isn't it in n it

*If you want to add a POS tag to an item enclosed in quotation marks, you must put parentheses around it.

The reverse situation is found for so-called Multiword units: several orthographic words are given a single POS tag and they are consequently treated as a single unit. Further information on the principles of tokenization can be found in the Automatic POS-tagging manual and a complete list of multiword units is found here. Some examples are:

What you are looking for	Query strings to enter
Multiword unit	Query string	POS-tagged
a bit	`"a bit"`	`"a bit"=AV0`
even though	`"even though"`	`"even though"=CJS`
ad hoc	`"ad hoc"`	`"ad hoc"=AJ0 "ad hoc"=AV0`
hocus pocus	`"hocus pocus"`	`"hocus pocus"=NN1`
in front of	`"in front of"`	`"in front of"=PRP`
lo and behold	`"lo and behold"`	`"lo and behold"=ITJ`
no matter what	`"no matter what"`	`"no matter what"=DTQ`
no one	`"no one"`	`"no one"=PNI`

Note that in some contexts a word-sequence may function as a multiword unit, while in others it does not. Cf.

"which they would not have done <w PRP>but for the presence of the police"

with

"<w CJC>But <w PRP>for years now darkness has been growing"

Quotation marks are not required when looking for the latter.

Note: A search for front will also retrieve all instances of front which are part of the multiword unit in front of. This was not the case for the version of the SARA server distributed with the BNC1 release.

Specifying accented and other special characters

A number of accented and other "special" characters occur in the BNC. While they are usually displayed correctly in the browser windows, it is important to remember that if any character with diacritics (such as é, ö, Ü, â etc.) or other special character is part of your query string, the word containing it has to be enclosed in double quotation marks. Thus to look for occurrences of Zürich in the corpus, remember to enter

"Zürich"

and not

Zürich

in the query box. If you forget to put in the quotation marks, the first special character is likely to terminate the query string, causing undesired results.

Advanced queries

BNCweb allows more complex queries than those already discussed. It can, for example, find partly-specified word patterns, sequences with optional characters or words, and words occurring in proximity with one another:

CQL searches

CQL is short for Corpus Query Language, SARA's own internal command language. Any query conforming to CQL syntax can be entered in the search box. This can be used to refine a query in various ways:

Using CQL to find words in particular contexts

Query string	Desired effect
`(dog\|cat)`	finds either the word `dog` or the word `cat`
`$Dog`	finds the word `Dog` (but not `dog`)
`@dog`	find the word `dog` in headers as well as in texts
`(cat)_(dog)`	finds three-word phrases of which the first word is `cat` and the last is `dog`
`cat*dog`	finds occurrences of `cat` followed anywhere within the same document by `dog`
`cat#dog`	finds occurrences of `cat` followed or preceded by `dog` anywhere within the same document
`cat*dog/10`	finds occurrences of `cat` followed by `dog` within ten words
`cat*dog/<s>`	finds occurrences of `cat` followed by `dog` within a single s element
`cat*dog/<u>`	finds occurrences of `cat` followed by `dog` within a single u element

Note: CQL-syntax currently does not allow a combination of several scopes as for example in "(cat#dog/10)/<s>" which is an attempt to find occurrences of cat followed by dog within ten words of the same <s>-unit.

Using CQL to search a restricted set of texts

Using a special query prefix, you can instruct the program to search one or more texts whose three letter file identification you already know. The following example tells BNCweb to look for instances of the string want in the BNC documents G42 and AKX only:

<bncDoc id="G42" id="AKX">#(want)

There is a limit for the length of this type of query string (about 10 text IDs). As a consequence, it is not possible to use this query syntax for searching hundreds of texts at the same time. Use the Subcorpus feature for this instead.

Users more familiar with CQL and the BNC text category codes may also prefer to express other types of queries in the compact notation of CQL:

<catRef target=wrimed1>#<catRef target=wridom4>#(lovely)

This will produce the same result as the query shown in the screenshot above (under Restricting a query using metatextual categories), i.e. it will restrict searches of lovely to books within the social science domain.

Regular expression queries

Queries in BNCweb can be made in the form of regular expressions (or patterns). Some examples are:

{critici[sz]e} finds criticise and criticize

{in(ter)?dependent} finds independent and interdependent

{spr[^eo]ngs?i?n?g?} finds spring, springs, sprang, sprung, and springing.

There are countless introductions to regular expressions on the internet. Although not everything will work in BNCweb, the basics will certainly be supported.

PLEASE NOTE that regular expression queries must be put within curly brackets. There is a certain overlap between CQL-syntax (see above) and regular expression syntax (e.g. the use of the vertical bar (|) for indicating alternation). As a consequence, some regular expressions may produce the required result without having been placed within curly brackets. In most cases, however, you are more likely to see a syntax error message...

Introduction

Regular Expressions are useful for searching text. They can be used the same way one would search something with the simple Find command in a word processor.

1) Example: hit, hat, hot, hut, shut, etc.

Simple case: each character stands for itself. Easy but tedious!

Unlike the Find... command Regular Expressions allow a search for variable patterns.

2) Example: h.t

Here the full-stop does not stand for itself. The full-stop indicates to the computer that any character may be encountered at that position. This means that the pattern will find hit, hat, hot, hut, etc. The full-stop in the pattern is an atomic regular expression; i.e. it stands for one (1!) character, which in this case may be any character.

Atomic Regular Expressions

An atomic regular expression stands for one character. The full-stop in the example above is an atomic regular expression. It represents one character. Thus the regular expression h.t does not match words like heat, because heat is not defined by the search: Find text where the letter h is followed by any one letter and the letter t.

Character Sets

Often it doesn't make sense to allow all characters. We might for example be interested in vowels only. It is possible to require character sets.

3) Example: h[aeiou]t

Character sets are written as a list of characters in square brackets. We might also encounter cases in which we would like to express our wish to allow any character except one. It would be very tedious to enumarate all other characters.

4) Example: h[^ae]t

The caret (^) negates the character set. Thus [^ae] will match all characters except a or e.

The minus sign (-) may be used to indicate a range of consecutive ASCII characters; for example, [0-9] is equivalent to [0123456789].

Quantification

We have noted above that the word heat would not be matched. To include such cases it is possible to specify the number of times an expression is allowed to occur.

5) Example: h[aeiou]+t

The plus-sign means that the expression to its left may occur 1 or more times. It will therefore also match heat, hoooooooooot but not ht.

Quantifier	Explanation
*	Match 0 or more times
+	Match 1 or more times
?	Match 1 or 0 times

Complex Regular Expressions

Brackets can be used to group several atomic regular expression into one regular expression. This is especially useful for quantification.

6) Example: (green|yellow|mauve|red)

This is the way to express alternation. It finds either green or yellow or ....

Literal vs. Meta Meaning of Characters

Alphanumeric characters (i.e. letters and digits) always stand for themselves. To use them as expressions meaning something else they must be preceded by a backslash (\). Other characters may or may not stand for themselves. It is therefore safer to precede them with a backslash to avoid undesired results.
The following characters must be preceded by a backslash to stand for themselves:

+ ? . * ^ $ ( ) [ ] {} | \

Notes

Some queries take a very long time to be displayed (e.g. longer than 10 minutes). If this is the case, do not start the same query for a second time as this will only slow down your own query and that of other users on the same server. Also: once you have started a query, clicking the 'Stop' button of your browser will not stop the query from being executed on the server - the results are simply not displayed by your browser.

Regarding searches for POS-categories, note that it is not possible to do a concordance on POS-tags at the outset, independently of a lexical item. You cannot for instance retrieve a concordance of all the adjectives, or all the singular nouns, in the corpus. You can however get a list of all the words that match a particular POS-tag, using the Frequency lists function. The grammatical tagging of the corpus is not 100% error-free. For information on the reliability of the tagging of different POS-categories, see Leech and Smith (2000).

`can=VM0`	finds can tagged as a modal auxiliary
`can=NN1`	finds can tagged as a singular common noun

What you are looking for		Query strings to enter
Contracted form	Full form	First word	Second word*	Third word
they've	they have	`they`	`"'ve"`
he'll	he will	`he`	`"'ll"`
she's	she has	`she`	`("'s"=VHZ)`
she's	she is	`she`	`("'s"=VBZ)`
doesn't	does not	`does`	`"n't"`
won't	will not	`wo`	`"n't"`
ain't	?	`ai`	`"n't"`
gimme	give me	`gim`	`me`
gonna	going to	`gon`	`na`
innit	isn't it	`in`	`n`	`it`

`{critici[sz]e}`	finds criticise and criticize
`{in(ter)?dependent}`	finds independent and interdependent
`{spr[^eo]ngs?i?n?g?}`	finds spring, springs, sprang, sprung, and springing.