BNCweb manual: Distribution

Distribution

The Distribution feature offers easy access to descriptive statistics concerning the distribution of your query result over the various metatextual categories encoded in the BNC. It saves you from having to perform several (or even dozens of) individual queries for a 'manual' compilation of the required statistics.

Performing a distribution analysis

In the BNC Query Result window, select Distribution from the drop-down menu and press 'Go!' This is shown in the following screenshot of the results window after running a query for the word lovely:

After the calculation has finished, a page with the distribution of your query result over several major categories will be displayed. Apart from the number of hits found in the individual catergories, the distribution feature also offers information about the total number of words contained in these categories and the corresponding relative frequency of your query result. Relative frequency information makes it possible to directly compare data for different categories. This is shown in the following screenshot: lovely is found more often in the written component (3,641 vs. only 2,397 hits in the spoken component), but the relative frequencies reveal that lovely is in fact more than five times as frequent in the spoken data (232 vs. 42 instances per million words).

Clicking on the number of hits in the third column will display all the sentences contained in the relevant category.

The upper drop-down menu on the top of the page allows you to choose from a list of other categories encoded in the header of the BNC texts. For example the category "Type of author" ('Sole', 'Corporate' or 'Multiple') is not shown on the default general information page. Select the desired category and press 'Show distribution'.

The lower drop-down menu can be used to display a crosstabulation of two different categories. In this way, it is for example possible to investigate the influence of the variables 'age' and 'sex' on the use of lovely in spoken English. As the following screenshot shows, men in general use lovely less often than women (166 vs 495 instances pmw). Men between the age of 24-35 use it most frequently but the differences between the age groups are not that great (between 189 and 144 instances pmw). With the female speakers, however, a clear connection between age and the use of lovely can be drawn: women aged 65 and over show a frequency of 779 instances pmw. This is more than three times as high as the frequency for the youngest age group (250 instances pmw). Without crosstabulation of data, such a correspondence might have gone unnoticed.

File frequency information

Relative frequencies are useful for comparing different categories. But they may also be misleading as some words are highly genre-specific (and some specialized vocabulary may in fact be restricted to one single text). For example, the word homoeopathic has 216 occurrences in the whole BNC, which corresponds to 2.21 instances per million words. But how equally is it distributed? This question can be answered by selecting File frequency information in the upper drop-down menu. The result is shown below:

Over 75% of all occurrences of homoeopathic are found in a single text. As the File information page reveals, this is an extract from a text entitled "Homeopathy for everyone".

Notes

A word of warning: Distribution data is often fascinating data, both because it confirms our intuitions about language use (e.g. old ladies being prone to say lovely) or sometimes also because it produces unexpected results that run counter to our perceptions of language use (e.g. the fact that the word car is more frequently said by women than by men in the spoken component of the BNC). But there is a danger of relying too much on such data - it is nothing but raw data. It is the task of the corpus linguist to describe such data, but a meaningful analysis must necessarily go beyond listing relative frequencies. For example, lovely may have different pragmatic functions in different contexts. Consider the following sentence which contains one of the 40 instances of lovely uttered by young girls:
KP3:2842 Oh, <name> oh, he's lovely, and <pause> disgusting. <unclear>.
Is this his kind of ironic use also found in utterances of women aged 65+? Information like that obviously cannot be found in frequency tabulations. Corpus linguistics almost inevitably requires 'manual' work in order to get at meaningful interpretations. BNCweb reduces some of the necessary manual work, but it can't replace it.