English dictionaries in the age of the internet (II)

Still, the dream lingered. What if one could get to 100% – lassoing the whole of English, from the beginning of written time to the present day? Numerous revisions or rivals to Johnson were proposed, though few were actually created. After a Connecticut schoolteacher named Noah Webster published his American Dictionary of the English Language in 1828 (70,000 entries), British pride was once again at stake.

In November 1857, the members of the London Philological Society convened to hear a paper by Richard Chenevix Trench, the dean of Westminster, entitled “On Some Deficiencies in our English Dictionaries”. It was a bombshell: Trench argued that British word banks were so unreliable that the slate needed to be wiped clean. In their place, he outlined “a true idea of a Dictionary”. This Platonic resource should be compiled on scholarly historical lines, mining deep into the caverns of the language for ancient etymology. It should describe rather than prescribe, casting an impartial eye on everything from Anglo-Saxon monosyllables to the latest technical jargon (though Trench drew the line at regional dialect). Most of all, it should be comprehensive, honouring what Trench called – glancing jealously at Germany, where the brothers Grimm had recently started work on a Deutsches Wörterbuch – “our native tongue”.

SPONSOR AD

The quest to capture the language in its entirety may have been centuries old, but, like a great railway line or bridge, this new dictionary would be thoroughly Victorian: scientific, audacious, epic and hugely expensive. Building it was a patriotic duty, Trench insisted: “A dictionary is a historical monument, the history of a nation”.

For the first two decades, the New English Dictionary, as it was called, looked as if it would go the way of so many previous projects. The first editor died a year in, leaving chaos in his wake. The second had more energy for young women, socialism, folksong and cycling. Only after it was taken over by Oxford University Press, who in 1879 were persuaded to appoint a little-known Scottish schoolteacher and philologist called James Murray as chief editor, did things begin to move.

The first part was published in 1884, A to Ant, and instalments emerged at regular intervals for the next 40-odd years. Although Murray died in 1915 – somewhere between “Turndun” and “Tzirid” – the machine churned on. In 1928, the finished dictionary was eventually published: some 414,800 headwords and phrases in 10 volumes, each with a definition, etymology and 1.8m quotations tracking usage over time.

It was one of the largest books ever made, in any language: had you laid the metal type used end to end, it would have stretched from London to Manchester. Sixty years late it may have been, but the publisher made the most of the achievement, trumpeting that “the Oxford Dictionary is the supreme authority, and without a rival”.

Yet if you knew where to look, its flaws were only too obvious. By the time it was published in 1928, this Victorian leviathan was already hopelessly out of date. The A-C entries were compiled nearly 50 years earlier; others relied on scholarship that had long been surpassed, especially in technology and science. In-house, it was admitted that the second half of the alphabet (M-Z) was stronger than the first (A-L); the letter E was regarded as especially weak. Among other eccentricities, Murray had taken against “marzipan”, preferring to spell it “marchpane”, and decreed that the adjective “African” should not be included, on the basis that it was not really a word. “American”, however, was, for reasons that reveal much about the dictionary’s lofty Anglocentric worldview.

The only solution was to patch it up. The first Supplement to the OED came out in 1933, compiling new words that editors had noted in the interim, as well as original omissions. Supplements to that Supplement were begun in 1957, eventually appearing in four instalments between 1972 and 1986 – some 69,300 extra items in all. Yet it was a losing battle, or a specialised form of Zeno’s paradox: the closer that OED lexicographers got to the finish line, the more distant that finish line seemed to be.

At the same time, the ground beneath their feet was beginning to give way. By the late 1960s, a computer-led approach known as “corpus linguistics” was forcing lexicographers to re-examine their deepest assumptions about the way language operates. Instead of making dictionaries the old-fashioned way – working from pre-existing lists of words/definitions, and searching for evidence that a word means what you think it does – corpus linguistics turns the process on its head: you use digital technology to hoover up language as real people write and speak it, and make dictionaries from that. The first modern corpus, the Brown Corpus of Standard American English, was compiled in 1964 and included 1m words, sampled from 500 texts including romance novels, religious tracts and books of “popular lore” – contemporary, everyday sources that dictionary-makers had barely consulted, and which it had never been possible to examine en masse. The general-language corpora that provide raw material for today’s dictionaries contain tens of billions of words, a database beyond the wildest imaginings of lexicographers even a generation ago.

There are no limits to the corpora that can be constructed: at a corpus linguistics conference in Birmingham last year, I watched researchers eavesdrop on college-age Twitter users (emojis have long since made “laughter forms” such as LOL and ROFL redundant, apparently) and comb through English judges’ sentencing remarks for evidence of gender bias (all too present).

For lexicographers, what’s really thrilling about corpus linguistics is the way it lets you spy on language in the wild. Collating the phrases in which a word occurs enables you to unravel different shades of meaning. Observing how a word is “misused” hints that its centre of gravity might be shifting. Comparing representative corpora lets you see, for example, how often Trump supporters deploy a noun such as “liberty”, and how differently the word is used in the Black Lives Matter movement. “It’s completely changed what we do,” the lexicographer Michael Rundell told me. “It’s very bottom-up. You have to rethink almost everything.”

But while other dictionary publishers leapt on corpus linguistics, OED editors stuck to what they knew, resisting computerisation and relying on quotation slips and researchers in university libraries. In the 1970s and 80s there was little thought of overhauling this grandest of historical dictionaries, let alone keeping it up to date: it was as much as anyone could do to plug the original holes. When the OED’s second edition was published in March 1989 – 20 volumes, containing 291,500 entries and 2.4m quotations – there were complaints that this wasn’t really a new edition at all, just a nicely typeset amalgam of the old ones. The entry for “computer” defined it as “a calculating-machine; esp an automatic electronic device for performing mathematical or logical operations”. It was illustrated by a quotation from a 1897 journal.

By astonishing coincidence, another earthquake, far bigger, struck the very same month that OED2 appeared in print: a proposal by an English computer scientist named Tim Berners-Lee for “a large hypertext database with typed links”. The world wide web, as it came to be called (OED dates the phrase to 1990), offered a shining path to the lexicographical future. Databases could be shared, and connected to one another; whole libraries of books could be scanned and their contents made searchable. The sum of human text was starting to become available to anyone with a computer and a modem.

The possibilities were dizzying. In a 1989 article in the New Yorker, an OUP executive said, with a shiver of excitement, that if the dictionary could incorporate corpus linguistics resources properly, something special could be achieved: “a Platonic concept – the ideal database”. It was the same ideal laid out by Richard Chevenix Trench 132 years before: the English language over a thousand or more years, every single word of it, brought to light.

The fact that so much text is now available online has been the most cataclysmic change. Words that would previously have been spoken are now typed on social media. Lexicographers of slang have long dreamed of being able to track variant forms “down to the level, say, of an individual London tower block”, says the slang expert and OED consultant Jonathon Green; now, via Facebook or Instagram, this might actually be possible. Lexicographers can be present almost at the moment of word-birth: where previously a coinage such as “mansplain” would have had to find its way into a durable printed record, which a researcher could use as evidence of its existence, it is now available near-instantly to anyone.

Anyone, and anywhere – when the OED was first dreamed up in the 1850s, English was a language of the British Isles, parts of North America, and a scattering of colonies. These days, nearly a quarter of the world’s population, 1.5bn people, speak some English, mostly as a second language – except, of course, that it isn’t one language. There are myriad regional variants, from the patois spoken in the West Indies and Pidgin forms of West Africa to a brood of compound offspring – Wenglish (Welsh English), Indlish or Hinglish (Indian/Hindi English), and the “Chinglish” of Hong Kong and Macau. All of these Englishes are more visible now than ever, each cross-fertilising others at greater and greater speed.

“The circle of the English language has a well-defined centre but no discernible circumference,” James Murray once wrote, but modern lexicographers beg to differ. Instead of one centre, there are many intersecting subgroups, each using a variety of Englishes, inflected by geographical background or heritage, values, other languages, and an almost incalculable number of variables. And the circumference is expanding faster than ever. If OED lexicographers are right that around 7,000 new English words surface annually – a mixture of brand-new coinages and words the dictionary has missed – then in the time you’ve been reading this, perhaps two more words have come into being.

Join Daily Trust WhatsApp Community For Quick Access To News and Happenings Around You.