Content and Language Intergrated Learning

What is Lexical Dark Matter?

May 13, 2018


What is Lexical Dark Matter?

What is Lexical Dark Matter?

With the help of a new online tool developed by Google Scientists at Harvard University say they have developed a way of identifying cultural trends over the past 200 years by using a database of 5 million digitized texts, containing about 4% of all books ever printed. 

A recent article (free to read if you register) quoted Scientists at Harvard University as saying:

“We estimated that 52% of the English lexicon – the majority of words used in English books – consist of lexical ‘dark matter’ undocumented in standard references.”

Meaning, they found more words in use than appear in any dictionary. 

This gap between dictionaries and the lexicon results from a balance between high-frequency and low-frequency words: Many words are obscure and rarely used and lexicographers fittingly place reasonably high standards on what words will be included in dictionaries. A dictionary must be comprehensive enough to be a useful reference, but concise enough to be used. Therefore, many infrequent words are omitted from the lexiverse and don’t have a dictionary definition. 

Writers are constantly adding to the lexical dark matter of the linguistic universe, either by writing about things so new that the terms used to discuss them are still relatively unknown, or just through slang or made up jargon. Nearly half of these new words are not included in any dictionary and are dubbed lexical “dark matter” and will remain dormant unless they enter common parlance in the future. The authors found, for example, that dictionaries are unable to keep up with advances in the language, often failing to add or holding off on adding a new word which then delays its inclusion until it is already on the decline.  Anybody can coin a neologism or create an evocative sense of a word, but it takes more than it being used by a couple of dozen academics or geeks to get it into “The Dictionary”. 


Analysis of this corpus will enable researchers to investigate cultural trends quantitatively, which they say can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology.

“Our results suggest that culturomic tools will aid lexicographers in at least two ways: (i) finding low-frequency words that they do not list; and (ii) providing accurate estimates of current frequency trends to reduce the lag between changes in the lexicon and changes in the dictionary.”

Below are just some of the questions that can now be answered using this tool, which looks at words through time (measured by mentions in books).

Which words have been censored in history?

When did the use of certain words become popular?

Do words decline or increase within popular culture?

How many words in the English language are not listed in dictionaries?

How do certain words influence the collective memory?

How does this effect/constrict the collective thought process?

Keep in touch with the changes in the dictionary check out the new words and new definitions for existing words  just added to dictionary.


Leave a comment