Automated Methods for Identifying Causation in Verbs and the Large-Scale Structure of the Lexicon

Philip Wolff, Emory University

A common strategy for determining whether a word or phrases expresses the concept of CAUSE is to consult intuitions about the meaning of the verb cause. Recent developments in computational linguistics offer a new approach to this problem that could help clarify which verbs do and do not encode this conceptual component. The meanings of words and sentences can be extracted, in part, by using so-called word-embedding methods such as Word2vec (Mikolov et al., 2013). Such techniques make it possible to view the large-scale structure of the lexicon (N = 6159 verbs) and automatically identify major classes of verbs in a manner that accords quite well with human judgments (N = 200). However, such techniques cannot, on their own, identify components of meaning, such as CAUSE, because such components—as determined by human judgments (N = 800)— are cross-cutting, meaning that they appear in the meanings of verbs from many verb categories across the lexicon. To identify cross-cutting components of meaning requires a procedure for automatically generating word definitions. This can be accomplished using a stochastic gradient descent optimization technique that finds the set of word embeddings (e.g., send, receive, have) that, when properly weighted, produce a vector that is maximally similar to a target verb vector (e.g., give). According to human judges (N = 200), 838 verbs out of the 3,000 most common verbs in English contain the component of CAUSE. The optimization procedure was able to identify 80% of these verbs, without any reference to human judgments. Importantly, the same procedure can be applied to identify the presence of causation in phrases or sentences, as well as the presence of other conceptual components, such as CONTACT, MANNER and CHANGE. The results from these analyses offer an initial glimpse into how the semantic contents of words and phrases might be derived from the statistical properties of text.