Contextual compositionality detection with external knowledge bases and word embeddings

When the meaning of a phrase cannot be inferred from the individual meanings of its words (e.g., hot dog ), that phrase is said to be non-compositional . Automatic compositionality detection in multi-word phrases is critical in any application of semantic processing, such as search engines [9]; failing to detect non-compositional phrases can hurt system effectiveness notably. Existing research treats phrases as either compositional or non-compositional in a deterministic manner. In this paper, we operationalize the view-point that compositionality is contextual rather than deterministic, i.e., that whether a phrase is compositional or non-compositional depends on its context. For example, the phrase “green card” is compositional when referring to a green colored card, whereas it is non-compositional when meaning permanent residence authorization. We address the challenge of detecting this type of contextual compositionality as follows: given a multi-word phrase, we enrich the word embedding representing its semantics with evidence about its global context (terms it often collocates with) as well as its local context (narratives where that phrase is used, which we call usage scenarios ). We further extend this representation with information extracted from external knowledge bases. The resulting representation incorporates both localized context and more general usage of the phrase and allows to detect its compositionality in a non-deterministic and contextual way. Empirical evaluation of our model on a dataset of phrase compositionality 1 , manually collected by crowdsourcing contextual compositionality assessments, shows that our model outperforms state-of-the-art baselines notably on detecting phrase compositionality.


INTRODUCTION
Automatic compositionality detection refers to the automatic assessment of the extent to which the meaning of a multi-word phrase is decomposable into the meanings of its constituents words and their combination. For example, while brown dog is a fully compositional phrase meaning a dog of brown color, hot dog is a non-compositional phrase denoting a type of food. Compositionality plays a vital role in word embeddings because a non-decomposable phrase should, in principle, be treated as a single word instead of a bag of word (BOW) in word embedding approaches.
A typical line of research in automatic compositionality detection is to "perturb" the input phrase by replacing one of its constituent words at a time with its synonym, and then to measure the semantic distance between the original phrase and the perturbed phrase set [8]. The larger this distance, the less compositional the original phrase. For instance, hot dog would be perturbed to warm dog and hot canine. The semantic distance between the original phrase and its two perturbations is high, indicating that they denote different concepts; hence hot dog is non-compositional. However, the phrase brown dog would be perturbed to hazel dog and brown canine, which have a shorter semantic distance to brown dog, indicating that it is compositional.
In this paper, we posit that the compositionality of a phrase is not dichotomous or deterministic, but instead varies across scenarios. For instance, heavy metal could refer to a dense metal that is toxic, which is compositional, but it could also be non-compositional when it refers to a genre of music. Previous work acknowledges this property of compositionality theoretically [8], but no operational models implementing this have been presented to this day.
Given a multi-word phrase as input, we reason that the phrase is used in some narrative, e.g., a query, sentence, snippet, document, etc. We refer to this narrative as usage scenario of the phrase. We combine evidence extracted from this usage scenario of the phrase with the global context (frequently co-occurring terms) of the phrase and use this to enrich the word embedding representation of the phrase. We linearly combine the weights of the tokens that are obtained from the usage scenario and the global context. We further extend this representation with information extracted from external knowledge bases. We evaluate our model on a large dataset of phrases which are labeled as per five degrees of compositionality under various usage scenarios. We find that our model outperforms state-of-theart baselines notably on identifying phrase compositionality. Our contributions are as follows: • A novel model that detects phrase compositionality under different contexts and that outperforms the state of the art performance in the area. • A benchmarking dataset of contextualized compositionality detection, that we make publicly available to the community.

RELATED WORK ON AUTOMATIC COMPOSITIONALITY DETECTION
Compositionality detection mainly focuses on the semantic distance or similarity calculation between a given phrase and its component words or its perturbations under a corpus or dictionary. Earlier approaches mostly estimate the similarity between the original phrase and its component words. For example, Baldwin et al. [1], and Katz and Giesbrecht [6] employ Latent Semantic Analysis (LSA) to calculate the semantic similarity (and hence to measure compositionality). Venkatapathy and Joshi [16] extended this by adding collocation features, e.g., phrase frequency, point-wise mutual information, extracted from the British National Corpus. More recent work estimates the similarity between a phrase and perturbed versions of that phrase where the words are replaced, one at a time, by their synonyms. For instance, Kiela and Clark [7] compute the semantic distance between a phrase and its perturbation, using cosine similarity, which measures a phrase weight by pointwise-multiplication vectors of its terms. Lioma et al. [10] calculate the semantic distance with Kullback-Leibler divergence based on a language model; and, in subsequent work, Lioma et al. [8] represent the original phrase and its perturbations as ranked lists, and measure their correlation or distance.
A promising line of work uses word embeddings and deep artificial neural networks for compositionality detection. Salehi [15] employs the word-based skip-gram model for learning non-compositional phrases, treating phrases as individual tokens with vectorial composition functions. Hashimoto and Tsuruoka [5] adopt syntactic features including word index, frequency and PMI of a phrase and its components words to learn the embeddings. Yazdani et al. [17] utilize a polynomial projection function and deep artificial neural networks to learn the semantic composition and detect non-compositional phrases like those that stand out as outliers, assuming that the majority are compositional.
Closer to our work, Salehi el. [14] use Wiktionary and utilize the definition, synonyms, and translations of Wiktionary to detect noncompositional components. Specifically, they analyze the lexical overlap between the definition of a phrase and its component words to measure compositionality. They assume that multi-word phrases are included in Wiktionary, while there is no guarantee for perfect coverage of the dictionary. Unlike this approach, we use Wiktionary together with DBPedia as a structured knowledge base to represent the contextual semantics of phrases.
To our knowledge, no prior work has operationalized the compositionality of a phrase as contextual.

OUR CONTEXTUAL REPRESENTATION MODEL FOR COMPOSITIONALITY DETECTION 3.1 Problem Formulation
Given an input phrase p and its accompanying usage scenario s, the aim is to compute the compositionality score Score (p) of phrase p with respect to usage scenario s. We follow the substitution-based line of work [7], which (a) generates perturbations of the input phrase p by substituting one word at a time with its synonym, (b) builds a semantic representation (a vector of its co-occurring terms) separately for the input phrase p and each perturbed phrase, and (c) uses the distance between the vectors of the input phrase and its perturbations to approximate the compositionality of the input phrase: the higher the distance, the less compositional the input phrase. This substitution-based line of work does not accommodate the usage scenario of the input phrase or its perturbations. The vectors of co-occurring terms are computed on one corpus, and hence these vectors represent the global distributional semantics of the input phrase and its perturbations. We extend this line of work by incorporating the local usage scenario of the phrase and its perturbations. We furthermore enrich these representations using external knowledge bases KBs. We describe this next. Figure 1 shows how the phrase and scenario are fed into the external corpus and knowledge base in a sequential manner in the architecture, which we refer to as contextual representation model (CRM).

Building Global and Local Phrase Context
Global Phrase Context. In Natural Language Processing (NLP), the distributional semantics of an input word are computed by fixing a natural number n and, for each occurrence of a word in some corpus, finding the n words occurring immediately before, and n words occurring immediately after each occurrence of the input word (called context window). If there is a total of N context windows for a word, its distributional semantics in vector form can be calculated by using all these N windows. Because this is a global representation of the word's distributional semantics across the whole corpus, the vector is called a globalized vector. A general word embedding (e.g., word2vec) is comparable to such global context. Concretized representations of this globalized vector can be calculated with, e.g. ranked lists or word embeddings, as described in section 4.1.
Local Phrase Context. We aim to incorporate a representation of the local usage scenario of the input phrase (local phrase context) into the above described global phrase context. This representation will not be in the vector directly, because usage scenarios are typically extremely short (in terms of words), which may strongly bias the contextual representation of the phrase. Therefore, we rank all the global context windows of the phrase according to their similarity to the usage scenario of the phrase, and we select the top K most similar context windows to the usage scenario. These top K context windows are used to build the local usage scenario context representation of the phrase. Then, we linearly combine the global representation of the phrase (i.e., by taking all N context windows) and the local usage scenario representation of the phrase (i.e., the top K context windows) to acquire the localized phrase context. The ranking score is the similarity between the usage scenario s and a window W i , i.e. sim i = similarity(W i , s) ∈ [0, 1]. The details of how the similarity score is computed are introduced in Section 4.
In the above, the value of K is determined by the length of the usage scenario as follows: where N is the total number of context windows that contain phrase p in the corpus; length(s) is the number of words in usage scenario s excluding the original phrase; and M is a threshold. We explain these next. We posit that 2 length(s ) indicates the degree of shrinking: the longer the usage scenario is, the smaller number of windows will be shrunk. The reason behind this is that the longer the usage scenario is, the more semantics it contains and subsequently fewer specific windows we are supposed to be capable of locating on. For instance, if the usage scenario is empty with length(s) = 0, then it returns the entire N windows of p with no shrinking performed; if the usage scenario has three words with 2 l enдt h (s ) = 8, it only collects the top 1 8 of all the windows (K = N 8 ). Note that K depends on the usage scenario length, and is not fixed as a threshold of the similarity values. Since the similarity values may vary drastically in between [0, 1] for different usage scenarios, we argue that our method is more robust to such variations. Furthermore, an empirical threshold of M is introduced to guarantee at least M windows will be selected anyhow. To this end, the localized usage scenario context of a phrase is given by: where a is a weight parameter between 0-1 indicating the weight of global vector, and the remaining of 1 − a corresponds to the contribution of the localized vector; R denotes the semantic representation for W . In this paper, we represent a phrase as a ranked list of words and word embedding, so the same symbol R is adopted to denote both. The approach to calculate R is described in section 4.1.

Enriching Global and Local Phrase Context with Knowledge Bases
We enrich the global and local context representations of the input phrase with information extracted from external knowledge bases. We describe this next. We reason that the corpus used to extract the global and local contexts has good coverage of various but not all possible usage scenarios of the phrase. A knowledge base is expected to contain more comprehensive, declarative information about the phrase, e.g., entities and phrase senses with categorized information. We, therefore, enrich the global and local phrase contexts extracted from the corpus with phrase information extracted from external knowledge bases.
Given an input phrase, we collect all candidate senses and entities (uniformly referred to as candidates in this paper) by searching the following properties (and associated values) from the knowledge bases: the properties dbpedia:redirects, dbpedia:disambiguation and their propagation relation with dbpedia:name and rdfs:label. The associated resources in these retrieved triples result in a set of candidates. Then, for each candidate, the values of rdfs:label, dbpedia:abstract and rdf:type are concatenated as the context for that candidate, excluding the title (which is mostly the phrase name). We also use the interface 2 to retrieve senses from Wiktionary and merge them into the same candidate set for that given input.
Most phrases only contain a limited number of candidates, and different candidates of the same phrase can have entirely different meanings or be distinct entities. We hence investigate a sequential way to incorporate the knowledge base into the phrase contextual representation, as follows. First, the phrase is fed into the knowledge base to find all candidate articles. Then, the candidates are ranked according to how similar they are to the localized phrase context. Those with similarity values above a certain threshold are identified as the matched candidates, denoted as {D i , sim i } n i=1 , where D i and sim i refer to the i t h matched candidate with similarity value sim i ∈ [0, 1]. A linear combination of the localized phrase context and the candidate articles is then conducted to compute the adjusted phrase context as follows: where is the normalized similarity score for the i th candidate article D i while R (D i ) denotes the semantic representation for D i . Since KB contains well-defined knowledge of words, we use a weighted sum of the matched candidates, instead of a simple average of matched contexts in the text corpus.
The knowledge base we employ consists of DBpedia, Wiktionary, and Wordnet. DBpedia is constructed by extracting structured information from Wikipedia. The English version of the DBpedia contains 4.58 million entries, of which 4.22 million are classified and managed under one consistent ontology. Wiktionary is a multilingual, web-based, freely available dictionary, thesaurus and phrase book, designed as the lexical companion to Wikipedia. Volunteers collaboratively construct Wiktionary, so there are no specialized qualifications necessary.

Non-linear combination
This section represents an approach of non-linear combination as a companion to the linear combination approach introduced in section 3.2 and 3.3. In addition to the weight parameter oriented design of the linear combination, we also employ a non-linear sigmoid function in RNNs (recurrent neural networks), which resolves the arrangement of the combining order for context inputs. In other words, RNNs take into consideration the feedback from the previous context vector back and forth, leading to numerous applications [3,4]. Specifically, we train a neural network model using the Keras library to identify the compositionality label of each phrase. We encode the semantics adopting pre-trained word embeddings -word2vec [11] as word representations, a recurrent neural network with LSTM cells as the model, and cross-entropy as the loss function. As an optimizer, we utilize Adam optimizer for training the model.
In a realistic scenario (also represented in our dataset) there are fewer non-compositional than compositional phrases. This situation resembles the class imbalance issue which happens when one class (or label) is represented by most of the examples while the other one is represented just by a few. Therefore, we adopt re-sampling strategies to tackle this problem.

Compositionality Detection
Here we introduce our proposed method for compositionality detection with Algorithm 1. Given a phrase p of length l and its usage scenario s in a large corpus Corp, we compute its compositionality score through the following steps: (1) Obtain localized phrase context through Eq. 1 and 2. The usage scenario of a phrase is the critical information, and Update perturbed phrase set S (p) ← S (p) ∪p 6: end for 7: for phrase p ′ ∈ {p ∪ S (p)} do

IMPLEMENTATION 4.1 Semantic Representation
The semantic representation of a context (a phrase, a context window or a candidate content), i.e., R (·) is concretized as either a ranked list or a word embedding. For the ranked list model, we calculate the TF-IDF as weight for all the tokens, rank them according to the weight, resulting in a ranked list of those tokens as the localized contextual representation. For the word embedding model, we use existing pre-trained word vectors -Glove [12], and represent the vector with the average of all tokens. The corpus we employ in our experiment is ClueWeb12-B13, a subset of some 50 million pages of ClueWeb12-Full dataset 3 . These two contextual representations lead to two different compositionality scores for the same model. We apply suffixes "-word embedding" or "-ranked list" in order to distinguish the way the contextual representation is computed, resulting in two distinct models, namely CRM word embedding and CRM ranked list.

Similarity Measure
In this study, we are faced with the problem of computing the similarity value between two context vectors. Here, we consider two types of similarity measures to achieve this purpose: cosine similarity and Pearson correlation coefficient.
One of the most commonly used similarity measures, cosine similarity, computes the cosine value of the angle between the two vectors of the same length. For two vectors ⃗ a = [a 1 , a 2 , .., a n ] and ⃗ b = [b 1 , b 2 , .., b n ], their cosine similarity cossim(a,b) is given below: The Pearson correlation coefficient computes the degree of correlation between two variables, each having a set of observed values. Suppose two variables X and Y are associated with two set of values {X 1 , X 2 , ..., X n } and {Y 1 , Y 2 , ..., Y n } respectively. The Pearson correlation coefficient r can, therefore, be computed as follows: where X and Y denote the average of X and Y respectively.

Perturbation
Here, we introduce the process to obtain the perturbations of a phrase p with length l. First, we get the synonyms for each word in the phrase. Then, we construct the whole perturbation set, which contains all phrases composed of l − 1 words in p and a synonym of the remaining word. Suppose the i t h word has n i synonyms, then the perturbation set contains l i=1 n i perturbed phrases. We then prune the perturbation set by filtering out the rare perturbed phrases in the text corpus. Basically, we compute the occurrence frequency of all perturbed phrases and pick the perturbed phrases with top K frequency values. In our study, we set K to be 7, which

Parameter Settings
Contextual Windows setting: We set window = 20, which means it scans the previous 20 and subsequent 20 words of that phrase, with a sum of 40 words for each window. In Equation 1, we set M = 10, and the base 2 can also be parameter-free which can be changed into 2,3,4,etc., to increase the localization level.
Knowledge base threshold: As for the threshold of KB candidates, we set the threshold of similarity value between localized context and KB candidates as 0.5 to filter out those candidates with similarity less than 0.5.

Ranked list length:
In line with the work [8], we set a maximum length of the ranked list as 1000, which means that we rank the tokens according to their TFIDF weight, and the tokens after the position of 1000 would be pruned. Training and testing: As we are working with imbalanced data, we use a random oversampling strategy. We split our data in a stratified fashion into 65% for training, 15% for validation, and 20% for testing. The re-sampling is be done after splitting the data into training and test, and only on the training data, i.e., none of the information in the test data is being used to create synthetic observations.

EXPERIMENT AND VALIDATION
In this section, we evaluate the effectiveness of the model presented in Section 3. Section 5.1 introduces the dataset and Section 5.2 presents the results achieved by our model.

Crowdsourcing data
We employ a dataset that consists of 1042 phrases that are nounnoun 2-term phrases [2]. In this dataset, each phrase was assessed four times using a binary scale (compositional or non-compositional). However, these phrases are assessed with a deterministic label, meaning that no scenario or context was given, and the degree of compositionality may not always be binary [13]. Therefore, we extend the dataset into a new version where each phrase is enriched with one or two scenarios if possible, by taking advantage of a  Figure Eight 4 , and we use a graded level of compositionality. In Table 2 we summarize the dataset statistics.
We divided the assessment into two stages: for the first stage, the trustful assessors, with level 3 (highest in Figure Eight), are required to understand the various meanings of a phrase, and, if possible, create two scenarios for the same phrase. From these two scenarios, one should be compositional or as compositional as possible, and the other non-compositional or as non-compositional as possible. If the phrase can only be compositional or only be noncompositional, then they create one scenario for it. For the second stage, the assessors are required to assess the compositionality of phrases within different scenarios with one of the five graded labels: compositional, mostly-compositional, ambiguous to judge, mostly non-compositional, and non-compositional. Note that, for the first stage, the two scenarios of a phrase are not necessarily of two extreme polarities. x-coordinate is α for controlling localized context and λ stands for y-coordinate, controlling the KB combining weight. Deeper blue represents higher performance whereas red indicates the opposite.

Performance and Validation
Two linear combination parameters influence the performance of our model: the combination weight α (in Eq. 2) between the vectors of a phrase and its scenario, resulting in a localized phrase context, and λ (in Eq. 3) between the localized context and knowledge base. The impacts of these two parameters on the final performance are visualized in Figure 2 and 3, corresponding to the word embeddingbased and ranked list-based contextual representation respectively. α and λ denote the x and y coordinates. The colors indicate the performance, which is the correlation between the ground truth labels and the predicted labels of our models ranging from -1 to 1. The performance values are colored ranging from red to blue, representing the lowest performance to the highest.
As shown in Figure 2, the performance is negatively correlated with α while positively correlated with λ. This indicates that reducing the relative importance of localized context (right direction on x-coordinate) while enhancing the influence of knowledge base (bottom direction on y-coordinate) can improve the performance for word embedding based contextual representation. In contrast, as shown in Figure 3, if we ignore the 0 column, α is negatively correlated with the performance while λ does not have an apparent influence on the performance. This indicates that attaching higher importance to the localized context (left direction on x-coordinate) can improve the performance for the ranked list based contextual representation, while the adoption of knowledge base does not have an apparent influence to the overall performance. The first 0 column, which is shown more like an outlier, indicates that the existence of a vector of the original phrase is necessary. In other words, localized context would have relatively poor performance.
As summarized in Table 1, the performance improved from 0.147 to 0.375 for CRMs based on word embeddings (the best); from 0.131 to 0.209 for CRMs based on ranked list; and 0.176 to 0.324 for CRM based on RNN. For the word embedding based contextual representation model, relying more on the knowledge base while keeping the scenario to limited importance will lead to a high-performed model; for the ranked list based contextual representation model, on the other hand, adequately high adoption of localized context can lead to improved performance. The reason behind this can be that the knowledge base contains relatively trimmed but wellcategorized information, therefore, the word embedding model can take full use of this text as informative vectors. In contrast, ranked lists, depending on tokens, work better on a large-scale corpus where they induce a large number of context windows. However, the knowledge base contains a limited number of tokens that may have little contribution to the final representation. Even though we can tune the weight of tokens from a knowledge base, it still can have limited influence in comparison to the long ranked list, which can be as long as 1000 tokens in our experiment.
For the non-linear combination where we employed the sigmoid function in RNNs, the CRM based on RNN still beats the original RNN. However, the performance is still lower than the unsupervised approaches.

CONCLUSION
We developed a novel method for compositionality detection where the compositionality of a phrase is contextual rather than static. x-coordinate is α for controlling localized context and y-coordinate stands for λ, controlling the KB combining weight. Deeper blue represents higher performance whereas red indicates the opposite.
Instead of considering an isolated phrase as input, we assume a phrase and its usage scenario (e.g., a query, snippet, sentence, etc.) as input, and we model a joint semantic representation of these by combining distributional semantics extracted from a corpus and additional evidence extracted from an external structured knowledge base.
Our resulting model uses word embeddings to detect compositionality, more accurately than the related state of the art. Our experiments show that for word embeddings, the usage of knowledge bases can lead to notable performance improvements.
In the future, we plan to evaluate our model on further datasets and compositionality detection scenario, e.g., Verbal Phraseological Units (VPUs).