Multi-word expression (MWE) detection is a challenging Natural Language Processing (NLP) task that consists in finding isolated tokens or sequences of tokens that form high-level structures (i.e. multi-word expressions). Naively, this can be regarded as a sequence labeling task, which in fact influenced legacy MWE methods to use this strategy. In fact, this is not far fetched, since this approach has yielded highly accurate results so far. However, we have to point a couple of task-specific details:
Our Named Entity Recognition (NER) system employs Graph-Based-Decoding (GBD) over a hybrid network architecture composed of bidirectional LSTMs for word-level encoding, with a MultiLayerPerceptron for detecting links inside entities and a unidirectional LSTM for label estimation:
We are currently working on integrating the NER branch into master. Until then, you can check out the source for the GDB-NER system at this repository, see the results obtained on the PARSEME corpus here. If interested, we also encourage you to read our paper draft
Our graph-based encoding/decoding strategy is able to cope with overlapping high-level entities and large spans between tokens, by forcing the model to focus on the dependencies between tokens that belong to the same expression. We start off with a sequence of words as input, and convert it into a sequence of fixed-sized vectors (by combining lexical and morphological attributes for every word). After this, we use two stacked Bidirectional LSTM (BDLSTM) layers, obtaining a list of vector embeddings for each word inside the original sequence. Next, we take each vector from the internal representation and we split it using three parallel $tanh$ layers into 3 separate "projections".
As such, for each word \(w_i\), we obtain three vectors which we will refer to as \(v_i^0\), \(v_i^1\) and \(v_i^2\). Finally, for every pair of words \(w_i\) and \(w_j\) with \(i\neq j\) we use single unit sigmoid activation, outputting the probability of the two words being part of the same high-level expression or not. The input for the sigmoid unit is obtained by concatenating the first vector from the split operation for word \(w_i\) (i.e. \(v_i^0\)) with the second vector of the split operation for word \(w_j\) (i.e. \(v_j^1\)). We store the resulting values in an adjacency matrix \(a\).
Finally, we use a graph-based algorithm to decode NERs and assign the label using an unidirectional LSTM over the third projection (\(v_i^2\)) of the words contained in each expression (in natural order). Here we have to handle three cases: