NLP - Cube

Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging, Dependency Parsing and Named Entity Recognition for more than 50 languages

Browse the code

Setup:

Before running the server, you need the model's weights, and you can follow two approaches to get them:

Download data in order to train the model yourself
Download already existing model weights

Installing dyNET:

Make sure you have Mercurial, Python, PIP, CMAKE installed (you can also check steps documented here)
Install Intel's MKL library
Install dyNET by using the installation steps from the manual installation page. More specifically, you should use:

            
            pip install cython
            mkdir dynet-base
            cd dynet-base

            git clone https://github.com/clab/dynet.git
            hg clone https://bitbucket.org/eigen/eigen -r 2355b22  # -r NUM specified a known working revision

            cd dynet
            mkdir build
            cd build
            cmake .. -DEIGEN3_INCLUDE_DIR=/path/to/eigen -DMKL_ROOT=/opt/intel/mkl -DPYTHON=`which python2`

            make -j 2 # replace 2 with the number of available cores
            make install

            cd python
            python2 ../../setup.py build --build-dir=.. --skip-build install

Training the lemmatizer (example):

Use the following command to train your lemmatizer:

            
            python2 cube/main.py --train=lemmatizer --train-file=corpus/ud_treebanks/UD_Romanian/ro-ud-train.conllu --dev-file=corpus/ud_treebanks/UD_Romanian/ro-ud-dev.conllu --embeddings=corpus/wiki.ro.vec --store=corpus/trained_models/ro/lemma/lemma --test-file=corpus/ud_test/gold/conll17-ud-test-2017-05-09/ro.conllu --batch-size=1000

Running the server:

Use the following command to run the server locally:

            
            python2 cube/main.py --start-server --model-tokenization=corpus/trained_models/ro/tokenizer --model-parsing=corpus/trained_models/ro/parser --model-lemmatization=corpus/trained_models/ro/lemma --embeddings=corpus/wiki.ro.vec --server-port=8080