Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging, Dependency Parsing and Named Entity Recognition for more than 50 languages
We are currently focusing on training our models and participating in this years UD Shared Task on Multilingual Parsing.
UD covers more than 50 languages and the data comes in all shapes and sizes: (i) large versus small tagsets, (ii) languages which only require suffix analysis for morphological attribute resolution and languages in which morphological attributes depend on inter-character patterns; (iii) various writing systems such as alphabetic, abjad and syllable-based logographics).
Needless to say, there is much work to be done, but we are doing our best in an attempt to provide a best-fit solution for all the training corpora. Once the official evaluation is finished, we will publish the system description, results and make our models available to everyone interested in using NLP-Cube