A European company was interested in expanding its existing NLU (Natural Language Understanding) solution for English, Spanish, and French, with support for other European languages, to:
The Envion Software experts were approached to develop an extension of the client’s Named-Entity Recognition module, which was an essential part of the entire text processing solution.
In a 6-month term, the Envion Software NLP team implemented a NER solution for Person, Location, and Organization named-entity extraction from articles of the general domain for nine (9) European languages (German, Portuguese, Italian, Polish, Danish, Dutch, Finnish, Norwegian, and Swedish). To achieve the goal, our NLP experts built a Java NER application using the GATE framework for natural language processing.
Given the limited number of training examples available from the client and the peculiarities of the texts, a rule-based approach was selected as the most reasonable choice. As a result, a hierarchical entity extraction grammar was developed, relying on the Java Annotation Patterns Engine (JAPE).
Moreover, in order to ensure a sufficiently representative set of gazetteers (i.e., lists of named entities, various contextual words, and phrases used as features for entity identification), our experts queried the multilingual ontology resource DBPedia with help of the SPARQL querying language. One of the benefits of the NER application introduced was the ease of extending it to new languages.
Such an approach made it possible to achieve 90+% precision & 70+% recall of Organization, Person, and Location entities extraction without a significant loss in speed for the initial NLU solution.
Apart from that, since a high degree of accuracy of automatic grammatical analysis of texts was one of the critical requirements for the client, the Envion experts fine-tuned several open-source libraries for Part-of-Speech tagging and lemmatization: Treetagger, Hunspeller, OpenNLP, SyntaxNet & achieved 95+% accuracy for lemmatization & 98+% for part-of-speech tagging.
The quality of the NER solution developed was assured by performing numerous precision and recall assessments per entity, and also for Part-of-Speech tagging and lemmatization. For this purpose, a Golden Standard Corpora was annotated by the Envion Software experts.
Upon successful completion of the project, the multilingual NER module having been developed, the client’s solution became able to attribute and categorize journal articles in multiple languages, and made them available to readers across the globe.