The multilingual named – entity recognition for Journal Article Summarization

Country

USA
Industry

NLP
Type

Web Application
Duration

1 Year

Challenge

A European company was interested in expanding its existing NLU (Natural Language Understanding) solution for English, Spanish, and French, with support for other European languages, to:

perform text summarizing
accomplish categorizing of journal articles
make them available for their readers worldwide

The Envion Software experts were approached to develop an extension of the client’s Named-Entity Recognition module, which was an essential part of the entire text processing solution.

Solution

In a 6-month term, the Envion Software NLP team implemented a NER solution for Person, Location, and Organization named-entity extraction from articles of the general domain for nine (9) European languages (German, Portuguese, Italian, Polish, Danish, Dutch, Finnish, Norwegian, and Swedish). To achieve the goal, our NLP experts built a Java NER application using the GATE framework for natural language processing.

Given the limited number of training examples available from the client and the peculiarities of the texts, a rule-based approach was selected as the most reasonable choice. As a result, a hierarchical entity extraction grammar was developed, relying on the Java Annotation Patterns Engine (JAPE).

Moreover, in order to ensure a sufficiently representative set of gazetteers (i.e., lists of named entities, various contextual words, and phrases used as features for entity identification), our experts queried the multilingual ontology resource DBPedia with help of the SPARQL querying language. One of the benefits of the NER application introduced was the ease of extending it to new languages.

Such an approach made it possible to achieve 90+% precision & 70+% recall of Organization, Person, and Location entities extraction without a significant loss in speed for the initial NLU solution.

Apart from that, since a high degree of accuracy of automatic grammatical analysis of texts was one of the critical requirements for the client, the Envion experts fine-tuned several open-source libraries for Part-of-Speech tagging and lemmatization: Treetagger, Hunspeller, OpenNLP, SyntaxNet & achieved 95+% accuracy for lemmatization & 98+% for part-of-speech tagging.

The quality of the NER solution developed was assured by performing numerous precision and recall assessments per entity, and also for Part-of-Speech tagging and lemmatization. For this purpose, a Golden Standard Corpora was annotated by the Envion Software experts.

Technical details

Java 8
OpenNLP
SyntaxNet
SPARQL
Hunspeller
GATE
JAPE
Treetagger

Result

Upon successful completion of the project, the multilingual NER module having been developed, the client’s solution became able to attribute and categorize journal articles in multiple languages, and made them available to readers across the globe.

The multilingual named – entity recognition for Journal Article Summarization

Country

Industry

Type

Duration

Challenge

Solution

Technical details

Result

Similar projects

Company related event extraction for risk assessment

The semantic reasoning tool for revealing patterns across versatile datastores

Contact us