Processing multilingual information or information in multiple languages has become ever more relevant in today’s digital world. Pangea’s Language Detector identifies the language and character encoding of incoming documents. It supports more than 84 languages, covering major Western and Eastern European, Semitic, Central Asian, Turkish, Japanese, Chinese, etc.
Pangea Language Detector can be successfully used:
As a pre-process before machine translation
To pre-filter text and improve the quality of input text data when training algorithms (most natural processing algorithms have monolingual texts as training data - adding other languages can decrease the performance of document management systems);
To organize data (speech-to-text, documents, etc.) before other processes;
To mine bilingual texts for machine translation from online resources;
For retrieval, grouping and understanding relevant information (user’s texts, emails and etc.) in multilingual environment.
Pangea Language Detector accurately determines not only the language of the whole document, but also the language of each snippet, paragraph or fragment.
Our Language Detector combines both statistical and neural technologies in order to obtain the highest recognition results. Our proprietary language detection algorithm is based on a strong mathematical model of vector spacing algorithm. We create a multidimensional space of vectors scanning document contests and use N-grams notion for calculating frequencies. The algorithm analyzes the positions of the necessary vectors in space to determine their similarity. Finally, combined algorithm results are corrected using special linguistic rules developed by our language team.
For evaluation purposes, we have created a demo page to detect the most popular languages achieving language identification accuracy from 95% to 99% (typical competitors’ results: 86 – 96%). The average processing speed was over 8000 KB/s.
Privacy & Cookies Policy
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.