FAQs
Machine Translation is a hot topic. Gone are the times when translation companies could proclaim “machines will never match human quality”. Markets pressures and, above all, the need to speed time-to-market texts have turned the limelight on translation automation. Several developments have even made it to the press, and the advent of free, general, plain text domain engines on the web like Google Translate, have put translation and fast language transfer high in the agenda for international businesses. President Obama’s call for the advancement and improvement of machine translation to help the world communicate and understand each better in October 2009 only made the subject gather more momentum (reports in the New York Times and The Economist, March 2010).
By answering the key 17 questions (or any others that you may have) we hope to provide enough information from experience and a few practical approaches on how to convert this new challenge into an innovative, competitive strategy.
If you have wondered or asked anyone any of the following questions, this will be a key section for you.
Q17 – Is your machine translation good with Czech language?
Slavic languages have many cases (word inflections). This made statistical machine translation to work quite badly as the possibilities of any string happening were quite low. These type of languages are also called “morphologically rich languages” because of the number of combinations that are possible.
Neural networks changed the approach completely. A neural network works well below and above the word level to understand the complexities of how each word is formed and how it relates to the words next to it. This means that neural networks-based machine translation understands far better the relations between the different words in a sentence. By taking into account the dependencies between the words as a result of the training data provided, neural works-based machine translation provides output that translates in the feeling of near-human flow or human-quality machine translation.
One of our clients asked us
I thought that PangeaMT provide only generic engines and we can customize this engines with our own TMs and create in-domain specific „mirrors“ (with using “OnlineTraining” module). And I know that our language combinations (EN <-> CS and DE <-> CS, both ways) are not enough supported by other MT providers (Czech language is really complicate for MT solutions). So I had to ask If PangeaMT provide this two combination as well.
Well, indeed, you can customize your engine with our online tool using your own server. This provides a lot of freedom and independence when setting up a machine translation environment for a translation agency. As a language consultant, linguists tackle texts and documents of a different nature and conflicting terminology. Mixing everything in a single engine would be detrimental to performance and accuracy.
Take the following Czech English TMX file.
A neural network will find the relations between the sentences and similarities, to the syllable and letter level if necessary (this is a very useful feature in neural training called BPE). It is also responsible for neural machine translation’s success and higher acceptability than the previous n-gram based “Statistical Machine Translation”, that is still successful with short sentences because its higher “memorizing” capabilities, as explained in our first neural machine translation developments publications back in 2017. Our findings at the time proved that a short sentence with less than 9-10 words could probably be translated more accurate with a statistical system than with a neural system. As system have improved over the years, the gap between one and the other has shortened. However, it is true that when ecommerce websites only need to translate a couple of words, and those words have been part of the training data, a statistical system will recall them more quickly and efficiently. A neural system, however, will reconstruct the sentence with a more human fluency.
Therefore, if you ask if our machine translation is good with Czech language, the answer is YES! We have the team, technology and data to make your MT engine run smoothly and produce high quality translation, millions of words!
Q16 – How about data cleaning? What is your approach?
Anybody who has been in the translation industry long enough has come across some kind of “bad” TM. This could come in many shapes, from simply being a bad translation to being terminologically inaccurate, etc. Fortunately for our users, this kind of data cleaning has become part of PangeaMT’s standard cleaning procedure.
Some of the basic cleaning cycles are described below. They take into account some procedures which have been automated to system owners so they can rest assured that
- their initial training data is clean before engine training in order to achieve the best results possible
- any future post-edited material also goes through a virtuous cleaning cycle in order to check any noise that may be introduced in the system and thus affect re-trainings.
This is not a comprehensive list of all cleaning steps. Nevertheless, it will allow users to realise what kind of material will be extracted for human approval before re-entering the training cycle. All segments detected as “suspect” will be stripped out of the training set for human approval /revision / editing in TMX format and then re-entered in the system.
- Segments with significant difference in length between the source and target
Generally, we consider a sentence a “suspect” when it is more than 50% in length, but this can be varied according to your particular needs (Czech, for example, is usually shorter than English and French being 25% or 30% longer than English is not an indication per se of there being anything wrong). - Segments where source or target contains typographical symbols missing in the other, such as [ ], *, + =.
- Segments where source and target are identical.
- “Empty segments”, i.e. segments with source but no target.
- Segments containing particular names or expressions which are part of the client’s preferred terminology.
All these are candidates for human revision.
This is (one of the things) that sets PangeaMT apart from other offerings: we will train you and provide you with the tools so you become your own master in future re-trainings.
Clean data is the route to quality input and thus improved engine performance. The old translation saying applies: garbage in, garbage out. Thanks to our cleaning routines, you can rest assured that you will own a system which will strip out any “dubious” material for your consideration. But even after installation, please remember you have a full year free support. Any odd results you see or experience, any patterns you would like to apply/correct, we are here to help. This is not a black box system or company selling words or engines. Our model is “user empowerment“, i.e. technology transfer.
Q15 – In what way are you different from Google Translate?
While Google focuses in making as much general information available and has wide resources to gather trillions of words of data, PangeaMT’s approach is to build a custom application for your particular needs with your preferential terminology, expressions and word usage, i.e. a machine translation application that translates like you wish. Training data is typically provided by you and enhanced by PangeaMT. Additional language data may be added so there are sufficient lexical resources in the engine. A Language Model may be specifically built for you or adapted for your purposes. Furthermore, PangeaMT’s system is designed to fit and aid current TM-based systems by translating TMX or xliff formats, something that Google Translate cannot do (it only translates plain text).
By translating files and not plain text, PangeaMT plugs in directly and easily into any localization or Knowledge Base workflow. TMX or XLIFF files can be easily post-edited using most (if not all) of yesterday’s CAT tools as editing tools.
In short, PangeaMT’s developments fit in current translation environments and automate current processes, whereas Google Translate is an informative engine.
Q14 – I deal with texts that are full of in-lines and tags. Most SMT systems only offer plain text and it takes a long time to copy and paste the in-lines/tags back in place. Have you done anything to solve this problem?
If the engine has to process very heavy in-line text, translation quality may suffer. In that scenario, some clients may choose to have in-line instances identified and presented in a given segment position (e.g. beginning or end), and have their post-editors move the tag back to the right place. Or they may opt to have the engine produce in-lines where they should be.
Experience shows that this is a fair measure. To our knowledge, our in-line parser constitutes an innovation well-above the current level of maturity of well-known SMT systems. Moreover, PangeaMT solutions are the only ones providing you with a choice of content formats (txt / TMX / XLIFF) as our mission is two-fold: to follow open standards and democratize Machine Translation as much as possible.
Check out our online demo, where you can test our TMX generator and some abridged versions of our domain engines in a few language directions!
Q13 – What do you mean your system is built in open standards? What is the difference with other models?
With open standards, there are no expensive lock-ins, no expensive upgrades and updates. There will be a need to update your system with your post-edited material, yes, but this is the learning curve of the system. An engine pays for itself in saved translation fees before one year. An update with post-editing material is a fraction of that cost.
Once your development reaches maturity, there will be little need for maintenance -unless you are a heavy duty corporate user with very specific requirements. You can then concentrate on producing more and more translated material, or consider the experience to create more customized engines.
Q12 – Are there any good (better, free) post-editing tools you can recommend?
You can also use freeware tools such as XBench, which will aid post-editing TMX files and even check consistency between segments before final proof.
Q11 – What about consistency? How do you ensure my company’s terminology prevails statistically over other options?
As part of our consultancy services, PangeaMT can add more muscle to your initial set of data so that a large linguist corpus comes into the training (we most probably have quite a bit to build a Language Model or turn any of our Language Models more like your style). All the data we add will be relevant to your subject field and the engines will be tested with and without it so you can check the effect of more data on your development. (You can find an abridged version of what a test can look in our October 2009 news. This was part of a free test for several organizations.)
Generally speaking, it is assumed that the more data the better. There has been some controversy as to whether smaller and cleaner sets of data provide higher accuracy. This will depend largely on your application and if “world awareness” is required by your system or if you are running an engine for a very specific domain. 2M words of civil engineering data will probably have little impact if you are building a system for a software company fighting virus, or a medical engine fighting a very different kind of virus. It is a common mistake to add data and data thinking it will be useful at some point, but our studies conclude that if that data is not likely to be needed/recalled, it is better to leave it as part of your Language Model.
In short, there is no way to ensure that statistics will work one way or another (that is precisely the point of statistics, they analyze the chances of something happening). If the system is too wide, pre- and post-processing systems can be built (in a kind of hybridation) to “fix” or “force” certain expressions. There are other ways of working towards higher chances, as it can be done with the combined engine method or the combined hypothesis (i.e. combining parts of likely outputs with a high certainty to remake sentences which the engine reprocesses). So far, we have heard good experiences of post-editors using the same terminology tools as with CAT tools to check terminology consistency.
Q10 – Can you build any combination (for example Chinese or Japanese into Spanish or Russian)? What are the challenges?
This is the greatest advantage of statistical systems. All you need is data, no linguistic knowledge of how language A relates to language B. If you are building “rules” between Japanese and Chinese and any European language, you are facing a tough task. Transfer rules are more and more remote between non-related languages. But with a statistical system, your engine analyzes the changes of a word or series of words happening when other expressions happen in other languages.SMT systems also work very well with similar or “related” languages, as little reordering is needed. When we are dealing with very remote languages, peripheral processes, pre-processing and post-processing become very important, as well as word reordering (i.e. making the sentence flow). How the Language Model is built is also important, but the key is really a good set of pre-processing and post-processing.
The answer is thus, yes, any language combination can be built and much faster and efficiently than with rule-based systems.
Q9 – If I use MT, does that mean I cannot use my TM-based systems any more? Can you integrate MT with my TM-based software?
– A full MT+PE service, mostly for corporate users looking for a package solution. We develop the MT system with your data and are in charge of the development and training of the change, plus the post-editing of the output. The engine can be hosted at either site and it produces plain text. Since 2009, we have offered a seamless TMX workflow, XLIFF compatibility since 2010 and since 2011 TTX integration with % Match recognition so you can leverage text from your existing TMs using your CAT tool and then ask the engine to do the hard work.
– SaaS services (we develop a “theme” engine in the domain and language you require with your data and you use it in a “pay-as-you-go” service, buying raw MT output that you then post-edit internally in TMX, XLIFF or TTX format. The engine is hosted internally at PangeaMT.
However, the most popular implementation is our customization of an engine that is hosted internally at the client’s server. Again, we develop and train an engine that will fit your domain and expressions and use your TM data and related data to build it. This is installed in your server, together with a set of peripheral modules (tag parser, intranet web interface, data transfer scripts, Language Model, control panel etc). You can then use it for translation as many times as you like within your organization, there is only a limitation in the number of servers the engine is installed. There is a period of engine adjustment and fitting into your system and of course re-training is highly recommended after you gather a certain amount of post-edited material.
Your existing TM software (or any new one you may acquire) can become your post-editing environment. There is no need to start a long learning curve with your existing linguists and suppliers. As PangeaMT works with a TMX workflow, you simply need to export those segments that you need to translate (typically matches below 70% or 75%), get the TMX translated, update your project TM penalizing MT! translator –thus your TM software will stop every time it finds a segment that has been machine-translated. Alternatively, just leverage your existing translations from your TM in a CAT tool using open standard XLIFF or proprietary TTX and send the batch of files to your PangeaMT engine for translation.
It could not be easier and thus the system can interface easily with your existing TM environment. The advantage is that you need not update the CAT software ever again: your system is MT driven and will improve with the data you now generate. Furthermore, the system offers the advantages of leveraging high percentage matches from your TM (that would make no sense to send to MT as a human can quickly spot the difference) with the power of a domain-specific statistical engine.
One alternative (depending on the CAT software you use) is to build an API to interface with your translation software segment after segment if the TM match does not reach a threshold.
Your savings in translation are immediate. You can then turn over more content and more text and reach more clients.
Q8 – What about “translator resistance” to become a post-editor?
Every “new” technology (or technique) always faces resistance. There is nothing we love more than security, certainties. In the translation world, this means the relatively long (?) cycle of learning CAT tools. We do not mean the omnipresent tools which are marketed so well, but also the lesser known tools that can also do the job pretty well. Some have made a conscious effort to offer plug-ins to MT (Swordfish, from maxprograms.com) and, like PangeaMT are designed on open standards with a “no lock-in” mentality. Now you are telling your translators to “correct” machine output and at a lower fee. Back to the 90’s…
Indeed there may be some resistance from long-standing translators. Recent graduates are still trained in translation theory linked to computer-assisted tools.
However, now that end users can in certain contexts play with already built systems, even if not fully customized to their domain, the postediting stage may become a selection criterion. Before full deployment, corporations, organizations, industries and LSPs usually run quantifiable evaluation pilots to become accustomed to post-editing tasks, identifying recurrent changes for automatic solutions, and base expectations about quality and pricing on objective data. This means that future post-editors, whether they are current translators or new recruits need to be involved at some stage prior to deployment.
Post-editing is still a nascent profession and experimentation with MT systems is required to gain a set of skills relative to each language. For example, if you are running an engine which lacks general “world” vocabulary, or very usual words. This may be annoying in large-scale systems and we run statistical dictionary modules to add words which were not in your training corpus. Nevertheless, post-editors in localization or documentation environments may think it is better to leave unknown terms in the source language so they can run “search & replace” and post-edit quickly. Thus, do expect the same resistance any new technology encounters but explain the benefits of it. Human translation cannot resolve the issues in terms of speed and cost in the digital content era. There are simply not enough qualified translators and even if there were, the logistics and costs of translating 50.000 words in a day or two would make project managers crazy. These pressures may also explain the high human “turn around” in the language industry. The truth is that with the advent of online translation services and desktop services and MT server engines, machines translate more words than humans already….
Q7 – What is the ROI on an MT engine?
The following graph shows the cost of translating 750k new words with a CAT tool at 11c. Two biannual software upgrades have been calculated.
SMT = cost of customized training (year 1), 2 yearly updates and 750k new words at 60% of translation rate. “Protection Plan” from year 2.*
* Includes in-line parser
CAT translation costs 750k new words p.a. | SMT + training + upgrade + PE 750k | |
Year 1 | 82,500 | 43,912.5 |
Year 2 (soft upgrade) | 85,500 | 22,207.5 |
Year 3 | 82,500 | 21,352.5 |
Year 4 (soft upgrade) | 85.500 | 21.352,5 |
Year 5 | 82,500 | 21,352.5 |
Q6 – What do you mean by re-training? Do engines need to be updated all the time, like TMs? How much does it cost?
PangeaMT will use this material to refine a language model for your particular case (i.e. an engine that speaks pharma like a bilingual EN/FR, or an engine that speaks like a bilingual German engineer, etc). Depending on the particular field and size of your bilingual data, more content may be required or it will need to be generated. Thus, the first engine, good as it will be, is at what we call “Stage 1” (really we call it teenage). Once you provide us with more information (typically a TMX file with previous translations or post-edited content), we re-train the engine with more material as it is intended to translate. This means that the engine gives more and more preference to certain expressions, word combinations.
PangeaMT reached 1.2Bn aligned sentences for machine learning in 2018 and 4.5Bn in 2019. Gathering huge resources for machine learning helps it create near-human quality machine translation engines with little client text input.
In-domain material is usually added at the beginning and at the end of the neural engine training cycle. This ensures that the algorithm picks up the nuances and characteristics of the domain, language and field it will translate. This is particularly true when material is added at the time of the training cycle (the last epoch), which is highly prioritized and thus serves as a “domain and style filter”.
Q5 – Are savings in translation immediate?
Yes, your costs in translation (and more importantly time-to-market) will be reduced significantly. You will be able to notice that from the first week. Nevertheless, do not forget that engines will improve over time and that a few re-training exercises (at least 1 a year) are highly recommended. Post-edited material is a very good candidate for engine retraining since it reflects your day-to-day needs.
Q4 – How much does post-editing cost?
Q3 – Will MT kill the need for a human translator?
The digital era has also transformed the role of the translator and, for quite a lot of time, translators have had to deal with formatting problems they were not trained for in CAT tools. XLIFF and Dita standards are one way of helping the translator do what he/she does best (language transfer) rather than fight with tags and colors inside Computer-Assisted Translation tools. In this sense, MT is a massive productivity tool.
Machine Translation is going to be one of the best aids a translator can have. It improves the speed at which a translator works (by not having to “think” translations and word connections that have been translated thousands of times before). Even if it only saved time on typing, that alone would be an improvement. If you are dealing with a particular domain (mechanical engineering), it will help the translator become more familiar with the terminology and concentrate in the added-value tasks that only humans can do.
Curiously, the truth is that machines already translate more words per day (i.e. people click the “Translate” button on a web or on a desktop or server translation program like BabelFish or GoogleTranslate to get general, gisting translation) than humans (there are some 300,000 registered translators in the world, with an average output of around 2,200-2,800 words per day).
Q2 – Why Statistical MT and not Rule-Based MT? What are the advantages and disadvantages?
- SMT only needs to learn parallel corpus to generate a translation engine. In contrast, RBMT needs a great deal of knowledge external to the corpus that only linguistic experts can generate, e.g. superficial categorization, syntax and semantics of all the words of one language in addition to the transfer rules between languages. These latter rules are entirely dependent on language pair involved and are not generally as studied as the characterization of each separate language. Defining general transfer rules is not easy, and so multiple rules according to individual cases need to be defined, especially between languages with very different structures, and / or when the source language has greater flexibility for the management of structural objects in a sentence.
- An SMT system is developed rapidly if have the appropriate corpus is available, making it more profitable. A RBMT system, in turn, requires great development and customization costs until it reaches the desired quality threshold. Packaged RBMT systems have already been developed by the time the user purchases them: most users approach MT by purchasing “out of the box” or “server ready” programs. The program works and will work in a certain way, but it is extremely difficult to reprogram models and equivalences. Above all, RBMT deployment generally is a much longer process involving more human resources. This is one key issue when companies calculate full implementation cost.
- SMT is adapted to automatically be retrained to situations not seen before (hitherto unknown words, new expressions that are translated differently from the way they were previously translated, etc.). RBMT is ‘re-trained’ by adding new rules and vocabulary among other things, which in turns means more time / increased handling by “expert humans”.
- SMT generates more fluent translations (fluency), although pure statistical systems may offer less consistency and less predictable results if the training corpus is too wide for the purpose. RBMT, however, may not have found the surface / syntactic information or words suitable for analyzing the source language, or does not know the word. This will prevent it from finding an appropriate rule.
- While statistical machine translation works well for translations in a specific domain, with the engine trained with bilingual corpus in that domain, RBMT may work better for more general domains.
- It is clear need for powerful computing in SMT in terms of hardware to train the models. Billions of calculations need to take place during the training of the engine and the hardware and computing knowledge required for it is highly specialized. However, training time can be reduced nowadays thanks to the wider availability of more powerful computers. RBMT requires a longer deployment and compilation time by experts so that, in principle, building costs are also higher.
- SMT generates statistical patterns automatically, including a good learning of exceptions to rules. As regards to the rules governing the transfer of RBMT systems, certainly they can be seen as special cases of statistical standards. Nevertheless, they generalize too much and cannot handle exceptions.
- Finally, SMT systems can be upgraded with syntactic information, and even semantics, like the RBMT. But in this case, the statistical patterns that a SMT would learn can be seen as a more general type of transfer rules, although currently the inclusion of such information in current systems does not provide significant improvements.
- A SMT engine can generate improved translations if retrained or adapted again. In contrast, the RBMT generates very similar translations between different versions.
Q1 – How many words do I need to build a good engine?
Typical PangeaMT developments within domains (software, electronics, automotive, engineering, tourism) have started at 5M words. There several ways to increase the number of words by gathering reliable parallel texts and PangeaMT offers consultancy and guidance so you can start an engines with as many words as possible. We call a engine with 15M or 20M words “mature” within a domain, because it is likely to have most of the terminology, vocabulary and expressions required for that language domain. Do not despair if you do not have so much data. The important thing is to get the engine started. You can add post-edited material and other materials that you gather with experience in later re-trainings.
There has been much argument about “unreasonable effectiveness of massive amounts of data” versus “smaller amounts of well-selected data”. Many people considering their first MT development are unsure as to whether put in as much text as possible (massive amounts of data) or to select the most accurate bilingual texts possible even if that means dealing with smaller sets of data. Our experience points in several directions
a) if you are trying to build a generalist type of engine, capable of translating the unexpected (from news articles to economics papers and literature), gather as much data as possible. You are trying to build a system to cater for sunny days and rainy days. No number will ever be enough. Sooner or later, you will need to build some kind of syntactical aids into it.
b) if you are trying to build an engine that will fit your particular language field and needs (or even if you want an engine that understand your products and services, but also some kind of financial information and legalities), you do not need trillions of literature. In this case, gathering as much data as possible from your organization (or similar) seems more reasonable and worth the effort.
Either way, do not underestimate the effort and teamwork required during the data-gathering stages. This is essential for the good training (and thus, the results) of the engine. It will be the beginning of the change in your adoption of MT technologies and a good chance to involve stakeholders in the process.