FAQs

Common Questions (and Fears) or everything you wanted to know about MT but were afraid to ask
Implementations, free training programs and several domain-specific customized engines developed for our own use/clients have made us realize that there are several key questions, fears and misconceptions when companies, other LSPs, and even translators approach the use of MT.

Machine Translation is a hot topic. Gone are the times when translation companies could proclaim “machines will never match human quality”. Markets pressures and, above all, the need to speed time-to-market texts have turned the limelight on translation automation. Several developments have even made it to the press, and the advent of free, general, plain text domain engines on the web like Google Translate, have put translation and fast language transfer high in the agenda for international businesses. President Obama’s call for the advancement and improvement of machine translation to help the world communicate and understand each better in October 2009 only made the subject gather more momentum (reports in the New York Times and The Economist, March 2010).

Yes, some translation technologies have been around for over 50 years, but how better are the newer technologies? how can MT be implemented successfully and integrated in a real-life production environment? What is the expected productivity increase and resulting cost saving? How will translators and staff react to the MT output? How can you manage it? As it happens with any innovation, there are fears and uncertainties … until a few success stories guide the way.

By answering the key 17 questions (or any others that you may have) we hope to provide enough information from experience and a few practical approaches on how to convert this new challenge into an innovative, competitive strategy.

If you have wondered or asked anyone any of the following questions, this will be a key section for you.

Q17 – Is your machine translation good with Czech language?

This is a typical question from some of our Slavic-speaking clients: Is your machine translation good with Czech language? Is your machine translation good with Russian? Is your machine translation good with Croatian?

Slavic languages have many cases (word inflections). This made statistical machine translation to work quite badly as the possibilities of any string happening were quite low. These type of languages are also called “morphologically rich languages” because of the number of combinations that are possible.

Neural networks changed the approach completely. A neural network works well below and above the word level to understand the complexities of how each word is formed and how it relates to the words next to it. This means that neural networks-based machine translation understands far better the relations between the different words in a sentence. By taking into account the dependencies between the words as a result of the training data provided, neural works-based machine translation provides output that translates in the feeling of near-human flow or human-quality machine translation.

One of our clients asked us

I thought that PangeaMT provide only generic engines and we can customize this engines with our own TMs and create in-domain specific „mirrors“ (with using “OnlineTraining” module). And I know that our language combinations (EN <-> CS and DE <-> CS, both ways) are not enough supported by other MT providers (Czech language is really complicate for MT solutions). So I had to ask If PangeaMT provide this two combination as well.

Well, indeed, you can customize your engine with our online tool using your own server. This provides a lot of freedom and independence when setting up a machine translation environment for a translation agency. As a language consultant, linguists tackle texts and documents of a different nature and conflicting terminology. Mixing everything in a single engine would be detrimental to performance and accuracy.

Take the following Czech English TMX file.

Translators are very familiar with this format. It is the txt version (database version) of a Translation Memory. Every time a translator saves a segment, it is creating an equivalent of the source sentence in the target language. This is wonderful for machine learning as translators create parallel data. It is the basis of many developments at PangeaMT.

A neural network will find the relations between the sentences and similarities, to the syllable and letter level if necessary (this is a very useful feature in neural training called BPE). It is also responsible for neural machine translation’s success and higher acceptability than the previous n-gram based “Statistical Machine Translation”, that is still successful with short sentences because its higher “memorizing” capabilities, as explained in our first neural machine translation developments publications back in 2017. Our findings at the time proved that a short sentence with less than 9-10 words could probably be translated more accurate with a statistical system than with a neural system. As system have improved over the years, the gap between one and the other has shortened. However, it is true that when ecommerce websites only need to translate a couple of words, and those words have been part of the training data, a statistical system will recall them more quickly and efficiently. A neural system, however, will reconstruct the sentence with a more human fluency.

Therefore, if you ask if our machine translation is good with Czech language, the answer is YES! We have the team, technology and data to make your MT engine run smoothly and produce high quality translation, millions of words!

Q16 – How about data cleaning? What is your approach?

Companies cannot underestimate (and often only begin to understand) the effort required in data cleaning when they begin to export bilingual (parallel) data for machine learning. Due to CAT limitations and features, noise can enter in a sentence in the shape of unwanted code, but the concept of data cleaning goes beyond removing in-lines, as explained in Q14.Some typical examples of data cleaning which is necessary were presented at Japan Translation Federation 2011 as part of our Japanese Syntax-Based Machine Translation Hybrid.

Anybody who has been in the translation industry long enough has come across some kind of “bad” TM. This could come in many shapes, from simply being a bad translation to being terminologically inaccurate, etc. Fortunately for our users, this kind of data cleaning has become part of PangeaMT’s standard cleaning procedure.

Some of the basic cleaning cycles are described below. They take into account some procedures which have been automated to system owners so they can rest assured that

  • their initial training data is clean before engine training in order to achieve the best results possible
  • any future post-edited material also goes through a virtuous cleaning cycle in order to check any noise that may be introduced in the system and thus affect re-trainings.

PangeaMT needs to ensure initially that the initial training set from the client has passed all cleaning checks before training. This will result in a clean bitext (parallel corpus) and aid computer learning. Together with PangeaMT’s own processes, from language-specific rules to syntax or POS tagging, data enters the engine training cycle.

This is not a comprehensive list of all cleaning steps. Nevertheless, it will allow users to realise what kind of material will be extracted for human approval before re-entering the training cycle. All segments detected as “suspect” will be stripped out of the training set for human approval /revision / editing in TMX format and then re-entered in the system.

  1. Segments with significant difference in length between the source and target
    Generally, we consider a sentence a “suspect” when it is more than 50% in length, but this can be varied according to your particular needs (Czech, for example, is usually shorter than English and French being 25% or 30% longer than English is not an indication per se of there being anything wrong).
  2. Segments where source or target contains typographical symbols missing in the other, such as [ ], *, + =.
  3. Segments where source and target are identical.
  4. “Empty segments”, i.e. segments with source but no target.
  5.  Segments containing particular names or expressions which are part of the client’s preferred terminology.

All these are candidates for human revision.

This is (one of the things) that sets PangeaMT apart from other offerings: we will train you and provide you with the tools so you become your own master in future re-trainings.
Clean data is the route to quality input and thus improved engine performance. The old translation saying applies: garbage in, garbage out. Thanks to our cleaning routines, you can rest assured that you will own a system which will strip out any “dubious” material for your consideration. But even after installation, please remember you have a full year free support. Any odd results you see or experience, any patterns you would like to apply/correct, we are here to help. This is not a black box system or company selling words or engines. Our model is “user empowerment“, i.e. technology transfer.

Q15 – In what way are you different from Google Translate?

Greatly. As part of its mission to organize the world’s information, Google takes on translation as an informative offering, very cleverly done, but also generalistic. Its translation application is actually state-of-the-art, but it attempts to a portal that can handle translation requests for every topic.  Just as we found during our initial efforts with MT, Google decided to ditch rule-based approached to machine translation and it embraced statistical methodologies for translation. This is not so rare as scientists at both organizations are strong SMT advocates and there has been a degree of academic collaboration beween PangeaMT’s core R&D team and some of Google’s lead researches.

While Google focuses in making as much general information available and has wide resources to gather trillions of words of data, PangeaMT’s approach is to build a custom application for your particular needs with your preferential terminology, expressions and word usage, i.e. a machine translation application that translates like you wish. Training data is typically provided by you and enhanced by PangeaMT. Additional language data may be added so there are sufficient lexical resources in the engine. A Language Model may be specifically built for you or adapted for your purposes. Furthermore, PangeaMT’s system is designed to fit and aid current TM-based systems by translating TMX or xliff formats, something that Google Translate cannot do (it only translates plain text).

By translating files and not plain text, PangeaMT plugs in directly and easily into any localization or Knowledge Base workflow. TMX or XLIFF files can be easily post-edited using most (if not all) of yesterday’s CAT tools as editing tools.

In short, PangeaMT’s developments fit in current translation environments and automate current processes, whereas Google Translate is an informative engine.

Q14 – I deal with texts that are full of in-lines and tags. Most SMT systems only offer plain text and it takes a long time to copy and paste the in-lines/tags back in place. Have you done anything to solve this problem?

Yes, that´s right. Statistical machine translation systems usually produce plain text output because this is also the format they can process. However, we are keen to see PangeaMT solutions in use and adapted to the most demanding language industry requirements. This is why we focused our effort on developing SMT engines capable of handling in-line coding typical of other content formats used in localization production environments. Thanks to an innovative inline parser, PangeaMT can identify in-lines without attempting to translate them. An in-line placeholder is first inserted, and then replaced by the in-line itself before the output.

If the engine has to process very heavy in-line text, translation quality may suffer. In that scenario, some clients may choose to have in-line instances identified and presented in a given segment position (e.g. beginning or end), and have their post-editors move the tag back to the right place. Or they may opt to have the engine produce in-lines where they should be.

Experience shows that this is a fair measure. To our knowledge, our in-line parser constitutes an innovation well-above the current level of maturity of well-known SMT systems. Moreover, PangeaMT solutions are the only ones providing you with a choice of content formats (txt / TMX / XLIFF) as our mission is two-fold: to follow open standards and democratize Machine Translation as much as possible.

Check out our online demo, where you can test our TMX generator and some abridged versions of our domain engines in a few language directions!

Q13 – What do you mean your system is built in open standards? What is the difference with other models?

It means we support industry-wide standards that are not the property of any one company. We want to bring democracy to translation, and particularly, the MT world. Both have been dominated by technology proprietors with an acute eye for business, but facts plainly tell us there was little interest in the advancement of the industry.

With open standards, there are no expensive lock-ins, no expensive upgrades and updates. There will be a need to update your system with your post-edited material, yes, but this is the learning curve of the system. An engine pays for itself in saved translation fees before one year. An update with post-editing material is a fraction of that cost.

Once your development reaches maturity, there will be little need for maintenance -unless you are a heavy duty corporate user with very specific requirements. You can then concentrate on producing more and more translated material, or consider the experience to create more customized engines.

Q12 – Are there any good (better, free) post-editing tools you can recommend?

Yes. Any CAT tool will become excellent post-editing environments. When you follow an open standards TMX workflow, you will be able to leverage matches from your TM while profiting from large chunks of translated text.

You can also use freeware tools such as XBench, which will aid post-editing TMX files and even check consistency between segments before final proof.

Q11 – What about consistency? How do you ensure my company’s terminology prevails statistically over other options?

Ideally, your customized engine(s) should only contain your own data to ensure no noisy material perturbs your writing or company style. In reality, few organizations have as much data available. Data gathering and consultancy on how to obtain more relevant data has become a favorite sport among SMT developers.

As part of our consultancy services, PangeaMT can add more muscle to your initial set of data so that a large linguist corpus comes into the training (we most probably have quite a bit to build a Language Model or turn any of our Language Models more like your style). All the data we add will be relevant to your subject field and the engines will be tested with and without it so you can check the effect of more data on your development. (You can find an abridged version of what a test can look in our October 2009 news. This was part of a free test for several organizations.)

Generally speaking, it is assumed that the more data the better. There has been some controversy as to whether smaller and cleaner sets of data provide higher accuracy. This will depend largely on your application and if “world awareness” is required by your system or if you are running an engine for a very specific domain. 2M words of civil engineering data will probably have little impact if you are building a system for a software company fighting virus, or a medical engine fighting a very different kind of virus. It is a common mistake to add data and data thinking it will be useful at some point, but our studies conclude that if that data is not likely to be needed/recalled, it is better to leave it as part of your Language Model.

In short, there is no way to ensure that statistics will work one way or another (that is precisely the point of statistics, they analyze the chances of something happening). If the system is too wide, pre- and post-processing systems can be built (in a kind of hybridation) to “fix” or “force” certain expressions. There are other ways of working towards higher chances, as it can be done with the combined engine method or the combined hypothesis (i.e. combining parts of likely outputs with a high certainty to remake sentences which the engine reprocesses). So far, we have heard good experiences of post-editors using the same terminology tools as with CAT tools to check terminology consistency.

Q10 – Can you build any combination (for example Chinese or Japanese into Spanish or Russian)? What are the challenges?

This is the greatest advantage of statistical systems. All you need is data, no linguistic knowledge of how language A relates to language B. If you are building “rules” between Japanese and Chinese and any European language, you are facing a tough task. Transfer rules are more and more remote between non-related languages. But with a statistical system, your engine analyzes the changes of a word or series of words happening when other expressions happen in other languages.SMT systems also work very well with similar or “related” languages, as little reordering is needed. When we are dealing with very remote languages, peripheral processes, pre-processing and post-processing become very important, as well as word reordering (i.e. making the sentence flow). How the Language Model is built is also important, but the key is really a good set of pre-processing and post-processing.

The answer is thus, yes, any language combination can be built and much faster and efficiently than with rule-based systems.

Q9 – If I use MT, does that mean I cannot use my TM-based systems any more? Can you integrate MT with my TM-based software?

There are several ways in which you can use an SMT development within you organization. One of our latest developments was introduced in Localization World Barcelona 2012. This new version of PangeaMT features self-training (so you do not have to come back to us for updates), automated engine creation, glossary and many other features. Click here to read the press release. PangeaMT offers:

–       A full MT+PE service, mostly for corporate users looking for a package solution. We develop the MT system with your data and are in charge of the development and training of the change, plus the post-editing of the output. The engine can be hosted at either site and it produces plain text. Since 2009, we have offered a seamless TMX workflow, XLIFF compatibility since 2010 and since 2011 TTX integration with % Match recognition so you can leverage text from your existing TMs using your CAT tool and then ask the engine to do the hard work.

–       SaaS services (we develop a “theme” engine in the domain and language you require with your data and you use it in a “pay-as-you-go” service, buying raw MT output that you then post-edit internally in TMX, XLIFF or TTX format.  The engine is hosted internally at PangeaMT.

However, the most popular implementation is our customization of an engine that is hosted internally at the client’s server. Again, we develop and train an engine that will fit your domain and expressions and use your TM data and related data to build it. This is installed in your server, together with a set of peripheral modules (tag parser, intranet web interface, data transfer scripts, Language Model, control panel etc). You can then use it for translation as many times as you like within your organization, there is only a limitation in the number of servers the engine is installed. There is a period of engine adjustment and fitting into your system and of course re-training is highly recommended after you gather a certain amount of post-edited material.

Your existing TM software (or any new one you may acquire) can become your post-editing environment. There is no need to start a long learning curve with your existing linguists and suppliers. As PangeaMT works with a TMX workflow, you simply need to export those segments that you need to translate (typically matches below 70% or 75%), get the TMX translated, update your project TM penalizing MT! translator –thus your TM software will stop every time it finds a segment that has been machine-translated. Alternatively, just leverage your existing translations from your TM in a CAT tool using open standard XLIFF or proprietary TTX and send the batch of files to your PangeaMT engine for translation.

It could not be easier and thus the system can interface easily with your existing TM environment. The advantage is that you need not update the CAT software ever again: your system is MT driven and will improve with the data you now generate. Furthermore, the system offers the advantages of leveraging high percentage matches from your TM (that would make no sense to send to MT as a human can quickly spot the difference) with the power of a domain-specific statistical engine.

One alternative (depending on the CAT software you use) is to build an API to interface with your translation software segment after segment if the TM match does not reach a threshold.

Your savings in translation are immediate. You can then turn over more content and more text and reach more clients.

Q8 – What about “translator resistance” to become a post-editor?

If you remember the translator’s resistance to use CAT tools in the late 90’s (I do, I freelanced in the UK those days), you will get an idea of how post-editing may be viewed in 2010 and onwards.

Every “new” technology (or technique) always faces resistance. There is nothing we love more than security, certainties. In the translation world, this means the relatively long (?) cycle of learning CAT tools. We do not mean the omnipresent tools which are marketed so well, but also the lesser known tools that can also do the job pretty well. Some have made a conscious effort to offer plug-ins to MT (Swordfish, from maxprograms.com) and, like PangeaMT are designed on open standards with a “no lock-in” mentality. Now you are telling your translators to “correct” machine output and at a lower fee. Back to the 90’s…

Indeed there may be some resistance from long-standing translators. Recent graduates are still trained in translation theory linked to computer-assisted tools.

However, now that end users can in certain contexts play with already built systems, even if not fully customized to their domain, the postediting stage may become a selection criterion. Before full deployment, corporations, organizations, industries and LSPs usually  run quantifiable evaluation pilots to become accustomed to post-editing tasks, identifying recurrent changes for automatic solutions, and base expectations about quality and pricing on objective data. This means that future post-editors, whether they are current translators or new recruits need to be involved at some stage prior to deployment.

Post-editing is still a nascent profession and experimentation with MT systems is required to gain a set of skills relative to each language. For example, if you are running an engine which lacks general “world” vocabulary, or very usual words. This may be annoying in large-scale systems and we run statistical dictionary modules to add words which were not in your training corpus. Nevertheless, post-editors in localization or documentation environments may think it is better to leave unknown terms in the source language so they can run “search & replace” and post-edit quickly. Thus, do expect the same resistance any new technology encounters but explain the benefits of it. Human translation cannot resolve the issues in terms of speed and cost in the digital content era. There are simply not enough qualified translators and even if there were, the logistics and costs of translating 50.000 words in a day or two would make project managers crazy. These pressures may also explain the high human “turn around” in the language industry. The truth is that with the advent of online translation services and desktop services and MT server engines, machines translate more words than humans already….

Q7 – What is the ROI on an MT engine?

Engines typically pay for themselves before the first year of operation. PangeaMT’s mission is to bring democracy to the machine translation world and make the technology affordable and usable for as many users as possible. The cost of engine has become extremely affordable. Thus, early adopters are benefiting more as their systems can reach maturity levels faster. This, in turn, means savings and the possibility to automate processes in more languages and domains.

The following graph shows the cost of translating 750k new words with a CAT tool at 11c. Two biannual software upgrades have been calculated.

SMT = cost of customized training (year 1), 2 yearly updates and 750k new words at 60% of translation rate. “Protection Plan” from year 2.*

* Includes in-line parser

CAT translation costs 750k new words p.a. SMT + training + upgrade + PE 750k
Year 1 82,500 43,912.5
Year 2  (soft upgrade) 85,500 22,207.5
Year 3 82,500 21,352.5
Year 4  (soft upgrade) 85.500 21.352,5
Year 5 82,500 21,352.5

Q6 – What do you mean by re-training? Do engines need to be updated all the time, like TMs? How much does it cost?

Your engines will be built with material that you need to provide PangeaMT for the training. Otherwise, we can use generic material that we have in most language combinations. In September 2019, we have 4.5Bn aligned sentences into over 80 languages – that is 3Bn sentences for machine learning than in 2018 as reported in Slator.

PangeaMT will use this material to refine a language model for your particular case (i.e. an engine that speaks pharma like a bilingual EN/FR, or an engine that speaks like a bilingual German engineer, etc). Depending on the particular field and size of your bilingual data, more content may be required or it will need to be generated. Thus, the first engine, good as it will be, is at what we call “Stage 1” (really we call it teenage). Once you provide us with more information (typically a TMX file with previous translations or post-edited content), we re-train the engine with more material as it is intended to translate. This means that the engine gives more and more preference to certain expressions, word combinations.

PangeaMT reached 1.2Bn aligned sentences for machine learning in 2018 and 4.5Bn in 2019. Gathering huge resources for machine learning helps it create near-human quality machine translation engines with little client text input.

In-domain material is usually added at the beginning and at the end of the neural engine training cycle. This ensures that the algorithm picks up the nuances and characteristics of the domain, language and field it will translate. This is particularly true when material is added at the time of the training cycle (the last epoch), which is highly prioritized and thus serves as a “domain and style filter”.

Q5 – Are savings in translation immediate?

Our engines pass several tests (including post-editing trials) before delivery to ensure your investment is worth the money and the time from the first day. Implementation is smooth and can be virtualized, installed in a server in your organization, work on an intranet or customized to your specific needs.

Yes, your costs in translation (and more importantly time-to-market) will be reduced significantly. You will be able to notice that from the first week. Nevertheless, do not forget that engines will improve over time and that a few re-training exercises (at least 1 a year) are highly recommended. Post-edited material is a very good candidate for engine retraining since it reflects your day-to-day needs.

Q4 – How much does post-editing cost?

Market trends point to a rate 60% of full translation fee for post-editing good MT output… but this should be considered a guideline rather than a fixed rule. There are many cases both sides of that figure. We cannot tell what is the best post-editing fee in every circumstance and domain. Nevertheless, LSP and content writers are taking that figure as a reference and working on production improvement figures. We know post-editing is also being paid by the Kb, by the segment or by the time.

Q3 – Will MT kill the need for a human translator?

Absolutely not. Computer-Assisted Translation (CAT) tools did not kill the need for human translators; in fact, they created and made the market grow, as translation was made more affordable. Most of us with some years in the industry still remember the initial resistance by some established linguists to adopt the early TM-based tools. Many considered them a gimmick, a trick to pay translators less, when the truth was that translators were being paid a lot even for repetitions as there was no way to count them… but manually. Good old 90’s….

The digital era has also transformed the role of the translator and, for quite a lot of time, translators have had to deal with formatting problems they were not trained for in CAT tools. XLIFF and Dita standards are one way of helping the translator do what he/she does best (language transfer) rather than fight with tags and colors inside Computer-Assisted Translation tools. In this sense, MT is a massive productivity tool.

Machine Translation is going to be one of the best aids a translator can have. It improves the speed at which a translator works (by not having to “think” translations and word connections that have been translated thousands of times before). Even if it only saved time on typing, that alone would be an improvement. If you are dealing with a particular domain (mechanical engineering), it will help the translator become more familiar with the terminology and concentrate in the added-value tasks that only humans can do.

Curiously, the truth is that machines already translate more words per day (i.e. people click the “Translate” button on a web or on a desktop or server translation program like BabelFish or GoogleTranslate to get general, gisting translation) than humans (there are some 300,000 registered translators in the world, with an average output of around 2,200-2,800 words per day).

Q2 – Why Statistical MT and not Rule-Based MT? What are the advantages and disadvantages?

Any experienced MT (or at least reader or post-editor of MT) will tell you that Statistical MT flows much better than the traditional rule-based systems (RB). Anyone who has studied or implemented SMT will tell you implementation and development times are much shorter (thus ROI). RB is usually bought as a cheaper package once a company has done all the programming of rules and built in the syntactics. The package is closed and customization (or hybridizing) is a longer process. Statistical MT can improve by Coupling Reordering and Decoding, and by applying many many other mathematical and statistical formulas which will determine with certainty that a word (or series or combinations of words) happen together in comparison with other words. Read below if you need a comprehensive listing.

  • SMT only needs to learn parallel corpus to generate a translation engine. In contrast, RBMT needs a great deal of knowledge external to the corpus that only linguistic experts can generate, e.g. superficial categorization, syntax and semantics of all the words of one language in addition to the transfer rules between languages. These latter rules are entirely dependent on language pair involved and are not generally as studied as the characterization of each separate language. Defining general transfer rules is not easy, and so multiple rules according to individual cases need to be defined, especially between languages with very different structures, and / or when the source language has greater flexibility for the management of structural objects in a sentence.
  • An SMT system is developed rapidly if have the appropriate corpus is available, making it more profitable. A RBMT system, in turn, requires great development and customization costs until it reaches the desired quality threshold. Packaged RBMT systems have already been developed by the time the user purchases them:  most users approach MT by purchasing “out of the box” or “server ready” programs. The program works and will work in a certain way, but it is extremely difficult to reprogram models and equivalences. Above all, RBMT deployment generally is a much longer process involving more human resources. This is one key issue when companies calculate full implementation cost.
  • SMT is adapted to automatically be retrained to situations not seen before (hitherto unknown words, new expressions that are translated differently from the way they were previously translated, etc.). RBMT is ‘re-trained’ by adding new rules and vocabulary among other things, which in turns means more time / increased handling by “expert humans”.
  • SMT generates more fluent translations (fluency), although pure statistical systems may offer less consistency and less predictable results if the training corpus is too wide for the purpose. RBMT, however, may not have found the surface / syntactic information or words suitable for analyzing the source language, or does not know the word. This will prevent it from finding an appropriate rule.
  • While statistical machine translation works well for translations in a specific domain, with the engine trained with bilingual corpus in that domain, RBMT may work better for more general domains.
  • It is clear need for powerful computing in SMT in terms of hardware to train the models. Billions of calculations need to take place during the training of the engine and the hardware and computing knowledge required for it is highly specialized. However, training time can be reduced nowadays thanks to the wider availability of more powerful computers. RBMT requires a longer  deployment and compilation time by experts so that, in principle, building costs are also higher.
  • SMT generates statistical patterns automatically, including a good learning of exceptions to rules. As regards to the rules governing the transfer of RBMT systems, certainly they can be seen as special cases of statistical standards. Nevertheless, they generalize too much and cannot handle exceptions.
  • Finally, SMT systems can be upgraded with syntactic information, and even semantics, like the RBMT. But in this case, the statistical patterns that a SMT would learn can be seen as a more general type of transfer rules, although currently the inclusion of such information in current systems does not provide significant improvements.
  • A SMT engine can generate improved translations if retrained or adapted again. In contrast, the RBMT generates very similar translations between different versions.

Q1 – How many words do I need to build a good engine?

Most people will tell you that 2 million words is the bare minimum you can provide for a “bare bones” engine and some kind of automation within a domain – but do not expect great results if you are dealing with texts they may include a lot of new, unexpected words like economics, or journalism. If you are dealing with a highly controlled language and you little variation on your theme (technical manuals, set documentation packages, etc), try to pump up as much text as you can.

Typical PangeaMT developments within domains (software, electronics, automotive, engineering, tourism) have started at 5M words. There several ways to increase the number of words by gathering reliable parallel texts and PangeaMT offers consultancy and guidance so you can start an engines with as many words as possible. We call a engine with 15M or 20M words “mature” within a domain, because it is likely to have most of the terminology, vocabulary and expressions required for that language domain. Do not despair if you do not have so much data. The important thing is to get the engine started. You can add post-edited material and other materials that you gather with experience in later re-trainings.

There has been much argument about “unreasonable effectiveness of massive amounts of data” versus “smaller amounts of well-selected data”. Many people considering their first MT development are unsure as to whether put in as much text as possible (massive amounts of data) or to select the most accurate bilingual texts possible even if that means dealing with smaller sets of data. Our experience points in several directions

a) if you are trying to build a generalist type of engine, capable of translating the unexpected (from news articles to economics papers and literature), gather as much data as possible. You are trying to build a system to cater for sunny days and rainy days. No number will ever be enough. Sooner or later, you will need to build some kind of syntactical aids into it.

b) if you are trying to build an engine that will fit your particular language field and needs (or even if you want an engine that understand your products and services, but also some kind of financial information and legalities), you do not need trillions of literature. In this case, gathering as much data as possible from your organization (or similar) seems more reasonable and worth the effort.

Either way, do not underestimate the effort and teamwork required during the data-gathering stages. This is essential for the good training (and thus, the results) of the engine. It will be the beginning of the change in your adoption of MT technologies and a good chance to involve stakeholders in the process.