Glossaries in Machine Translation

Glossy seat

In Machine Translation we use the term “glossary” in a different way than the traditional meaning of “an alphabetically ordered list of words with an explanation”. For us it is an association of phrases in different languages. It is used for translating certain combinations of words in a particular way. Usually the languages are two, and there is some implicit direction between them, but the mechanism is simple enough that works well not only in the reverse direction, but also with more than two languages. The distinction between source and target sounds a bit artificial in this context, but we will keep it for simplicity in what follows.

Also, for what I want to say, the target language is unimportant, so to illustrate the issues involved I’ll use the following example: “press the start key” -> “X Y Z“, meaning that these particular four words in English should be translated as “X Y Z”, representing not necessarily words, but also ideograms, or any other notation used for the target language.

Let’s see what happens when we want to translate a sentence like “When you are ready to begin, please press the start key and wait for the blue light to turn on“.

Now we could proceed in two ways:

  1. Make the replacement in the source, before sending it for translation. The sentence we send to the engine would be “When you are ready to begin, please X Y Z and wait for the blue light to turn on“. Hopefully the foreign words or symbols are different enough from English that they won’t confuse the engine. Even so, we can see that there are potential problems lurking in this solution. Any training of the engine based on fragments like “start key and wait” would be lost.
  2. Find out how the machine would translate “press the start key” on its own and make the replacement in the target document once it has been translated. Say that the sentence is translated as “A B C“. Then we expect to have “A B C” as a fragment of the translation of the whole sentence. In this case it is straightforward to replace it by the content indicated in the glossary. However, it is possible, even likely, that when this fragment is translated as a part of a larger sentence, the extra context changes the translation, and “A B C” doesn’t appear (with the extra context it might get translated into, say, “A B D C“).

So we see that an apparently simple task as using a glossary becomes problematic in the context of Machine Translation.

In my opinion, the best way to deal with this issue is first to perform the variant number two, and then flag the instances in which the target doesn’t contain the corresponding fragment.

This is one of the challenges that we strive to solve in this new field of Machine Translation. Get in touch with us for a free quote if you are interested in making use of this new technology.