
Magical mystery tags
XML has become a widely accepted standard for structuring and exchanging data. It combines power and flexibility, two qualities that usually compete with each other, but in XML have achieved a well-balanced equilibrium.
The format itself is deceptively simple: well nested tags with no pre-defined meaning and with optional attributes, which themselves can have any name. A typical tag (with its content) looks like this:
<arbitraryName arbitraryAttribute=”hello”>Some text</arbitraryName>
When the data being manipulated is itself formatted as tags between angle brackets (for example HTML, XHTML or XML itself) we have a problem. How do we distinguish between XML structure and data? To get around this, angle brackets that belong to the data are transformed into their html-entity code: < goes into < and > into > (they are abbreviations of “less than” and “greater than”, an easy way of remembering it). XML knows the meaning of these two entities, and documents that contain them are considered valid.
When we want to translate these files using some CAT tools, it is convenient to have them converted back. In this way the tags are interpreted as such instead of being displayed as part of the text. We get better segmentation, better context matches and they don’t show up in the sentences to be translated. In the finished translation they need to be converted again into entities to maintain the original structure.
Before importing the translated segments, say into a CMS, they need to be processed once again to convert the html entities into the corresponding characters, so the tags get interpreted correctly by the browser. In general the number and names of encoded tags in the original document coincide with the ones in the translated document. However we have had to deal with a situation in which tags were introduced in fields where no tags were present originally: “XIX century” is translated into French as “XIXe siècle”. In HTML this is marked with the “sup” tag: “XIX<sup>e</sup> siècle”. The CMS on the client side was not prepared to deal with the entities in these fields, and they were showing up in the text. Therefore it was necessary to further process the XML files, removing occurrences of the “sup” tag in fields with unsupported entity conversion.
This anecdote serves to illustrate that with great power and flexibility, there also comes the possibility of pitfalls. It is not enough to validate the XML files; it is also necessary to look at the final product and adjust the fine details accordingly in translation projects involving XML.
[...] an entry about issues that might arise when the documents to be translated are in XML format: Power and flexibility in XML Hope someone finds it helpful. Cheers. P. Reply With Quote + Reply to [...]
Very interesting and usefull article.
Another approach is to use the XML CDATA node to store characters not allowed in the content of the XML document. This way there is no need to use entities and convert back. It is widely used to contain entire html elements and even html documents. XML parsers to not check the content of CDATA for characters so no error is thrown.
<![CDATA[anything is allowed here: &]]>
Hi Bill
You are right that using CDATA solves the problem. But in my experience this approach is not used very much. It might become more widely used in the future, but as it still requires pre and post-processing of the data, it’s not clear that there is a real saving of effort.
Cheers.
P.
[...] del original de Pablo [...]