Faster translations with XLIFF

Translating website content just got easier. Magnolia CMS now supports XLIFF as a content export format. I sat down with Language Technologist Twan Sevriens to test the feature and learn how it makes a translation workflow easier.

Translating a multilanguage website

When you translate a website into multiple languages, you typically have two options:
  1. Translate the content within the CMS.
  2. Export the content in some file format, translate it outside the system, and import the results back.
Although the first option sounds promising it has the drawback that most CMS systems are not built to optimize the translation workflow.

Translators prefer to use CAT (computer-aided translation) tools that support and facilitate the translation process. CAT tools give direct access to helper functions and present the translator with just the text they need to translate, hiding details such as formatting.

Looking for speed and consistency

Twan explained that optimizing the content exchange makes the translation of multilanguage website more efficient. The benefit of CAT tools and translation memories really kicks in when content must be routinely translated into several languages.

CAT tools support and facilitate the translation process by providing functions that are not available in the authoring system such as spell checkers, grammar checkers, terminology managers, dictionaries and translation memories. Translation memories are databases of text segments in the source language and their translations in one or more target languages. If you have already translated a piece of text before, the TM tool will recognize it and suggests a translation.

What is XLIFF?

XLIFF is a gateway to such benefits. It is an XML file format designed by a group of software and localization service providers. It is intended as a single interchange format that most tools understand.

It works like this. First you configure target languages in Magnolia CMS, then export content to an XLIFF file. When the CMS and the CAT tools interoperate using the standard format, no manual conversion is needed. Authors and translators can focus on what they do best: writing persuasive content and making sure the language stays true to the original in the translation.

Twan, who develops automation for translation and localization tools for living, is interested in removing any friction between authoring systems and translation tools. He was keen to see how an enterprise content management system uses XLIFF so we jumped right in.

Clever attributes and room for improvement

We logged in the Magnolia CMS 4.5 and configured two target languages, German and French. Then we exported some pages into XLIFF. The export created a file for each target language: de.xlf and fr.xlf. The files contained the English source text and placeholders for a translation.

Here is a snippet from the German de.xlf file:
<trans-unit id="d30fb1e8-0f78-4a97-9986-f2017fcaa2d2:abstract">
   <source 
      date="2012.02.17 10:55:29 427" 
      link="http://www.demo-project.com/about/subsection-articles.html" 
      title="Section Intro: Abstract" 
      xml:lang="en">This is the abstract for the section "Articles". 
      It is a brief résumé on the content of this section.</source>
   <target xml:lang="de" />
</trans-unit>

Each component on the page (headings, teasers, links etc.) is stored in a trans-unit element. Within each trans-unit you have a source element for the source text and a target element for the translation.

We immediately spotted couple of clever attributes:
  • link is the URL where the content comes from. This is potentially very useful. The translator can click the link to see the content in page context. We opened the XLIFF file in two different CAT tools but neither displayed the link right away since it is not an XLIFF native attribute. As an enhancement, Twan suggested adding it as a <note> element instead or customizing the XML namespace so that the CAT tool recognizes our custom attribute. Great tip.
  • title is the component name such as "Subheading" or "Abstract". It helps the translator understand what role the text plays. Suppose that all teaser headings must have a call to action such as "Look here for new arrivals!". Knowing that the text is a teaser, a translator can choose a strong action verb. Again, the CAT tool did not display this info out of the box since it is not a native XLIFF attribute. However, a similar native attribute restype (resource type) exists. It provides pre-defined values such as heading, linklabel and menu and allows custom values. Another enhancement opportunity!

Interpreting the spec

I also had a chat with Robert Siska, the Magnolia developer who wrote the XLIFF feature. He said the XLIFF specification is generally well written but doesn't always match to how content is managed in a CMS.

For example, the specification requires an attribute named original which must specify the name of the file where the content comes from. A CMS export does not necessarily come from a single file. You could export an entire branch which consists of many pages. The path to the top-level content node might be a more useful value than a page name. Also, a channel-agnostic CMS like Magnolia may publish the same content in HTML, RSS or PDF format. Setting the datatype attribute to html would only tell part of the story.

Need for better comment support

Twan told me that commenting is an area where XLIFF makes headway. Adding comments in the XML and passing them between authors and translators is necessary for clear communication.

Users have high expectations in this area. When they see smart commenting and annotations in consumer software like Google Docs they expect similar functionality in CAT tools. Adding comments to XLIFF is possible today with the note element. The problem is that CAT tools must also support the element, otherwise it is of no use to translators.

Protecting non-translatable content

Another tricky subject is dealing with HTML elements and special markup. We opened a Text & Image component in the CAT tool.


This content obviously came from a rich-text editor. You can see the HTML elements mixed in: a <p> element, a <strong> and a special &amp; character entity to represent an ampersand. How do you handle special markup in XLIFF?

Twan explained that you typically want to protect such custom elements so they don't get deleted or changed by accident. XLIFF supports protecting inline elements by enclosing them in specifically defined tags or by replacing them with placeholder elements. In fact, when configuring CAT tools for "normal" XML files, most of the work involves figuring out what is translatable and what isn't, and protecting the non-translatable content. With XLIFF this information is stored automatically by using the correct elements during the export process.

XLIFF adoption in content management systems

What surprised me is that Twan has not come across many CMSs that support XLIFF. On the contrary, many content management systems are custom-built and export custom XML that is sometimes not even well-formed. When content comes in a format less suitable for translation, more effort is needed to pre-process the files. Training translators and post-processing the translated files may also be necessary. This hinders the translation process and eventually ends up costing the content owner more.

Many CAT tools have native XLIFF support. They even use XLIFF as an intermediate conversion format when the source text is in another format. We saw this in action when we opened an Excel file. The tool converted the Excel file into its own custom flavor of XLIFF. CAT tool developers of course depend on standards compliance more than the eventual translator. Once the source text is in a CAT tool the conversion has already happened - the translator does not need to care about the format.

Try it yourself

Download Magnolia CMS and try the Content Translation Support module. You can find it in Tools > Content Translation. See module documentation for help.

Resources