THE PERILS OF CSV

WARNING: Highly technical content. Hit the back button, if you're faint of heart. ;-)

Some translators will cringe away from the technicalities of their job. But if you work with CAT tools, then by definition you are prepared to work with many different file formats. And there is no getting away from the technical side of this job.

Here's my experience with a certain CSV. It was clearly exported from a content management system. Only some columns were intended for translation, and one of these columns contained very lengthy strings of text with HTML and CSS tags.

What is the first thing that comes to mind when you want to view a CSV file and prepare it for translation in a CAT tool such as SDL Studio? Excel, of course. But then it turns out that the wizard for importing a CSV file is far from intuitive. You have to set the content type for each column one by one. For comparison, in LibreOffice Calc you can select all columns in the wizard and change the content type in one go.

That is one hurdle behind us, now the excel file can be imported into Studio, an analysis can be done and a quote sent to the client. A few days later the quote is accepted and some days later you get to work.

But then, a surprise. You find that text and tags from the longest column is often truncated. After some digging you find out that even though the latest Excel can hold vast amounts of data in a cell, it will most likely truncate the text at 1024 characters or so.

So when you're about to start working on the job, you find out that your quote was undervalued and there is more text to translate than you thought, and the client won't accept a new quote.

But you need a workaround. So your idea is to copy each column to a separate text file and process these files in Studio, because you know how to handle tags in text files in Studio. Wrong again, and on two levels. First, Studio will assume that these files are encoded in the default code page for your locale, for Poland it is windows-1250. Converting the CSV file to windows-1250 is not a good idea, because then some technical characters, such as Ø, will be lost. And the segmentation in the text file with tags will be horrible, for example, that one lengthy cell from the CSV will be crammed into one segment in Studio anyway.

So, finally the columns without tags were imported as ODS sheets (that is the LibreOffice Calc format), and the columns with tags were converted into ‘fake’ HTML files (with HEAD, META encoding [utf-8, by the way], and BODY tags) and imported in these formats. Only then the segmentation of the ‘tag soup’ content was bearable.

And only then was I able to start my main line of work, that is translation.

Phew. Lesson learned.



Comments

Popular posts from this blog

XSLT to the Rescue