Archive for the 'OSTD' Category

Announcing the Open Source Translation Database

Thursday, March 8th, 2012

Translating software is hard, I know from my experience of starting two new open source projects (ISO Master and Asunder) about the challenges of learning how to use Gettext, finding volunteers to do the translations, encouraging and enabling them to translate my software.

The work was worth it for me, I now have almost 70 full translations of my software in 40 languages. But I’d like to make the process of getting your first translation easier, and generally help more software maintainers to get more translations with less effort.

The OSTD ( http://littlesvr.ca/ostd/ ) is an automatic translations system – it will take your .POT file and populate it with translations based on strings in other open source software, generating .PO files. Given that you can see which software the strings come from – this will be much more accurate than other automatic translation systems such as Google Translate.

I just started the project so there is a lot of polish still coming, and some significant features such as updating existing .PO files and a web service interface for other software to use. But it can be useful as it is already. Please try it out!

Any feature requests and bug reports are welcome. My goal is to make it as useful as possible to as many people as possible. I’m doing this part time, but I’m excited about the project and will do my best to improve it as quickly as possible.

Size matters

Thursday, March 8th, 2012

I was going to show the OSTO to Chris Tyler and earlier that day, because demos never work, I tried it, from the Seneca network.

Turns out already the OSTD is a victim of its own success. When translating the ISO Master POT file I get almost 6000 translated strings in 153 languages. I you do a bit of math – that’s a lot of text. 2.8MB in fact.

Well the problem is that 2.8MB of stuff needs to be downloaded from the server, and it can take quite a while.

Luckily Chris immediately suggested that I enable gzip compression on Apache. I thought that it was enabled by default, but I was wrong. Here’s what I was missing:

<IfModule mod_deflate.c>
        AddOutputFilterByType DEFLATE text/plain
        AddOutputFilterByType DEFLATE text/html
        AddOutputFilterByType DEFLATE text/xml
        AddOutputFilterByType DEFLATE text/css
        AddOutputFilterByType DEFLATE application/xml
        AddOutputFilterByType DEFLATE application/xhtml+xml
        AddOutputFilterByType DEFLATE application/rss+xml
        AddOutputFilterByType DEFLATE application/javascript
        AddOutputFilterByType DEFLATE application/x-javascript

        DeflateCompressionLevel 9

        DeflateFilterNote Input instream
        DeflateFilterNote Output outstream
        DeflateFilterNote Ratio ratio

        LogFormat '"%r" %{outstream}n/%{instream}n (%{ratio}n%%)'
deflate
        CustomLog       logs/deflate_log deflate
</IfModule>

Seems to work great, thanks again Chris :)

Debian import complete

Tuesday, March 6th, 2012

Finally a couple of days ago the import of all the translated strings from most of the software in Debian into OSTD has been completed.

Now there is a grand total of 11236263 translated strings!

It took 1059647 seconds, which is just over 12 days. That’s 0.094 seconds per translation. I’m sure it could be sped up a lot in the future if I had a real need for to do that.

Neither my PHP import script nor MySQL crashed during the process, which is pretty cool. Also looks like I had enough memory for all this stuff, I don’t think MySQL was swapping a significant amount of data at any time.

I will probably have to do another pass through the po files from Debian to establish an NtoN relationship between the translated strings and the software they are used in, but what I have already is a great start.

Soon I’ll finish up some little things and will be announcing the project to the world.

6 million translated strings and counting

Saturday, February 25th, 2012

Since the 19th of this month (that’s 6 days ago, I don’t know where all that time has gone,oh yeah, tests) I’ve been importing the translated strings from Debian.

Right now I’ve done over 6 million (6036472) and I’ve only got to the end of the projects beginning with the letter “g”. Using some simple (i.e. inaccurate) math – it will take me another 7 days to finish importing everything I could guess the language code and parse.

The process is driven by a dedicated php script I wrote. PHP because the rest of my code is php, and I wasn’t going to rewrite things in bash :) Turns out that it works pretty well. At first I thought I was going to run out of memory, the script quickly ate up 5% of RAM, but over the next few days it went back down and now sits at a comfortable 1.8%.

I ran it manually (not through apache, actually apache isn’t allowed to read that script at all) in a screen session, which is one of the reasons I had to stop the first import attempt (that was in a plain terminal).

The other reason I had to restart the import was my MySQL configuration. Given that I’m not a database guy my MySQL was always using the minimum amount of resources, the defaults from my-small.cnf in Slackware. I’ve replaced that with my-huge.cnf and that had a very nice effect: no more swapping!

In the first attempt after about a day MySQL was using 120% of my CPU (dual-core). Now even after 6 days and 6 million strings inserted it’s using on average 15% of CPU and 25% of RAM. Everything else on the server (Apache, Sendmail, Imapd, etc) seem to be completely unaffected by the very heavy process.

One sucky thing about migrating from my-small.cnf to my-huge.cnf was that the Innodb backends are incompatible. So I had to:

  • figure this out,
  • reconfigure the server using the old settings,
  • dump the OSTD database into plain text SQL,
  • delete the backend,
  • reconfigure MySQL with the new settings, and
  • import the old SQL from the plain text

Luckily OSTD was the only MySQL user that was using the Innodb backend. So none of my blogs were affected. Though it all worked out fine in the end – I’m quite surprised that there is no automagic way to “upgrade” the Innodb backend. It’s bizzare to me that in this day and age of the cloud and enterprise scalability my storage backend woult be tied to MySQL memory settings.

I’ve started to clean up the site in preparation for the completion of the import, when I’ll be announcing its release. Still not sure if I’m going to register a domain for it or not, but probably not at first.

Language codes, part 2

Friday, February 17th, 2012

Most of the po files in the Debian tarball follow the naming convention packagename_version_languagecode.po

So for all of those I could figure out the language code using a regular expression (or three) on the filename. Armed with that and the exceptions I mentioned in the last post on this topic I was able to get to this point in my importer:

ostd$ ./manualpoupload.php unstable/
Examining tree... done (19.11 seconds)
85238 files found. Of those:
 - 68412 had a guessable language code
 - 16826 cannot be used because the language code could not be guessed

Hopefully soon I can run my PO parser against all those 68 thousand files and successfully read all the translated strings from them. It seems likely that they will not all work but I’m optimistic.

For the rest of the 17 thousand files I cannot use right now I will have to come up with a different strategy. Probably after the successful import above I will put the bad ones into a different tree and work on them separately. There are a few strategies I can try then:

  • Write different regexes for every piece of software. This is probably not realistic and would drive me crazy.
  • Try to find some patterns in the bad filenames that can lower the 16K to something much smaller. This idea seems to have some potential.
  • Attempt to find some metadata inside the po files, not just guess based on the filename. My experience with gettext suggests this is unlikely to succeed.
  • Or maybe wait till I get a better idea.

Anyway – 68 thousand files seems like a good start and it would definitely be enough to launch with, so maybe this problem will take a lower priority once I confirm I can parse all the files I guessed the language for. I look forward to finding out how many translated strings are in those files, how long it will take me to parse them and insert them into SQL, and how long a query on the resulting enormous table will take.

 

Language codes, part 1

Thursday, February 16th, 2012

While analyzing the files I got from Debian I ran into a lot of language codes that weren’t in my database already.

It was an interesting exercise, involving me learning about the existence of languages such as Javanese and countries that I already forgot about.

The problem is that some of the language codes are redundant, including the country code even though the language is the default for that particular country. For example el_GR means Greek from Greece, no kidding.

I don’t have el_GR in my database and see no point in adding it. So for Debian translation files that are identified as el_GR I have a hardcoded if(el_GR)replacewith(el). I’ve got about 72 such replacements that I had to figure out one by one.

A smaller set of language/country combo codes I did add to the database, such as English from Canada, South Africa, Ireland (not kidding); Catalan from Italy and Andorra, Arabic from Oman and Egypt, French from Luxembourg.

I just wanted to make a note of this, because it took me a hell of a long time to look through the list of unknown codes, figure out what they stand for, and whether they deserve a country specific version or not.

There will be a lot more work needed to clean up the list of PO files from Debian, so this post was just part 1 of hopefully not too long a series.

Lots of translations

Thursday, February 9th, 2012

Christian Perrier from the debian-i18n list has done me a huge favour. He created a tarball with every translation in every language for every piece of software in Debian!

You may imagine it’s huge as did I, but I was shocked at just how big it is. Almost 2 GB of gzip-compressed PO files from the testing and unstable branches!

I wrote a little script to extract all the po files from the extracted tarball:

find | while read A; do gunzip -v "$A" ; done

I’ve no idea how long it’s going to take to run :)

After that I’ll have to write a special PHP script to parse all the po files and add the translations to the database. there are going to be some challenges with that:

  1. It’s going to be very hard to notice if an error happened during parsing or insertion.
  2. It’s probably going to take a very long time on current hardware.
  3. I might actually run out of disk space, since my MySQL databases are in /var and that’s on the root partition and it’s quite small.
  4. If my schema design isn’t great – I might have to scrap it all and go through the exercise again. This is, sadly, quite likely.

All solvable problems, and I’m happy that I already got to the point where I have to seriously worry about scalability.

Thanks Christian!

Oh, you Git!

Wednesday, January 18th, 2012

Here’s an OSTD feature I was really excited about and looking forward to implement: allow the user to put in some version control info for a project so that I can run a nightly cron job and pull any new translations that have been added to that project. Would have been a great feature, and it would have provided a self-improvement mechanism for the OSTD.

And then I went and read about the capabilities of Git and Mercurial. Did you know that unlike that archaic version control system SVN the new and fancy distributed version control systems do not allow partial clones of repositories?

In other words, I cannot pull from a public repository only the po folder with the po files. I have to clone the entire repository. There’s an option to only pull the latest revision, but even that can be huge.

I cannot possibly afford the disk space, bandwidth, or time necessary to clone and do updates on thousands of full repositories which I was hoping to have linked into the OSTD. It would have worked perfectly with SVN, but not with the new systems.

So now what? My feature has been erased from the design board by the growing popularity of Git. I’m going to have to find a different solution, perhaps I can arrange something with Debian who have some kind of mechanism for indexing po files inside their packages.

Git.

Good query bad query

Tuesday, January 17th, 2012

I’ve started to run into serious performance issues with my SQL queries. I mentioned my concerns earlier, but now (still long before production time) I’m already experiencing clearly unacceptable performance.

I’ve added a couple of thousand translated strings to the database, and uploaded another PO file for the record. That would run the following query for each line in the PO file:

"SELECT Translation.TranslatedString FROM Translation,Language WHERE " .
"Translation.LanguageID = Language.LanguageID AND Language.LanguageCode = '%s' " .
"AND Translation.EnglishString = '%s'"

This I think is called a join, and my boring data structures and algorithms experience says it’s an n^2 algorithm. But I figured a couple of things:

  1. This has got to be almost the most basic kind of SQL query you can write. Select from table1 where the foreign key in table1 is the same as the primary key in table2. Database 101? Why is MySQL not smart enough to cache the result of the subquery, and only repeat it if table2 is changed in the meantime? Was that really such a hard optimisation to make, or did they leave it slow on purpose to encourage better query design?
  2. My training as a real software developer tells me to avoid race conditions, and despite the extremely low likelihood that the LanguageID for french is going to change during the runtime of this query, I should never assume, and rely on the DBMS for making that association.

I figured wrong. The query above, after I inserted enough records into Translation would take ages, my page would take 5 minutes to load during which time MySQL would use 97% of the CPU. Unacceptable.

I had to step over my reluctance of caching a key outside the DBMS and run a separate query to obtain Translation.LanguageID, and ended up with this query in my loop:

"SELECT Translation.TranslatedString FROM Translation,Language WHERE " .
"Translation.LanguageID = '%s' AND Translation.EnglishString = '%s'"

Works much faster. So much faster that I’ve almost stopped thinking about extra hardware for MySQL. Shame that I had to do this.

 

HTML tags inside translated strings

Tuesday, January 17th, 2012

Here’s something not many people working with PO files have run into. What happens when your english/translated string contains a <b> tag and you try to display that string on a webpage? Luckily I have one of those (in Asunder, where it’s actually a GTK formatting tag, not an HTML tag), so I ran into this problem already.

Piece of cake to fix, PHP has a function called htmlspecialchars() which will escape the special-meaning characters such as “<” in strings. I can use that function before sending my strings over as HTML.

Now what happens if my PHP sends the browser not HTML but JSON, and I construct the page using the data in that JSON? Any guesses? Nothing happens. Because I create a DOM text node and put the string into there – it just shows up with the “<” and ‘”‘ and all the other special chars.

Cool.