Archive for the 'OSTD' Category

Language codes, part 2

Friday, February 17th, 2012

By Andrew Smith

Most of the po files in the Debian tarball follow the naming convention packagename_version_languagecode.po

So for all of those I could figure out the language code using a regular expression (or three) on the filename. Armed with that and the exceptions I mentioned in the last post on this topic I was able to get to this point in my importer:

ostd$ ./manualpoupload.php unstable/
Examining tree... done (19.11 seconds)
85238 files found. Of those:
 - 68412 had a guessable language code
 - 16826 cannot be used because the language code could not be guessed

Hopefully soon I can run my PO parser against all those 68 thousand files and successfully read all the translated strings from them. It seems likely that they will not all work but I’m optimistic.

For the rest of the 17 thousand files I cannot use right now I will have to come up with a different strategy. Probably after the successful import above I will put the bad ones into a different tree and work on them separately. There are a few strategies I can try then:

  • Write different regexes for every piece of software. This is probably not realistic and would drive me crazy.
  • Try to find some patterns in the bad filenames that can lower the 16K to something much smaller. This idea seems to have some potential.
  • Attempt to find some metadata inside the po files, not just guess based on the filename. My experience with gettext suggests this is unlikely to succeed.
  • Or maybe wait till I get a better idea.

Anyway – 68 thousand files seems like a good start and it would definitely be enough to launch with, so maybe this problem will take a lower priority once I confirm I can parse all the files I guessed the language for. I look forward to finding out how many translated strings are in those files, how long it will take me to parse them and insert them into SQL, and how long a query on the resulting enormous table will take.

 

Language codes, part 1

Thursday, February 16th, 2012

By Andrew Smith

While analysing the files I got from Debian I ran into a lot of language codes that weren’t in my database already.

It was an interesting exercise, involving me learning about the existence of languages such as Javanese and countries that I already forgot about.

The problem is that some of the language codes are redundant, including the country code even though the language is the default for that particular country. For example el_GR means Greek from Greece, no kidding.

I don’t have el_GR in my database and see no point in adding it. So for Debian translation files that are identified as el_GR I have a hardcoded if(el_GR)replacewith(el). I’ve got about 72 such replacements that I had to figure out one by one.

A smaller set of language/country combo codes I did add to the database, such as English from Canada, South Africa, Ireland (not kidding); Catalan from Italy and Andorra, Arabic from Oman and Egypt, French from Luxembourg.

I just wanted to make a note of this, because it took me a hell of a long time to look through the list of unknown codes, figure out what they stand for, and whether they deserve a country specific version or not.

There will be a lot more work needed to clean up the list of PO files from Debian, so this post was just part 1 of hopefully not too long a series.

Lots of translations

Thursday, February 9th, 2012

by Andrew Smith

Christian Perrier from the debian-i18n list has done me a huge favour. He created a tarball with every translation in every language for every piece of software in Debian!

You may imagine it’s huge as did I, but I was shocked at just how big it is. Almost 2 GB of gzip-compressed PO files from the testing and unstable branches!

I wrote a little script to extract all the po files from the extracted tarball:

find | while read A; do gunzip -v "$A" ; done

I’ve no idea how long it’s going to take to run :)

After that I’ll have to write a special PHP script to parse all the po files and add the translations to the database. there are going to be some challenges with that:

  1. It’s going to be very hard to notice if an error happened during parsing or insertion.
  2. It’s probably going to take a very long time on current hardware.
  3. I might actually run out of disk space, since my MySQL databases are in /var and that’s on the root partition and it’s quite small.
  4. If my schema design isn’t great – I might have to scrap it all and go through the exercise again. This is, sadly, quite likely.

All solvable problems, and I’m happy that I already got to the point where I have to seriously worry about scalability.

Thanks Christian!

Oh, you Git!

Wednesday, January 18th, 2012

By Andrew Smith

Here’s an OSTD feature I was really excited about and looking forward to implement: allow the user to put in some version control info for a project so that I can run a nightly cron job and pull any new translations that have been added to that project. Would have been a great feature, and it would have provided a self-improvement mechanism for the OSTD.

And then I went and read about the capabilities of Git and Mercurial. Did you know that unlike that archaic version control system SVN the new and fancy distributed version control systems do not allow partial clones of repositories?

In other words, I cannot pull from a public repository only the po folder with the po files. I have to clone the entire repository. There’s an option to only pull the latest revision, but even that can be huge.

I cannot possibly afford the disk space, bandwidth, or time necessary to clone and do updates on thousands of full repositories which I was hoping to have linked into the OSTD. It would have worked perfectly with SVN, but not with the new systems.

So now what? My feature has been erased from the design board by the growing popularity of Git. I’m going to have to find a different solution, perhaps I can arrange something with Debian who have some kind of mechanism for indexing po files inside their packages.

Git.

Good query bad query

Tuesday, January 17th, 2012

By Andrew Smith

I’ve started to run into serious performance issues with my SQL queries. I mentioned my concerns earlier, but now (still long before production time) I’m already experiencing clearly unacceptable performance.

I’ve added a couple of thousand translated strings to the database, and uploaded another PO file for the record. That would run the following query for each line in the PO file:

"SELECT Translation.TranslatedString FROM Translation,Language WHERE " .
"Translation.LanguageID = Language.LanguageID AND Language.LanguageCode = '%s' " .
"AND Translation.EnglishString = '%s'"

This I think is called a join, and my boring data structures and algorithms experience says it’s an n^2 algorithm. But I figured a couple of things:

  1. This has got to be almost the most basic kind of SQL query you can write. Select from table1 where the foreign key in table1 is the same as the primary key in table2. Database 101? Why is MySQL not smart enough to cache the result of the subquery, and only repeat it if table2 is changed in the meantime? Was that really such a hard optimisation to make, or did they leave it slow on purpose to encourage better query design?
  2. My training as a real software developer tells me to avoid race conditions, and despite the extremely low likelihood that the LanguageID for french is going to change during the runtime of this query, I should never assume, and rely on the DBMS for making that association.

I figured wrong. The query above, after I inserted enough records into Translation would take ages, my page would take 5 minutes to load during which time MySQL would use 97% of the CPU. Unacceptable.

I had to step over my reluctance of caching a key outside the DBMS and run a separate query to obtain Translation.LanguageID, and ended up with this query in my loop:

"SELECT Translation.TranslatedString FROM Translation,Language WHERE " .
"Translation.LanguageID = '%s' AND Translation.EnglishString = '%s'"

Works much faster. So much faster that I’ve almost stopped thinking about extra hardware for MySQL. Shame that I had to do this.

 

HTML tags inside translated strings

Tuesday, January 17th, 2012

By Andrew Smith

Here’s something not many people working with PO files have run into. What happens when your english/translated string contains a <b> tag and you try to display that string on a webpage? Luckily I have one of those (in Asunder, where it’s actually a GTK formatting tag, not an HTML tag), so I ran into this problem already.

Piece of cake to fix, PHP has a function called htmlspecialchars() which will escape the special-meaning characters such as “<” in strings. I can use that function before sending my strings over as HTML.

Now what happens if my PHP sends the browser not HTML but JSON, and I construct the page using the data in that JSON? Any guesses? Nothing happens. Because I create a DOM text node and put the string into there – it just shows up with the “<” and ‘”‘ and all the other special chars.

Cool.

Plural forms, again

Tuesday, January 17th, 2012

By Andrew Smith

Back in 2009 I wrote a post complaining about the needless complexities plural forms introduce to the i18n process. Now I ran into them again.

Working on the OSTD I have to make sure I work with all kinds of PO files, and that has to include PO files with plural forms. The format of PO files is not a standard, mostly beacuse there’s only one implementation for fully handling them (GNU gettext). Heck, it’s not even a spec. I was barely able to piece together an understanding of possible variations of the format using the gettext manual and looking at existing real PO files. That and the fact that the format is kind of loose and everything else I mentioned in the previous post made me hate plural forms even more than I did before.

My implementations of the PO file parser and writer do not handle plural forms. I am hoping that I’m ignoring them properly when reading, and it’s ok that they’re missing when writing. And I have no plans of supporting them in the future. I will use what little influence I have to convince app maintainers to not use them, or to get rid of them.

 

Scoping in JavaScript

Tuesday, January 17th, 2012

By Andrew Smith

There are a number of articles and blog posts out there trying to explain scope and closures, but overall I’d say that a majority of them aren’t crystal-clear.

No sh**. I’ve been trying to figure out how scope works in JavaScript on-and-off for the last several years. I was never a committed JS programmer though so it wasn’t important enough for me to make sure I learn it once and for all.

Why is it that something as simple as scope is so complicated? That was a rhetorical question, I don’t really want to hear excuses. Last week I was working on some moderately advanced JS code and sure enough I spent hours finding impossible bugs caused by assignments that overwrote values in the wrong variables.

I have no choice but to suck it up, but this blog is a partial registry of my complaints, and this one was definitely worth mentioning.

 

Modifying JSON using a form

Tuesday, December 27th, 2011

By Andrew Smith

If you read the slightly older post and look at its screenshot and do some thinking – you might like me wonder this: given a bunch of JSON with multiple selections which can be modified in JavaScript using a form.. wait, modified using a form?

One of the nice things about json is you can just do myjson[x].whatever[whatever].12 = “ABC” and it works. But if all you have is a <select> element and an onClick handler – that’s not so straightforward.

You can store the string “[x].whatever[whatever].12″ in the value field of the <option> but sadly you cannot just do myjson”[x].whatever[whatever].12″ = “DEF”, that’s a syntax error.

I had to wonder and look for a while, I even found something called JsonPath, which I got really excited about until I realised it’s only for reading (what exactly is the point of it then?). Today I found the solution: eval!

So continuing the lame example above I would simply do this: eval(‘myjson['+x+'].’+whatever+’['+whatever'+'].’+12+’ = “‘+”DEF”+’”‘) Roughly speaking, I haven’t tested this line. But you get the point?

Now that works but luckily because I’m testing with a real pot file I found a bug – all the ‘\n’ literals in the strings are replaced with newlines by eval, which is not what I want. I found a solution that not only fixes that but I hope will prevent my evals from being hacked: I escape all the backslashes and the single quotes in the string before giving it to eval:

eval(jsPath + ” = ‘” + text.toString().replace(/\\/g,’\\\\’).replace(/’/g,”\\’”) + “‘;”);

Yey!

Scary json_encode()

Tuesday, December 27th, 2011

By Andrew Smith

PHP has this really neat function, json_encode(). It can take an object of whatever type, including my own class with child arrays/classes, and make a valid JSON string out of it. I was going to write this function myself but I found PHP already has it.

There’s one concern I have about it – it takes the entire object tree and makes json out of it, and all my member variable names end up as keys in the JSON. So without much difficulty you can look at the JSON and see exactly how I structure the data on the server in the PHP.

I’m not sure I really care but this makes me uneasy. It just doesn’t feel right. Maybe it will be ok.