Language codes, part 2 – Grumble Grumble

Most of the po files in the Debian tarball follow the naming convention packagename_version_languagecode.po

So for all of those I could figure out the language code using a regular expression (or three) on the filename. Armed with that and the exceptions I mentioned in the last post on this topic I was able to get to this point in my importer:

ostd$ ./manualpoupload.php unstable/
Examining tree... done (19.11 seconds)
85238 files found. Of those:
 - 68412 had a guessable language code
 - 16826 cannot be used because the language code could not be guessed

Hopefully soon I can run my PO parser against all those 68 thousand files and successfully read all the translated strings from them. It seems likely that they will not all work but I’m optimistic.

For the rest of the 17 thousand files I cannot use right now I will have to come up with a different strategy. Probably after the successful import above I will put the bad ones into a different tree and work on them separately. There are a few strategies I can try then:

Write different regexes for every piece of software. This is probably not realistic and would drive me crazy.
Try to find some patterns in the bad filenames that can lower the 16K to something much smaller. This idea seems to have some potential.
Attempt to find some metadata inside the po files, not just guess based on the filename. My experience with gettext suggests this is unlikely to succeed.
Or maybe wait till I get a better idea.

Anyway – 68 thousand files seems like a good start and it would definitely be enough to launch with, so maybe this problem will take a lower priority once I confirm I can parse all the files I guessed the language for. I look forward to finding out how many translated strings are in those files, how long it will take me to parse them and insert them into SQL, and how long a query on the resulting enormous table will take.