{"id":369,"date":"2012-02-17T01:06:21","date_gmt":"2012-02-17T06:06:21","guid":{"rendered":"http:\/\/littlesvr.ca\/grumble\/?p=369"},"modified":"2012-12-05T00:47:13","modified_gmt":"2012-12-05T05:47:13","slug":"language-codes-part-2","status":"publish","type":"post","link":"http:\/\/littlesvr.ca\/grumble\/2012\/02\/17\/language-codes-part-2\/","title":{"rendered":"Language codes, part 2"},"content":{"rendered":"<p>Most of the po files in the Debian tarball follow the naming convention packagename_version_languagecode.po<\/p>\n<p>So for all of those I could figure out the language code using a regular expression (or three) on the filename. Armed with that and the exceptions I mentioned in the <a href=\"http:\/\/littlesvr.ca\/grumble\/2012\/02\/16\/language-codes-part-1\/\">last post<\/a> on this topic I was able to get to this point in my importer:<\/p>\n<blockquote>\n<pre>ostd$ .\/manualpoupload.php unstable\/\r\nExamining tree... done (19.11 seconds)\r\n85238 files found. Of those:\r\n - 68412 had a guessable language code\r\n - 16826 cannot be used because the language code could not be guessed<\/pre>\n<\/blockquote>\n<p>Hopefully soon I can run my PO parser against all those 68 thousand files and successfully read all the translated strings from them. It seems likely that they will not all work but I&#8217;m optimistic.<\/p>\n<p>For the rest of the 17 thousand files I cannot use right now I will have to come up with a different strategy. Probably after the successful import above I will put the bad ones into a different tree and work on them separately. There are a few strategies I can try then:<\/p>\n<ul>\n<li>Write different regexes for every piece of software. This is probably not realistic and would drive me crazy.<\/li>\n<li>Try to find some patterns in the bad filenames that can lower the 16K to something much smaller. This idea seems to have some potential.<\/li>\n<li>Attempt to find some metadata inside the po files, not just guess based on the filename. My experience with gettext suggests this is unlikely to succeed.<\/li>\n<li>Or maybe wait till I get a better idea.<\/li>\n<\/ul>\n<p>Anyway &#8211; 68 thousand files seems like a good start and it would definitely be enough to launch with, so maybe this problem will take a lower priority once I confirm I can parse all the files I guessed the language for. I look forward to finding out how many translated strings are in those files, how long it will take me to parse them and insert them into SQL, and how long a query on the resulting enormous table will take.<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Most of the po files in the Debian tarball follow the naming convention packagename_version_languagecode.po So for all of those I could figure out the language code using a regular expression (or three) on the filename. Armed with that and the exceptions I mentioned in the last post on this topic I was able to get &hellip; <\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,4],"tags":[],"class_list":{"0":"entry","1":"post","2":"publish","3":"author-andrew","4":"post-369","6":"format-standard","7":"category-ostd","8":"category-safeforseneca"},"_links":{"self":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/369","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/comments?post=369"}],"version-history":[{"count":2,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/369\/revisions"}],"predecessor-version":[{"id":570,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/369\/revisions\/570"}],"wp:attachment":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/media?parent=369"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/categories?post=369"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/tags?post=369"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}