Archive for the 'Safe For Seneca' Category

6 million translated strings and counting

Saturday, February 25th, 2012

Since the 19th of this month (that’s 6 days ago, I don’t know where all that time has gone,oh yeah, tests) I’ve been importing the translated strings from Debian.

Right now I’ve done over 6 million (6036472) and I’ve only got to the end of the projects beginning with the letter “g”. Using some simple (i.e. inaccurate) math – it will take me another 7 days to finish importing everything I could guess the language code and parse.

The process is driven by a dedicated php script I wrote. PHP because the rest of my code is php, and I wasn’t going to rewrite things in bash :) Turns out that it works pretty well. At first I thought I was going to run out of memory, the script quickly ate up 5% of RAM, but over the next few days it went back down and now sits at a comfortable 1.8%.

I ran it manually (not through apache, actually apache isn’t allowed to read that script at all) in a screen session, which is one of the reasons I had to stop the first import attempt (that was in a plain terminal).

The other reason I had to restart the import was my MySQL configuration. Given that I’m not a database guy my MySQL was always using the minimum amount of resources, the defaults from my-small.cnf in Slackware. I’ve replaced that with my-huge.cnf and that had a very nice effect: no more swapping!

In the first attempt after about a day MySQL was using 120% of my CPU (dual-core). Now even after 6 days and 6 million strings inserted it’s using on average 15% of CPU and 25% of RAM. Everything else on the server (Apache, Sendmail, Imapd, etc) seem to be completely unaffected by the very heavy process.

One sucky thing about migrating from my-small.cnf to my-huge.cnf was that the Innodb backends are incompatible. So I had to:

  • figure this out,
  • reconfigure the server using the old settings,
  • dump the OSTD database into plain text SQL,
  • delete the backend,
  • reconfigure MySQL with the new settings, and
  • import the old SQL from the plain text

Luckily OSTD was the only MySQL user that was using the Innodb backend. So none of my blogs were affected. Though it all worked out fine in the end – I’m quite surprised that there is no automagic way to “upgrade” the Innodb backend. It’s bizzare to me that in this day and age of the cloud and enterprise scalability my storage backend woult be tied to MySQL memory settings.

I’ve started to clean up the site in preparation for the completion of the import, when I’ll be announcing its release. Still not sure if I’m going to register a domain for it or not, but probably not at first.

Language codes, part 2

Friday, February 17th, 2012

Most of the po files in the Debian tarball follow the naming convention packagename_version_languagecode.po

So for all of those I could figure out the language code using a regular expression (or three) on the filename. Armed with that and the exceptions I mentioned in the last post on this topic I was able to get to this point in my importer:

ostd$ ./manualpoupload.php unstable/
Examining tree... done (19.11 seconds)
85238 files found. Of those:
 - 68412 had a guessable language code
 - 16826 cannot be used because the language code could not be guessed

Hopefully soon I can run my PO parser against all those 68 thousand files and successfully read all the translated strings from them. It seems likely that they will not all work but I’m optimistic.

For the rest of the 17 thousand files I cannot use right now I will have to come up with a different strategy. Probably after the successful import above I will put the bad ones into a different tree and work on them separately. There are a few strategies I can try then:

  • Write different regexes for every piece of software. This is probably not realistic and would drive me crazy.
  • Try to find some patterns in the bad filenames that can lower the 16K to something much smaller. This idea seems to have some potential.
  • Attempt to find some metadata inside the po files, not just guess based on the filename. My experience with gettext suggests this is unlikely to succeed.
  • Or maybe wait till I get a better idea.

Anyway – 68 thousand files seems like a good start and it would definitely be enough to launch with, so maybe this problem will take a lower priority once I confirm I can parse all the files I guessed the language for. I look forward to finding out how many translated strings are in those files, how long it will take me to parse them and insert them into SQL, and how long a query on the resulting enormous table will take.

 

Homebrewed live server migration

Friday, February 17th, 2012

I mentioned that I’ll talk about the software migration from the old littlesvr.ca hardware to the new machine. The neat thing is – I accomplished it in less than a minute of downtime while preserving all my data/metadata.

Here’s the long story (shorter version at the bottom):

  1. First step was to install the OS on the new hardware. This was a full install (just like the old one) of the newest available Slackware version.
  2. At this point I had two servers running, on different internal IPs, both claiming to be littlesvr.ca but only one (the old one) beeing accessible from the internet.
  3. Then I had to remember/relearn how to use rsync (-avx).
  4. My first sync was from the entire root of the old server into a directory on the home partition on the new one. I’ve used this tree to set up the services the way I wanted them. Most services I reconfigured manually rather than using the old config files – partly because I was expecting the newer versions to have different options (which was sort of true with Apache) and partly because I wanted to make sure I’ve done it right the first time (mostly I have).
  5. Don’t underestimate the step above, that was a lot of work. Things I have completely forgotten about such as my aliases.db file and the stunnel config had to be accounted for.
  6. Originally I was going to keep all the keys from the old server, but instead I’ve decided to consolidate the keys and now I have one set for most of the services I use. Yeah, yeah, whatever.
  7. I also needed to migrate my MySQL databases (of which I have a few). It turned out that just copying /var/lib/mysql isn’t enough, so I had to make an SQL dump of the old database and restore it on the new server. That approach had these problems:
    • The old database wasn’t as secure as I liked, it still had the test db in it, and though I’m sure I went through the users thoroughly, I wasn’t sure enough.
    • The dump included the “mysql” database, which had some tables that changed slightly in the newer version. So mysql refused to work properly.
  8. So even after doing a dump, and transferring the dump over to the new server, and importing it, I still needed to run a couple of commands to secure the databases and make MySQL happy.
  9. The second rsync was more complex. Here I had to sync my home directories (lots of static/dynamic data), the SQL, and /var/spool/mail.
  10. And now the magic:

This is the short version, and the interesting part, here’s what I did:

  1. Opened up my router web configuration page in the browser, navigated to the port forwarding page, and changed all the IPs for forwarded services from the old server’s LAN address to the new one’s. But didn’t save the changes yet.
  2. Stopped all the relevant services on the new server (simple script).
  3. Ran the second rsync again, this completed much quicker than the first time because most of the data was unchanged.
  4. Restarted all the relevant services.
  5. Pressed save on my router config.

The trick worked so well I amazed myself :) I happend to be tailing my apache logs on both the old and new server while doing the final steps of the migration, and the second I saved my new port forwarding settings I saw the logs stop on the old server and start on the new one. It was an awesome feeling.

I’m sure I must have said in the past that rsync is a pain in the ass. I don’t necessarily take it back – but I will say I appreciated having such a powerful tool that day.

How [not] to make a book from your blog

Friday, February 17th, 2012

A couple of things happened recently which got me reading again:

  1. A book arrived at the library that I asked for about 6 month ago. “Nothing to Hide: The False Tradeoff between Privacy and Security”, by Daniel Solove. I’ve heard of it on an interview Moira Gunn had with the author.
  2. I started reading Garth Turner’s Greater Fool blog, and got all of his books they had at the library, this post is about “Greater Fool: The troubled future of real estate”.

Both of these books have been heavily based on the respective author’s prior work. Mostly essays by Solove and blog posts by Turner.

Despite the fascinating topic and great ideas and decent essays – Solove’s book is simply awful. From the introduction where he said “you can read the chapters in any order” I got suspicious, a few chapters in I realised this is not a book, it’s simply a collection of unrelated essays. Despite the author’s claim that he rewrote a lot of the stuff I saw no evidence of cohesion in either narrative or logic.

Turner on the other hand did a great job practically making an entire book out of blog posts. The content is the same but it’s been rewritten and carefully arranged into chapters, with select quotes from blog posts that brought a perceivable timeline to the story. Having been reading his works for a while I am comfortable claiming that he’s using the hammer the message technique, but whether that’s true or not, good or bad, I actually finished his book because it was so much better put together than Solove’s.

Perhaps academics aren’t as skilled as former (journalist+politician)s at putting books together, or perhaps Turner is better at it than Solove. I don’t know. I do recommend that if you’re considering making a book out of your blog – read these two and see the difference between a good one and a bad one.

Language codes, part 1

Thursday, February 16th, 2012

While analyzing the files I got from Debian I ran into a lot of language codes that weren’t in my database already.

It was an interesting exercise, involving me learning about the existence of languages such as Javanese and countries that I already forgot about.

The problem is that some of the language codes are redundant, including the country code even though the language is the default for that particular country. For example el_GR means Greek from Greece, no kidding.

I don’t have el_GR in my database and see no point in adding it. So for Debian translation files that are identified as el_GR I have a hardcoded if(el_GR)replacewith(el). I’ve got about 72 such replacements that I had to figure out one by one.

A smaller set of language/country combo codes I did add to the database, such as English from Canada, South Africa, Ireland (not kidding); Catalan from Italy and Andorra, Arabic from Oman and Egypt, French from Luxembourg.

I just wanted to make a note of this, because it took me a hell of a long time to look through the list of unknown codes, figure out what they stand for, and whether they deserve a country specific version or not.

There will be a lot more work needed to clean up the list of PO files from Debian, so this post was just part 1 of hopefully not too long a series.

New littlesvr.ca

Friday, February 10th, 2012

For the longest time I’ve been using a tiny piece of hardware to run littlesvr. I loved it. Low power consumption, very unlikely to break due to a hardware failure, and fast enough for anything I wanted to do on it.

But now I want to run MySQL and insert gigabytes worth of rows into a database. So I figured now is a good time to upgrade it, before I start running out of RAM (again) and the 600MHz Geode becomes a bottleneck.

The new box is an Acer Aspire Revo. Physically it’sabout twice bigger in volume than the old Koolu box but it’s still quite small. It has 2GB of much faster RAM, a dual core 1.6GHz Atom CPU, and a 100GB spinning SATA drive.

Overall it’s about twice better than the old one. It does have a fan, but only one. Hopefully it’s not going to fail in the next 2-3 years.

Here it is:

And here it is in my “server room”, sitting on top of the old littlesvr, both currently plugged into a UPS:

No, I’m not running windows. It’s still Slackware, but the newest version (13.37) with all the updates installed. As luck would have it there was a glibc update released just as I was going through the migration, what perfect timing!

In another post I’ll talk about how I did the migration from a data, services, and networking point of view.

Year 2038 problem, in 2012

Thursday, February 9th, 2012

I’m generating a new certificate for myself, and I still remember the frustration I ran into a long time ago where the problem was my certificate expired needlessly and completely unexpectedly.

So this time I figured what the hell, I’ll set it to expire in 100 years. I thought that was the end of it, until I dumped my cert and observed the following:

        Validity
            Not Before: Feb 10 04:16:53 2012 GMT
            Not After : Jan  4 21:48:37 1976 GMT

Come on, you have got to be kidding me? I’m going to guess this is the Year 2038 problem. But in security software (openssl) that was built last year? I sure hope they’ll be working on ridding themselves of this problem before we get close to that date. You think it’s far, but remember how much was spent on Y2K?

Lots of translations

Thursday, February 9th, 2012

Christian Perrier from the debian-i18n list has done me a huge favour. He created a tarball with every translation in every language for every piece of software in Debian!

You may imagine it’s huge as did I, but I was shocked at just how big it is. Almost 2 GB of gzip-compressed PO files from the testing and unstable branches!

I wrote a little script to extract all the po files from the extracted tarball:

find | while read A; do gunzip -v "$A" ; done

I’ve no idea how long it’s going to take to run :)

After that I’ll have to write a special PHP script to parse all the po files and add the translations to the database. there are going to be some challenges with that:

  1. It’s going to be very hard to notice if an error happened during parsing or insertion.
  2. It’s probably going to take a very long time on current hardware.
  3. I might actually run out of disk space, since my MySQL databases are in /var and that’s on the root partition and it’s quite small.
  4. If my schema design isn’t great – I might have to scrap it all and go through the exercise again. This is, sadly, quite likely.

All solvable problems, and I’m happy that I already got to the point where I have to seriously worry about scalability.

Thanks Christian!

Beat my uptime: 3 years!

Wednesday, January 18th, 2012
andrew@littlesvr:~$ uptime
13:26:34 up 1104 days, 10:14,  4 users,  load average: 0.61, 0.31, 0.21

I guess 1100 days doesn’t sound like a lot but 3 years does :)

It’s still the same machine I mentioned a year ago. It’s been running and running and running, and not crashing!

Theoretically 3 years is not a lot, but in the real world such uptime requires a combination of luck, great software, and good hardware. A power failure will kill it. A kernel update will kill it. A kernel that can’t handle the load will kill it. An admin who doesn’t know how to upgrade or restart services without a reboot will kill it. A bad fan will kill it. And mine is still up :)

Even cdot, which is running Slackware 9 (littlesvr is running 12.2) has only been up for 208 days today though I’m guessing it’s sitting in a real server room and has real admins taking care of it.

Yesterday (this was after the 3 year mark) I thought my server was finally about to die. When I realised what was going on there were 115 httpd processes running, I had 25MB of physical memory and 36MB of swap space left. Sendmail fell over, refusing to work with a load average over 15 (it got to 44). Sshd stopped accepting connections. imapd could barely serve requests. Interestingly Apache still worked.

I tailed the apache logs (this took me a half an hour, working entirely off swap is very slow) and saw nothing unusual. I have no idea what got into it, why the suicidal behaviour. In the logs there were only the typical 2-3 requests and 2-3 errors per minute. I tried to run apachectl status but that was taking too long. So I did the obvious, apachectl stop. After 5-10 minutes the harddrive light stopped blinking, and littlesvr breathed a sigh of relief.

As for me – I’m not really sure that I cared if it died. 3 years is a lot, I’m starting to get itchy to upgrade software (though it works perfectly fine) and to upgrade the hardware (512MB of RAM is too little). Even during normal operations I’m using almost all my RAM, and hopefully the load will be heavier in the future.

Maybe when the next Slackware comes out I’ll decide whether today’s 13.37 is a stable enough version for the long term.

Oh, you Git!

Wednesday, January 18th, 2012

Here’s an OSTD feature I was really excited about and looking forward to implement: allow the user to put in some version control info for a project so that I can run a nightly cron job and pull any new translations that have been added to that project. Would have been a great feature, and it would have provided a self-improvement mechanism for the OSTD.

And then I went and read about the capabilities of Git and Mercurial. Did you know that unlike that archaic version control system SVN the new and fancy distributed version control systems do not allow partial clones of repositories?

In other words, I cannot pull from a public repository only the po folder with the po files. I have to clone the entire repository. There’s an option to only pull the latest revision, but even that can be huge.

I cannot possibly afford the disk space, bandwidth, or time necessary to clone and do updates on thousands of full repositories which I was hoping to have linked into the OSTD. It would have worked perfectly with SVN, but not with the new systems.

So now what? My feature has been erased from the design board by the growing popularity of Git. I’m going to have to find a different solution, perhaps I can arrange something with Debian who have some kind of mechanism for indexing po files inside their packages.

Git.