A day before Christmas I’ve submitted my dissertation! It was due on the 28th but I wasn’t going to do any more work, and didn’t want to take a chance with who-knows-what system going down.

It’s been a very long and very hard road. There have been more challenges than I can count or remember. I only made it through without despairing because I had the support of really good people. I will list them here, and I really hope I won’t forget anyone.

My sponsor prof. Greg Wilson, who was at the University of Toronto when I started and I’m not sure what his official job is today. He’s full of ideas, all of them valuable both from technical and academic points of view. It was not at all his job to help me, but after accepting to be my sponsor he treated me the same as one of his graduate students. He provided not only advice and feedback but also access to resources and people. I expect I was just as much a pain in the ass as I usually am, and that he was always understanding tells me that either he’s very good at controlling his emotions or he’s just a naturally really nice guy :)

Quite a few people from Seneca helped out too. The Chair of Computer Studies Evan Weaver and prof. Chris Szalwinski allowed me to use teaching materials from Seneca courses for my experiments. Chris also helped recruit participants for the experiments. Dawn Mercer from ORI helped me out of a real bind when my tablet broke and she lent me another one on more than one occasion, so I could continue my dissertation without spending another few hundred dollars. And not least professors John Selmys, David Humphrey, and Peter McIntyre who together with Chris and Evan inspired me see potential in this career and achieve my own full potential during and following my undergraduate studies.

Dr. Khai Truong from the University of Toronto lent me a tablet during the preliminary stages of the dissertation when I was still formulating what the problem I’ll be solving was. I’ve never used a tablet before and it was really good to have one to play with at that point, to be able to see what the thing might be capable of doing.

Dr. Sam Kamin from the University of Illinois at Urbana-Champaign was very receptive to my goal of modifying SLICE for the purpose of my own research and has provided much guidance in customizing that piece of software.

Finally, two of my peers from the University of Toronto Mike Conley and Zuzel Vera Pacheco have been working on their Master’s degrees under Greg’s supervision at the same time as me. Both have always been available and willing to help whenever I needed it, and I’ve enjoyed helping them whenever I could.

Now I have to wait until may (yeah, no kidding) to see whether I got honours or not. I think it will be prudent to wait until that’s done before I post a demo of my prototype. Don’t think this is suspense, it’s really not that fancy a piece of software, but it might be interesting to some people so I will post it.

I’m done, and am looking for something new to do in my spare time. Thanks again, everyone!

After my tablet broke I’ve decided I cannot afford the time or the money to get another one. Someone suggested that I get one from the store and return it in the 10 days or whatever they allow for laptops.

While I was considering that I wrote an email to Dawn Mercer, a research consultant at the Seneca Office of Research and Innovation. I got my bachelor’s degree at Seneca and worked on more than one project in her office. She had a tablet and was willing to lend it to me for my research! Thanks Dawn, you’ve saved me a lot of trouble.

The EliteBook 2730p is a beast compared to my little TC1100. It’s not just a good laptop today, it’s a great laptop, and a tablet on top of that.

It’s heavier and bulkier but the screen is considerably bigger and it runs much faster (which makes Slice more smooth). Also the keyboard is much closer to a full size, rather than miniaturized as in the TC1100. That’s important because one of the earlier participants refused to use the TC1100 keyboard and I had to plug in a USB keyboard.

The most important part was the accuracy of the stylus. On most of the screen it was really good, but within 5-10 cm north of the hinge it was terrible. Calibration was useless, I couldn’t even touch the bottom crosshairs.

It turns out this is a common problem with the model, I even found a claim someone returned two tablets under warranty and the third still had this problem.

During my search I found some magic command-line parameters I could give to the windows calibration software that would increase the number of crosshairs from 16 to 64. That helped to make the tablet useful for my experiment, confining the problem to under 1 cm north of the hinge, so the problem only really affected the Slice scrollbar.

The rest of my experiments were run on the new tablet.

During one of the first experiments I turned on the tablet and the screen came up looking broken. I attributed that to a bad connection somewhere, jiggled things around, rebooted and it worked fine afterwords.

Then during another experiment the tablet failed to ever boot properly. I even opened it up using the participant’s screwdriver and looked at the insides, but found nothing. This was it:

It seems everything still works except something’s wrong with the display hardware, showing those rectangles everywhere. I took the whole thing apart (which was a challenge because it had special HP pentagon screws) and removed/reattached all the ribbons many times, nothing helped.

Sometimes it would come up ok, but only once or twice. I was obviously wrong though:

That windows refuses to switch to higher than 640×480 tells me the problem is probably in the video card, not the connection to the display. I really didn’t want to think this. It took much time and energy and a sizable investment to procure this device – I was not willing to duplicate all that.

A solution needed to be found, and fast… see next post.

Screen recorder saved me

The software I used for the tablet reviews (Slice) is capable of saving the result in an xml file, including every single stroke on every file. I wrote about this previously.

It turns out that this only works about half the time, the other times the saved version looks like this:

I saw this for the first time after I was done with all the experiments and started tallying the results. It frightned me a little – potentially half of my experiments would have to be scrapped if I couldn’t access this data.

Luckily I used another program – BBFlashBack to make flash recordings of the entire screen for each review session. Playing it back at 4x speed allowed me to record comments as accurately as if they were on paper, and also I remembered things that I would have forgotten if I didn’t have the recording.

I will give them some free promotion here – this is a most excellent screen recorder. I tried a few and this is the only one that will:

  • Run smoothly on an older computer
  • Record with a very decent framerate
  • Make reasonably small flash files

I was so happy that I had these recordings – I even wanted to buy the full version, but at 90$ it’s a bit too expensive for me.

Just as a matter of curiosity I will record here a couple of thoughts that came to my mind as I was analysing the results. This post will be a bit of a rant, please skip it if you don’t have the patience for grumbles.

It’s amazing how different different reviewers are. I don’t even mean the level of expertise or knowledge or intelligence, but just plain differences.

Some felt really passionate about using short variable names, others hated the idea. Some accepted having getter functions, some called it a performance drain. But the most interesting differences in opinion are in those involving stuff I classified under “design” – something I am now tempted to call “rationalisation”.

The more experienced people made comments that are obviously heavily inspired by their experience. Don’t do it that way, do it that other way. Why? Don’t ask why, I know what I’m talking about. Is their way really better? It’s hard to criticise the more senior people. These are the most realistic code reviews, it’s what happens in real companies and is what I get at work these days.

They’re not about code errors, and while one may argue it’s good for maintainability to have uniform use of patterns and other major code structure – I’m not convinced that the review is a good time to encourage it. Pre-commit reviews may force the author to make (potentially significant) changes, but I’m not sure the amount of time spent on doing this actually pays off in the end.

Who does the best reviews? That’s another very interesting thing I found after doing the tally of the reviews’ contents – it wasn’t the most experienced people who found the most errors, it seemed completely random. I think I’ve heard this somewhere before – probably related to Mike‘s marking research.

What factors did I not account for that I can’t answer that question? There are a few that come to mind. Talent is certainly a strong factor. And I’m putting my bet on personality too. If one is fundamentally only interested in errors – they will only really put effort into finding errors, and not waste brain cycles on anything else. In my opinion the count of real errors (errors that would cause the software to not function as intended) found is the ultimate measure of the quality of a review.

This brought up so many interesting questions! It makes me want to think of continuing research in this field to find answers, but I have a deadline for now and can’t be sidetracked. Sadly I don’t have time to learn what a great review contains under what circumstances and what kinds of people are most likely to make great reviewers.

Test experiment run

P.S. This was written on the 18th of july, before I started the experiments. I’ve decided not to publish it at that time but to wait until the experiments are complete, which is now.

Last wednesday two of my peers Mike and Zuzel have kindly helped me by reviewing some code using my new tablet, a customised Slice, and a Java assignment I wrote in my 2nd year undergraduate degree. The point in doing this was to find problems with the experiment, and get other insights that might help do it right with actual participants (because of their close involvement with my project they cannot count as participants in the experiment). It’s turned out to be very valuable, I’ve learned quite a bit – thanks guys!

Here I will relate things that I feel should not contaminate participants if they happen to stumble across this post.

One of the first things that became obvious was that the stylus needs to be calibrated for each participant. I watched Mike calibrate it and he’s done the same I would have, but Zuzel’s settings have been quite different. This will not be a problem, just put the control panel icon on the desktop and ask each participant to run it, it only takes 6 clicks for the entire process.

I failed to time either of them. As I was thinking about it I realised this is a much more complicated problem than I thought. Both Mike and Zuzel stopped during their review – either to ask something or to join a conversation. Do those pauses count as part of the 20 minutes I’m planning, or not really? Should I let them at it for as long as they are willing? And should I instrument some breaks/distractions so the participants don’t get bored?

Which reminded me – I have not given much thought to something Greg mentioned a while ago: what order do the participants do the reviews in? Always the same (such as tablet then form then paper), or a random order for each person? Given the expected small number of participants – random order would probably still need analysis, and I would rather not have the extra variable. So I’ll probably have them all done in the same order, just have to decide on it.

A keyboard was not available, even though it’s in my project plan to add keyboard input functionality for Slice. I asked both Mike and Zuzel whether they missed it, and whether they would have used it if it were available. Mike thought so, but wasn’t sure how its use would balance with that of the stylus, Zuzel was sure she wouldn’t have.

Neither used the line numbers, which is good I guess (I can just get rid of them), and not unexpected given the simplicity of the code.

It’s become apparent that a problem with the software I’ve been aware of is a major one. The eraser can erase not only the stuff penned in, but also the line numbers and the code itself. Zuzel had this happen 3 times, Greg had this happen the one time he used the software, and it happens to me periodically. That means 75% of the participants will run into this problem, which has the following effect:

I can fix it post-mortem by overlaying the comments over the code, but such an occurrence pretty much invalidates the data gathered from the participant it’s happened to. It definitely needs to be fixed.

Also I thought the code was too complicated. It’s not really complicated code, but when you have 20 minutes to become familiar enough with it and review it – anything other than a single function is too complicated. I’ve decided to leave it as it is though, it seems to be more a realistic review this way.

Resuming posts to the blog

I avoided posting anything here since I started the experiments to avoid any chance of contamination. At first I thought that there’s enough in this blog already to contaminate people, but I took Greg’s advice. After all – better safe than sorry.

Actually with a couple of participants I had the feeling that they’re quoting me when describing their overall experience, but I later decided that was just a coincidence. I would myself have a hard time finding particular opinions in the 29 posts published to date, and students hardly have the time to bother reading all of my blog.

Anyway, this one is just for the sake of the title, I’m blogging again about my experiences until I’m done with this degree, some time in december this year.

Call for participants!

25 September 2010: The experiments have been completed. Thanks to everyone for participating! I will now resume blogging, as soon as I recover my energy :)

I am looking for participants for a study in my Master’s project. The goal of the project is to determine what advantages and disadvantages three types of code review tools have.

As a participant you would spend about an hour doing 3 code reviews (20 minutes each):

  • One on paper, just like marking in the good old days.
  • One using ReviewBoard, a popular code review tool.
  • One using a Tablet PC, with custom-built software.

No code review or marking experience is required, all that you need is to be familiar with either C or Java.

An immediate benefit for you is that you will get a 10$ gift card for Tim Horton’s. Another one is that you will get to play with a Tablet PC.

More long term benefits include some experience with code review (you bet employers care about that) and participation in a post-graduate Computer Studies / Information Technology study (you never know when you’ll decide to continue your studies).

You need to be in the Greater Toronto Area (loosely defined as 50km around downtown Toronto).

Please get in touch!

I’ll be posting these two fliers around UofT and Seneca: Seneca Recruitment Flier and UofT Recruitment Flier.

Unexperienced reviewers

Again I should say a big thanks to Zuzel Vera Pacheco, Mike Conley, and Alecia Fowler for helping me iron out potential problems with my experiment.

A final lesson I learned this week from the test runs with my fellows is that I should have considered that unexperienced reviewers will not actually know how to do a code review, so I have to account for that.

I think the best way to deal with that problem is to have a checklist of the types of comments the participants could give, including:

  • Poor or very good coding style in any way (whitespace, variable/function/class names, etc)
  • Functional mistakes (doesn’t do what it should as far as the reviewer can tell)
  • Poor or very good design in any way (classes, class members, data structures, etc.)
  • Poor or very good comments
  • Poor or very good error handling
  • Poor or very good performance

Perhaps I will even require that the participants only give feedback from these categories, though I don’t like that idea because I don’t want to exclude something really important by mistake.

Code to review

For my study I need 3 code samples of equal complexity and with the same number of mistakes in Java, plus 3 in C/C++.

I dug through my entire undergraduate homework tree (which I kept for sentimental value) and only found one assignment that was appropriate – an introductory Java assignment.

I forgot about this problem for some weeks, but now that I’m getting close to running the experiments I need to get it ready. My search for other people’s assignments failed, so I decided to write my own.

Immediately it became obvious that I haven’t seriously considered this problem. Each participant gets 3 pieces of code to review, one for each review method, and I haven’t thought the following through:

  1. Should they be the same code? Apparently obviously not, the second and third reviews would be tainted by the participant’s comments on the first.
  2. Should it be completely different code? The problem above goes away but then this becomes a huge variable, since a student may be well familiar with one concept (e.g. string manipulation) but completely unfamiliar with another (e.g. recursion).
  3. Should it be code that uses similar techniques, but is solving a different problem?

I picked (3). I don’t think it entirely solves the problems in (1), since the same feedback may apply to the same mistakes in all 3 sets of code, but that problem would exist regardless. Actually I know someone who may have an answer to that before I get to the analysis stage of my dissertation, that will be nice to mention.

Having made this decision I dug up another introductory Java assignment specification and implemented it. The results were astounding. One would think that given several years of work experience it should be easy, but if I were still an undergrad student and I submitted this, I would have received a failing grade. The specifications are completely confusing and the requirements make no sense. Not having a prof to ask what he meant makes it nearly impossible to implement properly.

Which I later thought is ok. My goal was to make sure that the second set of code is of the same quality as the first, and adherence to assignment specifications is irrelevant. I made sure the code compiles and did my best to make sure that fundamentally it makes sense.

Then I introduced some errors of different types, in both assignments. Hopefully they are roughly the same now.

This exercise took 4 solid hours of hard work, and I have to do 4 more (1 java and 3 C/C++). I’ll try to find time somewhere.