Our school moved to PeopleSoft for.. I’m not going there.. but that’s where everyone’s timetables are now. I thought maybe this big fancy company has an API to let me access the data but no, it’s basically impossible to access the API directly.

So I was left with screen scraping, which I always wanted to try, why not. Go to the page I want to examine, open up Firebug, and drill down to the table elements I’m interested in: body>div>iframe>html>body>form>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div…

Er, wtf? I seemed to be going in some Firebug bug infinite loop. Surely they don’t have that many tables inside each other? Then I discovered the “Click an element” button and found that there are lots and lots of tables inside tables on this simple page:

faculty centre

This is with the text at its minimum size, you can see by the scrollbars what I’m talking about:

Firebug peoplesoft html

But after a while I managed to figure it out. I had to learn some XPath to find the cells I was interested in based on their IDs, but I couldn’t use XPath for everything – I tried but it ate all my RAM and was still working through the swap partition when I killed it in the morning.

Here’s the script in case you’re in the same boat. It prints the timetable data in the console. For myself I intend to make some Json out of it for import into Everyone’s Timetable.

// Firebug script to scrape timetable data from a PeopleSoft-backed website.
// Run it when you're on the page that shows the timetable. You get to that page
// like so:
// Faculty Center
//  Click the Search tab
//    Expand Additional Search Criteria
//      Set "Instructor Last Name" to the one you're looking for
//        Start Firebug, go to Console, paste in this script and run it
//
// Author: Andrew Smith http://littlesvr.ca

var frameDocument = document.getElementById('ptifrmtgtframe').contentWindow.document;

// DERIVED_CLSRCH_DESCR200$0, $1, etc. have the course title
var courseTitles = frameDocument.
  evaluate("//div[contains(@id,'DERIVED_CLSRCH_DESCR200')]", 
  frameDocument.documentElement, null,
  XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);

// For each course
for (var i = 0; i < courseTitles.snapshotLength; i++) {
  var courseTitle = courseTitles.snapshotItem(i);
  
  console.log(courseTitle.textContent);
  
  // Find the the next tr which has the timetable data for this course
  var timetableTableParentRow = courseTitle
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .parentNode
    .nextSibling
    .nextSibling;
  
  // There's some fucked up empty row after the first course title only
  if (i == 0)
  {
    timetableTableParentRow = timetableTableParentRow
      .nextSibling
      .nextSibling;
  }
  
  // Now go down to the table in this tr, it's the only thing that has 
  // an id so I can use xpath to find its children (timetable rows).
  var timetableTableId = timetableTableParentRow
    .firstChild
    .nextSibling
    .nextSibling
    .nextSibling
    .firstChild
    .nextSibling
    .id;
  
  // MTG_DAYTIME$0, $1, etc. have the day and time range in this format:
  // Mo 1:30PM - 3:15PM
  var times = frameDocument.
    evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_DAYTIME')]", 
    frameDocument.documentElement, null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
  var timesArray = new Array();
  for (var j = 0; j < times.snapshotLength; j++) {
    timesArray[j] = times.snapshotItem(j).textContent;
  }
  
  // MTG_ROOM$0, $1, etc. have the room number in this format:
  // S@Y SEQ Bldg S3028
  var rooms = frameDocument.
    evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_ROOM')]", 
    frameDocument.documentElement, null,
    XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);
  var roomsArray = new Array();
  for (var j = 0; j < rooms.snapshotLength; j++) {
    roomsArray[j] = rooms.snapshotItem(j).textContent;
  }
  
  // MTG_INSTR$0, $1, etc. have the instructor names but I think I'll
  // ignore them. For shared courses it won't hurt too much I hope.
  
  // Dump all the timetable data into the console, will do something with it later.
  for (var j = 0; j < times.snapshotLength; j++) {
    console.log(timesArray[j] + roomsArray[j]);
  }
}