Our school moved to PeopleSoft for.. I’m not going there.. but that’s where everyone’s timetables are now. I thought maybe this big fancy company has an API to let me access the data but no, it’s basically impossible to access the API directly.
So I was left with screen scraping, which I always wanted to try, why not. Go to the page I want to examine, open up Firebug, and drill down to the table elements I’m interested in: body>div>iframe>html>body>form>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div>table>tbody>tr>td>div…
Er, wtf? I seemed to be going in some Firebug bug infinite loop. Surely they don’t have that many tables inside each other? Then I discovered the “Click an element” button and found that there are lots and lots of tables inside tables on this simple page:
This is with the text at its minimum size, you can see by the scrollbars what I’m talking about:
But after a while I managed to figure it out. I had to learn some XPath to find the cells I was interested in based on their IDs, but I couldn’t use XPath for everything – I tried but it ate all my RAM and was still working through the swap partition when I killed it in the morning.
Here’s the script in case you’re in the same boat. It prints the timetable data in the console. For myself I intend to make some Json out of it for import into Everyone’s Timetable.
// Firebug script to scrape timetable data from a PeopleSoft-backed website. // Run it when you're on the page that shows the timetable. You get to that page // like so: // Faculty Center // Click the Search tab // Expand Additional Search Criteria // Set "Instructor Last Name" to the one you're looking for // Start Firebug, go to Console, paste in this script and run it // // Author: Andrew Smith http://littlesvr.ca var frameDocument = document.getElementById('ptifrmtgtframe').contentWindow.document; // DERIVED_CLSRCH_DESCR200$0, $1, etc. have the course title var courseTitles = frameDocument. evaluate("//div[contains(@id,'DERIVED_CLSRCH_DESCR200')]", frameDocument.documentElement, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null); // For each course for (var i = 0; i < courseTitles.snapshotLength; i++) { var courseTitle = courseTitles.snapshotItem(i); console.log(courseTitle.textContent); // Find the the next tr which has the timetable data for this course var timetableTableParentRow = courseTitle .parentNode .parentNode .parentNode .parentNode .parentNode .parentNode .parentNode .parentNode .parentNode .parentNode .parentNode .nextSibling .nextSibling; // There's some fucked up empty row after the first course title only if (i == 0) { timetableTableParentRow = timetableTableParentRow .nextSibling .nextSibling; } // Now go down to the table in this tr, it's the only thing that has // an id so I can use xpath to find its children (timetable rows). var timetableTableId = timetableTableParentRow .firstChild .nextSibling .nextSibling .nextSibling .firstChild .nextSibling .id; // MTG_DAYTIME$0, $1, etc. have the day and time range in this format: // Mo 1:30PM - 3:15PM var times = frameDocument. evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_DAYTIME')]", frameDocument.documentElement, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null); var timesArray = new Array(); for (var j = 0; j < times.snapshotLength; j++) { timesArray[j] = times.snapshotItem(j).textContent; } // MTG_ROOM$0, $1, etc. have the room number in this format: // S@Y SEQ Bldg S3028 var rooms = frameDocument. evaluate("//div[@id='" + timetableTableId +"']//div[contains(@id,'MTG_ROOM')]", frameDocument.documentElement, null, XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null); var roomsArray = new Array(); for (var j = 0; j < rooms.snapshotLength; j++) { roomsArray[j] = rooms.snapshotItem(j).textContent; } // MTG_INSTR$0, $1, etc. have the instructor names but I think I'll // ignore them. For shared courses it won't hurt too much I hope. // Dump all the timetable data into the console, will do something with it later. for (var j = 0; j < times.snapshotLength; j++) { console.log(timesArray[j] + roomsArray[j]); } }