{"id":1056,"date":"2015-05-03T21:39:29","date_gmt":"2015-05-04T02:39:29","guid":{"rendered":"http:\/\/littlesvr.ca\/grumble\/?p=1056"},"modified":"2015-05-03T21:39:29","modified_gmt":"2015-05-04T02:39:29","slug":"screen-scraping-timetable-data-from-a-peoplesoft-faculty-center","status":"publish","type":"post","link":"http:\/\/littlesvr.ca\/grumble\/2015\/05\/03\/screen-scraping-timetable-data-from-a-peoplesoft-faculty-center\/","title":{"rendered":"Screen scraping timetable data from a PeopleSoft Faculty Center"},"content":{"rendered":"<p>Our school moved to PeopleSoft for.. I&#8217;m not going there.. but that&#8217;s where everyone&#8217;s timetables are now. I thought maybe this big fancy company has an API to let me access the data but no, it&#8217;s basically impossible to access the API directly.<\/p>\n<p>So I was left with screen scraping, which I always wanted to try, why not. Go to the page I want to examine, open up Firebug, and drill down to the table elements I&#8217;m interested in: body&gt;div&gt;iframe&gt;html&gt;body&gt;form&gt;div&gt;table&gt;tbody&gt;tr&gt;td&gt;div&gt;table&gt;tbody&gt;tr&gt;td&gt;div&gt;table&gt;tbody&gt;tr&gt;td&gt;div&gt;table&gt;tbody&gt;tr&gt;td&gt;div&gt;table&gt;tbody&gt;tr&gt;td&gt;div&#8230;<\/p>\n<p>Er, wtf? I seemed to be going in some Firebug bug infinite loop. Surely they don&#8217;t have that many tables inside each other? Then I discovered the &#8220;Click an element&#8221; button and found that there are lots and lots of tables inside tables on this simple page:<\/p>\n<p><a href=\"https:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/faculty-centre.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-1058\" src=\"https:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/faculty-centre-300x229.png\" alt=\"faculty centre\" width=\"300\" height=\"229\" srcset=\"http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/faculty-centre-300x229.png 300w, http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/faculty-centre.png 711w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>This is with the text at its minimum size, you can see by the scrollbars what I&#8217;m talking about:<\/p>\n<p><a href=\"https:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/Firebug-peoplesoft-html.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-medium wp-image-1059\" src=\"https:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/Firebug-peoplesoft-html-300x184.png\" alt=\"Firebug peoplesoft html\" width=\"300\" height=\"184\" srcset=\"http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/Firebug-peoplesoft-html-300x184.png 300w, http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/Firebug-peoplesoft-html-1024x629.png 1024w, http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/05\/Firebug-peoplesoft-html.png 1209w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>But after a while I managed to figure it out. I had to learn some XPath to find the cells I was interested in based on their IDs, but I couldn&#8217;t use XPath for everything &#8211; I tried but it ate all my RAM and was still working through the swap partition when I killed it in the morning.<\/p>\n<p>Here&#8217;s the script in case you&#8217;re in the same boat. It prints the timetable data in the console. For myself I intend to make some Json out of it for import into <a href=\"http:\/\/littlesvr.ca\/et\/\">Everyone&#8217;s Timetable<\/a>.<\/p>\n<pre>\/\/ Firebug script to scrape timetable data from a PeopleSoft-backed website.\r\n\/\/ Run it when you're on the page that shows the timetable. You get to that page\r\n\/\/ like so:\r\n\/\/ Faculty Center\r\n\/\/\u00a0 Click the Search tab\r\n\/\/\u00a0\u00a0\u00a0 Expand Additional Search Criteria\r\n\/\/\u00a0\u00a0\u00a0\u00a0\u00a0 Set \"Instructor Last Name\" to the one you're looking for\r\n\/\/\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Start Firebug, go to Console, paste in this script and run it\r\n\/\/\r\n\/\/ Author: Andrew Smith http:\/\/littlesvr.ca\r\n\r\nvar frameDocument = document.getElementById('ptifrmtgtframe').contentWindow.document;\r\n\r\n\/\/ DERIVED_CLSRCH_DESCR200$0, $1, etc. have the course title\r\nvar courseTitles = frameDocument.\r\n\u00a0 evaluate(\"\/\/div[contains(@id,'DERIVED_CLSRCH_DESCR200')]\", \r\n\u00a0 frameDocument.documentElement, null,\r\n\u00a0 XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);\r\n\r\n\/\/ For each course\r\nfor (var i = 0; i &lt; courseTitles.snapshotLength; i++) {\r\n\u00a0 var courseTitle = courseTitles.snapshotItem(i);\r\n\u00a0 \r\n\u00a0 console.log(courseTitle.textContent);\r\n\u00a0 \r\n\u00a0 \/\/ Find the the next tr which has the timetable data for this course\r\n\u00a0 var timetableTableParentRow = courseTitle\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .parentNode\r\n\u00a0\u00a0\u00a0 .nextSibling\r\n\u00a0\u00a0\u00a0 .nextSibling;\r\n\u00a0 \r\n\u00a0 \/\/ There's some fucked up empty row after the first course title only\r\n\u00a0 if (i == 0)\r\n\u00a0 {\r\n\u00a0\u00a0\u00a0 timetableTableParentRow = timetableTableParentRow\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 .nextSibling\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 .nextSibling;\r\n\u00a0 }\r\n\u00a0 \r\n\u00a0 \/\/ Now go down to the table in this tr, it's the only thing that has \r\n\u00a0 \/\/ an id so I can use xpath to find its children (timetable rows).\r\n\u00a0 var timetableTableId = timetableTableParentRow\r\n\u00a0\u00a0\u00a0 .firstChild\r\n\u00a0\u00a0\u00a0 .nextSibling\r\n\u00a0\u00a0\u00a0 .nextSibling\r\n\u00a0\u00a0\u00a0 .nextSibling\r\n\u00a0\u00a0\u00a0 .firstChild\r\n\u00a0\u00a0\u00a0 .nextSibling\r\n\u00a0\u00a0\u00a0 .id;\r\n\u00a0 \r\n\u00a0 \/\/ MTG_DAYTIME$0, $1, etc. have the day and time range in this format:\r\n\u00a0 \/\/ Mo 1:30PM - 3:15PM\r\n\u00a0 var times = frameDocument.\r\n\u00a0\u00a0\u00a0 evaluate(\"\/\/div[@id='\" + timetableTableId +\"']\/\/div[contains(@id,'MTG_DAYTIME')]\", \r\n\u00a0\u00a0\u00a0 frameDocument.documentElement, null,\r\n\u00a0\u00a0\u00a0 XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);\r\n\u00a0 var timesArray = new Array();\r\n\u00a0 for (var j = 0; j &lt; times.snapshotLength; j++) {\r\n\u00a0\u00a0\u00a0 timesArray[j] = times.snapshotItem(j).textContent;\r\n\u00a0 }\r\n\u00a0 \r\n\u00a0 \/\/ MTG_ROOM$0, $1, etc. have the room number in this format:\r\n\u00a0 \/\/ S@Y SEQ Bldg S3028\r\n\u00a0 var rooms = frameDocument.\r\n\u00a0\u00a0\u00a0 evaluate(\"\/\/div[@id='\" + timetableTableId +\"']\/\/div[contains(@id,'MTG_ROOM')]\", \r\n\u00a0\u00a0\u00a0 frameDocument.documentElement, null,\r\n\u00a0\u00a0\u00a0 XPathResult.ORDERED_NODE_SNAPSHOT_TYPE, null);\r\n\u00a0 var roomsArray = new Array();\r\n\u00a0 for (var j = 0; j &lt; rooms.snapshotLength; j++) {\r\n\u00a0\u00a0\u00a0 roomsArray[j] = rooms.snapshotItem(j).textContent;\r\n\u00a0 }\r\n\u00a0 \r\n\u00a0 \/\/ MTG_INSTR$0, $1, etc. have the instructor names but I think I'll\r\n\u00a0 \/\/ ignore them. For shared courses it won't hurt too much I hope.\r\n\u00a0 \r\n\u00a0 \/\/ Dump all the timetable data into the console, will do something with it later.\r\n\u00a0 for (var j = 0; j &lt; times.snapshotLength; j++) {\r\n\u00a0\u00a0\u00a0 console.log(timesArray[j] + roomsArray[j]);\r\n\u00a0 }\r\n}<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Our school moved to PeopleSoft for.. I&#8217;m not going there.. but that&#8217;s where everyone&#8217;s timetables are now. I thought maybe this big fancy company has an API to let me access the data but no, it&#8217;s basically impossible to access the API directly. So I was left with screen scraping, which I always wanted to &hellip; <\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,4],"tags":[],"class_list":{"0":"entry","1":"post","2":"publish","3":"author-andrew","4":"post-1056","6":"format-standard","7":"category-everyones-timetable","8":"category-safeforseneca"},"_links":{"self":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/1056","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/comments?post=1056"}],"version-history":[{"count":4,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/1056\/revisions"}],"predecessor-version":[{"id":1062,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/1056\/revisions\/1062"}],"wp:attachment":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/media?parent=1056"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/categories?post=1056"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/tags?post=1056"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}