{"id":242,"date":"2011-12-09T17:23:01","date_gmt":"2011-12-09T22:23:01","guid":{"rendered":"http:\/\/littlesvr.ca\/grumble\/?p=242"},"modified":"2012-12-05T00:52:45","modified_gmt":"2012-12-05T05:52:45","slug":"scraping-data-from-a-reliable-source","status":"publish","type":"post","link":"http:\/\/littlesvr.ca\/grumble\/2011\/12\/09\/scraping-data-from-a-reliable-source\/","title":{"rendered":"Scraping data from a reliable source"},"content":{"rendered":"<p>One of the things I will need in my database is a table with all the language codes used in Linux locales. Things like en, fr, es, etc. There are lots, but where do I get a reliable list?<\/p>\n<p>I&#8217;ve done some searching and found the <a href=\"http:\/\/www.iana.org\/assignments\/language-subtag-registry\">IANA language subtag repository<\/a>. It&#8217;s a 45000 line text file with contents in this format:<\/p>\n<blockquote>\n<pre>%%\r\nType: language\r\nSubtag: ab\r\nDescription: Abkhazian\r\nAdded: 2005-10-16\r\nSuppress-Script: Cyrl<\/pre>\n<\/blockquote>\n<p>Of all those records only 1155 lines are 2-letter codes, which is what I was interested in. How do I get the language code and english name from there into a database? Piece of cake if you know some basic shell scripting:<\/p>\n<blockquote>\n<pre>#!\/bin\/bash\r\n\r\ncat languagelist.txt | while read LINE; \r\ndo \r\n  if echo $LINE | grep Subtag &gt; \/dev\/null; \r\n  then \r\n    echo -n \"`echo $LINE | cut -f 2 -d' '` \"; \r\n    HAVECODE=1\r\n  elif echo $LINE | grep Description &gt; \/dev\/null; \r\n  then \r\n    if [ $HAVECODE -eq 1 ]\r\n    then\r\n      echo `echo $LINE | cut -f 2 -d' '`; \r\n    fi\r\n    HAVECODE=0\r\n  fi;\r\ndone<\/pre>\n<\/blockquote>\n<p>And insert it all into the database:<\/p>\n<blockquote>\n<pre>#!\/bin\/bash\r\n\r\n.\/parselanguagelist.sh | while read LINE;\r\ndo\r\n  CODE=`echo $LINE | cut -f 1 -d ' '`\r\n  NAME=`echo $LINE | cut -f 2 -d ' '`\r\n  mysql -u user -ppassword -e \"INSERT INTO Language (LanguageCode,LanguageEnglishName) VALUES('$CODE','$NAME');\" ostd\r\n  if [ $? -eq 0 ]\r\n  then\r\n    echo \"Inserted $CODE ($NAME)\"\r\n  else\r\n    echo \"Failed to insert $CODE ($NAME)\"\r\n  fi\r\ndone<\/pre>\n<\/blockquote>\n<p>Done, 190 records. And next time I want to update the list (who knows, it might happen) I&#8217;ll just need to get a new list and use the MySql feature that will let me either create or update a row depending on whether it already exists.<\/p>\n<p>I think it would have taken me quite a while to generate this list of sql commands by hand :)<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the things I will need in my database is a table with all the language codes used in Linux locales. Things like en, fr, es, etc. There are lots, but where do I get a reliable list? I&#8217;ve done some searching and found the IANA language subtag repository. It&#8217;s a 45000 line text &hellip; <\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,4],"tags":[],"class_list":{"0":"entry","1":"post","2":"publish","3":"author-andrew","4":"post-242","6":"format-standard","7":"category-ostd","8":"category-safeforseneca"},"_links":{"self":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/242","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/comments?post=242"}],"version-history":[{"count":4,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/242\/revisions"}],"predecessor-version":[{"id":254,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/242\/revisions\/254"}],"wp:attachment":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/media?parent=242"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/categories?post=242"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/tags?post=242"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}