{"id":1104,"date":"2015-11-14T23:15:43","date_gmt":"2015-11-15T04:15:43","guid":{"rendered":"http:\/\/littlesvr.ca\/grumble\/?p=1104"},"modified":"2015-11-14T23:32:15","modified_gmt":"2015-11-15T04:32:15","slug":"cc-registry-architecture","status":"publish","type":"post","link":"http:\/\/littlesvr.ca\/grumble\/2015\/11\/14\/cc-registry-architecture\/","title":{"rendered":"CC Registry &#8211; Architecture"},"content":{"rendered":"<p>In this series:<\/p>\n<ul>\n<li><a href=\"http:\/\/littlesvr.ca\/grumble\/2015\/11\/14\/cc-registry-what-its-all-about\/\">CC Registry &#8211; What it&#8217;s all about<\/a><\/li>\n<li>CC Registry &#8211; Architecture (this post)<\/li>\n<li><a href=\"http:\/\/littlesvr.ca\/grumble\/2015\/11\/14\/cc-registry-next-steps\/\">CC Registry &#8211; Next steps<\/a><\/li>\n<\/ul>\n<p>I&#8217;ll now describe the components making <a href=\"http:\/\/littlesvr.ca\/grumble\/2015\/11\/14\/cc-registry-what-its-all-about\/\">the registry<\/a> possible. All the source code for this is open source and available <a href=\"https:\/\/github.com\/CreativeCommons-Seneca\/registry\">from GitHub<\/a>.<\/p>\n<p><a href=\"http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/11\/CC_Registry_subsystems_overview.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1108\" src=\"http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/11\/CC_Registry_subsystems_overview.png\" alt=\"CC_Registry_subsystems_overview\" width=\"913\" height=\"298\" srcset=\"http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/11\/CC_Registry_subsystems_overview.png 913w, http:\/\/littlesvr.ca\/grumble\/wp-content\/uploads\/2015\/11\/CC_Registry_subsystems_overview-300x98.png 300w\" sizes=\"auto, (max-width: 913px) 100vw, 913px\" \/><\/a><\/p>\n<h1>Data Source<\/h1>\n<p>When this goes into production &#8211; the images need to be hashed at the source. The bandwidth necessary to transfer all newly created images to a central server for registration would cost too much, there are literally millions of new images uploaded a day.<\/p>\n<p>To accomplish that &#8211; one or more partnerships need to be set up. Either directly with new content hosts (Wikimedia Commons, Flickr, etc) or with an intermediary such as the Internet Archive.<\/p>\n<h1>Hashes<\/h1>\n<p>We&#8217;ve run quite a bit of testing on different perceptual hashing mechanisms:<\/p>\n<ul>\n<li><a href=\"http:\/\/littlesvr.ca\/grumble\/2015\/04\/27\/perceptual-hash-comparison-phash-vs-blockhash-false-positives\/\" rel=\"bookmark\">Perceptual hash comparison: pHash vs Blockhash: false positives<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/03\/12\/perceptual-hash\/\" rel=\"bookmark\">Perceptual Hash for\u00a0image<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/11\/making-images-from-alphabet-characters\/\" rel=\"bookmark\">Making images from alphabet\u00a0characters<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/12\/making-sample-images-using-convert-utility\/\" rel=\"bookmark\">Making sample images using \u2018convert\u2019 utility<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/13\/perceptual-hash-testing-for-text-image\/\" rel=\"bookmark\">Perceptual hash test for text\u00a0image<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/19\/research-of-opensift\/\" rel=\"bookmark\">Research about OpenSIFT<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/19\/research-about-opensift-2\/\" rel=\"bookmark\">Research about OpenSIFT\u00a02<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/26\/content-based-image-retreval-concept-and-projects\/\" rel=\"bookmark\">Content Based Image Retreval : Concept and\u00a0Projects<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/05\/28\/performance-test-of-phash\/\" rel=\"bookmark\">Performance test of\u00a0pHash<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/01\/mh-image-hash-in-phash\/\" rel=\"bookmark\">MH Image Hash in\u00a0pHash<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/02\/mh-image-hash-in-phash-2-test-result\/\" rel=\"bookmark\">MH Image Hash in pHash 2 : test\u00a0result<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/02\/performance-test-of-phash-mh-image-hash\/\" rel=\"bookmark\">Performance Test of pHash : MH Image\u00a0Hash<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/10\/mvp-tree-for-similarity-search\/\" rel=\"bookmark\">MVP Tree for similarity\u00a0search<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/10\/mvp-tree-with-mh-hash-for-image-search\/\" rel=\"bookmark\">MVP Tree with MH Hash for Image\u00a0Search<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/11\/mh-hash-mvp-tree-indexersearcher-for-mysqlphp\/\" rel=\"bookmark\">MH Hash, MVP-Tree indexer\/searcher for\u00a0MySQL\/PHP<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/15\/pastec-test-result\/\" rel=\"bookmark\">Pastec test method and\u00a0result<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/18\/pastec-analysis\/\" rel=\"bookmark\">Pastec analysis<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/24\/pastec-test-for-real-image-datas\/\" rel=\"bookmark\">Pastec Test for real image\u00a0data<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/06\/28\/pastec-test-for-performance\/\" rel=\"bookmark\">Pastec Test for\u00a0Performance<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/08\/22\/dct-hash-matching-quality-for-resized-images\/\" rel=\"bookmark\">DCT Hash matching quality for resized\u00a0images<\/a><\/li>\n<li><a href=\"https:\/\/hosunghwang.wordpress.com\/2015\/08\/22\/dct-hash-matching-quality-for-resized-images-2\/\" rel=\"bookmark\">DCT HASH MATCHING QUALITY FOR RESIZED IMAGES\u00a02<\/a><\/li>\n<\/ul>\n<p>The result was obvious &#8211; the DCT algorithm from pHash won. What it does is create a 64bit hash for an image of any original size and complexity. I was worried about too many false positives but the results were very good, in the range of under 1%.<\/p>\n<p>It&#8217;s not the fastest algorithm but it only takes a couple of seconds to compute a hash and because of its small size &#8211; it can be stored in RAM for fast searches.<\/p>\n<h1>Database<\/h1>\n<p>The data is persisted in an off-the-shelf MySQL database. I was worried that querying from a hundred million records would tax the system too much but a simple select by primary key takes no time at all so we didn&#8217;t investigate other options.<\/p>\n<p>The persistent database stores all the metadata, but not the thumbnails. Those take too much space, so each of those is stored as a regular file on an XFS filesystem on a rotating drive. The path to the thumbnail is stored in the database.<\/p>\n<h1>Hash Matcher<\/h1>\n<p>This part needs speed. For each query we need to run an operation against every record in the database. Using any existing technology for that would have been too slow so we developed our own.<\/p>\n<p>The hash matcher loads pairs of (DBprimaryKey, hash) into RAM as a singly-linked list. That&#8217;s 24 bytes per record. I made sure that a server with 32GB of ram would have enough memory for 100 million records and it does, that&#8217;s way more than enough.<\/p>\n<p>The matching operation is performed using a single-instruction <span class=\"st\"><em>POPCNT<\/em><\/span> operation via the GCC <b>__builtin_popcount<\/b>(). One of my biggest worries was the time it will take per query, but it&#8217;s very fast, under one second against a hundred million records on a quad-core Xeon E5-1603 (inexpensive workstation CPU).<\/p>\n<h1>Internal API<\/h1>\n<p>The hash matcher is not externally accessible, it&#8217;s used on localhost via a unix domain socket. See <a href=\"https:\/\/github.com\/CreativeCommons-Seneca\/registry\/blob\/master\/daemon\/readme.md\">daemon\/readme.md<\/a> and <a href=\"https:\/\/github.com\/CreativeCommons-Seneca\/registry\/blob\/master\/daemon\/interface.md\">daemon\/interface.md<\/a> for usage info.<\/p>\n<h1>External API<\/h1>\n<p>The service is accessible by the world via a web API. This is implemented in PHP and is documented <a href=\"https:\/\/github.com\/CreativeCommons-Seneca\/registry\/tree\/master\/api\">here<\/a>.<\/p>\n<h1>Client Software<\/h1>\n<p>We built a <a href=\"http:\/\/cc-registry.littlesvr.ca\/ui\/\">demo page<\/a> (that&#8217;s also written in PHP but is separate from the API) to show that this system works. It has over a million images (that&#8217;s what we had the bandwidth to download) from Wikimedia and Flickr.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this series: CC Registry &#8211; What it&#8217;s all about CC Registry &#8211; Architecture (this post) CC Registry &#8211; Next steps I&#8217;ll now describe the components making the registry possible. All the source code for this is open source and available from GitHub. Data Source When this goes into production &#8211; the images need to &hellip; <\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8],"tags":[],"class_list":{"0":"entry","1":"post","2":"publish","3":"author-andrew","4":"post-1104","6":"format-standard","7":"category-creative-commons"},"_links":{"self":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/1104","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/comments?post=1104"}],"version-history":[{"count":11,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/1104\/revisions"}],"predecessor-version":[{"id":1125,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/posts\/1104\/revisions\/1125"}],"wp:attachment":[{"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/media?parent=1104"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/categories?post=1104"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/littlesvr.ca\/grumble\/wp-json\/wp\/v2\/tags?post=1104"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}