2007-04-05 21:36 Starting from Joe Gregorio's BitWorking post on 'megadata' I'm wondering what happens if you work this thing from the other end - rather than trying to make a database that scales well over distributed compute nodes, instead focus on optimising each node whilst trying not to compromise external scalability. Getting parallelism working, whether via network or just concurrent threads in the same machine, is harder than straight-ahead computing in a single node. Since my background in engineering computing and simulation, I've little idea why it takes 15 minutes to load 10,000,000 triples into a RDF store , whereas reading the file itself takes seconds. So I'm trying to find the cost points for nodes, rather than looking at distribution schemes here. Unlike CPU hardware, I'm not sure that nodes have reached a point where they need parallelism for performance. For this, I'm taking a node as a virtual processor and half a GiB RAM, mainly because my laptop has a single non-hyperthreaded core and 640 MiB RAM. At that size, a node doesn't have to handle more than a billion tuples, and I'm assuming a little bit that some of the strategies for distributing queries over nodes will still apply to such nodes. Limiting a node to a billion tuples also makes coding easier - you can use 32 bit identifiers, as long as they are (reasonably) local to the node. By 'reasonably local', I'd expect some number of IDs to be common to all nodes to represent the most frequently occuring term values. Most of the RDF stores seem to be written in Java, and what I've seen of the implementations do a lot of pointer chasing, so this is a bit of fun to see what we can do with SSE2 to make a RDF-like query engine without pointer chasing. If each tuple in the store is a quad vector of 32 bit ints, we can do quite fast searches using SSE2. As John Sowa predicted, many triple-stores have found that triples aren't enough, and are quad stores now. Though maybe a hexa-store is best for human-oriented data, as most of the relations we come up with have arity less than seven. Pipeline micro benchmark: Since the SSE comparison ops return -1, we can bitmask the matches rather than using branches, which sometimes helps on pipeline CPU architectures. This appears to be quite nice for no branches and locality, and scales until you hit virtual memory: // matches the second value in quad __m128i* match2 (int value, __m128i* data, __m128i* result, size_t size) { __m128i mask(_mm_set_epi32(0, value, 0, 0)); for (size_t i(0); i < size; ++i) { __m128i in(_mm_loadu_si128(data+i)); __m128i cmp(_mm_shuffle_epi32(_mm_cmpeq_epi32(in, mask), SPREAD_EPI32(1))); _mm_storeu_si128(result, _mm_and_si128(in, cmp)); result += _mm_extract_epi16(cmp, 0) & 1; } } If the second value in a quad matches the test value, the result of cmpeq is (, otherwise it's ; this is spread over the quad to be or . This is and'ed with the original, and written to the result. It's also and'ed with the increment for the result, so either is written, and result is not changed, or the matching quad is written and result incremented. (Actually, it's better not to use SIMD in this case, see match_bench_solo no_sse2 below. I don't seem to be able to get advantages from SIMD in this code very often here.) $ time match_bench 10 16 number of quads: 10485760 memory required: 320 MiB ... 163840 results ... real 0m8.945s user 0m7.460s sys 0m0.640s So 16 runs of a query which tests whether the 2nd term in the quad matches one of four values, run over a 10 mega-triple set, takes ~ 9 seconds. To match multiple values, the term being queried can be spread over the whole quad, then compared with a mask with 4 of the values, then the comparison results spread and or'ed with each other. $ time match_bench_x4 10 16 number of quads: 10485760 memory required: 320 MiB ... 163840 results ... real 0m4.045s user 0m3.040s sys 0m0.624s So this is the best case for the matching - match one of four values with the 2nd term in the quad takes ~ 4 seconds instead of 9 with the spread and merge approach. Given some of this is overhead, it's still pretty good for non-indexed queries. It's irrelevent which term is matched, I just happened to use the 2nd term for the micro benchmark. Non-indexed queries matter because if you can do them fast, you don't have to do so much work to make adding tuples to the store scale. -- Importing data: I'm working from this dataset: which is an SQL dump, so to get warmed up there needs to be a conversion of that file to triples (in rather verbose Turtle). Which also needs a reasonable file reader/delimited string parser. So this was my first cut of the converter, which used read(fd, buf, len) each time in needed to read: performance of sql2triples base: $ time sql2triples triples.sql 15 > triples.triples.15 real 0m19.809s user 0m1.176s sys 0m16.633s $ time sql2triples triples.sql 150 > triples.triples.150 real 2m44.558s user 0m10.125s sys 2m30.293s -- Commented out output of triples to see what the output formatting costs: $ time sql2triples triples.sql 150 real 2m31.191s user 0m8.617s sys 2m15.068s -- Block read benchmark for comparison: $ time readbench triples.sql 150 real 0m0.505s user 0m0.000s sys 0m0.236s -- So it`s doing something wrong - 150s of its 165s runtime is doing stuff other than efficient io; reading sql shouldn't take that long. First optimisation: turn reader into block mode internally, so it reads 16K blocks and instead of reading from the file each time its read() method is called: $ time sql2triples triples.sql 15 > triples.triples.15 real 0m1.596s user 0m0.628s sys 0m0.828s $ time sql2triples triples.sql 150 > triples.triples.150 real 0m8.690s user 0m4.080s sys 0m4.252s -- filereadertest uses the reader to read 1K blocks and writes them to std::cout. So a direct copy of the input takes 6.5 seconds: $ time filereadertest triples.sql 150 > triples.sql.150 real 0m6.515s user 0m0.772s sys 0m1.604s so parsing and reformatting 150MiB of data costs 2.2 seconds. That's probably good enough for now; what will matter next is the unique hashing code for converting the triples into binary. -- The hash table is based on a simple linear probe 2^N map. I'm assuming throughout that the design goal is for a billion triples - basically so I can get away with 32 bit ids, as they mean you can fit a quad into an SSE register. More than that, and you'd probably want a distributed engine anyway. You could do triples of 42 bit and complicate things, but you loose the fast swizzles and comparisons. You could use 64bit, which reduces throughput and the number of tuples that can fit into physical memory by a factor of two. For this demonstration, a billion strings seems enough. Ok, so we read in the triple, hash its spo (subject-predicate-object) as strings, linear probe the table so to get a unique id, and output the hash table (hash to string table offset), a string table (offset to value), and an image of the quad-store - each quad is a 128 bit tuple of the unique 32 bit ids of its four strings. Getting the id of a value, and merging quad-stores requires pointer-chasing. Querying reduces to a set of OR matches, done using simple loop over the binary tuple store, each yields a smaller list of candidates, which are then joined by matching against the least-long list. Since the stores can be read directly from binary files, it should take half a second to load a 10 Mi tuple store (160MiB), and about the same to run the initial query on it to find candidate matches for each term. That could be reduced with indexing, but sorted indexing costs on insertions. You could use 2^4N buckets for dividing the space up - for N=2 you'd be putting into buckets 00.01.10.01 each term whose 2 least significant bits match the bucket id, which reduces the search space for each query by 2^N. I'm not sure if having many stores would help or not - with 256 buckets you'd have to 64 files to run the simplest query, with 4096, 512. However, with bucket semi-indexing you don't have insertion costs. Anyway, the test dataset is 10 Mi triples, so its the same size as the micro-benchmark queries above, so a query such as p = a or p = b or p = c should take less than half a second. I've no idea whether that's good compared to SQL databases, but it's better than the SPARQL ones given as examples, but worse than Katie's from http://xtech06.usefulinc.com/schedule/paper/18 : Size of store (millions of triples) 2 10 77 99 152 DCIDENT (ms) 17 15 25 32 23 SPARQL (ms) 385 987 1048 1128 1465 SPARQL naive (ms) 1404 21237 123547 161112 212844 $ time match_bench_solo 10 32 real 0m5.415s user 0m4.136s sys 0m0.564s $ time match_bench_solo 10 64 real 0m9.619s user 0m8.005s sys 0m0.612s (9619-5415)/32 = 131 $ time match_bench_solo 10 32 no_sse2 real 0m4.393s user 0m3.276s sys 0m0.540s $ time match_bench_solo 10 64 no_sse2 real 0m7.815s user 0m6.164s sys 0m0.648s (7815-4393)/32 = 107 An ident match using match_bench takes ~100ms for 10 Mi triples, which isn't bad for a non-indexed store. You'd need to partition at least 8 ways, giving 2^(4*3)=4096 buckets, to match the above id lookup for 10 Mi tuples. Scaling would be linear. So indexing at least on ID is required to be competitive, and the bucket approach for quads probably doesn't scale much beyond that. Actually, trying again, it does seem that the sse2 versions of the match are slower than using 32 bit int directly: $ time match_bench_x4 10 16 real 0m3.496s user 0m2.524s sys 0m0.564s $ time match_bench_x4 10 16 no_sse real 0m3.149s user 0m2.196s sys 0m0.604s Apparently, removing branching from the sse2 code doesn't help in this case, and the non-sse2 code is faster still. // matches the second value in quad without using sse2 __m128i* match2_x4_no_sse (int value0, int value1, int value2, int value3, __m128i* data, __m128i* result, size_t size) { int* idata((int*)data); for (size_t i(0), imax(size*4); i < imax; i += 4) { int value(idata[i + 2]); if ((value == value0) || (value == value1) || (value == value2) || (value == value3)) { _mm_storeu_si128(result, _mm_loadu_si128(data+(i/4))); ++result; } } return result; } Changing the _mm_storeu_si128 to four integer memory ops doesn't change performance measurably. Changing to using aligned arrays help the sse2 version a little, but it's still slower. The question of whether SSE2 helps for multiple term matches needs investigating. $ time match_bench_x4 10 100 no_sse real 0m13.378s user 0m11.349s sys 0m0.636s $ time match_bench_x4 10 200 no_sse real 0m24.063s user 0m21.013s sys 0m0.704s (21013-11349)/100 = 96 ms/query If it takes ~1s to pass once over a 100 Mi tuple data set, it should take tens of seconds to create an index for a given predicate, or a few minutes for a giga-store. Leave that for now; it's good enough. I may be tempted to code up a few variants and let the query engine choose based on trends, but that's a while off. -- Ok, where are we now. We've tried a few implementations for non-indexed scans of the data set in a compact binary form. I don't think it's entirely co-incidental that the gzip of a 10 Mi triple dump is a 160 MiB file, and 10 Mi quads saved as binary would be a 180 MiB file. A triple is about yay-much information content (waves hands). We've shown it's not expensive to load data from disk. That extends to queries, so as long as we assume we can serial scan as the first pass of any query, that's reasonable performance - it's well within the ballpark. I'm assuming that the results of the first scan are kept in memory, and subsequent scans or joins are done faster as they're operating on fewer items. We've also got reasonable performance for a parser that reads the sql dump and outputs triples. So now let's address getting these triples into the right form to operate on, and go back to the hashing. In a full system, you'd need to be able to combine quad-stores; I'm leaving that alone for now, and just hashing the 10 Mi triple dataset I'm working with. First cut of string map: hash function for string, returns length and hash value hash map based on linear increment of hash value until unused value found. Initially loading from sql rather than triples, since I've already got the code to do that. Moving the code to do unicode escapes into a utils.cpp function hits performance in sql2triples; instead of the form: std::string turtle_encode (const std::string& value); use the form: void write_turtle_encoded (std::ostream& buf, const std::string& value); otherwise theres a factor of two slowdown: using turtle_encode: $ time sql2triples triples.sql 150 > triples.triples.150 real 0m18.661s user 0m5.824s sys 0m4.544s using write_turtle_encoded: $ time sql2triples triples.sql 150 > triples.triples.150 real 0m9.084s user 0m4.032s sys 0m4.136s I am surprised that std::ostringstream is that bad a hit, but there you go. Continuing to pull the sql reading code out, using a triple-target interface so it can push data to either a triple-writer or triple-store from the sql_reader. Using virtual method calls in sql_reader.write_to(triple_target& target): $ time sql2triples triples.sql 150 > triples.triples.150 real 0m9.747s user 0m4.392s sys 0m4.268s Using template in sql_reader.write_to(T& target): $ time sql2triples triples.sql 150 > triples.triples.150 real 0m8.297s user 0m4.052s sys 0m3.920s Created a dummy triple-target for sql import: $ time sqlimport triples.sql 150 #read 150 MiB. real 0m2.417s user 0m2.104s sys 0m0.128s $ time sqlimport triples.sql #read 1716 MiB. real 1m14.737s user 0m28.818s sys 0m2.704s These seem to use around 50% cpu, so are possibly io bound. Ok, first cut of string_hashing: $ time sqlimport triples.sql 15 > triples.int.15 real 1m43.726s user 1m37.034s sys 0m1.320s Not very good. Using comparison of hash before using std::string operator == : $ time sqlimport triples.sql 15 > triples.int.15 real 0m26.890s user 0m24.946s sys 0m0.916s So there's quite an overhead on the std::string again. Stopped the 150 MiB test after 5 mins. This is writing out the triples as decimal, rather than binary at the moment: ... 1734213532 => "34f15571c95b532edbfe9d7da0c10f3cb4e23493" 3074817951 => "76f0e56a8a818aea4c12a4d626260a239f6df066" 1939079077 => "650eafd6100f3ab69e410a084d481b6ca0cbc8a3" 361627558 => "e020c9:1048124e9e5:-432a[@7057]" 183631784 => "I am not a college professor with 2 kids, I'm a writer, mother, and grandmother, who has one goofy german shepard dog, a fluffy kitty (she is NOT fat!) and a world full of friends I couldn't do without." 2231369665 => "just a 30-something, trying to figure it all out" 2665086921 => "http://www.amateurgourmet.com/" 87424997 => "http://leegoldberg.typepad.com/" 3283353580 => "5448634688e9f2b4d7381aede77f63c7722886d6" Obviously, the triples have locally repeating strings, eg 3723984522 in the first three. Adding a cache of the 8 most recently used items into the string table yields: $ time sqlimport triples.sql 15 > triples.int.15 real 0m27.240s user 0m25.538s sys 0m0.976s That didn't help at all; moving the recent check to after the first attempted hash table lookup misses means it's not used. $ time sqlimport triples.sql 15 > triples.int.15 real 0m27.049s user 0m25.222s sys 0m1.128s $ time sqlimport triples.sql 30 > triples.int.x real 2m14.507s user 2m8.232s sys 0m2.072s That's rather too non linear, approx O(N^2). That suggests the hash table isn't behaving as an O(1), which is usually related to loading. Changing the loading on the hash table from 75% to 25% gives: $ time sqlimport triples.sql 15 > triples.int.15 real 0m1.915s user 0m0.840s sys 0m0.944s $ time sqlimport triples.sql 150 > triples.int.150 real 0m15.992s user 0m9.289s sys 0m5.408s The reference for performance was 150MiB in 15 minutes; we're not quite 60 times faster at this point. $ time sqlimport triples.sql 1500 > triples.int.1500 #read 501 MiB. real 2m21.728s user 1m53.039s sys 0m12.329s Stopped it after 501 MiB; it does seem to be getting slower. Added more metrics, using clock() to time loading: #read 15 MiB (8.62069 MiB/s). 106163 triples (61013.2 triples/s). #read 30 MiB (12.2951 MiB/s). 189610 triples (77709.0 triples/s). #read 45 MiB (14.1509 MiB/s). 278842 triples (87686.2 triples/s). #read 60 MiB (15.0754 MiB/s). 387876 triples (97456.3 triples/s). #read 90 MiB (16.0428 MiB/s). 580790 triples (103528.0 triples/s). #read 150 MiB (12.5313 MiB/s). 942939 triples (78775.2 triples/s). #read 300 MiB ( 7.25514 MiB/s). 1781925 triples (43093.7 triples/s). #read 450 MiB ( 4.40055 MiB/s). 2781531 triples (27200.6 triples/s). #read 600 MiB ( 3.11737 MiB/s). 3807691 triples (19783.3 triples/s). Definitely there's a slow-down there; memory isn't high enough to be hitting virtual. Making the hash table even sparser (12.5%) increases performance significantly: #read 600 MiB (17.4419 MiB/s). 3807691 triples (110689 triples/s). #read 675 MiB (17.215 MiB/s). 4320159 triples (110180 triples/s). but falls off afterwards #read 700 MiB (16.0293 MiB/s). 4495263 triples (102937 triples/s). #read 800 MiB (13.6495 MiB/s). 5195900 triples (88652.1 triples/s). #read 900 MiB (12.2332 MiB/s). 5890576 triples (80067.6 triples/s). #read 1000 MiB (11.3392 MiB/s). 6581952 triples (74633.8 triples/s). #read 1100 MiB (10.6962 MiB/s). 7239014 triples (70391 triples/s). #read 1190 MiB (10.2295 MiB/s). 7827255 triples (67284.9 triples/s). So looking again at the hash, and adding a bit-swizzle to it. This improves speed, but still has a fall-off around 700 MiB. #read 500 MiB (17.9469 MiB/s). 3122718 triples (112086 triples/s). #read 600 MiB (17.9319 MiB/s). 3807691 triples (113798 triples/s). #read 700 MiB (17.6545 MiB/s). 4495263 triples (113374 triples/s). #read 800 MiB (15.2149 MiB/s). 5195900 triples (98818.9 triples/s). #read 900 MiB (13.5074 MiB/s). 5890576 triples (88407.3 triples/s). #read 1000 MiB (12.3533 MiB/s). 6581952 triples (81308.9 triples/s). #read 1100 MiB (11.7521 MiB/s). 7239014 triples (77339.9 triples/s). #read 1200 MiB (11.1669 MiB/s). 7894748 triples (73466.9 triples/s). Changing to a higher loading, to see if the swizzle gives a decent spread of the hashes. 50% loading #read 30 MiB (1.13507 MiB/s). 189610 triples (7174.04 triples/s). uses 100% cpu 25% loading #read 100 MiB (17.1821 MiB/s). 640108 triples (109984 triples/s). but uses 100% cpu and falls off after 200: #read 200 MiB (14.0746 MiB/s). 1223296 triples (86087 triples/s). #read 300 MiB (11.6414 MiB/s). 1781925 triples (69147.3 triples/s). #read 400 MiB (8.91663 MiB/s). 2444720 triples (54496.7 triples/s). #read 500 MiB (6.94444 MiB/s). 3122718 triples (43371.1 triples/s). 12.5% loading is above; 6.75% loading: Although process clock still reads high throughputs, cpu usage plumments and real time increases past 700 MiB as it starts to use virtual memory: #read 730 MiB (15.0889 MiB/s). 4704741 triples (97245.6 triples/s). real 4m5.772s user 0m26.958s sys 0m22.037s So we keep at 12.5% loading, and look into streaming the strings to disk rather than keeping them in main memory, or compressing them, or both. #read 675 MiB (18.9873 MiB/s). 4320159 triples (121523 triples/s). strings: 2097153 unique strings/tuple = 2097153/4320159 ~= 1/2 I was expecting fewer. Hmm. So the options are: . code my own disk cache, . compress using a dumb scheme, such as lzw, . compress using trie, . compress by searching for RDF namespaces Found some lzw code on the web, but it's very MFC contaminated. Decomtaminated it, but introduced some bugs. It's now 2007-04-07 07:00, and 'Today' is starting, which is well past bed-time. ... It's now 2007-04-07 13:50, and I got up in time for 'From our own correspondent', had a shower, listened to 'The Now Show' and had some lunch and a coffee, then read some blogs. I may be insane but at least I'm hygenic. A thought - since we have quads not triples, we can do named graphs or concept graph style containment. If the compressed file with the 10 Mi triples is 160 MiB, then all the strings should fit into my laptop's 512 MiB memory. There's not a lot of repetition in the individual strings, so the compression needs to be across strings. There's also several copying operations when loading - const char* sql_reader::read_quoted_string () copies from the buffer it uses to read from the file to the buffer it returns. That buffer is then copied when it's passed to a std::string constructor, which is then copied when its added to the table. Also the length of the quoted_string is lost and so it has to be scanned again. It's looking like I need a different string implementation, which will intern the strings directly from the buffer. -- 2007-04-07 16::11 Got the lzw compression to work with strings; now changing to operate on char* buffers. The test case works with strings, so wrote a simple 'read in a block and compress it, repeat' benchmark: $ time lzw_bench triples.sql 15 read: 15 MiB, compressed to: 2 MiB. real 0m9.763s user 0m9.161s sys 0m0.060s $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 18 MiB. real 2m15.826s user 2m9.788s sys 0m1.164s $ time readbench triples.sql 150 real 0m5.961s user 0m0.008s sys 0m0.284s So it has a good enough compression ratio - our string table should fit in ~ 300MiB, but it's very slow. Replacing the lzw::dictionary::element struct with a bit pattern: $ time lzw_bench triples.sql 15 real 0m8.592s user 0m7.588s sys 0m0.104s $ time lzw_bench triples.sql 150 real 1m52.598s user 1m47.675s sys 0m0.792s Replacing the recursive write_bytes method with a interative write-forward-and-reverse one - maybe, doesn't effect compression though. Changing the lookup from: int dictionary::get_entry (int prefix, char letter) { return _lookup[(prefix << 8) | letter] - 1; } to make use of the fact that all keys less that 128 map to themselves: int dictionary::get_entry (int prefix, char letter) { unsigned int key((prefix << 8) | letter); return (key < 128) ? key : (_lookup[key] - 1); } $ time lzw_bench triples.sql 15 real 0m7.853s user 0m7.372s sys 0m0.088s Still only seem to be chipping at the edges though. Turning the compression off, by writing the bytes passed into compress() using compress_data(), gives fast throughput: $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 150 MiB. real 0m7.344s user 0m2.184s sys 0m0.164s So it's this part that's slow: // check if the prefix and current byte exist in the dictionary int result(_dictionary.get_entry(prefix, current)); // either add to the dictionary, or change the prefix to the one in the dictionary if (result == not_in_dictionary) { calculate_bit_size(_dictionary.add_entry(prefix, current)); compress_data(destination, written, limit, prefix); prefix = source[index]; } else { prefix = result; } In the absence of a profiler, attach by binary-search commenting out lines: With int result(_dictionary.get_entry(prefix, current)); uncommented, and the rest commented out: $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 150 MiB. real 0m16.293s user 0m12.013s sys 0m0.132s With everything working except _dictionary.add_entry(prefix, current): $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 150 MiB. real 0m20.237s user 0m16.901s sys 0m0.136s From above, the complete code takes: $ time lzw_bench triples.sql 150 real 1m52.598s user 1m47.675s sys 0m0.792s So we have these approximate costs: _dictionary.get_entry(prefix, current) 16.293s - 7.344s => 9s calculate_bit_size() 20.237s - 16.293s => 4s _dictionary.add_entry 112.598s - 20.237s => 90s Obviously, the main cost is in the add_entry function: int dictionary::add_entry (int prefix, char letter) { unsigned int elem(element(prefix, letter)); _elements.push_back(elem); _lookup[elem] = _elements.size(); return _elements.size() - 1; } Commenting out the push_back and the map operator [] calls, gives 21s runtime. Commenting out just the push_back call, gives 14s runtime. Commenting out just the map operator [] call, gives 14s runtime. So there's some feedback between the two. Removing the map and using a linear probe of the _elements vector kills performance: $ time lzw_bench triples.sql 15 read: 3 MiB, compressed to: 0 MiB. real 1m11.540s Alternatives are hash table or trie. ... 2007-04-07 18:51 After having tea, it occured to me that it may be that some of the bit sizes may be less efficient, and that's what's hitting it. Changing compression::calculate_bit_size to always use 13 bits had no effect on performance, but did uncover a buffer overflow bug where destination[written] = 0; was called when written == limit. Normally in the tests the destination limit is not reached, as the compressed data is no larger than the original. Changing get_entry to return -1 for all keys > 128 gives: $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 150 MiB. real 0m13.413s user 0m7.368s sys 0m0.744s So what's costing is when get_entry returns an index for a key with a prefix: performing the lookup and returning -1: return (key < 128) ? key : (_lookup[key], - 1); read: 150 MiB, compressed to: 150 MiB. real 0m14.667s user 0m9.173s sys 0m0.692s performing the lookup and returning its value - 1: return (key < 128) ? key : (_lookup[key] - 1); gives read: 150 MiB, compressed to: 18 MiB. real 1m50.886s user 1m45.375s sys 0m0.788s Using the proper lookup, and replacing prefix = result; with compress_data(destination, written, limit, prefix); prefix = source[index]; in compress gives fast throughput again - there's something going on that makes prefixes slow the system down: $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 150 MiB. real 0m7.867s user 0m3.864s sys 0m0.184s performing the next lookup on the prefix increases cost: $ time lzw_bench triples.sql 150 real 0m27.809s user 0m22.641s sys 0m0.160s and doing all the lookups without writing costs much the same as with the output: $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 0 MiB. real 1m56.095s user 1m46.143s sys 0m0.584s First cut at using a trie, runs out of memory and starts thrashing after 100 MiB but very fast before that: read: 83 MiB, compressed to: 10 MiB. real 0m17.360s user 0m12.861s sys 0m0.656s Reducing the default size of a trie node to 4 children, keeps just under the memory limit for 150MiB: read: 150 MiB, compressed to: 14 MiB. real 0m27.394s user 0m21.805s sys 0m0.656s That's fast enough, but we're using all our memory for the trie now. So there are two ways to do from here - reduce the size of the trie, and limit the size of the dictionary. Adding a max_size parameter to the dictionary, also passed in to the benchmark, gives a range of speed/compression values: $ time lzw_bench triples.sql 150 1024 read: 150 MiB, compressed to: 67 MiB. real 0m11.500s user 0m7.444s sys 0m0.148s $ time lzw_bench triples.sql 150 10240 read: 150 MiB, compressed to: 48 MiB. real 0m10.626s user 0m6.444s sys 0m0.144s $ time lzw_bench triples.sql 150 102400 read: 150 MiB, compressed to: 38 MiB. real 0m8.533s user 0m6.756s sys 0m0.108s $ time lzw_bench triples.sql 150 1024000 read: 150 MiB, compressed to: 24 MiB. real 0m15.300s user 0m10.981s sys 0m0.244s $ time lzw_bench triples.sql 150 10240000 read: 150 MiB, compressed to: 14 MiB. real 0m25.376s user 0m21.685s sys 0m0.656s So somewhere between 1 million and 10 million entries there's a 1:10 compression ratio, and a spot where we should be able to get the trie and the compressed data all into memory at once. --- 2007-04-07 20:44 Performance budgeting: From above, importing the triples (before it starts thrashing) gives: $ time sqlimport triples.sql 150 > triples.int.150 real 0m15.992s user 0m9.289s sys 0m5.408s So the cost of the compression is about the same as the parsing. That's a reasonable starting point - a system which is pants in all respects has no bottlenecks ;->. --- Adding some metrics to see what the trie's up to - number of leaf nodes: total nodes: 1024 leaf nodes: 442 nodes with 1 children: 371 nodes with 2 children: 129 nodes with 3 children: 61 nodes with 4 children: 21 nodes with 128 children: 1 total nodes: 10240 leaf nodes: 2923 nodes with 1 children: 5627 nodes with 2 children: 910 nodes with 3 children: 455 nodes with 4 children: 325 nodes with 128 children: 1 total nodes: 102400 leaf nodes: 31095 nodes with 1 children: 53379 nodes with 2 children: 9112 nodes with 3 children: 4587 nodes with 4 children: 4227 nodes with 128 children: 1 total nodes: 1024000 leaf nodes: 306840 nodes with 1 children: 538649 nodes with 2 children: 91969 nodes with 3 children: 44883 nodes with 4 children: 41659 nodes with 128 children: 1 total nodes: 5618083 leaf nodes: 1661616 nodes with 1 children: 2995538 nodes with 2 children: 491091 nodes with 3 children: 239117 nodes with 4 children: 230721 nodes with 128 children: 1 Most of the nodes only have one child, and a significant number are leaves. Only the root has more than four children. The current default allocation of 4 children can be changed to zero, one or two: With trie_nodes defaulting to space for zero children it crashes. There was a bug which was causing the trie_nodes never to reallocate: trie_node* get (char value) { return _data[value&_mask]; } so always would have a child, hence no nodes with more than four children. Changed to test whether the value matches, rather than just the hash matches: trie_node* get (char value) { trie_node* node(_data[value&_mask]); return (node && (node->_value == value)) ? node : null; } And we get a much more sensible distribution of nodes (I should have noticed a smell with the above results): $ time lzw_bench triples.sql 150 10485760 ... read: 150 MiB, compressed to: 18 MiB. total nodes: 7191955 leaf nodes: 2734838 nodes with 1 children: 3587938 nodes with 2 children: 406975 nodes with 3 children: 166758 nodes with 4 children: 87294 nodes with 5 children: 52989 nodes with 6 children: 34981 nodes with 7 children: 25054 nodes with 8 children: 19219 nodes with 9 children: 16143 nodes with 10 children: 11744 nodes with 11 children: 7720 nodes with 12 children: 5879 nodes with 13 children: 4823 nodes with 14 children: 4160 nodes with 15 children: 3742 nodes with 16 children: 4006 nodes with 17 children: 2907 nodes with 18 children: 2680 nodes with 19 children: 1914 nodes with 20 children: 1318 nodes with 21 children: 973 ... nodes with 39 children: 114 nodes with 40 children: 93 ... nodes with 73 children: 10 nodes with 74 children: 5 ... nodes with 255 children: 1 real 0m50.855s user 0m28.898s sys 0m1.036s Now we can look at more meaningful compression ratios: $ time lzw_bench triples.sql 150 262144 read: 150 MiB, compressed to: 66 MiB. real 0m15.615s user 0m10.325s sys 0m0.196s $ time lzw_bench triples.sql 150 1048576 read: 150 MiB, compressed to: 55 MiB. real 0m16.737s user 0m12.669s sys 0m0.256s $ time lzw_bench triples.sql 150 4194304 read: 150 MiB, compressed to: 27 MiB. real 0m24.691s user 0m19.401s sys 0m0.524s $ time lzw_bench triples.sql 150 67108864 read: 150 MiB, compressed to: 18 MiB. real 0m34.298s user 0m26.902s sys 0m1.052s Though next phase really should be testing whether we can decompress our compressed strings (lzw_test only tests one short string). However, lets see if we're anyway near close to getting the string table in memory: $ time lzw_bench triples.sql 1750 1048576 read: 1715 MiB, compressed to: 614 MiB. real 3m35.814s user 2m19.977s sys 0m2.736s 1048576 nodes uses 65 MiB when running + 614 MiB data; still just too big. $ time lzw_bench triples.sql 1750 4194304 read: 1715 MiB, compressed to: 513 MiB. real 3m53.596s user 2m49.559s sys 0m3.048s 4194304 nodes uses 275 MiB when running + data; still too big. Costs 210 MiB to compress by an extra 100 MiB, so try scaling the other way: $ time lzw_bench triples.sql 1750 262144 read: 1715 MiB, compressed to: 754 MiB. real 2m55.980s user 1m51.615s sys 0m2.456s 262144 nodes uses 18 MiB when running, and 754 MiB data. There are two ways from here - use unions in the nodes to reduce the footprint of each, and/or work out how to unload chunks of the string table. I can see a lot of configuration parameters happening. ... So we have a budget as follows: Around 512 MiB to work with. A 10 mega-triple store takes 160 MiB uncompressed. Need at least two images to work with, so that's 320 MiB. There's not a lot of point using more than 32 MiB for the scaffolding for the compression. That leaves 160 MiB for a string table. So only 1/5 of the strings corresponding to the part of the triples can be loaded at a go. That's a problem for generating queries, so need a stable means of getting a unique id from a string, so can look up the right page in the table on disk and compare. Since we have to use the disk, is it worth compressing the table in memory? Well, it means we can have 2 or 3 times as many live strings in the mru cache, and I'm not sure compressing costs much once you're hashing and comparing - comparing two strings costs less if both are compressed using the same dictionary. And I have spent a day getting the compression to work. Ok, lets create a dictionary from the triples. 2007-04-07 23:05 ... brief rest reading blogs. ... 2007-04-08 00:12 went to bed ... 2007-04-08 01:48 booted up again Thinking about how to get most from a 32 MiB trie. From bottom, allocate trie nodes of fixed size. From top, allocate arrays of children. Each trie_node is 64 bits: 32 bits 24:8 prefix, 1 byte value. 32 bits either 1:23:8 flag = 0, offset and value for single child, or flag = 1, offset of children array from top of store and size of child array. Each element in child array is 24:8 offset of child and value of child. Could use operator new, but since everything is bit twiddled it's not much of an object. ... Got a version similar to above working: $ time lzw_bench triples.sql 150 read: 150 MiB, compressed to: 34 MiB. real 0m11.281s user 0m7.688s sys 0m0.224s Seems to be a bit faster. $ time lzw_bench triples.sql 1750 read: 1715 MiB, compressed to: 530 MiB. real 2m38.805s user 1m22.865s sys 0m2.372s That stayed on 32.1 MiB usage when running, which was as designed, with similar compression to the 4194304 node case but with less memory, and faster than the 262144 node case with better compression. 2007-04-08 05:41 - really time for bed now. ... 2007-04-08 12:38 Reinstating metrics for new form of trie: $ time lzw_bench triples.sql 150 metrics read: 150 MiB, compressed to: 34 MiB. total nodes: 1644364 leaf nodes: 630124 (38%) nodes with 1 children: 820836 (or fewer: 1450960, 88%) nodes with 2 children: 90637 (or fewer: 1541597, 93%) nodes with 3 children: 35347 (or fewer: 1576944, 95%) nodes with 4 children: 18106 (or fewer: 1595050, 97%) ... nodes with storage for 0 children: 630044 (or fewer: 630044, 38%) nodes with storage for 1 children: 820836 (or fewer: 1450880, 88%) nodes with storage for 2 children: 49620 (or fewer: 1500500, 91%) nodes with storage for 4 children: 42790 (or fewer: 1543290, 93%) nodes with storage for 8 children: 37086 (or fewer: 1580376, 96%) nodes with storage for 16 children: 26245 (or fewer: 1606621, 97%) nodes with storage for 32 children: 28332 (or fewer: 1634953, 99%) nodes with storage for 64 children: 5704 (or fewer: 1640657, 99%) nodes with storage for 128 children: 3627 (or fewer: 1644284, 99%) table is full storage for nodes: 39% storage for child lists: 60% total size of child lists: 2743424 non-empty slots in lists: 823395 list efficiency: 30% list fragmentation: 53% list fragmentation is difference between heap usage and total length of lists. list efficiency is ratio of non-empty elements in list to total length of list. To reduce fragmentation, a copying collector could be used, or the block size increased. To increase efficiency, small lists could use linear probe. 88% of the nodes have zero or one child, so 60% of the space is being used to store the child lists of 12% of the nodes. Having space on a default node for 3 children means 88% of the nodes would use more space, and 7% would use less (I think), so that's not a good trade off. 2007-04-08 13:40 - going to my sister's for the afternoon. ... 2007-04-09 21:28 - back from my sister's, having fixed her sofa. Taking a quick look at http://www.gzip.org/algorithm.txt to see how gzip works since gz gives 10:1 compression wheras I'm only getting 3:1 - it still needs a trie, but is Huffman encoding the output. GZ isn't compressing the whole file - it works on 32K chunks. It also occured to me that creating a new node on the 2nd hit rather than the first may improve things - the node id isn't used ... Changed to use _dictionary.add_entry(prefix, current); calculate_bit_size(_dictionary.max_code()); instead of calculate_bit_size(_dictionary.add_entry(prefix, current)); when compressing. This uncovered a bug: since _dictionary.add_entry returned -1 if there's no more room, the ended up being 8 some of the time, instead of 22 or so. Fixing that has no effect on compression ratio though. read: 150 MiB, compressed to: 34 MiB. Changing to jump from one child to eight, with a linear scan rather than masking the least bits gives more nodes: $ time lzw_bench triples.sql 150 true ... read: 150 MiB, compressed to: 41 MiB. total nodes: 2153348 leaf nodes: 1118071 (51%) ... real 0m11.867s user 0m7.796s sys 0m0.172s ... Added larger tests to lzw_test. The decoder has drifted a little, with the larger tests showing up what happens when the number of bits to encode the prefix goes up, so tidied that up. 2007-04-10 01:27 - bed time. The decoder fails when the compression has written the id of the last added code in the dictionary. This should not get written until after the next code is written, as it's not added to the decoder's dictionary until the first character following the previous code is known. Added extra logic to the compression to handle that case. Added a feature to allow the dictionary size to be changed again. This exposed a bug - was only stopping adding nodes once 2 * _node_count was greater than _heap_top - which was supposed to be the index of the top of the heap of child lists. Unfortunately, _heap_top was actually the last index find_free_heap returned, which wasn't necessarily the lowest index on the heap, so the heap lists could get corrupted. The add_entry code was changed to check whether the value at index was the value free slots were intitialised to: -1 - which cannot be a valid prefix and letter combinations, as prefixes are limited to 23 bit unsigned integers. Doubled the limit of the heap size by rolling the data index right one, allowing half the heap for child lists if the heap is 64 MiB, With that bug fixed, and lzw_test passing, we get the following effect of altering the heap size. $ time lzw_bench triples.sql 150 65536 read: 150 MiB, compressed to: 29 MiB. real 0m13.522s user 0m9.153s sys 0m0.312s $ time lzw_bench triples.sql 150 32768 read: 150 MiB, compressed to: 35 MiB. real 0m13.351s user 0m8.333s sys 0m0.252s $ time lzw_bench triples.sql 150 16384 read: 150 MiB, compressed to: 54 MiB. real 0m12.098s user 0m8.101s sys 0m0.224s $ time lzw_bench triples.sql 150 8192 read: 150 MiB, compressed to: 61 MiB. real 0m12.165s user 0m8.185s sys 0m0.188s $ time lzw_bench triples.sql 150 4096 read: 150 MiB, compressed to: 65 MiB. real 0m11.686s user 0m7.660s sys 0m0.188s $ time lzw_bench triples.sql 150 1024 read: 150 MiB, compressed to: 150 MiB. real 0m11.016s user 0m9.141s sys 0m0.104s It's still an order or two magnitude slower than just reading the data: $ time readbench triples.sql 150 read 152 MiB. real 0m0.470s user 0m0.012s sys 0m0.216s $ time lzw_bench triples.sql 1750 read: 1715 MiB, compressed to: 559 MiB. real 2m45.079s user 1m34.734s sys 0m2.312s 13.5 * 1715 / 150 = 154 = 2m34s, so approximately linear. 2007-04-10 23:11 ... Looking into finding how well it compresses our triples, which should be more regular than the pure SQL: triple_compression_bench ties a sql_reader to a triple_compressor triple_target: $ time triple_compression_bench triples.sql 1750 #read 1715 MiB (20.8435 MiB/s). 10423960 triples (126689 triples/s). #read 10429951 triples. #read 1716 MiB. #compressed 462 MiB. real 2m3.149s user 1m20.269s sys 0m2.184s That means we're using an average of 46 bytes to store the strings associated with a triple, or 15.5 bytes per term. Given the block read takes more than half that real time for reading, it's not too bad. $ time readbench triples.sql 1750 real 1m17.170s user 0m0.024s sys 0m3.580s This is much slower than linear for readbench 150 because it's larger than the disk-cache, which readbench gets a bonus from for 150MiB reads: $ readbench triples.sql 1500 > /dev/null # pushes first part of triples.sql out of cache $ time readbench triples.sql 160 > /dev/null real 0m3.355s $ time readbench triples.sql 160 > /dev/null real 0m0.329s $ time readbench triples.sql 160 > /dev/null real 0m0.238s $ time readbench triples.sql 320 > /dev/null real 0m9.515s $ time readbench triples.sql 320 > /dev/null real 0m0.690s $ time readbench triples.sql 320 > /dev/null real 0m0.763s $ time readbench triples.sql 640 > /dev/null real 0m14.261s $ time readbench triples.sql 640 > /dev/null real 0m16.484s $ time readbench triples.sql 640 > /dev/null real 0m22.929s I doubt the built-in cache for my HDD is 320 MiB, so I suspect Linux is using any free memory as cache, so the 320 MiB threashold won't be guaranteed once we're using the memory for other stuff. So it's fast enough, but I'd like better compression - the goal is to have the most common uris should compressed to 30 bits, so that one bit of the id indicates URI, one bit for compressed, and the remainder holds the string, which will then be unique and the same for all strings of that value, giving us the id for free. The current LZW compression implementation doesn't do that- on average it doesn't compress enough, and it can't take advantage of us already having built the compression dictionary during the compression. Added metrics for the number of string compressed to 30 bits or fewer, and the compressed size of the remainder: $ time triple_compression_bench triples.sql 165 #read 160 MiB (22.5035 MiB/s). 998121 triples (140383 triples/s). #read 160 MiB. #compressed 25 MiB. # 1510959 strings compressed to below 30 bits #the remainder compressed into 22 MiB. real 0m10.825s user 0m7.000s sys 0m0.224s So for a million triples/(quads including graph id) we have about half the strings compressed into IDs and about half stored in 22 MiB, 32 MiB for the compression dictionary, 16 MiB for the quads = 68 MiB. To generate an index should take 10s, to load a compressed store ~ 0.2s if in disk cache. $ time triple_compression_bench triples.sql 1750 #read 1715 MiB (20.4556 MiB/s). 10423960 triples (124332 triples/s). #read 1716 MiB. #compressed 462 MiB. # 11490080 strings compressed to below 30 bits #the remainder compressed into 433 MiB. real 1m53.261s user 1m21.929s sys 0m2.096s With the larger dataset, only a third of the strings meet the compression criteria. Looking at the effect of limiting the compression dictionary to 15 bits (32 KiB), in the hope that we will gain more 2 code URIs than what we lose by compressing less in general: $ time triple_compression_bench triples.sql 1750 32 #read 1715 MiB (21.5371 MiB/s). 10423960 triples (130905 triples/s). #read 1716 MiB. #compressed 1036 MiB. # 3052834 strings compressed to below 30 bits #the remainder compressed into 1031 MiB. real 2m1.158s user 1m17.677s sys 0m2.020s That didn't work. Trying with the largest possible table for the implementation - 64 MiB $ time triple_compression_bench triples.sql 1750 65536 #read 1715 MiB (19.993 MiB/s). 10423960 triples (121520 triples/s). #read 1716 MiB. #compressed 402 MiB. # 14177540 strings compressed to below 30 bits #the remainder compressed into 365 MiB. real 2m3.871s user 1m23.633s sys 0m2.420s So we've spent 32 MiB and gained 433-365 = 68. ... Changing the threashold to 62 bits doesn't capture very many more: # 16369289 strings compressed to below 62 bits #the remainder compressed into 353 MiB. ... Finding the average length of the extra size strings: 10423960 => 31271880 strings, 31271880 - 14177540 = 17094340 long strings 365 * 1024 * 1024 / 17094340 = 22.4 average length. ... 32 MiB for dictionary, 160 MiB for 10 million quads, 433 MiB for longer strings = 625 MiB 64 MiB for dictionary, 160 MiB for 10 million quads, 365 MiB for longer strings = 589 MiB So we're in the ballpark for building an in-memory 10 mega-triple store that imports from a text format in 2 minutes. A giga-store would take 3 1/2 hours to import, assuming it's linear. Finding string collisions requires hashing each long string twice. If a third of the strings have ids from compression so don't need hashing, then on average there needs to be hashing of two 22.5 byte strings per triple. Using 30 bit hashes, collisions can be detected using 128 MiB, or double-hash collisions in 256MiB, then a second scan of all strings with colliding hashes can be made. The problem is that that's efficient where collisions are rare. A 20-bit hash table would takes 4 MiB empty, plus 16 bits per entry (id:32, next:32, file offset:64), around 280 MiB. One question is whether we can make streaming it sensible option for larger queries - scanning a database using a sustained read of 1280 MiB/minute is 80 million quads/minute, or say 1 mega-quads/s, so the top performance to query a full giga-store serially would be 15 mins for a query. Which is quite a long time. For that scale you'd need indexing, or at least partitioning, of the store. Not that you'd want to have it all in the one 64 giga-byte file anyway, but you'd need to partition into N^4 to get a factor if N improvement when scanning one value in the quad only. For example, partition into 4096 mega-bases, based on tree bits of the id in each quad, then you create four indices of each mega-base, one per term in the quad. Here's a question - is it faster to have null entries, so each megabase is a power of two in size and distributed as closely as possible to having the indexed term be an index into the file. nulls help. Indexing and partitioning aren't hard to scale, as long as ids can be shared. What matters is ensuring equal values have the same id, and non-equal values have different ones. ... Created a simple hash table for strings. Discovered that can't decompress our compressed strings - the compression was missing off a tailing partial byte. Now with elimination of duplicate strings: # 64 MiB dictionary $ time triple_compression_bench triples.sql 1760 65536 #read 1716 MiB. #compressed 401 MiB. # 14148062 strings compressed into 30 bits #the remainder, compressed into 141 MiB. real 2m12.344s user 1m27.329s sys 0m2.868s # 32 MiB dictionary $ time triple_compression_bench triples.sql 1760 #read 1716 MiB. #compressed 478 MiB. # 11500011 strings compressed into 30 bits #the remainder, compressed into 153 MiB. real 2m4.132s user 1m26.109s sys 0m2.720s # 16 MiB dictionary $ time triple_compression_bench triples.sql 1760 16384 #read 1716 MiB. #compressed 496 MiB. # 11766306 strings compressed into 30 bits #the remainder, compressed into 156 MiB. real 2m6.607s user 1m23.161s sys 0m2.744s Some points - at the moment the storage for the long strings is fixed at 360 MiB. This should be dynamic up to a limit, otherwise we can't use it for anything else. - the default size of 32 MiB seems good. We may save a little with 16 MiB. - need to count occurances next, to see if it's work huffman coding them. - at the end of a run, the system has created a list of around 10 million triples in memory, where each term in the triple is a an id, from which the original triples can be reconstructed, using 160 MiB (triples), 32 MiB (dictionary) and 156 MiB (long strings) = 348 MiB. Which leaves enough for a second 160 MiB result table for creating indices. - I've no idea how this will scale between stores, or even if it has to. ... Looking at inplace sort - merge sort claims to be, but can't get it to work right, it seems to want to recurse over the swapped values: [|0 1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] [ 0|1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] [ 0 0|2 3 4 5 6 7 1,1 2 3 4 5 6 7 ] [ 0 0 1|3 4 5 6 7 1 2,2 3 4 5 6 7 ] [ 0 0 1 2|4 5 6 7 1 2 3,3 4 5 6 7 ] [ 0 0 1 2 3|5 6 7 1 2 3 4,4 5 6 7 ] [ 0 0 1 2 3 4|6 7 1 2 3 4 5,5 6 7 ] [ 0 0 1 2 3 4 5|7 1 2 3 4 5 5,6 7 ] [ 0 0 1 2 3 4 5 6,1 2 3 4 5 5 7,7 ] ... 2007-04-11 06:08 bed time. ... 2007-04-11 19:19 back to sort. If we have a swap buffer of half the size of the sorted set, then we're ok. If the buffer's too small, then there's another N/2 swaps. [|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7 ] [$_ _ _ _ ] [ _|1 2 3 4 5 6 7|0 1 2 3 4 5 6 7 ] [ 0$_ _ _ ] [ _|1 2 3 4 5 6 7 _|1 2 3 4 5 6 7 ] [ 0 0$_ _ ] [ _|1 2 3 4 5 6 7 _ _|2 3 4 5 6 7 ] [ 0 0 1$_ ] [ 0 1|2 3 4 5 6 7 _ _|2 3 4 5 6 7 ] [ 0 0 1 1$] [ 0 0 1 1$_ _ 6 7 _ _|2 3 4 5 6 7 ] [|2 3 4 5 ] [ 0 0 1 1 2$_ 6 7 _ _|2 3 4 5 6 7 ] [|2 3 4 5 ] [ 0 0 1 1 2 2$6 7 _ _ _|3 4 5 6 7 ] [:_|3 4 5 ] [ 0 0 1 1 2 2 3$7 _ _ _|3 4 5 6 7 ] [ 6:_|4 5 ] [ 0 0 1 1 2 2 3 3$_ _ _ _|4 5 6 7 ] [ 6 7|4 5 ] [|0 1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] ? ? [ 0|1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] ? ? [ 0 0|2 3 4 5 6 7,1 1 2 3 4 5 6 7 ] ? ? [ 0 0 1|3 4 5 6 7,2 1 2 3 4 5 6 7 ] ? ? [ 0 0 1 1|4 5 6 7,2 3 2 3 4 5 6 7 ] ? ? [ 0 0 1 1 2|5 6 7,2 3 4 3 4 5 6 7 ] ? ? [ 0 0 1 1 2 2|6 7,5 3 4 3 4 5 6 7 ] ? ? [ 0 0 1 1 2 2 3|7,5 6 4 3 4 5 6 7 ] ? ? [ 0 0 1 1 2 2 3 3|5 6 4 7 4 5 6 7 ] ? ? [ 0 0 1 1 2 2 3 3 4|6 4 7 5 5 6 7 ] ? ? [ 0 0 1 1 2 2 3 3 4 4|6 7 5 5 6 7 ] ? ? [ 0 0 1 1 2 2 3 3 4 4 5|7 6 5 6 7 ] ? ? [ 0 0 1 1 2 2 3 3 4 4 5 5|6 7 6 7 ] ? ? [ 0 0 1 1 2 2 3 3 4 4 5 5 6|7 6 7 ] ? ? [ 0 0 1 1 2 2 3 3 4 4 5 5 6 6|7 7 ] ? ? [ 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 ] OK, I can see where I was going wrong - I was just tracking left and right, not left, right and insertion point. Initially, while data[l] <= data[r], ++l, ++i. if (data[l] > data[r]), swap(i, r), ++r, ++i else swap(i, l), ++l, ++i. until r = limit-1 and l=r-1 Try that algorithm: Initiall: [|0 1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] i=0, l=0, r=8 ? ? data[l] <= data[r] => ++l [ 0|1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] i=1, l=1, r=8 ? ? data[l] > data[r] => swap(i,r), l := r, ++r Second phase, i < mid, [ 0 0|2 3 4 5 6 7,1 1 2 3 4 5 6 7 ] i=2, l=8, r=9 ? ? data[l] <= data[r] => swap(i,l); [ 0 0 1|3 4 5 6 7,2 1 2 3 4 5 6 7 ] i=3, l=8, r=9 ? ? data[l] > data[r] => swap(i,r), ++r; [ 0 0 1 1|4 5 6 7,2 3>2 3 4 5 6 7 ] i=4, l=8, r=10 ? ? data[l] <= data[r] => swap(i,l), ++l; [ 0 0 1 1 2|5 6 7,4>3 2 3 4 5 6 7 ] i=5, l=9, r=10 ? ? data[l] > data[r] => swap(i,r), ++r; [ 0 0 1 1 2 2|6 7,4 3 5>3 4 5 6 7 ] i=6, l=9, r=11 ? ? data[l] <= data[r] => swap(i,l); ++l; [ 0 0 1 1 2 2 3|7,4 6>5 3 4 5 6 7 ] i=7, l=10, r=11 ? ? data[l] > data[r] => swap(i,r); ++r; [ 0 0 1 1 2 2 3 3|4 6>5 7>4 5 6 7 ] i=8, l=10, r=12 New phase: i = mid, so data[i+1] no guaranteed to be greater than data[i]: [ 0 0 1 1 2 2 3 3|4 6>5 7>4 5 6 7 ] i=8, l=10, r=12 ? ? data[l] > data[r] => swap(i,r); ++r; After a swap, increment the l or r only if the swapped value is greater than the next value. The swapped values are always increasing. Once i > mid, need to check data[i] too. If l<=i, l = i+1, if r<=l, r = l+1. I'm assuming that doing it in-place is worth the extra comparisons. ... 2007-04-11 22:05 merge_sort_test sorts all selections from S8 sucessfully. Creating merge_sort_bench for sorting 160 MiB of quads. This shows up a bug in the merging: sorting 1024 quads. [|0142 0482 0652 2347 3176 3346 3795 3969>0259 0270 0448 0525 0796 0823 2215 3797 ] i=0 l=0 r=8 ^^^^ ^^^^ [ 0142|0482 0652 2347 3176 3346 3795 3969>0259 0270 0448 0525 0796 0823 2215 3797 ] i=1 l=0 r=8 ^^^^ ^^^^ [ 0142 0259|0652 2347 3176 3346 3795 3969>0482>0270 0448 0525 0796 0823 2215 3797 ] i=2 l=8 r=9 ^^^^ ^^^^ [ 0142 0259 0270|2347 3176 3346 3795 3969>0482 0652>0448 0525 0796 0823 2215 3797 ] i=3 l=8 r=10 ^^^^ ^^^^ [ 0142 0259 0270 0448|3176 3346 3795 3969>0482 0652 2347>0525 0796 0823 2215 3797 ] i=4 l=8 r=11 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482|3346 3795 3969>3176>0652 2347>0525 0796 0823 2215 3797 ] i=5 l=9 r=11 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525|3795 3969>3176>0652 2347 3346>0796 0823 2215 3797 ] i=6 l=9 r=12 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652|3969>3176 3795>2347 3346>0796 0823 2215 3797 ] i=7 l=10 r=12 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796|3176 3795>2347 3346 3969>0823 2215 3797 ] i=8 l=10 r=13 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823|3795>2347 3346 3969>3176>2215 3797 ] i=9 l=10 r=14 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215|2347 3346 3969>3176 3795 3797 ] i=10 l=10 r=14 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347|3346 3969>3176 3795 3797 ] i=11 l=11 r=14 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346|3969>3176 3795 3797 ] i=12 l=12 r=14 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346>3176|3969>3795 3797 ] i=13 l=13 r=14 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346>3176 3795|3969>3797 ] i=14 l=14 r=15 ^^^^ ^^^^ [ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346>3176 3795|3797 3969 ] i=14 l=15 r=16 ^^^^ ^^^^ The 3176 value which gets swapped to mid and then again to 13 isn't in increasing order with 3969, so invalidates one of the assumptions of the above merging algorithm. When i = mid, we've got three lists to merge: 3176 3795; 2347 3346 3969;0823 2215 3797 whereas the above merges two. The code for a source/target merge would be much simpler. ... Got a to-and-fro' merge sort which works on a set of psuedo-random quads: $ time merge_sort_bench 10240 10485760 quads. 0 1 2 3 4 5 6 7 sorting: sorted quads. checking: done. 0 1 2 3 4 5 6 7 real 0m11.000s user 0m9.813s sys 0m0.556s $ time merge_sort_bench 1024 1048576 quads. 0 1 2 3 4 5 6 7 sorting: sorted quads. checking: done. 0 1 2 3 4 5 6 7 real 0m1.388s user 0m1.208s sys 0m0.072s Adding the output of the quads to disk doesn't effect the run time. Reading the quads in again from cold disk takes 75% as long as creating them: $ time readbench sorted.quads read 160 MiB. real 0m7.688s user 0m0.000s sys 0m0.324s So can we tune the sort enough to make it not worth creating indices? Except that we can binary search on file without loading it all. 2007-04-12 00:46 bed time. sensible bed time for once! ... 2007-04-12 01:40 maybe not. Changing quad_sort to use aligned arrays: $ time merge_sort_bench 10240 real 0m10.577s user 0m9.601s sys 0m0.480s Not significant. Simplifying the operations to access the term in question data[start+(l<<2|term)] to initialise pointers to data + ... and then increment them in the loops shaves a half second or so: $ time merge_sort_bench 10240 real 0m9.983s user 0m9.005s sys 0m0.500s Using SSE2 operations to copy: $ time merge_sort_bench 10240 real 0m9.621s user 0m8.693s sys 0m0.452s Using SSE2 operations to compare: $ time merge_sort_bench 10240 real 0m9.950s user 0m8.993s sys 0m0.436s Using SSE2 operations to compare and bit ops to avoid branch: $ time merge_sort_bench 10240 real 0m10.085s user 0m9.061s sys 0m0.544s For this, it seems it's best just to use the SSE2 for copying: $ time merge_sort_bench 10240 real 0m9.815s user 0m8.857s sys 0m0.464s I can't think of an easy way of parallelising the sort to SIMD, I'll leave it for now. 9.8s at 2.2GHz for 24 passes over 10 MiB = 85 clocks per item visited. (9.8*2.2e9)/(24*10*1024*1024) = 85.6717428 At the maximum of 1 visit per item, that's a minimum of 40ns per item, or 408ms for an index of a 10 million triple store. 2007-04-12 03:28 ... There's been a follow-up post by Joe Gregorio, which points to mulgara. http://mulgara.org/confluence/display/dev/%28Im%29Perfect+Indexes Thinking about indexing next, it's noted here that to generate indices where all quads matching two terms are contiguous, you need an index for each selection of two terms from four, ie six indices. S, P, O and G denote the four terms in the quad - subject, predicate, object and graph. For one or three term queries, four indices suffices - for one term, one index for each of SPOG, for three terms, one index sorted on the three terms in any order. For a query of four terms, any index will have the matches contiguous. So for quads, six indices will give you an O(log(N)) match on any single part of a query, as long as SPOG appear first and last in at least one of . mulgara's indices don't have that property, so can't be used for three-term queries. On the other hand, they do have a SPOG index, which is the most likely variant of three terms. so a selection of: SPOG -> S, SP, SPO matches contiguous, SP matches ordered by O over G. POSG -> P, PO, matches contiguous, PO matches ordered by S over G. OSPG -> O, SO, matches contiguous, SO matches ordered by S over G. GSPO -> G, SG, SPG matches contiguous GPOS -> PG, POG matches contiguous GOSP -> OG, SOG matches contiguous might be used. Based on the assumptions I've made about what a node in a distributed array needs to provide, ie 32 bit IDs is enough, and how well the sample quads compress, these aren't indices as a reference into a file takes as much space as the data itself. I'd expect that most of the time you'd want the results around a single subject, particularly if you have lots of blank nodes, ie you'd be using one ordering to get ?PO_, then use SP?_ (? desired term, _ don't care, SPOG specified value). Taking the example query from http://xtech06.usefulinc.com/schedule/paper/18 : SELECT ?pub ?title ?issue ?article WHERE { ?title rdf:type struct:Title . ?title dc:identifier . ?issue prism:isPartOf ?title . ?issue prism:volume "20" . ?issue prism:number "4" . ?article prism:isPartOf ?issue . ?article prism:startingPage "539" . } An optimiser may first look at the POGS and OGSP orderings to determine the spans of the predicates and literals to determine which patterns to match first. Either you can build a topological sort and start with finding ?title candidates, then ?issue candidates, then ?article, or you can merge the results of all matches with one ungrounded term. The thing about this index approach is that this pattern: ?title rdf:type struct:Title . would use the POGS index, and the result would be far from the result for ?title dc:identifier . This lack of locality, I think, is going to be the pain point for this kind of normalised store. Whether it's an index with pointers into a larger store, or an ordering of a compressed form of the quads, you've got a cache (disk or memory) miss, which is why I'm trying to implement this without chasing pointers. Wrote blog as it's Thursday. 2007-04-13 00:25 ... Looked a bit into Linux's off64_t, large file support and B-trees. Put a call to posix_fadvise before the read in file_reader doesn't help; we're reading sequentially in blocks anyway. 2007-04-13 01:30 ... 2007-04-14 16:02 Thinking about how to improve raw indexing, assuming 2N working storage. Firstly thought about using radix 4 sorting, and growing lists of sorted values and index into original data from top and bottom of each half of the temporary storage space. This has disadvantage of reversing the previous order of half the data. For indexing b bit values, a radix sort with radix r = 2^a visits each item b/a times. With N temporary storage, can radix sort N/r items into r simple buckets in the temporary storage. This may incur an extra visit to copy back, if the next pass can't be done on the buckets. It then takes a-1 merge sorts to combine these b/a sub-sequences. So you have a-1 + ceil(32/a) visits to sort 32 bit data (or a-1 + 2.ceil(32/a) with copying): a : 1 2 3 4 5 6 7 8 9 10 16 32 a-1 + ceil(32/a) : 32 16 13 11 11 11 11 11 12 13 17 32 a-1 + 2.ceil(32/a) : 64 31 24 19 18 17 16 15 16 17 19 33 So I'd expect, where ln2(N) > 15, a hybrid radix+merge to be better than a straight merge. ... 2007-04-14 22:27 Further thought: making one pass to count number of items with radix, then make subsequent passes using this count to offset the address into the working memory means you can do a full sort of the data in 1+ceil(B/a) passes, using 4*2^a additional bytes temporary data: a : 1 2 4 6 8 11 16 32 1 + ceil(32/a) : 33 17 9 7 5 4 3 2 4*2^a : 8 16 64 256 1K 8K 256K 16G Given: $ cat /proc/cpuinfo model name : Mobile AMD Athlon(tm) 64 Processor 3400+ cpu MHz : 2200.000 cache size : 1024 KB It seems the temp storage will fit into cache for up to radix 2^16, which allows us to sort in two passes. So for 10 million quads, radix sort wins over merge. '-no-sort' gives the overhead in the test program. $ time quad_sort_bench 10240 -radix real 0m4.127s user 0m3.256s sys 0m0.608s $ time quad_sort_bench 10240 -merge real 0m10.112s user 0m9.081s sys 0m0.512s $ time quad_sort_bench 10240 -no-sort real 0m1.476s user 0m1.056s sys 0m0.340s Switching to use SSE2 in the radix implementation - load the data, extract a 16 byte value from it, do the index stuff, and store it - improves speed and, unusually for SSE2, makes the code in the loops shorter (though it does have to be repeated 4 times as it has to use immediates): $ time quad_sort_bench 10240 -radix real 0m3.674s user 0m2.900s sys 0m0.556s Now, at the granulatity of bash's time command, radix is not slower for datasets down to 1K. 1.85 s to sort 10 Million items at 2.2GHz in 388 clocks/item; as it visits each item 5 times so 80 cycles per item. That's still way higher than the number of instructions. Using radix256, appear to get better throughput: $ time quad_sort_bench 10240 -radix256 real 0m2.949s user 0m2.292s sys 0m0.444s Then saving off a cold cache: $ time quad_sort_bench 10240 -radix256 -save real 0m10.049s user 0m3.028s sys 0m1.120s So the sort is quicker than the disk write of the index, but I'm not sure whether the sort is memory bandwidth constrained - adding a micro-benchmark which copies the quads 10 times: $ time quad_sort_bench 10240 -move-only moving quads 10 times : real 0m4.194s user 0m3.508s sys 0m0.476s So access and update takes around 200ms for 10 million quads ((3.5s-1.5s)/10), or 42 cycles per quad. I'm not sure what that all means, but the radix 256 seems to be much more efficient. 2007-04-15 00:45 ... 2007-04-15 13:13 Comparing move to scatter: $ time quad_sort_bench 10240 -scatter-only 10485760 quads. scattering quads 10 times : real 0m15.706s user 0m14.949s sys 0m0.648s $ time quad_sort_bench 10240 -move-only 10485760 quads. moving quads 10 times : real 0m4.013s user 0m3.512s sys 0m0.392s $ time quad_sort_bench 10240 -no-sort real 0m0.627s user 0m0.340s sys 0m0.248s ((14.949+0.468)-(0.340+0.248))/10 = 1.5s ((3.512+0.392)-(0.340+0.248))/10 = 0.33s So scattering takes much longer than the consecutive move - 1,500ms vs 330ms I don't know -no-sort is reporting a lower time today. If scattering is bad (which is not surprising), then radix 256 is looking at 256 pointers, but radix 65536 many more, so much more scattered. Created a hybrid sort which uses radix 256 to sort blocks, then merges them. Varying block size: $ time quad_sort_bench 10240 -hybrid 256 real 0m7.889s user 0m7.300s sys 0m0.524s $ time quad_sort_bench 10240 -hybrid 1024 real 0m6.170s user 0m5.524s sys 0m0.556s $ time quad_sort_bench 10240 -hybrid 4096 real 0m6.484s user 0m5.788s sys 0m0.604s $ time quad_sort_bench 10240 -hybrid 16384 real 0m4.918s user 0m4.396s sys 0m0.448s $ time quad_sort_bench 10240 -hybrid 65536 real 0m5.181s user 0m4.680s sys 0m0.360s $ time quad_sort_bench 10240 -hybrid 262144 real 0m4.557s user 0m4.044s sys 0m0.428s $ time quad_sort_bench 10240 -hybrid 1048576 real 0m4.832s user 0m4.264s sys 0m0.492s $ time quad_sort_bench 10240 -hybrid 2097152 real 0m4.564s user 0m3.968s sys 0m0.528s $ time quad_sort_bench 10240 -hybrid 4194304 real 0m3.920s user 0m3.296s sys 0m0.560s $ time quad_sort_bench 10240 -hybrid 10485760 real 0m3.497s user 0m2.868s sys 0m0.548s So that way of trying to get locality into the sort doesn't give an improvement over the plain radix case. 2007-04-15 14:01 ... 2007-04-15 23:05 Attempting a radix16 sort: $ time quad_sort_bench 10240 -radix16 real 0m4.087s user 0m3.408s sys 0m0.452s More visits, not better. Tidied code to use template functions for term value so cleaner instead of cases as term must be immediate. Using _mm_stream_si128 rather than _mm_store_si128 had to output results of radix256 sort had no measurable effect. Probably could use SIMD over the four byteN_index_total arrays in radix256, but would introduce an extra operation into the bigger loop, so not worth trying. Took a look at the generated assembly, can't quite read it well enough - 180: 66 0f 7f 04 02 movdqa %xmm0,(%edx,%eax,1) Don't know what that 3-argument addressing mode means. $ time quad_sort_bench 10240 -radix256 real 0m2.940s user 0m2.292s sys 0m0.444s $ time quad_sort_bench 10240 -no-sort real 0m1.476s user 0m0.808s sys 0m0.292s (2.292+0.444)-(0.808+0.292) = 1.63 seconds (1.63 * 2.2e9)/(5*10e6) = 72 clocks/visit still quite high, this is the high half word buffer -> data loop of the radix 256 sort: 1d3: 8b 95 f0 ef ff ff mov 0xffffeff0(%ebp),%edx 1d9: 39 55 0c cmp %edx,0xc(%ebp) 1dc: 73 36 jae 214 (unsigned int*, unsigned int*, unsigned int)+0x214> 1de: 8b 5d 0c mov 0xc(%ebp),%ebx 1e1: 66 0f 6f 03 movdqa (%ebx),%xmm0 1e5: 66 0f c5 c0 01 pextrw $0x1,%xmm0,%eax 1ea: 0f b6 c4 movzbl %ah,%eax 1ed: 8b 94 85 f4 ef ff ff mov 0xffffeff4(%ebp,%eax,4),%edx 1f4: 8d 4a 01 lea 0x1(%edx),%ecx 1f7: 89 8c 85 f4 ef ff ff mov %ecx,0xffffeff4(%ebp,%eax,4) 1fe: c1 e2 04 shl $0x4,%edx 201: 8b 45 08 mov 0x8(%ebp),%eax 204: 66 0f 7f 04 02 movdqa %xmm0,(%edx,%eax,1) 209: 83 c3 10 add $0x10,%ebx 20c: 39 9d f0 ef ff ff cmp %ebx,0xffffeff0(%ebp) 212: 77 cd ja 1e1 (unsigned int*, unsigned int*, unsigned int)+0x1e1> 16 instructions ~ 4.5 clocks per instruction. I know the SSE2 load and store (movdqa) have latency 4 on intel, but AFAIK the others should be single clock cycle. ... 2007-04-18 20:03 Looking at the AMD optimisation manual, the latency should only be 2 on the stores. The complex addressing modes may allow use of an partly interleaved offset array - scale of 8 gives a stride of 2 ints. Leaving that for now, since we want to create indices on disk, and think about locality - a sorted index of 1/100th of the data may help when initially importing, but we still need the b-tree implementation to work well. Thinking about indices again, we may want SPOG for blank nodes, but PSOG for transitive closures. Assuming the disk block size is a factor of 4K, there's 256 quads in a block. It would make sense to have quads with the same AB contiguous in a block in an ABCD index, which is not the case in the b-tree at http://cis.stvincent.edu/html/tutorials/swd/btree/btree.html . A billion quads is 16 GiB or 4 million blocks = 2^22. Sorting 256 quads takes little time, so the blocks can always be sorted. Maybe a hashmap would be better. A 16 byte hash should do for 10 million quads, need to scan for collsions. Ok, next micro-benchmark is to make a hash of each pair of terms in the quad, and test for collisions. But first, I want to check the copy code isn't optimising away the mix and so comparing that to scatter is valid: $ time quad_sort_bench 10240 -scatter-only real 0m15.997s user 0m14.589s sys 0m0.508s $ time quad_sort_bench 10240 -move-only real 0m4.180s user 0m3.368s sys 0m0.484s The mix being used to scatter the data is not being optimized away, the cost is in the scattering. So, collisions: 2007-04-18 22:49 ... 2007-04-19 22:51 Reading http://www.mark.masmcode.com/ Can we interleave reads/writes in the radix sort? we're only using the single xmm0 register at the moment. Using 1 xmm register in loops: $ time quad_sort_bench 10240 -radix256 real 0m2.958s user 0m2.340s sys 0m0.424s Using 2 xmm registers in loops: $ time quad_sort_bench 10240 -radix256 real 0m2.933s user 0m2.256s sys 0m0.464s Using 4 xmm registers in loops: $ time quad_sort_bench 10240 -radix256 real 0m2.889s user 0m2.256s sys 0m0.428s Breaking up the loop's internals further to decrease dependencies doesn`t seem to have a measurable effect. 2007-04-19 23:31 ... 2007-04-20 22:39 Downloaded uba for generating benchmark datasets.