2007-04-05 21:36
Starting from  Joe Gregorio's BitWorking post on 'megadata' <http://bitworking.org/news/158/ETech-07-Summary-Part-2-MegaData>
I'm wondering what happens if you work this thing from the other end - rather than trying to make a database that scales well over
distributed compute nodes, instead focus on optimising each node whilst trying not to compromise external scalability.

Getting parallelism working, whether via network or just concurrent threads in the same machine, is harder than straight-ahead
computing in a single node.
Since my background in engineering computing and simulation, I've little idea why it takes 15 minutes to load 10,000,000 triples
into a RDF store <http://www.lassila.org/blog/archive/2006/03/bigger_datasets.html>, whereas reading the file itself takes seconds.

So I'm trying to find the cost points for nodes, rather than looking at distribution schemes here.

Unlike CPU hardware, I'm not sure that nodes have reached a point where they need parallelism for performance.

For this, I'm taking a node as a virtual processor and half a GiB RAM, mainly because my laptop has a single non-hyperthreaded core and 640 MiB RAM.

At that size, a node doesn't have to handle more than a billion tuples, and I'm assuming a little bit that some of the strategies for distributing queries over nodes
will still apply to such nodes. Limiting a node to a billion tuples also makes coding easier - you can use 32 bit identifiers, as long as they are (reasonably)
local to the node. By 'reasonably local', I'd expect some number of IDs to be common to all nodes to represent the most frequently occuring term values.

Most of the RDF stores seem to be written in Java, and what I've seen of the implementations do a lot of pointer chasing,
so this is a bit of fun to see what we can do with SSE2 to make a RDF-like query engine without pointer chasing.

If each tuple in the store is a quad vector of 32 bit ints, we can do quite fast searches using SSE2. As John Sowa predicted, many triple-stores
have found that triples aren't enough, and are quad stores now. Though maybe a hexa-store is best for human-oriented data, as most of the relations we
come up with have arity less than seven.


Pipeline micro benchmark:

Since the SSE comparison ops return -1, we can bitmask the matches rather than using branches, which sometimes helps on pipeline CPU architectures.
This appears to be quite nice for no branches and locality, and scales until you hit virtual memory:

// matches the second value in quad
__m128i* match2 (int value, __m128i* data, __m128i* result, size_t size)
{
    __m128i mask(_mm_set_epi32(0, value, 0, 0));
    for (size_t i(0); i < size; ++i)
    {
        __m128i in(_mm_loadu_si128(data+i));
        __m128i cmp(_mm_shuffle_epi32(_mm_cmpeq_epi32(in, mask), SPREAD_EPI32(1)));
        _mm_storeu_si128(result, _mm_and_si128(in, cmp));
        result += _mm_extract_epi16(cmp, 0) & 1;
    }
}

If the second value in a quad matches the test value, the result of cmpeq is (</0,-1,0,0/>, otherwise it's </0,0,0,0/>;
this is spread over the quad to be </-1,-1,-1,-1/> or </0,0,0,0/>.
This is and'ed with the original, and written to the result. It's also and'ed with the increment for the result, so either </0,0,0,0/> is written,
and result is not changed, or the matching quad is written and result incremented.

(Actually, it's better not to use SIMD in this case, see match_bench_solo no_sse2 below. I don't seem to be able to get advantages from SIMD in this code very often here.)

$ time match_bench 10 16
number of quads: 10485760
memory required: 320 MiB
...
163840 results
...
real    0m8.945s
user    0m7.460s
sys     0m0.640s

So 16 runs of a query which tests whether the 2nd term in the quad matches one of four values, run over a 10 mega-triple set, takes ~ 9 seconds.

To match multiple values, the term being queried can be spread over the whole quad, then compared with a mask with 4 of the values,
then the comparison results spread and or'ed with each other.

$ time match_bench_x4 10 16
number of quads: 10485760
memory required: 320 MiB
...
163840 results
...
real    0m4.045s
user    0m3.040s
sys     0m0.624s

So this is the best case for the matching - match one of four values with the 2nd term in the quad takes ~ 4 seconds instead of 9 with the spread and
merge approach. Given some of this is overhead, it's still pretty good for non-indexed queries.
It's irrelevent which term is matched, I just happened to use the 2nd term for the micro benchmark.
Non-indexed queries matter because if you can do them fast, you don't have to do so much work to make adding tuples to the store scale.

--

Importing data:

I'm working from this dataset: <http://ebiquity.umbc.edu/blogger/index.php?p=257> which is an SQL dump, so to get warmed
up there needs to be a conversion of that file to triples (in rather verbose Turtle). Which also needs a reasonable
file reader/delimited string parser.

So this was my first cut of the converter, which used read(fd, buf, len) each time in needed to read:

performance of sql2triples
base:
$ time sql2triples triples.sql 15 > triples.triples.15

real    0m19.809s
user    0m1.176s
sys     0m16.633s
$ time sql2triples triples.sql 150 > triples.triples.150

real    2m44.558s
user    0m10.125s
sys     2m30.293s
--

Commented out output of triples to see what the output formatting costs:

$ time sql2triples triples.sql 150
real    2m31.191s
user    0m8.617s
sys     2m15.068s

--
Block read benchmark for comparison:

$ time readbench triples.sql 150
real    0m0.505s
user    0m0.000s
sys     0m0.236s

--
So it`s doing something wrong - 150s of its 165s runtime is doing stuff other than efficient io; reading sql shouldn't take that long.

First optimisation: turn reader into block mode internally, so it reads 16K blocks and instead of reading from the file each time its read() method is called:

$ time sql2triples triples.sql 15 > triples.triples.15

real    0m1.596s
user    0m0.628s
sys     0m0.828s
$ time sql2triples triples.sql 150 > triples.triples.150

real    0m8.690s
user    0m4.080s
sys     0m4.252s
--
filereadertest uses the reader to read 1K blocks and writes them to std::cout. So a direct copy of the input takes 6.5 seconds:

$ time filereadertest triples.sql 150 > triples.sql.150
real    0m6.515s
user    0m0.772s
sys     0m1.604s

so parsing and reformatting 150MiB of data costs 2.2 seconds.

That's probably good enough for now; what will matter next is the unique hashing code for converting the triples into binary.
--

The hash table is based on a simple linear probe 2^N map.
I'm assuming throughout that the design goal is for a billion triples - basically so I can get away with 32 bit ids, as they mean you can fit a quad into an SSE register.
More than that, and you'd probably want a distributed engine anyway.
You could do triples of 42 bit and complicate things, but you loose the fast swizzles and comparisons.
You could use 64bit, which reduces throughput and the number of tuples that can fit into physical memory by a factor of two.
For this demonstration, a billion strings seems enough.

Ok, so we read in the triple, hash its spo (subject-predicate-object) as strings, linear probe the table so to get a unique id, and output the hash table (hash to string table offset),
a string table (offset to value), and an image of the quad-store - each quad is a 128 bit tuple of the unique 32 bit ids of its four strings.

Getting the id of a value, and merging quad-stores requires pointer-chasing. Querying reduces to a set of OR matches, done using simple loop over the binary tuple store,
each yields a smaller list of candidates, which are then joined by matching against the least-long list.
Since the stores can be read directly from binary files, it should take half a second to load a 10 Mi tuple store (160MiB), and about the same to run the initial query
on it to find candidate matches for each term. That could be reduced with indexing, but sorted indexing costs on insertions.

You could use 2^4N buckets for dividing the space up - for N=2 you'd be putting into buckets 00.01.10.01 each term whose 2 least significant bits match the bucket id,
which reduces the search space for each query by 2^N. I'm not sure if having many stores would help or not - with 256 buckets you'd have to 64 files to run the simplest query,
with 4096, 512. However, with bucket semi-indexing you don't have insertion costs.

Anyway, the test dataset is 10 Mi triples, so its the same size as the micro-benchmark queries above,
so a query such as p = a or p = b or p = c should take less than half a second.
I've no idea whether that's good compared to SQL databases, but it's better than the SPARQL ones given as examples,
but worse than Katie's from http://xtech06.usefulinc.com/schedule/paper/18 :

Size of store (millions of triples)        2      10      77      99     152

DCIDENT (ms)                              17      15      25      32      23
SPARQL (ms)                              385     987    1048    1128    1465
SPARQL naive (ms)                       1404   21237  123547  161112  212844

$ time match_bench_solo 10 32
real    0m5.415s
user    0m4.136s
sys     0m0.564s

$ time match_bench_solo 10 64
real    0m9.619s
user    0m8.005s
sys     0m0.612s

(9619-5415)/32 = 131

$ time match_bench_solo 10 32 no_sse2
real    0m4.393s
user    0m3.276s
sys     0m0.540s

$ time match_bench_solo 10 64 no_sse2
real    0m7.815s
user    0m6.164s
sys     0m0.648s

(7815-4393)/32 = 107

An ident match using match_bench takes ~100ms for 10 Mi triples, which isn't bad for a non-indexed store. You'd need to partition at least 8 ways,
giving 2^(4*3)=4096 buckets, to match the above id lookup for 10 Mi tuples.
Scaling would be linear.
So indexing at least on ID is required to be competitive, and the bucket approach for quads probably doesn't scale much beyond that.

Actually, trying again, it does seem that the sse2 versions of the match are slower than using 32 bit int directly:

$ time match_bench_x4 10 16
real    0m3.496s
user    0m2.524s
sys     0m0.564s

$ time match_bench_x4 10 16 no_sse
real    0m3.149s
user    0m2.196s
sys     0m0.604s

Apparently, removing branching from the sse2 code doesn't help in this case, and the non-sse2 code is faster still.
// matches the second value in quad without using sse2
__m128i* match2_x4_no_sse (int value0, int value1, int value2, int value3, __m128i* data, __m128i* result, size_t size)
{
    int*    idata((int*)data);

    for (size_t i(0), imax(size*4); i < imax; i += 4)
    {
        int value(idata[i + 2]);
        if ((value == value0) || (value == value1) || (value == value2) || (value == value3))
        {
            _mm_storeu_si128(result, _mm_loadu_si128(data+(i/4)));
            ++result;
        }
    }

    return result;
}

Changing the _mm_storeu_si128 to four integer memory ops doesn't change performance measurably.
Changing to using aligned arrays help the sse2 version a little, but it's still slower.
The question of whether SSE2 helps for multiple term matches needs investigating.

$ time match_bench_x4 10 100 no_sse
real    0m13.378s
user    0m11.349s
sys     0m0.636s

$ time match_bench_x4 10 200 no_sse
real    0m24.063s
user    0m21.013s
sys     0m0.704s

(21013-11349)/100 = 96 ms/query

If it takes ~1s to pass once over a 100 Mi tuple data set, it should take tens of seconds to create an index for a given predicate,
or a few minutes for a giga-store.

Leave that for now; it's good enough.
I may be tempted to code up a few variants and let the query engine choose based on trends,
but that's a while off.

--

Ok, where are we now.

We've tried a few implementations for non-indexed scans of the data set in a compact binary form.
I don't think it's entirely co-incidental that the gzip of a 10 Mi triple dump is a 160 MiB file,
and 10 Mi quads saved as binary would be a 180 MiB file. A triple is about yay-much information content (waves hands).

We've shown it's not expensive to load data from disk.
That extends to queries, so as long as we assume we can serial scan as the first pass of
any query, that's reasonable performance - it's well within the ballpark.

I'm assuming that the results of the first scan are kept in memory, and subsequent
scans or joins are done faster as they're operating on fewer items.

We've also got reasonable performance for a parser that reads the sql dump and outputs triples.

So now let's address getting these triples into the right form to operate on, and go back to the hashing.

In a full system, you'd need to be able to combine quad-stores; I'm leaving that alone for now, and just hashing
the 10 Mi triple dataset I'm working with.

First cut of string map:
hash function for string, returns length and hash value
hash map based on linear increment of hash value until unused value found.

Initially loading from sql rather than triples, since I've already got the code to do that.

Moving the code to do unicode escapes into a utils.cpp function hits performance in sql2triples;
instead of the form:

std::string turtle_encode   (const std::string& value);

use the form:

void    write_turtle_encoded    (std::ostream& buf, const std::string& value);

otherwise theres a factor of two slowdown:

using turtle_encode:
$ time sql2triples triples.sql 150 > triples.triples.150
real    0m18.661s
user    0m5.824s
sys     0m4.544s

using write_turtle_encoded:
$ time sql2triples triples.sql 150 > triples.triples.150
real    0m9.084s
user    0m4.032s
sys     0m4.136s

I am surprised that std::ostringstream is that bad a hit, but there you go.

Continuing to pull the sql reading code out, using a triple-target interface so it can push data to either a triple-writer or triple-store
from the sql_reader.

Using virtual method calls in sql_reader.write_to(triple_target& target):
$ time sql2triples triples.sql 150 > triples.triples.150
real    0m9.747s
user    0m4.392s
sys     0m4.268s

Using template in sql_reader.write_to(T& target):
$ time sql2triples triples.sql 150 > triples.triples.150

real    0m8.297s
user    0m4.052s
sys     0m3.920s

Created a dummy triple-target for sql import:
$ time sqlimport triples.sql 150
#read 150 MiB.
real    0m2.417s
user    0m2.104s
sys     0m0.128s

$ time sqlimport triples.sql
#read 1716 MiB.
real    1m14.737s
user    0m28.818s
sys     0m2.704s

These seem to use around 50% cpu, so are possibly io bound.

Ok, first cut of string_hashing:
$ time sqlimport triples.sql 15 > triples.int.15

real    1m43.726s
user    1m37.034s
sys     0m1.320s

Not very good. Using comparison of hash before using std::string operator == :

$ time sqlimport triples.sql 15 > triples.int.15
real    0m26.890s
user    0m24.946s
sys     0m0.916s

So there's quite an overhead on the std::string again.

Stopped the 150 MiB test after 5 mins.

This is writing out the triples as decimal, rather than binary at the moment:
</3484034554,3723984522,1678151486/>
</717322034,3723984522,3420024371/>
</1972840967,3723984522,1678151486/>
</3473525432,3723984522,1972840967/>
</766339085,2629472197,1972840967/>
</766339085,3723984522,513545588/>
</766339085,1732639216,3484034554/>
</3473525433,3723984522,1972840967/>
</1379127522,3723984522,3484034554/>
</1379127522,766339085,3473525432/>
...
1734213532 => "34f15571c95b532edbfe9d7da0c10f3cb4e23493"
3074817951 => "76f0e56a8a818aea4c12a4d626260a239f6df066"
1939079077 => "650eafd6100f3ab69e410a084d481b6ca0cbc8a3"
361627558 => "e020c9:1048124e9e5:-432a[@7057]"
183631784 => "I am not a college professor with 2 kids, I'm a writer, mother, and grandmother, who has one goofy german shepard dog, a fluffy kitty (she is NOT fat!) and a world full of friends I couldn't do without."
2231369665 => "just a 30-something, trying to figure it all out"
2665086921 => "http://www.amateurgourmet.com/"
87424997 => "http://leegoldberg.typepad.com/"
3283353580 => "5448634688e9f2b4d7381aede77f63c7722886d6"

Obviously, the triples have locally repeating strings, eg 3723984522 in the first three.

Adding a cache of the 8 most recently used items into the string table yields:

$ time sqlimport triples.sql 15 > triples.int.15
real    0m27.240s
user    0m25.538s
sys     0m0.976s

That didn't help at all; moving the recent check to after the first attempted hash table lookup misses means it's not used.
$ time sqlimport triples.sql 15 > triples.int.15
real    0m27.049s
user    0m25.222s
sys     0m1.128s

$ time sqlimport triples.sql 30 > triples.int.x
real    2m14.507s
user    2m8.232s
sys     0m2.072s

That's rather too non linear, approx O(N^2).
That suggests the hash table isn't behaving as an O(1), which is usually related to loading.

Changing the loading on the hash table from 75% to 25% gives:
$ time sqlimport triples.sql 15 > triples.int.15
real    0m1.915s
user    0m0.840s
sys     0m0.944s

$ time sqlimport triples.sql 150 > triples.int.150
real    0m15.992s
user    0m9.289s
sys     0m5.408s

The reference for performance was 150MiB in 15 minutes; we're not quite 60 times faster at this point.

$ time sqlimport triples.sql 1500 > triples.int.1500

#read 501 MiB.
real    2m21.728s
user    1m53.039s
sys     0m12.329s

Stopped it after 501 MiB; it does seem to be getting slower.

Added more metrics, using clock() to time loading:

#read  15 MiB (8.62069  MiB/s).   106163 triples  (61013.2 triples/s).
#read  30 MiB (12.2951  MiB/s).   189610 triples  (77709.0 triples/s).
#read  45 MiB (14.1509  MiB/s).   278842 triples  (87686.2 triples/s).
#read  60 MiB (15.0754  MiB/s).   387876 triples  (97456.3 triples/s).
#read  90 MiB (16.0428  MiB/s).   580790 triples (103528.0 triples/s).
#read 150 MiB (12.5313  MiB/s).   942939 triples  (78775.2 triples/s).
#read 300 MiB ( 7.25514 MiB/s).  1781925 triples  (43093.7 triples/s).
#read 450 MiB ( 4.40055 MiB/s).  2781531 triples  (27200.6 triples/s).
#read 600 MiB ( 3.11737 MiB/s).  3807691 triples  (19783.3 triples/s).

Definitely there's a slow-down there; memory isn't high enough to be hitting virtual.

Making the hash table even sparser (12.5%) increases performance significantly:
#read 600 MiB (17.4419 MiB/s).  3807691 triples (110689 triples/s).
#read 675 MiB (17.215 MiB/s).   4320159 triples (110180 triples/s).

but falls off afterwards
#read 700 MiB (16.0293 MiB/s).  4495263 triples (102937 triples/s).
#read 800 MiB (13.6495 MiB/s).  5195900 triples (88652.1 triples/s).
#read 900 MiB (12.2332 MiB/s).  5890576 triples (80067.6 triples/s).
#read 1000 MiB (11.3392 MiB/s). 6581952 triples (74633.8 triples/s).
#read 1100 MiB (10.6962 MiB/s). 7239014 triples (70391 triples/s).
#read 1190 MiB (10.2295 MiB/s). 7827255 triples (67284.9 triples/s).

So looking again at the hash, and adding a bit-swizzle to it.

This improves speed, but still has a fall-off around 700 MiB.

#read 500 MiB (17.9469 MiB/s).  3122718 triples (112086 triples/s).
#read 600 MiB (17.9319 MiB/s).  3807691 triples (113798 triples/s).
#read 700 MiB (17.6545 MiB/s).  4495263 triples (113374 triples/s).
#read 800 MiB (15.2149 MiB/s).  5195900 triples (98818.9 triples/s).
#read 900 MiB (13.5074 MiB/s).  5890576 triples (88407.3 triples/s).
#read 1000 MiB (12.3533 MiB/s). 6581952 triples (81308.9 triples/s).
#read 1100 MiB (11.7521 MiB/s). 7239014 triples (77339.9 triples/s).
#read 1200 MiB (11.1669 MiB/s). 7894748 triples (73466.9 triples/s).

Changing to a higher loading, to see if the swizzle gives a decent spread of the hashes.

50% loading
#read 30 MiB (1.13507 MiB/s).   189610 triples (7174.04 triples/s).
uses 100% cpu

25% loading
#read 100 MiB (17.1821 MiB/s).  640108 triples (109984 triples/s).
but uses 100% cpu and falls off after 200:
#read 200 MiB (14.0746 MiB/s).  1223296 triples (86087 triples/s).
#read 300 MiB (11.6414 MiB/s).  1781925 triples (69147.3 triples/s).
#read 400 MiB (8.91663 MiB/s).  2444720 triples (54496.7 triples/s).
#read 500 MiB (6.94444 MiB/s).  3122718 triples (43371.1 triples/s).

12.5% loading is above;

6.75% loading:
Although process clock still reads high throughputs, cpu usage
plumments and real time increases past 700 MiB as it starts to
use virtual memory:
#read 730 MiB (15.0889 MiB/s).  4704741 triples (97245.6 triples/s).
real    4m5.772s
user    0m26.958s
sys     0m22.037s

So we keep at 12.5% loading, and look into streaming the strings to disk rather than
keeping them in main memory, or compressing them, or both.

#read 675 MiB (18.9873 MiB/s).  4320159 triples (121523 triples/s).
strings: 2097153
unique strings/tuple = 2097153/4320159 ~= 1/2

I was expecting fewer. Hmm.

So the options are:
 . code my own disk cache,
 . compress using a dumb scheme, such as lzw,
 . compress using trie,
 . compress by searching for RDF namespaces

Found some lzw code on the web, but it's very MFC contaminated.
Decomtaminated it, but introduced some bugs.

It's now 2007-04-07 07:00, and 'Today' is starting, which is well past bed-time.
...
It's now 2007-04-07 13:50, and I got up in time for 'From our own correspondent', had a shower,
listened to 'The Now Show' and had some lunch and a coffee, then read some blogs.
I may be insane but at least I'm hygenic.

A thought - since we have quads not triples, we can do named graphs or concept graph style containment.

If the compressed file with the 10 Mi triples is 160 MiB, then all the strings should fit into my laptop's 512 MiB memory.

There's not a lot of repetition in the individual strings, so the compression needs to be across strings.
There's also several copying operations when loading - const char* sql_reader::read_quoted_string () copies from the buffer it uses to read from the file
to the buffer it returns.
That buffer is then copied when it's passed to a std::string constructor, which is then copied when its added to the table.
Also the length of the quoted_string is lost and so it has to be scanned again.

It's looking like I need a different string implementation, which will intern the strings directly from the buffer.

--
2007-04-07 16::11
Got the lzw compression to work with strings; now changing to operate on char* buffers.

The test case works with strings, so wrote a simple 'read in a block and compress it, repeat' benchmark:

$ time lzw_bench triples.sql 15
read: 15 MiB, compressed to: 2 MiB.

real    0m9.763s
user    0m9.161s
sys     0m0.060s

$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 18 MiB.
real    2m15.826s
user    2m9.788s
sys     0m1.164s

$ time readbench triples.sql 150
real    0m5.961s
user    0m0.008s
sys     0m0.284s

So it has a good enough compression ratio - our string table should fit in ~ 300MiB, but it's very slow.

Replacing the lzw::dictionary::element struct with a bit pattern:

$ time lzw_bench triples.sql 15
real    0m8.592s
user    0m7.588s
sys     0m0.104s
$ time lzw_bench triples.sql 150
real    1m52.598s
user    1m47.675s
sys     0m0.792s

Replacing the recursive write_bytes method with a interative write-forward-and-reverse one - maybe, doesn't effect compression though.

Changing the lookup from:
    int dictionary::get_entry (int prefix, char letter)
    {
        return _lookup[(prefix << 8) | letter] - 1;
    }
to make use of the fact that all keys less that 128 map to themselves:
    int dictionary::get_entry (int prefix, char letter)
    {
        unsigned int key((prefix << 8) | letter);
        return (key < 128) ? key : (_lookup[key] - 1);
    }

$ time lzw_bench triples.sql 15
real    0m7.853s
user    0m7.372s
sys     0m0.088s

Still only seem to be chipping at the edges though.

Turning the compression off, by writing the bytes passed into compress() using compress_data(), gives fast throughput:
$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 150 MiB.

real    0m7.344s
user    0m2.184s
sys     0m0.164s

So it's this part that's slow:
            // check if the prefix and current byte exist in the dictionary
            int result(_dictionary.get_entry(prefix, current));

            // either add to the dictionary, or change the prefix to the one in the dictionary
            if (result == not_in_dictionary)
            {
                calculate_bit_size(_dictionary.add_entry(prefix, current));
                compress_data(destination, written, limit, prefix);
                prefix = source[index];
            }
            else
            {
                prefix = result;
            }
In the absence of a profiler, attach by binary-search commenting out lines:

With
            int result(_dictionary.get_entry(prefix, current));
uncommented, and the rest commented out:

$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 150 MiB.

real    0m16.293s
user    0m12.013s
sys     0m0.132s

With everything working except _dictionary.add_entry(prefix, current):

$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 150 MiB.

real    0m20.237s
user    0m16.901s
sys     0m0.136s

From above, the complete code takes:
$ time lzw_bench triples.sql 150
real    1m52.598s
user    1m47.675s
sys     0m0.792s

So we have these approximate costs:
    _dictionary.get_entry(prefix, current)       16.293s -  7.344s  =>   9s
    calculate_bit_size()                         20.237s - 16.293s  =>   4s
    _dictionary.add_entry                       112.598s - 20.237s  =>  90s

Obviously, the main cost is in the add_entry function:
    int dictionary::add_entry   (int prefix, char letter)
    {
        unsigned int elem(element(prefix, letter));

        _elements.push_back(elem);
        _lookup[elem] = _elements.size();

        return _elements.size() - 1;
    }

Commenting out the push_back and the map operator [] calls, gives 21s runtime.
Commenting out just the push_back call, gives 14s runtime.
Commenting out just the map operator [] call, gives 14s runtime.

So there's some feedback between the two.

Removing the map and using a linear probe of the _elements vector kills performance:
$ time lzw_bench triples.sql 15
read: 3 MiB, compressed to: 0 MiB.
real    1m11.540s

Alternatives are hash table or trie.
...
2007-04-07 18:51
After having tea, it occured to me that it may be that some of the bit sizes may be less efficient, and that's what's hitting it.

Changing compression::calculate_bit_size to always use 13 bits had no effect on performance, but did uncover a buffer overflow bug
where destination[written] = 0; was called when written == limit. Normally in the tests the destination limit is not reached, as
the compressed data is no larger than the original.

Changing get_entry to return -1 for all keys > 128 gives:
$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 150 MiB.

real    0m13.413s
user    0m7.368s
sys     0m0.744s

So what's costing is when get_entry returns an index for a key with a prefix:

performing the lookup and returning -1:
        return (key < 128) ? key : (_lookup[key], - 1);
read: 150 MiB, compressed to: 150 MiB.

real    0m14.667s
user    0m9.173s
sys     0m0.692s

performing the lookup and returning its value - 1:
        return (key < 128) ? key : (_lookup[key] - 1);
gives
read: 150 MiB, compressed to: 18 MiB.

real    1m50.886s
user    1m45.375s
sys     0m0.788s

Using the proper lookup, and replacing
                prefix = result;
with
                compress_data(destination, written, limit, prefix);
                prefix = source[index];
in compress gives fast throughput again - there's something going on
that makes prefixes slow the system down:
$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 150 MiB.
real    0m7.867s
user    0m3.864s
sys     0m0.184s


performing the next lookup on the prefix increases cost:
$ time lzw_bench triples.sql 150
real    0m27.809s
user    0m22.641s
sys     0m0.160s

and doing all the lookups without writing costs much the same as with the output:
$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 0 MiB.
real    1m56.095s
user    1m46.143s
sys     0m0.584s

First cut at using a trie, runs out of memory and starts thrashing after 100 MiB but very fast before that:
read: 83 MiB, compressed to: 10 MiB.
real    0m17.360s
user    0m12.861s
sys     0m0.656s

Reducing the default size of a trie node to 4 children, keeps just under the memory limit for 150MiB:
read: 150 MiB, compressed to: 14 MiB.

real    0m27.394s
user    0m21.805s
sys     0m0.656s

That's fast enough, but we're using all our memory for the trie now.
So there are two ways to do from here - reduce the size of the trie, and limit the size of the dictionary.

Adding a max_size parameter to the dictionary, also passed in to the benchmark, gives a range of speed/compression values:

$ time lzw_bench triples.sql 150 1024
read: 150 MiB, compressed to: 67 MiB.
real    0m11.500s
user    0m7.444s
sys     0m0.148s

$ time lzw_bench triples.sql 150 10240
read: 150 MiB, compressed to: 48 MiB.
real    0m10.626s
user    0m6.444s
sys     0m0.144s

$ time lzw_bench triples.sql 150 102400
read: 150 MiB, compressed to: 38 MiB.
real    0m8.533s
user    0m6.756s
sys     0m0.108s

$ time lzw_bench triples.sql 150 1024000
read: 150 MiB, compressed to: 24 MiB.

real    0m15.300s
user    0m10.981s
sys     0m0.244s
$ time lzw_bench triples.sql 150 10240000
read: 150 MiB, compressed to: 14 MiB.

real    0m25.376s
user    0m21.685s
sys     0m0.656s

So somewhere between 1 million and 10 million entries there's a 1:10 compression ratio, and a spot where we should be able to get the trie and the compressed
data all into memory at once.

---
2007-04-07 20:44

Performance budgeting:

From above, importing the triples (before it starts thrashing) gives:
$ time sqlimport triples.sql 150 > triples.int.150
real    0m15.992s
user    0m9.289s
sys     0m5.408s

So the cost of the compression is about the same as the parsing.
That's a reasonable starting point - a system which is pants in all respects has no bottlenecks ;->.
---

Adding some metrics to see what the trie's up to - number of leaf nodes:

total nodes:               1024
leaf nodes:                 442
nodes with 1 children:      371
nodes with 2 children:      129
nodes with 3 children:       61
nodes with 4 children:       21
nodes with 128 children:      1

total nodes:              10240
leaf nodes:                2923
nodes with 1 children:     5627
nodes with 2 children:      910
nodes with 3 children:      455
nodes with 4 children:      325
nodes with 128 children:      1

total nodes:             102400
leaf nodes:               31095
nodes with 1 children:    53379
nodes with 2 children:     9112
nodes with 3 children:     4587
nodes with 4 children:     4227
nodes with 128 children:      1

total nodes:            1024000
leaf nodes:              306840
nodes with 1 children:   538649
nodes with 2 children:    91969
nodes with 3 children:    44883
nodes with 4 children:    41659
nodes with 128 children:      1

total nodes:            5618083
leaf nodes:             1661616
nodes with 1 children:  2995538
nodes with 2 children:   491091
nodes with 3 children:   239117
nodes with 4 children:   230721
nodes with 128 children:      1

Most of the nodes only have one child, and a significant number are leaves.
Only the root has more than four children.

The current default allocation of 4 children can be changed to zero, one or two:

With trie_nodes defaulting to space for zero children it crashes.

There was a bug which was causing the trie_nodes never to reallocate:
        trie_node*  get (char value)
        {
            return _data[value&_mask];
        }
so always would have a child, hence no nodes with more than four children.

Changed to test whether the value matches, rather than just the hash matches:
        trie_node*  get (char value)
        {
            trie_node* node(_data[value&_mask]);

            return (node && (node->_value == value)) ? node : null;
        }
And we get a much more sensible distribution of nodes (I should have noticed a smell with the above results):
$ time lzw_bench triples.sql 150 10485760
...
read: 150 MiB, compressed to: 18 MiB.
total nodes:           7191955
leaf nodes:            2734838
nodes with 1 children: 3587938
nodes with 2 children:  406975
nodes with 3 children:  166758
nodes with 4 children:   87294
nodes with 5 children:   52989
nodes with 6 children:   34981
nodes with 7 children:   25054
nodes with 8 children:   19219
nodes with 9 children:   16143
nodes with 10 children:  11744
nodes with 11 children:   7720
nodes with 12 children:   5879
nodes with 13 children:   4823
nodes with 14 children:   4160
nodes with 15 children:   3742
nodes with 16 children:   4006
nodes with 17 children:   2907
nodes with 18 children:   2680
nodes with 19 children:   1914
nodes with 20 children:   1318
nodes with 21 children:    973
...
nodes with 39 children:    114
nodes with 40 children:     93
...
nodes with 73 children:     10
nodes with 74 children:      5
...
nodes with 255 children:     1

real    0m50.855s
user    0m28.898s
sys     0m1.036s

Now we can look at more meaningful compression ratios:
$ time lzw_bench triples.sql 150 262144
read: 150 MiB, compressed to: 66 MiB.

real    0m15.615s
user    0m10.325s
sys     0m0.196s

$ time lzw_bench triples.sql 150 1048576
read: 150 MiB, compressed to: 55 MiB.
real    0m16.737s
user    0m12.669s
sys     0m0.256s

$ time lzw_bench triples.sql 150 4194304
read: 150 MiB, compressed to: 27 MiB.

real    0m24.691s
user    0m19.401s
sys     0m0.524s

$ time lzw_bench triples.sql 150 67108864
read: 150 MiB, compressed to: 18 MiB.

real    0m34.298s
user    0m26.902s
sys     0m1.052s

Though next phase really should be testing whether we can decompress our compressed strings (lzw_test only tests one short string).

However, lets see if we're anyway near close to getting the string table in memory:
$ time lzw_bench triples.sql 1750 1048576
read: 1715 MiB, compressed to: 614 MiB.

real    3m35.814s
user    2m19.977s
sys     0m2.736s

1048576 nodes uses 65 MiB when running + 614 MiB data; still just too big.

$ time lzw_bench triples.sql 1750 4194304
read: 1715 MiB, compressed to: 513 MiB.

real    3m53.596s
user    2m49.559s
sys     0m3.048s

4194304 nodes uses 275 MiB when running + data; still too big.
Costs 210 MiB to compress by an extra 100 MiB, so try scaling the other way:

$ time lzw_bench triples.sql 1750 262144
read: 1715 MiB, compressed to: 754 MiB.

real    2m55.980s
user    1m51.615s
sys     0m2.456s
262144 nodes uses 18 MiB when running, and 754 MiB data.

There are two ways from here - use unions in the nodes to reduce the footprint of each, and/or work out how to unload chunks of the string table.

I can see a lot of configuration parameters happening.

...

So we have a budget as follows:
Around 512 MiB to work with.
A 10 mega-triple store takes 160 MiB uncompressed.
Need at least two images to work with, so that's 320 MiB.
There's not a lot of point using more than 32 MiB for the scaffolding for the compression.
That leaves 160 MiB for a string table.

So only 1/5 of the strings corresponding to the part of the triples can be loaded at a go.
That's a problem for generating queries, so need a stable means of getting a unique id from a string,
so can look up the right page in the table on disk and compare.

Since we have to use the disk, is it worth compressing the table in memory?
Well, it means we can have 2 or 3 times as many live strings in the mru cache, and I'm not sure compressing costs much once you're hashing and
comparing - comparing two strings costs less if both are compressed using the same dictionary. And I have spent a day getting the compression to work.

Ok, lets create a dictionary from the triples.


2007-04-07 23:05
...
brief rest reading blogs.
...
2007-04-08 00:12 went to bed
...
2007-04-08 01:48 booted up again

Thinking about how to get most from a 32 MiB trie.

From bottom, allocate trie nodes of fixed size. From top, allocate arrays of children.
Each trie_node is 64 bits:
    32 bits 24:8 prefix, 1 byte value.
    32 bits either 1:23:8 flag = 0, offset and value for single child, or flag = 1, offset of children array from top of store and size of child array.
Each element in child array is 24:8 offset of child and value of child.

Could use operator new, but since everything is bit twiddled it's not much of an object.
...
Got a version similar to above working:

$ time lzw_bench triples.sql 150
read: 150 MiB, compressed to: 34 MiB.

real    0m11.281s
user    0m7.688s
sys     0m0.224s

Seems to be a bit faster.

$ time lzw_bench triples.sql 1750
read: 1715 MiB, compressed to: 530 MiB.

real    2m38.805s
user    1m22.865s
sys     0m2.372s

That stayed on 32.1 MiB usage when running, which was as designed, with similar compression to the 4194304 node case but with less memory,
and faster than the 262144 node case with better compression.

2007-04-08 05:41 - really time for bed now.
...
2007-04-08 12:38

Reinstating metrics for new form of trie:

$ time lzw_bench triples.sql 150 metrics
read: 150 MiB, compressed to: 34 MiB.
total nodes:    1644364
leaf nodes:     630124 (38%)
nodes with 1 children: 820836 (or fewer: 1450960, 88%)
nodes with 2 children: 90637 (or fewer: 1541597, 93%)
nodes with 3 children: 35347 (or fewer: 1576944, 95%)
nodes with 4 children: 18106 (or fewer: 1595050, 97%)
...
nodes with storage for 0 children: 630044 (or fewer: 630044, 38%)
nodes with storage for 1 children: 820836 (or fewer: 1450880, 88%)
nodes with storage for 2 children: 49620 (or fewer: 1500500, 91%)
nodes with storage for 4 children: 42790 (or fewer: 1543290, 93%)
nodes with storage for 8 children: 37086 (or fewer: 1580376, 96%)
nodes with storage for 16 children: 26245 (or fewer: 1606621, 97%)
nodes with storage for 32 children: 28332 (or fewer: 1634953, 99%)
nodes with storage for 64 children: 5704 (or fewer: 1640657, 99%)
nodes with storage for 128 children: 3627 (or fewer: 1644284, 99%)
table is full
storage for nodes:         39%
storage for child lists:   60%
total size of child lists: 2743424
non-empty slots in lists:  823395
list efficiency:           30%
list fragmentation:        53%

list fragmentation is difference between heap usage and total length of lists.
list efficiency is ratio of non-empty elements in list to total length of list.

To reduce fragmentation, a copying collector could be used, or the block size increased.
To increase efficiency, small lists could use linear probe.
88% of the nodes have zero or one child, so 60% of the space is being used to store the child lists
of 12% of the nodes.

Having space on a default node for 3 children means 88% of the nodes would use more space, and 7% would use less (I think),
so that's not a good trade off.

2007-04-08 13:40 - going to my sister's for the afternoon.
...
2007-04-09 21:28 - back from my sister's, having fixed her sofa.

Taking a quick look at http://www.gzip.org/algorithm.txt to see how gzip works since gz gives 10:1 compression
wheras I'm only getting 3:1 - it still needs a trie, but is Huffman encoding the output.
GZ isn't compressing the whole file - it works on 32K chunks.

It also occured to me that creating a new node on the 2nd hit rather than the first may improve things - the node id isn't used
...
Changed to use
                _dictionary.add_entry(prefix, current);
                calculate_bit_size(_dictionary.max_code());
instead of
                calculate_bit_size(_dictionary.add_entry(prefix, current));
when compressing.
This uncovered a bug: since _dictionary.add_entry returned -1 if there's no more room, the ended up being 8 some of the time, instead of 22 or so.

Fixing that has no effect on compression ratio though.
read: 150 MiB, compressed to: 34 MiB.


Changing to jump from one child to eight, with a linear scan rather than masking the least bits gives more nodes:
$ time lzw_bench triples.sql 150 true
...
read: 150 MiB, compressed to: 41 MiB.
total nodes:    2153348
leaf nodes:     1118071 (51%)
...
real    0m11.867s
user    0m7.796s
sys     0m0.172s

...
Added larger tests to lzw_test.

The decoder has drifted a little, with the larger tests showing up what happens when the number of bits to encode the prefix goes up, so tidied that up.

2007-04-10 01:27 - bed time.

The decoder fails when the compression has written the id of the last added code in the dictionary. This should not get written until after the next code is written, as it's not
added to the decoder's dictionary until the first character following the previous code is known. Added extra logic to the compression to handle that case.

Added a feature to allow the dictionary size to be changed again.
This exposed a bug - was only stopping adding nodes once 2 * _node_count was greater than _heap_top - which was supposed to be the index of the top of the heap of child lists.
Unfortunately, _heap_top was actually the last index find_free_heap returned, which wasn't necessarily the lowest index on the heap, so the heap lists could get corrupted.
The add_entry code was changed to check whether the value at index was the value free slots were intitialised to: -1 -
which cannot be a valid prefix and letter combinations, as prefixes are limited to 23 bit unsigned integers.

Doubled the limit of the heap size by rolling the data index right one, allowing half the heap for child lists if the heap is 64 MiB,

With that bug fixed, and lzw_test passing, we get the following effect of altering the heap size.

$ time lzw_bench triples.sql 150 65536
read: 150 MiB, compressed to: 29 MiB.

real    0m13.522s
user    0m9.153s
sys     0m0.312s

$ time lzw_bench triples.sql 150 32768
read: 150 MiB, compressed to: 35 MiB.

real    0m13.351s
user    0m8.333s
sys     0m0.252s

$ time lzw_bench triples.sql 150 16384
read: 150 MiB, compressed to: 54 MiB.

real    0m12.098s
user    0m8.101s
sys     0m0.224s

$ time lzw_bench triples.sql 150 8192
read: 150 MiB, compressed to: 61 MiB.

real    0m12.165s
user    0m8.185s
sys     0m0.188s

$ time lzw_bench triples.sql 150 4096
read: 150 MiB, compressed to: 65 MiB.

real    0m11.686s
user    0m7.660s
sys     0m0.188s

$ time lzw_bench triples.sql 150 1024
read: 150 MiB, compressed to: 150 MiB.

real    0m11.016s
user    0m9.141s
sys     0m0.104s

It's still an order or two magnitude slower than just reading the data:

$ time readbench triples.sql 150
read 152 MiB.

real    0m0.470s
user    0m0.012s
sys     0m0.216s

$ time lzw_bench triples.sql 1750
read: 1715 MiB, compressed to: 559 MiB.

real    2m45.079s
user    1m34.734s
sys     0m2.312s

13.5 * 1715 / 150 = 154 = 2m34s, so approximately linear.

2007-04-10 23:11
...
Looking into finding how well it compresses our triples, which should be more regular than the pure SQL:
triple_compression_bench ties a sql_reader to a triple_compressor triple_target:

$ time triple_compression_bench triples.sql 1750
#read 1715 MiB (20.8435 MiB/s). 10423960 triples (126689 triples/s).
#read 10429951 triples.
#read       1716 MiB.
#compressed  462 MiB.

real    2m3.149s
user    1m20.269s
sys     0m2.184s

That means we're using an average of 46 bytes to store the strings associated with a triple, or 15.5 bytes per term.

Given the block read takes more than half that real time for reading, it's not too bad.
$ time readbench triples.sql 1750
real    1m17.170s
user    0m0.024s
sys     0m3.580s

This is much slower than linear for readbench 150 because it's larger than the disk-cache, which readbench gets a bonus from for 150MiB reads:

$ readbench triples.sql 1500 > /dev/null # pushes first part of triples.sql out of cache
$ time readbench triples.sql 160 > /dev/null
real    0m3.355s
$ time readbench triples.sql 160 > /dev/null
real    0m0.329s
$ time readbench triples.sql 160 > /dev/null
real    0m0.238s

$ time readbench triples.sql 320 > /dev/null
real    0m9.515s
$ time readbench triples.sql 320 > /dev/null
real    0m0.690s
$ time readbench triples.sql 320 > /dev/null
real    0m0.763s

$ time readbench triples.sql 640 > /dev/null
real    0m14.261s
$ time readbench triples.sql 640 > /dev/null
real    0m16.484s
$ time readbench triples.sql 640 > /dev/null
real    0m22.929s

I doubt the built-in cache for my HDD is 320 MiB, so I suspect Linux is using any free memory as cache,
so the 320 MiB threashold won't be guaranteed once we're using the memory for other stuff.

So it's fast enough, but I'd like better compression - the goal is to have the most common uris should compressed to 30 bits, so that one bit of the id
indicates URI, one bit for compressed, and the remainder holds the string, which will then be unique and the same for all strings of that value,
giving us the id for free.

The current LZW compression implementation doesn't do that- on average it doesn't compress enough,
and it can't take advantage of us already having built the compression dictionary during the compression.

Added metrics for the number of string compressed to 30 bits or fewer, and the compressed size of the remainder:

$ time triple_compression_bench triples.sql 165
#read 160 MiB (22.5035 MiB/s).  998121 triples (140383 triples/s).
#read       160 MiB.
#compressed 25 MiB.
# 1510959 strings compressed to below 30 bits
#the remainder compressed into 22 MiB.

real    0m10.825s
user    0m7.000s
sys     0m0.224s

So for a million triples/(quads including graph id) we have about half the strings compressed into IDs and about half stored in 22 MiB, 32 MiB for the compression dictionary,
16 MiB for the quads = 68 MiB.  To generate an index should take 10s, to load a compressed store ~ 0.2s if in disk cache.

$ time triple_compression_bench triples.sql 1750
#read 1715 MiB (20.4556 MiB/s). 10423960 triples (124332 triples/s).
#read       1716 MiB.
#compressed 462 MiB.
# 11490080 strings compressed to below 30 bits
#the remainder compressed into 433 MiB.

real    1m53.261s
user    1m21.929s
sys     0m2.096s

With the larger dataset, only a third of the strings meet the compression criteria.

Looking at the effect of limiting the compression dictionary to 15 bits (32 KiB), in the hope that we will gain more 2 code URIs than what we lose by
compressing less in general:

$ time triple_compression_bench triples.sql 1750 32
#read 1715 MiB (21.5371 MiB/s). 10423960 triples (130905 triples/s).
#read       1716 MiB.
#compressed 1036 MiB.
# 3052834 strings compressed to below 30 bits
#the remainder compressed into 1031 MiB.

real    2m1.158s
user    1m17.677s
sys     0m2.020s

That didn't work.

Trying with the largest possible table for the implementation - 64 MiB

$ time triple_compression_bench triples.sql 1750 65536
#read 1715 MiB (19.993 MiB/s).  10423960 triples (121520 triples/s).
#read       1716 MiB.
#compressed 402 MiB.
# 14177540 strings compressed to below 30 bits
#the remainder compressed into 365 MiB.

real    2m3.871s
user    1m23.633s
sys     0m2.420s

So we've spent 32 MiB and gained 433-365 = 68.
...
Changing the threashold to 62 bits doesn't capture very many more:
# 16369289 strings compressed to below 62 bits
#the remainder compressed into 353 MiB.
...
Finding the average length of the extra size strings:
10423960 => 31271880 strings,
31271880 - 14177540 = 17094340 long strings
365 * 1024 * 1024 / 17094340  = 22.4 average length.
...
32 MiB for dictionary, 160 MiB for 10 million quads, 433 MiB for longer strings = 625 MiB
64 MiB for dictionary, 160 MiB for 10 million quads, 365 MiB for longer strings = 589 MiB

So we're in the ballpark for building an in-memory 10 mega-triple store that imports from a text format in 2 minutes.
A giga-store would take 3 1/2 hours to import, assuming it's linear.

Finding string collisions requires hashing each long string twice.
If a third of the strings have ids from compression so don't need hashing, then on average there needs to be hashing of two 22.5 byte strings per triple.
Using 30 bit hashes, collisions can be detected using 128 MiB, or double-hash collisions in 256MiB, then a second scan of all strings with colliding hashes can be made.
The problem is that that's efficient where collisions are rare. A 20-bit hash table would takes 4 MiB empty, plus 16 bits per entry (id:32, next:32, file offset:64),
around 280 MiB.


One question is whether we can make streaming it sensible option for larger queries -
scanning a database using a sustained read of 1280 MiB/minute is 80 million quads/minute, or say 1 mega-quads/s, so the top performance to
query a full giga-store serially would be 15 mins for a query. Which is quite a long time.

For that scale you'd need indexing, or at least partitioning, of the store.
Not that you'd want to have it all in the one 64 giga-byte file anyway, but you'd need to partition into N^4 to get a factor if N improvement when scanning one
value in the quad only.
For example, partition into 4096 mega-bases, based on tree bits of the id in each quad, then you create four indices of each mega-base, one per term in the quad.
Here's a question - is it faster to have null entries, so each megabase is a power of two in size and distributed as closely as possible to having the indexed term
be an index into the file. nulls help.

Indexing and partitioning aren't hard to scale, as long as ids can be shared. What matters is ensuring equal values have the same id, and
non-equal values have different ones.

...
Created a simple hash table for strings.
Discovered that can't decompress our compressed strings - the compression was missing off a tailing partial byte.
Now with elimination of duplicate strings:
# 64 MiB dictionary
$ time triple_compression_bench triples.sql 1760 65536
#read       1716 MiB.
#compressed 401 MiB.
# 14148062 strings compressed into 30 bits
#the remainder, compressed into 141 MiB.

real    2m12.344s
user    1m27.329s
sys     0m2.868s

# 32 MiB dictionary
$ time triple_compression_bench triples.sql 1760
#read       1716 MiB.
#compressed 478 MiB.
# 11500011 strings compressed into 30 bits
#the remainder, compressed into 153 MiB.

real    2m4.132s
user    1m26.109s
sys     0m2.720s

# 16 MiB dictionary
$ time triple_compression_bench triples.sql 1760 16384
#read       1716 MiB.
#compressed 496 MiB.
# 11766306 strings compressed into 30 bits
#the remainder, compressed into 156 MiB.

real    2m6.607s
user    1m23.161s
sys     0m2.744s

Some points
 - at the moment the storage for the long strings is fixed at 360 MiB. This should be dynamic up to a limit, otherwise we can't use it for anything else.
 - the default size of 32 MiB seems good. We may save a little with 16 MiB.
 - need to count occurances next, to see if it's work huffman coding them.

 - at the end of a run, the system has created a list of around 10 million triples in memory, where each term in the triple is a an id,
   from which the original triples can be reconstructed, using 160 MiB (triples), 32 MiB (dictionary) and 156 MiB (long strings) = 348 MiB.
   Which leaves enough for a second 160 MiB result table for creating indices.

 - I've no idea how this will scale between stores, or even if it has to.

...
Looking at inplace sort - merge sort claims to be, but can't get it to work right, it seems to want to recurse over the swapped values:
[|0 1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ]
[ 0|1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ]
[ 0 0|2 3 4 5 6 7 1,1 2 3 4 5 6 7 ]
[ 0 0 1|3 4 5 6 7 1 2,2 3 4 5 6 7 ]
[ 0 0 1 2|4 5 6 7 1 2 3,3 4 5 6 7 ]
[ 0 0 1 2 3|5 6 7 1 2 3 4,4 5 6 7 ]
[ 0 0 1 2 3 4|6 7 1 2 3 4 5,5 6 7 ]
[ 0 0 1 2 3 4 5|7 1 2 3 4 5 5,6 7 ]
[ 0 0 1 2 3 4 5 6,1 2 3 4 5 5 7,7 ]
...
 2007-04-11 06:08  bed time.
...
 2007-04-11 19:19  back to sort.

If we have a swap buffer of half the size of the sorted set, then we're ok. If the buffer's too small, then there's another N/2 swaps.

[|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7 ] [$_ _ _ _ ]
[ _|1 2 3 4 5 6 7|0 1 2 3 4 5 6 7 ] [ 0$_ _ _ ]
[ _|1 2 3 4 5 6 7 _|1 2 3 4 5 6 7 ] [ 0 0$_ _ ]
[ _|1 2 3 4 5 6 7 _ _|2 3 4 5 6 7 ] [ 0 0 1$_ ]
[ 0 1|2 3 4 5 6 7 _ _|2 3 4 5 6 7 ] [ 0 0 1 1$]
[ 0 0 1 1$_ _ 6 7 _ _|2 3 4 5 6 7 ] [|2 3 4 5 ]
[ 0 0 1 1 2$_ 6 7 _ _|2 3 4 5 6 7 ] [|2 3 4 5 ]
[ 0 0 1 1 2 2$6 7 _ _ _|3 4 5 6 7 ] [:_|3 4 5 ]
[ 0 0 1 1 2 2 3$7 _ _ _|3 4 5 6 7 ] [ 6:_|4 5 ]
[ 0 0 1 1 2 2 3 3$_ _ _ _|4 5 6 7 ] [ 6 7|4 5 ]

[|0 1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ]
  ?               ?
[ 0|1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ]
    ?             ?
[ 0 0|2 3 4 5 6 7,1 1 2 3 4 5 6 7 ]
                  ? ?
[ 0 0 1|3 4 5 6 7,2 1 2 3 4 5 6 7 ]
                  ? ?
[ 0 0 1 1|4 5 6 7,2 3 2 3 4 5 6 7 ]
                  ?   ?
[ 0 0 1 1 2|5 6 7,2 3 4 3 4 5 6 7 ]
                  ?     ?
[ 0 0 1 1 2 2|6 7,5 3 4 3 4 5 6 7 ]
                    ?   ?
[ 0 0 1 1 2 2 3|7,5 6 4 3 4 5 6 7 ]
                      ? ?
[ 0 0 1 1 2 2 3 3|5 6 4 7 4 5 6 7 ]
                      ?   ?
[ 0 0 1 1 2 2 3 3 4|6 4 7 5 5 6 7 ]
                      ?   ?
[ 0 0 1 1 2 2 3 3 4 4|6 7 5 5 6 7 ]
                      ?   ?
[ 0 0 1 1 2 2 3 3 4 4 5|7 6 5 6 7 ]
                          ? ?
[ 0 0 1 1 2 2 3 3 4 4 5 5|6 7 6 7 ]
                          ?   ?
[ 0 0 1 1 2 2 3 3 4 4 5 5 6|7 6 7 ]
                            ? ?
[ 0 0 1 1 2 2 3 3 4 4 5 5 6 6|7 7 ]
                              ? ?
[ 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 ]


OK, I can see where I was going wrong - I was just tracking left and right, not left, right and insertion point.

Initially, while data[l] <= data[r], ++l, ++i.

if (data[l] > data[r]), swap(i, r), ++r, ++i
else  swap(i, l), ++l, ++i.

until r = limit-1 and l=r-1

Try that algorithm:
Initiall:
[|0 1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] i=0, l=0, r=8
  ?               ?                 data[l] <= data[r] => ++l
[ 0|1 2 3 4 5 6 7,0 1 2 3 4 5 6 7 ] i=1, l=1, r=8
    ?             ?                 data[l] >  data[r] => swap(i,r), l := r, ++r
Second phase, i < mid,
[ 0 0|2 3 4 5 6 7,1 1 2 3 4 5 6 7 ] i=2, l=8, r=9
                  ? ?               data[l] <= data[r] => swap(i,l);
[ 0 0 1|3 4 5 6 7,2 1 2 3 4 5 6 7 ] i=3, l=8, r=9
                  ? ?               data[l] >  data[r] => swap(i,r), ++r;
[ 0 0 1 1|4 5 6 7,2 3>2 3 4 5 6 7 ] i=4, l=8, r=10
                  ?   ?             data[l] <= data[r] => swap(i,l), ++l;
[ 0 0 1 1 2|5 6 7,4>3 2 3 4 5 6 7 ] i=5, l=9, r=10
                    ? ?             data[l] >  data[r] => swap(i,r), ++r;
[ 0 0 1 1 2 2|6 7,4 3 5>3 4 5 6 7 ] i=6, l=9, r=11
                    ?   ?           data[l] <=  data[r] => swap(i,l); ++l;
[ 0 0 1 1 2 2 3|7,4 6>5 3 4 5 6 7 ] i=7, l=10, r=11
                      ? ?           data[l] >  data[r] => swap(i,r); ++r;
[ 0 0 1 1 2 2 3 3|4 6>5 7>4 5 6 7 ] i=8, l=10, r=12
New phase: i = mid, so data[i+1] no guaranteed to be greater than data[i]:
[ 0 0 1 1 2 2 3 3|4 6>5 7>4 5 6 7 ] i=8, l=10, r=12
                      ?   ?         data[l] >  data[r] => swap(i,r); ++r;


After a swap, increment the l or r only if the swapped value is greater than the next value.
The swapped values are always increasing.
Once i > mid, need to check data[i] too.
If l<=i, l = i+1, if r<=l, r = l+1.

I'm assuming that doing it in-place is worth the extra comparisons.
...
2007-04-11 22:05
merge_sort_test sorts all selections from S8 sucessfully.

Creating merge_sort_bench for sorting 160 MiB of quads.

This shows up a bug in the merging:
sorting 1024 quads.
[|0142 0482 0652 2347 3176 3346 3795 3969>0259 0270 0448 0525 0796 0823 2215 3797 ] i=0 l=0 r=8
  ^^^^                                    ^^^^
[ 0142|0482 0652 2347 3176 3346 3795 3969>0259 0270 0448 0525 0796 0823 2215 3797 ] i=1 l=0 r=8
  ^^^^                                    ^^^^
[ 0142 0259|0652 2347 3176 3346 3795 3969>0482>0270 0448 0525 0796 0823 2215 3797 ] i=2 l=8 r=9
                                          ^^^^ ^^^^
[ 0142 0259 0270|2347 3176 3346 3795 3969>0482 0652>0448 0525 0796 0823 2215 3797 ] i=3 l=8 r=10
                                          ^^^^      ^^^^
[ 0142 0259 0270 0448|3176 3346 3795 3969>0482 0652 2347>0525 0796 0823 2215 3797 ] i=4 l=8 r=11
                                          ^^^^           ^^^^
[ 0142 0259 0270 0448 0482|3346 3795 3969>3176>0652 2347>0525 0796 0823 2215 3797 ] i=5 l=9 r=11
                                               ^^^^      ^^^^
[ 0142 0259 0270 0448 0482 0525|3795 3969>3176>0652 2347 3346>0796 0823 2215 3797 ] i=6 l=9 r=12
                                               ^^^^           ^^^^
[ 0142 0259 0270 0448 0482 0525 0652|3969>3176 3795>2347 3346>0796 0823 2215 3797 ] i=7 l=10 r=12
                                                    ^^^^      ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796|3176 3795>2347 3346 3969>0823 2215 3797 ] i=8 l=10 r=13
                                                    ^^^^           ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823|3795>2347 3346 3969>3176>2215 3797 ] i=9 l=10 r=14
                                                    ^^^^                ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215|2347 3346 3969>3176 3795 3797 ] i=10 l=10 r=14
                                                    ^^^^                ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347|3346 3969>3176 3795 3797 ] i=11 l=11 r=14
                                                         ^^^^           ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346|3969>3176 3795 3797 ] i=12 l=12 r=14
                                                              ^^^^      ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346>3176|3969>3795 3797 ] i=13 l=13 r=14
                                                                   ^^^^ ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346>3176 3795|3969>3797 ] i=14 l=14 r=15
                                                                        ^^^^ ^^^^
[ 0142 0259 0270 0448 0482 0525 0652 0796 0823 2215 2347 3346>3176 3795|3797 3969 ] i=14 l=15 r=16
                                                                             ^^^^ ^^^^
The 3176 value which gets swapped to mid and then again to 13 isn't in increasing order with 3969, so invalidates
one of the assumptions of the above merging algorithm.
When i = mid, we've got three lists to merge: 3176 3795; 2347 3346 3969;0823 2215 3797
whereas the above merges two.

The code for a source/target merge would be much simpler.

...
Got a to-and-fro' merge sort which works on a set of psuedo-random quads:
$ time merge_sort_bench 10240
10485760 quads.
0       </1,0,39017576,0/>
1       </2,0,36990675,0/>
2       </3,0,458894,0/>
3       </4,0,9587986,0/>
4       </5,0,38006657,0/>
5       </6,0,17770795,0/>
6       </7,0,19423884,0/>
7       </8,0,39191010,0/>
sorting:
sorted quads.
checking:
done.

0       </7666744,0,0,0/>
1       </8654544,0,0,0/>
2       </6903318,0,5,0/>
3       </9513122,0,5,0/>
4       </4230425,0,16,0/>
5       </4410837,0,20,0/>
6       </5471850,0,20,0/>
7       </6027545,0,27,0/>

real    0m11.000s
user    0m9.813s
sys     0m0.556s

$ time merge_sort_bench 1024
1048576 quads.
0       </1,0,1268840,0/>
1       </2,0,3436243,0/>
2       </3,0,458894,0/>
3       </4,0,1199378,0/>
4       </5,0,257921,0/>
5       </6,0,993579,0/>
6       </7,0,2646668,0/>
7       </8,0,1442274,0/>
sorting:
sorted quads.
checking:
done.

0       </748052,0,11,0/>
1       </802349,0,14,0/>
2       </368645,0,15,0/>
3       </22571,0,21,0/>
4       </943486,0,26,0/>
5       </839690,0,27,0/>
6       </469832,0,30,0/>
7       </711353,0,39,0/>

real    0m1.388s
user    0m1.208s
sys     0m0.072s

Adding the output of the quads to disk doesn't effect the run time.
Reading the quads in again from cold disk takes 75% as long as creating them:
$ time readbench sorted.quads
read 160 MiB.

real    0m7.688s
user    0m0.000s
sys     0m0.324s

So can we tune the sort enough to make it not worth creating indices?
Except that we can binary search on file without loading it all.

2007-04-12 00:46 bed time. sensible bed time for once!
...
2007-04-12 01:40 maybe not.

Changing quad_sort to use aligned arrays:
$ time merge_sort_bench 10240
real    0m10.577s
user    0m9.601s
sys     0m0.480s

Not significant.

Simplifying the operations to access the term in question data[start+(l<<2|term)] to initialise pointers to data + ... and then increment them
in the loops shaves a half second or so:
$ time merge_sort_bench 10240
real    0m9.983s
user    0m9.005s
sys     0m0.500s

Using SSE2 operations to copy:
$ time merge_sort_bench 10240
real    0m9.621s
user    0m8.693s
sys     0m0.452s

Using SSE2 operations to compare:
$ time merge_sort_bench 10240
real    0m9.950s
user    0m8.993s
sys     0m0.436s

Using SSE2 operations to compare and bit ops to avoid branch:
$ time merge_sort_bench 10240
real    0m10.085s
user    0m9.061s
sys     0m0.544s

For this, it seems it's best just to use the SSE2 for copying:
$ time merge_sort_bench 10240
real    0m9.815s
user    0m8.857s
sys     0m0.464s

I can't think of an easy way of parallelising the sort to SIMD, I'll leave it for now.

9.8s at 2.2GHz for 24 passes over 10 MiB = 85 clocks per item visited.

(9.8*2.2e9)/(24*10*1024*1024) = 85.6717428

At the maximum of 1 visit per item, that's a minimum of 40ns per item, or 408ms for an index of a 10 million triple store.


2007-04-12 03:28
...
There's been a follow-up post by Joe Gregorio, which points to mulgara.

http://mulgara.org/confluence/display/dev/%28Im%29Perfect+Indexes

Thinking about indexing next, it's noted <a href="http://gearon.blogspot.com/2004/08/proof-reading-once-again-its-way-too.html">here</a> that
to generate indices where all quads matching two terms are contiguous, you need an index for each selection of two terms from four, ie six indices.
S, P, O and G denote the four terms in the quad - subject, predicate, object and graph.
For one or three term queries, four indices suffices - for one term, one index for each of SPOG, for three terms, one index sorted on the three terms in any order.
For a query of four terms, any index will have the matches contiguous.
So for quads, six indices will give you an O(log(N)) match on any single part of a query, as long as SPOG appear first and last in at least one of .
<a href="http://mulgara.org/confluence/display/dev/%28Im%29Perfect+Indexes">mulgara's indices</a> don't have that property, so can't be used for three-term queries.
On the other hand, they do have a SPOG index, which is the most likely variant of three terms.

so a selection of:
 SPOG -> S, SP, SPO matches contiguous, SP matches ordered by O over G.
 POSG -> P, PO, matches contiguous, PO matches ordered by S over G.
 OSPG -> O, SO, matches contiguous, SO matches ordered by S over G.
 GSPO -> G, SG, SPG matches contiguous
 GPOS -> PG, POG matches contiguous
 GOSP -> OG, SOG matches contiguous
might be used.

Based on the assumptions I've made about what a node in a distributed array needs to provide, ie 32 bit IDs is enough, and how well the sample quads
compress, these aren't indices as a reference into a file takes as much space as the data itself.

I'd expect that most of the time you'd want the results around a single subject, particularly if you have lots of blank nodes, ie you'd be using
one ordering to get ?PO_, then use SP?_ (? desired term, _ don't care, SPOG specified value).

Taking the example query from http://xtech06.usefulinc.com/schedule/paper/18 :

SELECT  ?pub ?title ?issue ?article
WHERE {
  ?title    rdf:type        struct:Title .
  ?title    dc:identifier   <http://metastore.ingenta.com/content/issn/02670836> .
  ?issue    prism:isPartOf   ?title .
  ?issue    prism:volume  "20" .
  ?issue    prism:number    "4" .
  ?article  prism:isPartOf   ?issue .
  ?article  prism:startingPage  "539" .
}

An optimiser may first look at the POGS and OGSP orderings to determine the spans of the predicates and literals to determine which patterns to match first.

Either you can build a topological sort and start with finding ?title candidates, then ?issue candidates, then ?article, or you can merge the results of
all matches with one ungrounded term.

The thing about this index approach is that this pattern:
  ?title    rdf:type        struct:Title .
would use the POGS index, and the result would be far from the result for
  ?title    dc:identifier   <http://metastore.ingenta.com/content/issn/02670836> .

This lack of locality, I think, is going to be the pain point for this kind of normalised store.
Whether it's an index with pointers into a larger store, or an ordering of a compressed
form of the quads, you've got a cache (disk or memory) miss, which is why I'm trying to implement this without chasing pointers.

Wrote blog as it's Thursday.

2007-04-13 00:25
...
Looked a bit into Linux's off64_t, large file support and B-trees.

Put a call to posix_fadvise before the read in file_reader doesn't help; we're reading sequentially in blocks anyway.

2007-04-13 01:30
...
2007-04-14 16:02

Thinking about how to improve raw indexing, assuming 2N working storage.

Firstly thought about using radix 4 sorting, and growing lists of sorted values and index into original data from top and bottom of each half
of the temporary storage space.
This has disadvantage of reversing the previous order of half the data.

For indexing b bit values, a radix sort with radix r = 2^a visits each item b/a times.
With N temporary storage, can radix sort N/r items into r simple buckets in the temporary storage.
This may incur an extra visit to copy back, if the next pass can't be done on the buckets.
It then takes a-1 merge sorts to combine these b/a sub-sequences.
So you have a-1 + ceil(32/a) visits to sort 32 bit data (or a-1 + 2.ceil(32/a) with copying):

    a                  :   1   2   3   4   5   6   7   8   9   10  16  32
    a-1 + ceil(32/a)   :  32  16  13  11  11  11  11  11  12   13  17  32
    a-1 + 2.ceil(32/a) :  64  31  24  19  18  17  16  15  16   17  19  33

So I'd expect, where ln2(N) > 15, a hybrid radix+merge to be better than a straight merge.
...
2007-04-14 22:27
Further thought: making one pass to count number of items with radix,
then make subsequent passes using this count to offset the address into the working memory means you can do
a full sort of the data in 1+ceil(B/a) passes, using 4*2^a additional bytes temporary data:
    a                :   1    2    4    6    8   11   16   32
    1 + ceil(32/a)   :  33   17    9    7    5    4    3    2
    4*2^a            :   8   16   64  256   1K   8K 256K  16G

Given:
$ cat /proc/cpuinfo
model name      : Mobile AMD Athlon(tm) 64 Processor 3400+
cpu MHz         : 2200.000
cache size      : 1024 KB

It seems the temp storage will fit into cache for up to radix 2^16, which allows us to sort in two passes.

So for 10 million quads, radix sort wins over merge. '-no-sort' gives the overhead in the test program.
$ time quad_sort_bench 10240 -radix
real    0m4.127s
user    0m3.256s
sys     0m0.608s

$ time quad_sort_bench 10240 -merge
real    0m10.112s
user    0m9.081s
sys     0m0.512s

$ time quad_sort_bench 10240 -no-sort
real    0m1.476s
user    0m1.056s
sys     0m0.340s

Switching to use SSE2 in the radix implementation - load the data, extract a 16 byte value from it,
do the index stuff, and store it - improves speed and, unusually for SSE2, makes the code in the
loops shorter (though it does have to be repeated 4 times as it has to use immediates):
$ time quad_sort_bench 10240 -radix
real    0m3.674s
user    0m2.900s
sys     0m0.556s

Now, at the granulatity of bash's time command, radix is not slower for datasets down to 1K.

1.85 s to sort 10 Million items at 2.2GHz in 388 clocks/item; as it visits each item 5 times
so 80 cycles per item. That's still way higher than the number of instructions.

Using radix256, appear to get better throughput:
$ time quad_sort_bench 10240 -radix256
real    0m2.949s
user    0m2.292s
sys     0m0.444s

Then saving off a cold cache:
$ time quad_sort_bench 10240 -radix256 -save
real    0m10.049s
user    0m3.028s
sys     0m1.120s

So the sort is quicker than the disk write of the index, but I'm not sure whether the sort
is memory bandwidth constrained - adding a micro-benchmark which copies the quads 10 times:

$ time quad_sort_bench 10240 -move-only
moving quads 10 times :
real    0m4.194s
user    0m3.508s
sys     0m0.476s

So access and update takes around 200ms for 10 million quads ((3.5s-1.5s)/10), or 42 cycles per quad.

I'm not sure what that all means, but the radix 256 seems to be much more efficient.

2007-04-15 00:45
...
2007-04-15 13:13
Comparing move to scatter:
$ time quad_sort_bench 10240 -scatter-only
10485760 quads.
scattering quads 10 times :
real    0m15.706s
user    0m14.949s
sys     0m0.648s

$ time quad_sort_bench 10240 -move-only
10485760 quads.
moving quads 10 times :
real    0m4.013s
user    0m3.512s
sys     0m0.392s

$ time quad_sort_bench 10240 -no-sort
real    0m0.627s
user    0m0.340s
sys     0m0.248s

((14.949+0.468)-(0.340+0.248))/10 = 1.5s
((3.512+0.392)-(0.340+0.248))/10  = 0.33s

So scattering takes much longer than the consecutive move - 1,500ms vs 330ms

I don't know -no-sort is reporting a lower time today.

If scattering is bad (which is not surprising), then radix 256 is looking at 256 pointers, but radix 65536 many more, so much more scattered.

Created a hybrid sort which uses radix 256 to sort blocks, then merges them. Varying block size:

$ time quad_sort_bench 10240 -hybrid 256
real    0m7.889s
user    0m7.300s
sys     0m0.524s
$ time quad_sort_bench 10240 -hybrid 1024
real    0m6.170s
user    0m5.524s
sys     0m0.556s
$ time quad_sort_bench 10240 -hybrid 4096
real    0m6.484s
user    0m5.788s
sys     0m0.604s
$ time quad_sort_bench 10240 -hybrid 16384
real    0m4.918s
user    0m4.396s
sys     0m0.448s
$ time quad_sort_bench 10240 -hybrid 65536
real    0m5.181s
user    0m4.680s
sys     0m0.360s
$ time quad_sort_bench 10240 -hybrid 262144
real    0m4.557s
user    0m4.044s
sys     0m0.428s
$ time quad_sort_bench 10240 -hybrid 1048576
real    0m4.832s
user    0m4.264s
sys     0m0.492s
$ time quad_sort_bench 10240 -hybrid 2097152
real    0m4.564s
user    0m3.968s
sys     0m0.528s
$ time quad_sort_bench 10240 -hybrid 4194304
real    0m3.920s
user    0m3.296s
sys     0m0.560s
$ time quad_sort_bench 10240 -hybrid 10485760
real    0m3.497s
user    0m2.868s
sys     0m0.548s

So that way of trying to get locality into the sort doesn't give an improvement over the plain radix case.
2007-04-15 14:01
...
2007-04-15 23:05
Attempting a radix16 sort:

$ time quad_sort_bench 10240 -radix16
real    0m4.087s
user    0m3.408s
sys     0m0.452s

More visits, not better.

Tidied code to use template functions for term value so cleaner instead of cases as term must be immediate.

Using _mm_stream_si128 rather than _mm_store_si128 had to output results of radix256 sort had no measurable effect.

Probably could use SIMD over the four byteN_index_total arrays in radix256, but would introduce an extra operation into the bigger loop, so not worth trying.

Took a look at the generated assembly, can't quite read it well enough -
 180:	66 0f 7f 04 02       	movdqa %xmm0,(%edx,%eax,1)
Don't know what that 3-argument addressing mode means.

$ time quad_sort_bench 10240 -radix256
real    0m2.940s
user    0m2.292s
sys     0m0.444s

$ time quad_sort_bench 10240 -no-sort
real    0m1.476s
user    0m0.808s
sys     0m0.292s

(2.292+0.444)-(0.808+0.292) = 1.63 seconds

(1.63 * 2.2e9)/(5*10e6) = 72 clocks/visit

still quite high, this is the high half word buffer -> data loop of the radix 256 sort:
 1d3:	8b 95 f0 ef ff ff    	mov    0xffffeff0(%ebp),%edx
 1d9:	39 55 0c             	cmp    %edx,0xc(%ebp)
 1dc:	73 36                	jae    214 <unsigned int* radix256_sort_quads_impl<0u>(unsigned int*, unsigned int*, unsigned int)+0x214>
 1de:	8b 5d 0c             	mov    0xc(%ebp),%ebx
 1e1:	66 0f 6f 03          	movdqa (%ebx),%xmm0
 1e5:	66 0f c5 c0 01       	pextrw $0x1,%xmm0,%eax
 1ea:	0f b6 c4             	movzbl %ah,%eax
 1ed:	8b 94 85 f4 ef ff ff 	mov    0xffffeff4(%ebp,%eax,4),%edx
 1f4:	8d 4a 01             	lea    0x1(%edx),%ecx
 1f7:	89 8c 85 f4 ef ff ff 	mov    %ecx,0xffffeff4(%ebp,%eax,4)
 1fe:	c1 e2 04             	shl    $0x4,%edx
 201:	8b 45 08             	mov    0x8(%ebp),%eax
 204:	66 0f 7f 04 02       	movdqa %xmm0,(%edx,%eax,1)
 209:	83 c3 10             	add    $0x10,%ebx
 20c:	39 9d f0 ef ff ff    	cmp    %ebx,0xffffeff0(%ebp)
 212:	77 cd                	ja     1e1 <unsigned int* radix256_sort_quads_impl<0u>(unsigned int*, unsigned int*, unsigned int)+0x1e1>
16 instructions ~ 4.5 clocks per instruction. I know the SSE2 load and store (movdqa) have latency 4 on intel, but AFAIK the others should be single clock cycle.
...
2007-04-18 20:03
Looking at the AMD optimisation manual, the latency should only be 2 on the stores.
The complex addressing modes may allow use of an partly interleaved offset array - scale of 8 gives a stride of 2 ints.

Leaving that for now, since we want to create indices on disk, and think about locality - a sorted index of 1/100th of the data
may help when initially importing, but we still need the b-tree implementation to work well.

Thinking about indices again, we may want SPOG for blank nodes, but PSOG for transitive closures.

Assuming the disk block size is a factor of 4K, there's 256 quads in a block.
It would make sense to have quads with the same AB contiguous in a block in an ABCD index,
which is not the case in the b-tree at http://cis.stvincent.edu/html/tutorials/swd/btree/btree.html .
A billion quads is 16 GiB or 4 million blocks = 2^22. Sorting 256 quads takes little time, so the blocks can always be sorted.
Maybe a hashmap would be better. A 16 byte hash should do for 10 million quads, need to scan for collsions.

Ok, next micro-benchmark is to make a hash of each pair of terms in the quad, and test for collisions.

But first, I want to check the copy code isn't optimising away the mix and so comparing that to scatter is valid:

$ time quad_sort_bench 10240 -scatter-only
real    0m15.997s
user    0m14.589s
sys     0m0.508s

$ time quad_sort_bench 10240 -move-only
real    0m4.180s
user    0m3.368s
sys     0m0.484s

The mix being used to scatter the data is not being optimized away, the cost is in the scattering.

So, collisions:


2007-04-18 22:49
...
2007-04-19 22:51
Reading http://www.mark.masmcode.com/
Can we interleave reads/writes in the radix sort? we're only using the single xmm0 register at the moment.

Using 1 xmm register in loops:
$ time quad_sort_bench 10240 -radix256
real    0m2.958s
user    0m2.340s
sys     0m0.424s

Using 2 xmm registers in loops:
$ time quad_sort_bench 10240 -radix256
real    0m2.933s
user    0m2.256s
sys     0m0.464s

Using 4 xmm registers in loops:
$ time quad_sort_bench 10240 -radix256
real    0m2.889s
user    0m2.256s
sys     0m0.428s

Breaking up the loop's internals further to decrease dependencies doesn`t seem to have a measurable effect.
2007-04-19 23:31
...
2007-04-20 22:39
Downloaded uba for generating benchmark datasets.