2007-04-20

More rdf bits

I've spent a little more time on quad-store, and found some more realistic performance figures from Franz Inc and the ESW wiki.

There's now a reasonably fast in-memory sorting (using radix 256), around 2.2s/10 million triples, which should help indexing small stores. I also have had a look at MPI for linking nodes together.

I think the premise may well hold - as the limiting factor for growth is IO and memory rates, not CPU, so trade CPU for compression and limit the number of values a store can hold, so making the store's representation smaller, reducing IO. Structure for locality and divide the store into nodes when the limits are exceeded. I still haven't found a large enough dataset to push the 230 values it can hold. And 230 values is up to 290 triples, as each triple has 3 values. 2 bits of the 32 are reserved to indicate whether it's a compressed URI, or a hash table URI, a blank or an literal.

OpenLink Virtuoso also seem to be using something similar in their 'bitmap indexes', though limiting to 32 bit values gives you a set of sorted b-trees of blocks of quads, rather than indices - in a pure index you have to look up the value, so incur an indirection.

I'm not yet bored with playing with this, as I am learning quite a bit about how bandwidth limited systems behave, whereas most of my previous work has been cpu limited numerical stuff, distributed services, or km and visualisation apps.


Pete

Labels: , , ,

2007-04-12

MegaData-QuadStore micro benchmarks

2007-04-05 21:36
Starting from Joe Gregorio's BitWorking post on 'megadata', I'm wondering what happens if you work this thing from the other end - rather than trying to make a database that scales well over distributed compute nodes, instead focus on optimising each node whilst trying not to compromise external scalability.

Getting parallelism working, whether via network or just concurrent threads in the same machine, is harder than straight-ahead computing in a single node.

Since my background is in engineering computing and simulation (with a little KM and recreational prolog), I've little idea why it takes 15 minutes to load 10,000,000 triples into a RDF store , whereas reading a 150 MiB file itself takes seconds.

So I'm trying to find the cost points for nodes, rather than looking at distribution schemes here.

For this, I'm taking a node as a virtual processor and half a GiB RAM, partly because my laptop has a single non-hyperthreaded core and 640 MiB RAM, so is a good model for a single node in a server array. A quad core hyperthreaded server with RAID should be eight nodes -eight times my laptop in Mips, disk speed and RAM.

At that size, a node doesn't have to handle more than a billion tuples, and I'm assuming a little bit that some of the strategies for distributing queries over nodes will still apply to such nodes. Limiting a node to a billion tuples also makes coding easier - you can use 32 bit identifiers, as long as they are reasonably local to the node. By 'reasonably local', I'd expect some number of IDs to be common to all nodes to represent the most frequently occurring term values, but distributed queries not to have to rely on having unique IDs that span between nodes.

Most of the RDF stores seem to be written in Java, and what I've seen of the implementations do a lot of pointer chasing, so this is a bit of fun to see what we can do with SSE2 to make a RDF-like query engine without pointer chasing.

If each tuple in the store is a quad vector of 32 bit ints, we can do quite fast searches using SSE2. As John Sowa predicted, many triple-stores have found that triples aren't enough, and are quad stores now. Though maybe a hexa-store is best for human-oriented data, as most of the relations we come up with have arity less than seven.

The log of my first week's hacking is available here.

Importing from the 1716 MiB SQL dump mentioned in the BitWorking post's comments, using a compression schema and a string table to generate a 160 MiB table of around 10 million triples and keep the full form of each quad within the memory limits takes around 2 minutes, around twice disk throughput speed.

Getting that far took most of Good Friday and Low Saturday, then I went to my sisters for the rest of Easter.

Coming back and coding in the evenings this week, I was playing with merge sort on the quads. Sorting one term over a 10 mega-quad table takes around 9 seconds, which is around 24 passes with merge sort, so visiting 28 million quads a second at ~85 clock cycles per quad (clock speed is 2.4GHz). That does seem a little high, as there's only around 20 instructions in the lowest level loop, though there is a branch. Using SSE to eliminate the branch costs more than the simple C variant.

Thinking about indexing next, it's noted here that
to generate indices where all quads matching two terms are contiguous, you need an index for each selection of two terms from four, ie six indices.
Let P, O and G denote the four terms in the quad - subject, predicate, object and graph.
For one or three term queries, four indices suffices - for one term, one index for each of SPOG, for three terms, one index sorted on the three terms in any order.
For a query of four terms, any index will have the matches contiguous.
So for quads, six indices will give you an O(log(N)) match on any single part of a query, as long as SPOG appear first and last in at least one of the indices.

So a selection such as these might be used:

SPOG -> S, SP, SPO matches contiguous, SP matches ordered by O over G.
POSG -> P, PO, matches contiguous, PO matches ordered by S over G.
OSPG -> O, SO, matches contiguous, SO matches ordered by S over G.
GSPO -> G, SG, SPG matches contiguous
GPOS -> PG, POG matches contiguous
GOSP -> OG, SOG matches contiguous


Based on the assumptions I've made about what a node in a distributed array needs to provide, ie 32 bit IDs is enough, and how well the sample quads compress, these don't have to be indices - a reference into a file takes as much space as the data itself - instead they are the IDs. I'm using these for direct matches, not comparisons such as SPARQL filter provides; you'd filter a potential, or create a union of a small range (such as 2 < x <= 5 translates to x=3|x=4|x=5 for integers).

I'd expect that most of the time you'd want the results around a single subject, particularly if you have lots of blank nodes, ie you'd be using one ordering to get ?PO_, then use SP?_ (? desired term, _ don't care, SPOG specified value).

Taking the example query from http://xtech06.usefulinc.com/schedule/paper/18 :

SELECT ?pub ?title ?issue ?article
WHERE {
?title rdf:type struct:Title .
?title dc:identifier .
?issue prism:isPartOf ?title .
?issue prism:volume "20" .
?issue prism:number "4" .
?article prism:isPartOf ?issue .
?article prism:startingPage "539" .
}

An optimiser may first look at the POGS and OGSP orderings to determine the spans of the predicates and literals to determine which patterns to match first.

Either you can build a topological sort and start with finding ?title candidates, then ?issue candidates, then ?article, or you can merge the results of all matches with one ungrounded term.

The thing about this index approach is that this pattern:
  ?title    rdf:type        struct:Title .

would use the POGS index, and the result would be far from the result for
  ?title    dc:identifier    .


This lack of locality, I think, is going to be a pain point for this kind of normalised store.

Whether it's an index with pointers into a larger store, or an ordering of a compressed form of the quads, you've got a cache (disk or memory) miss, which is why I'm trying to implement this without chasing pointers.

But so far I'm happy that it's possible make a midi-store node that's good for a billion quads, and then distribute a mega-store between them using a higher level protocol than that which is optimised for the node to use internally.


TME

Labels: , ,