Thinking more about compression

I'm using compression on the URIs and literals in the quad-store to trade CPU for disk/memory bandwidth. The current algorithm is LZW, with a 24bit index look up table. Yesterday's benchmarks indicate it's not doing massively good compression - ; on the other hand it's small and I can see how to preform regular expression searches on the compressed data. 24bit index tables take 64MiB and give around 1:5 compression - reduces to 23%. You can't reset the table like you do in a streaming compressor, since you want the mapping of compressed bits . Bigger tables take more room both - more entries, and the entries become larger as the number of bits to reference another entry increases. If you had 25bit indexes, you'd need 5 extra bytes per entry, so take 154MiB rather than 128MiB and possibly suffer due to cache line mis-alignment.

Since each value in the quad is reduced to a 32 bit identifier in the store, and at least one bit is needed to indicate whether it's fully described by the identifier (uncompressing the identifier's bit-pattern gives the URL), or needs a look-up to longer bit pattern. Obviously, the more URLs which can be squished into 32bit patterns the faster the store will be written to, as it removes the need to look elsewhere to match the string pattern of the value with one already in a hash table to check that the same identifier is only ever used for the same string. But 24+1 bits leaves 7 unused bits of space in half the possible identifiers, which reduces the maximum identifier density to (128+1)/256 = 50.3%, which isn't very good.

Looking at the benchmarks here, there seem to be much better methods than LZW, but most of them are neural net or model based, and I can't see how to map a sequence of hints for a neural net into anything that could have regex implemented on top of it without decompressing. They also are orders of magnitude heavier on the cpu, apparently enough to fade the total performance.

The other options are to not bother with the intrinsic identifiers - which means finding out how many of the URLs would fit in the LZW table for typical datasets (certain of the computer generated ones they would), finding other uses for the spare bits such as 11xx xxxx xxxx xxxx xxxx xxxx xyyy yyyy indicating a 23 bit LZW index and a 7 bit character. Maybe the simplest way would be to have aaaa aaaa xxxx xxxx xxxx xxxx xxxx xxxx where any non-zero in the a's indicates it's a hashkey, so you get full identifier use at the cost of more hashtable lookups.

Holding the full hash in memory isn't an option; even just a population bit would take half a gig. Since each node is single-threaded, it's possible to split URLs from literals and not care whether literals match outside of a smaller MRU table cache; that would increase search times for equality of literals with non-intrinsic ids, but not effect regex searches. It may also be worth maintaining separate LZW tables for literals and URIs. Eliminating duplications can be done in the background anyway.

I think the next experiment will be scanning the datasets I've got - swoogle's ten million triples, wikipedia in a couple of formats (dbpedia and the xml dump), and the uba owl generator - to see what hit rate and spread I'd get generating ids using LSW and hashes.


Labels: , ,

First compile and run on Solaris 10.

I've got first run of one of my quad-store benchmarks running on my Solaris 10 boxes.

I needed to make the machine architecture and a flag whether or not to use SSE2 environment variables, and add some extra #defines so none of the SSE stuff was included. I needed a macro to fake posix_memalign, though will think about using another malloc-based version. Compiling on Solaris also generates warnings for automatic promotion of -1 to an unsigned int so I added explicit casts (-1 is returned by some functions where 0 is a valid return and something is needed to signal an invalid value).

Running the compression benchmark, the sparc (440MHz) doesn't seem to get cpu bound for small files (100MiB):

tercel-1$ uname -a
SunOS tercel-1 5.10 Generic_118833-33 sun4u sparc SUNW,UltraSPARC-IIi-cEngine

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 100
read: 100 MiB, compressed to: 21 MiB.

real 0m23.094s
user 0m20.973s
sys 0m1.715s

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 1000
read: 1000 MiB, compressed to: 301 MiB.

real 3m55.124s
user 3m39.377s
sys 0m12.550s

fortinbras$ uname -a
Linux fortinbras 2.6.20-16-generic #2 SMP Thu Aug 30 23:16:15 UTC 2007 x86_64 GNU/Linux
fortinbras$ time bin/lzw_bench datasets/original/triples.sql 100
read: 100 MiB, compressed to: 21 MiB.

real 0m8.638s
user 0m5.784s
sys 0m0.296s

fortinbras$ time bin/lzw_bench datasets/original/triples.sql 1000
read: 1000 MiB, compressed to: 301 MiB.

real 0m56.176s
user 0m43.667s
sys 0m1.948s

The processor in fortinbras is Mobile AMD Athlon(tm) 64 Processor 3400+ at 2200 MHz, the Netra at 440 MHz, and appears to be about a quarter as fast at a fifth of the clock speed.

On the other hand, running the Netra at full doesn't heat my lap up, I've three of them, they've more, faster disks available and I can play with making the quad-store distribute between them.

Whether they hold up as well against the benchmarks which use SSE to make the Athlon wider I'll have to find out.


Labels: , ,


More rdf bits

I've spent a little more time on quad-store, and found some more realistic performance figures from Franz Inc and the ESW wiki.

There's now a reasonably fast in-memory sorting (using radix 256), around 2.2s/10 million triples, which should help indexing small stores. I also have had a look at MPI for linking nodes together.

I think the premise may well hold - as the limiting factor for growth is IO and memory rates, not CPU, so trade CPU for compression and limit the number of values a store can hold, so making the store's representation smaller, reducing IO. Structure for locality and divide the store into nodes when the limits are exceeded. I still haven't found a large enough dataset to push the 230 values it can hold. And 230 values is up to 290 triples, as each triple has 3 values. 2 bits of the 32 are reserved to indicate whether it's a compressed URI, or a hash table URI, a blank or an literal.

OpenLink Virtuoso also seem to be using something similar in their 'bitmap indexes', though limiting to 32 bit values gives you a set of sorted b-trees of blocks of quads, rather than indices - in a pure index you have to look up the value, so incur an indirection.

I'm not yet bored with playing with this, as I am learning quite a bit about how bandwidth limited systems behave, whereas most of my previous work has been cpu limited numerical stuff, distributed services, or km and visualisation apps.


Labels: , , ,


MegaData-QuadStore micro benchmarks

2007-04-05 21:36
Starting from Joe Gregorio's BitWorking post on 'megadata', I'm wondering what happens if you work this thing from the other end - rather than trying to make a database that scales well over distributed compute nodes, instead focus on optimising each node whilst trying not to compromise external scalability.

Getting parallelism working, whether via network or just concurrent threads in the same machine, is harder than straight-ahead computing in a single node.

Since my background is in engineering computing and simulation (with a little KM and recreational prolog), I've little idea why it takes 15 minutes to load 10,000,000 triples into a RDF store , whereas reading a 150 MiB file itself takes seconds.

So I'm trying to find the cost points for nodes, rather than looking at distribution schemes here.

For this, I'm taking a node as a virtual processor and half a GiB RAM, partly because my laptop has a single non-hyperthreaded core and 640 MiB RAM, so is a good model for a single node in a server array. A quad core hyperthreaded server with RAID should be eight nodes -eight times my laptop in Mips, disk speed and RAM.

At that size, a node doesn't have to handle more than a billion tuples, and I'm assuming a little bit that some of the strategies for distributing queries over nodes will still apply to such nodes. Limiting a node to a billion tuples also makes coding easier - you can use 32 bit identifiers, as long as they are reasonably local to the node. By 'reasonably local', I'd expect some number of IDs to be common to all nodes to represent the most frequently occurring term values, but distributed queries not to have to rely on having unique IDs that span between nodes.

Most of the RDF stores seem to be written in Java, and what I've seen of the implementations do a lot of pointer chasing, so this is a bit of fun to see what we can do with SSE2 to make a RDF-like query engine without pointer chasing.

If each tuple in the store is a quad vector of 32 bit ints, we can do quite fast searches using SSE2. As John Sowa predicted, many triple-stores have found that triples aren't enough, and are quad stores now. Though maybe a hexa-store is best for human-oriented data, as most of the relations we come up with have arity less than seven.

The log of my first week's hacking is available here.

Importing from the 1716 MiB SQL dump mentioned in the BitWorking post's comments, using a compression schema and a string table to generate a 160 MiB table of around 10 million triples and keep the full form of each quad within the memory limits takes around 2 minutes, around twice disk throughput speed.

Getting that far took most of Good Friday and Low Saturday, then I went to my sisters for the rest of Easter.

Coming back and coding in the evenings this week, I was playing with merge sort on the quads. Sorting one term over a 10 mega-quad table takes around 9 seconds, which is around 24 passes with merge sort, so visiting 28 million quads a second at ~85 clock cycles per quad (clock speed is 2.4GHz). That does seem a little high, as there's only around 20 instructions in the lowest level loop, though there is a branch. Using SSE to eliminate the branch costs more than the simple C variant.

Thinking about indexing next, it's noted here that
to generate indices where all quads matching two terms are contiguous, you need an index for each selection of two terms from four, ie six indices.
Let P, O and G denote the four terms in the quad - subject, predicate, object and graph.
For one or three term queries, four indices suffices - for one term, one index for each of SPOG, for three terms, one index sorted on the three terms in any order.
For a query of four terms, any index will have the matches contiguous.
So for quads, six indices will give you an O(log(N)) match on any single part of a query, as long as SPOG appear first and last in at least one of the indices.

So a selection such as these might be used:

SPOG -> S, SP, SPO matches contiguous, SP matches ordered by O over G.
POSG -> P, PO, matches contiguous, PO matches ordered by S over G.
OSPG -> O, SO, matches contiguous, SO matches ordered by S over G.
GSPO -> G, SG, SPG matches contiguous
GPOS -> PG, POG matches contiguous
GOSP -> OG, SOG matches contiguous

Based on the assumptions I've made about what a node in a distributed array needs to provide, ie 32 bit IDs is enough, and how well the sample quads compress, these don't have to be indices - a reference into a file takes as much space as the data itself - instead they are the IDs. I'm using these for direct matches, not comparisons such as SPARQL filter provides; you'd filter a potential, or create a union of a small range (such as 2 < x <= 5 translates to x=3|x=4|x=5 for integers).

I'd expect that most of the time you'd want the results around a single subject, particularly if you have lots of blank nodes, ie you'd be using one ordering to get ?PO_, then use SP?_ (? desired term, _ don't care, SPOG specified value).

Taking the example query from http://xtech06.usefulinc.com/schedule/paper/18 :

SELECT ?pub ?title ?issue ?article
?title rdf:type struct:Title .
?title dc:identifier .
?issue prism:isPartOf ?title .
?issue prism:volume "20" .
?issue prism:number "4" .
?article prism:isPartOf ?issue .
?article prism:startingPage "539" .

An optimiser may first look at the POGS and OGSP orderings to determine the spans of the predicates and literals to determine which patterns to match first.

Either you can build a topological sort and start with finding ?title candidates, then ?issue candidates, then ?article, or you can merge the results of all matches with one ungrounded term.

The thing about this index approach is that this pattern:
  ?title    rdf:type        struct:Title .

would use the POGS index, and the result would be far from the result for
  ?title    dc:identifier    .

This lack of locality, I think, is going to be a pain point for this kind of normalised store.

Whether it's an index with pointers into a larger store, or an ordering of a compressed form of the quads, you've got a cache (disk or memory) miss, which is why I'm trying to implement this without chasing pointers.

But so far I'm happy that it's possible make a midi-store node that's good for a billion quads, and then distribute a mega-store between them using a higher level protocol than that which is optimised for the node to use internally.


Labels: , ,