Tin Can Camera

2008-04-20

Bjarne Stroustrup on the Evolution of Languages

Link: http://msdn2.microsoft.com/en-us/magazine/cc500572.aspx

A few comments on this -

The illustrations in the interview don't appear related other than that they are of an IDE ("I might have a different opinion if a good IDE was universally available", queue Visual Studio, which implies that VS isn't a good IDE) and that the UML is illustrating multiple inheritance, but mechanism for dispatch in the paper is not based on multiple inheritance (at least as shown in the diagram).

It's an interesting paper though, and I would like open methods in a language.

I don't think that DSLs should be just extensions to general purpose languages. Mainly because that implies a specific execution model, and if you have a DSL then execution shouldn't be part of the domain model (unless it's something like Verilog, or the input to a program proof tool, in which case execution is a domain concept). DSLs should allow the modelling of a domain, and lose their benefit - abstraction away from execution - if they are tied to a programming language. If you have a model of the physical domain, then you can state 'F = m * a' and not have that imply that the values named by 'm' and 'a' are multiplied and stored in the location named 'F', as binding to C++ semantics would have it. Otherwise you need at least three models - 'F = m * a', 'm = F/a' and 'a = F/m' - to let you represent everything that the one entity in the domain represents. Which would create a mismatch between the language and the domain, eliminating the benefit of a DSL.

I've also been reading 'Programming Languages Pragmatics' and working on my model-based language kin a lot, and adapting the parser and code model of the modelling language I used to do a state model of some concurrent code to support kin. I was generating Java as a first cut to bootstrap a full compiler, but Java's generics aren't rich enough to support kin's parametric types, and without them it's not easier to generate Java than C. So I will go straight to a C back end, and generate an intermediate form which can be interpreted. It's more important for kin to be able to call C libraries such as the Intel Math Kernal Library than for Java libraries, and I don't really like the sort of compromises you have to make to fit in with Java's libraries, however nice the JVM's HotSpot and garbage collector are to have.

Labels: c++, kin, languages, programming

2008-03-29

More kin musings

Back from finishing a contract in Malvern, and have had a few ideas on kin.

If general allocation is monomorphic arrays in column order, then single object representation is just a stride of 1 and a length of 1, and padded as required.

Assumption is that the code will operate on arrays, so compiler should default to vector ops where possible.

Need some basic constructs before starting, including mono and poly VLists, and a base lisp-like language to write the parser in.

Problem is that it's easiest to built an interpreter on a higher level language, such as Java, but the reasons I'm looking at kin - such as SIMD and cache optimisations, pattern matching and relations - aren't available in Java.

Also, I'm getting more and more convinced that polymorphic references should be pairs rather than as in C++ and Java.

The big issue I'm having is that each time I think how to do something, I write it in kin, then can't be bothered to write it in C or NASM.

Starting with the string type, I end up finding there's only a couple of functions which are needed over seq<uint8>, and those are easier to specify in kin than code up in C. No points are performance critical - or rather, no points can be written faster in C than in kin.

So I will try to define a metacircular kin interpreter, and work out where the smallest subset lies.

Also, for my specific interest of event driven simulations, I'd been thinking about pushing events forward for handling collisions. This requires that updates on event don't change the pre-event state, but perform copy-on-modify, as the pre-event state is also used to derive the intermediate event state. If you do modify it, then the best you can do is say 'there was a collision in the time step', and you have to keep the time step short enough that it doesn't matter when the collision occured within it.

I'm trying to think where C++ style identity tied to memory is useful; usually it's a premature optimisation, causing inconsistent states if you're not careful with the visibility of an update to a simulation entity after an event (the work I was doing up to November suffered from this, but that architectural decision was made before I joined that team). In C++ it's hard to work out how to create a 'make an updated copy' operation which doesn't perform redundant copying. It either involves writing a constructor which is aware of the update operation (causing slicing issues if the update is polymorphic) or requiring redundant initialisation if you use something like:


void entity::update_kinematics (double time_step) {
    position += (velocity + 0.5 * acceleration * time_step) * time_step;
    velocity += acceleration * time_step;
}

// create a copy of previous step's state
entity* updated = new entity(*previous);

// update it
updated->update_kinematics (time_step);
// store it in the active entities list for the new time

Then all the updated entities go into the collision detection routine, and any collisions are handled before the next interaction stage, where acceleration is recalculated. If there is a collision, then a new time event is added before the current time step, so that the interactions due to the collision can be worked on, the collision is handled in some other way. Then any other interactions are handled, based on the positions (or trajectories) at the current time, and then the actors can updated their intentions based on the sensing done in the interaction phase. Each time there's a copy-on-write going on, so, for example, if aircraft X decides to shut its radar off at time t1, the radar emmision is still visible along its trajectory from time t0 to t1, and any other actors can sense it and respond.

For a 6DoF vector position and velocity, the cost of writing the data twice isn't that great, but for larger structures it can be. Maybe some sort of copy-on-write at the language level, and an escape analysis for whether to use that or mutable structures. That would of course need an experiment to see if copy-on-write costs more than mutability though; it does for large structures, but not always for smaller ones.

How this relates to the sort of purely functional data structures described by Chris Okasaki I've yet to think.

Other things I've been thinking of - modelling actors as state-machine transducers, implemented as coroutines (see this paper for some inspiration) using trace-based code generator. The problem is that coroutines fit well with transducers and the style of event-driven actor programming I would like to express, and also with mostly sequence based processing, but not with the VM implementations I'm familiar with (or C++ coding, especially RAII). Actors accept a set of values, and also yield symbols, or accept a different symbol.

For example, a translator from UTF-8 to Unicode codepoints would accept a variable number of uint8 values for each uint32 value it produces:


# Convert UTF-8 string to a sequence of Unicode codepoints
codepoints (str:string) : seq<uint32> =>
    yield str -> (x0:uint8) =>
        if ((x0 & 0x80) == 0) then
            yield uint32(x0)
        else
            accept (x1:uint8) where ((x1 & 0xc0) == 0x80) # UTF-8 encoding valid
                if ((x0 & 0xe0) == 0xc0) then # 2 byte sequence
                    yield (uint32(x0 & 0x1f) << 6) | uint32(x1 & 0x3f)
                else
                    accept (x2:uint8) where ((x2 & 0xc0) == 0x80) # UTF-8 encoding valid
                        if ((x0 & 0xf0) == 0xe0) then # 3 byte sequence
                            yield (uint32(x0 & 0x0f) << 12) | (uint32(x1 & 0x3f) << 6) | uint32(x2 & 0x3f)
                        else # 4 byte sequence
                            assert ((x0 & 0xf8) == 0xf0) # UTF-8 encoding valid

                            accept (x3:uint8) where ((x3 & 0xc0) == 0x80) # UTF-8 encoding valid
                                yield (uint32(x0 & 0x07) << 18) | (uint32(x1 & 0x3f) << 12) |
                                    (uint32(x2 & 0x3f) << 6) | uint32(x3 & 0x3f)

My lower-level implementation experience is only with fairly procedural languages, such as Java byte code or conventional use of assembly, so translating this to code that runs is a bit interesting; it sort of goes round in circles - if I already had a language which worked on the accept/yield principles (with enough control to create SIMD code and column-ordered VLists), then creating a parser and an interpreter would be easier.

Labels: c++, kin, languages

2007-11-01

One more wide-finder

Since previous efforts didn't print the results summary, they weren't really complete, so here are two more variants using a vector (tbrayA.cpp) and a priority queue (tbrayB.cpp) to get the top 10 entries.

There's a pthread parallel version of the B variant (tbrayB_parallel.cpp). The main thread reads chunks, scans back to a line break, and hands off the chunk to a pool of worker objects, which run the matching in their own thread. Once all the data is read, the workers' counts are summed and the top 10 counts found, then the results printed as in the single threaded version. Each worker's data buffer is written only by the main thread, for the most part in the call to the OS' read function; this helps reduce the amount of copying of data - though whether the T5120 would have to copy anyway I don't know (some CPUs share cache between cores, others don't, those that don't may have to copy the buffer from one cache to the other when the matching thread starts processing).

The code hasn't been built on anything other than linux this time, so I can't guarantee it will run on Solaris. There's also a deadlock on the scheduler if you only use 2 workers (you need at least two so the buffers are only being written or read by one thread at a time), which I have to track down.

Obviously, the multi-threaded version requires quite a lot of infrastructure - it's twice the size of the single-threaded version. It also runs significantly slower than the single-threaded version:

fortinbras:$ time bin/tbrayB_parallel datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real    0m1.358s
user    0m0.372s
sys     0m0.232s
fortinbras:$ time bin/tbrayB datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real    0m0.531s
user    0m0.232s
sys     0m0.156s

Commenting out the call to process_chunk(), so it just does the fork and join thread communications:

fortinbras:$ time ./widefinder datasets/hundred-o10k.ap 3

real    0m0.288s
user    0m0.056s
sys     0m0.220s

So there is some additional cost in the parallelism other than just the inter-thread messaging (372 > 232 + 56).

There's an archive of the code here; unpack it, run make and call ./widefinder with the path of the logfile and the number of worker threads. The number of worker threads is 3 minimum. The other examples compile to the ./bin directory from make tbray.

I'm still not overly convinced that it will be CPU bound if it has has a small, slow, miserable disk unless you are using a language whose virtual machine is excessively baggy; I've been careful not to use any optimisation which is specific to the data, rather than what a compiler for a higher-level language would know, given suitable annotations for parallelism.

The latest erlang effort from Pichi is twice as fast as the original ruby; ruby takes 2.580s CPU time on my machine, so the single-threaded C++ version is 11 times faster than ruby on one core, the parallel one 7 times. So once you have half-a-dozen cores then the erlang should beat the single threaded C++ code. Though I'd like to see whether on the disk on the T5120 means it is CPU limited or IO limited with the faster code, and I'm unconvinced that the T5120 is what future will look like - I'd expect more cache and flash on chip, rather than more cores, and a multi-node rather than a multi-core future.

Pete

Labels: c++, concurrency, programming, widefinder

2007-10-05

POITROAI

For this particular problem, the principles are more important that the exact results, but I'm happier with tbray7.cpp which gives the same results as the regex in the original ruby code:

fortinbras:$ time bin/tbray7 datasets/hundred-o10k.ap 
matches: 95400

real    0m0.439s
user    0m0.284s
sys     0m0.144s

The file was in cache for that one so ignore the real time; the important thing is that the user time wasn't adversely effected by getting the 'right' answer, instead of 1100 matches.

Pete

Labels: c++, programming, widefinder

2007-10-04

Of course, it helps if you tell gcc to generate a 64 bit executable

... when you're using 64 bit operations:

tercel-2:$ time bin32/tbray4 datasets/hundred-o10k.ap 
user    0m5.644s
tercel-2:$ time bin32/tbray6 datasets/hundred-o10k.ap 
user    0m4.048s
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap 
user    0m5.725s
tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
user    0m3.839s

I'd assumed it would default as it does on linux, but no. Which also explains why I was confused about sizeof(size_t) being less than sizeof(long long) here.

It's still around 8 clock cycles per byte - the 5% gain is not enough to alter yesterday's estimates, which is a little encouraging actually, as I wasn't sure whether or not 32 bit Sparc instruction generators, such as gnu lightning (used in rubinius) or whatever erlang's hipe uses would cause a significant slow down.

Pete

Labels: c++, programming, solaris, widefinder

2007-10-02

Some fingers in the air.

I got Steve Vinoski's 2007/09/29 erlang code, installed hipe and the bfile module, and it ran on the laptop:

fortinbras:$ cat ../datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time erl -smp -noshell -run tbray5 main 512 ../datasets/hundred-o10k.ap 
110100 matches found

user    1m23.649s
real    1m33.683s
sys     0m1.620s

I'm not sure looking at either mine or Steve's code where the 1101th match comes from - there are #ifdefs in my line splitting code to print the lines, and if you run diff that output with the input it's the same, and if it was related to the last line in the sample it would give a difference of 1 not 100 for the hundred-times repeated file. But something's inconsistent between the two somewhere, and also with the original ruby code which gives 954 matches for the o10k.ap, or 1097 for the regex %r{GET /ongoing/When/([^ .]+) }.

From the gnome-panel dials, the erlang isn't IO bound in total on this machine - for the first few seconds it is running at max IO and 90% CPU, then for the remainder 99% CPU and does zero IO, so it's reading it all into memory then spawning processes to scan it. I'll leave it running overnight on the 10 million line file to how it fares when it can't fit the file into memory, though I doubt that has much of a bearing on what would happen on the T2 box, as that has plenty of RAM.

[Update: It took 52 minutes 26, and was doing disk IO throughout, but that could well have been paging rather than the IO it has to. Nothing to conclude, other than that it doesn't scale linearly - 10 times bigger file takes 34 times longer.]

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real    0m8.622s
user    0m0.284s
sys     0m0.248s

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time ~/projects/quad-store/bin/read_bench datasets/hundred-o10k.ap > /dev/null
real    0m8.754s
user    0m0.000s
sys     0m0.180s

So the 64 bit-wide matcher on the AMD64 laptop is very IO bound; the difference between the total time of just doing the IO and scannning the lines is negligible.

At 2200 MHz, it's scanning 201 MB data in 0.284s, which is 2200*0.284/201 = 3 cycles per byte processed.

Sun's Thumper gives 2GB/s transfer into memory; the T2 runs at 1.4 GHz and assuming the same IO rate, we have 0.7 clock cycles per byte delivered to play with. 64 hardware threads would give about 44.

Running read_bench on one of the 'ancient creaking' Netras, gives:

tercel-2:$ time bin/read_bench datasets/thousand-o10k.ap 
real    1m15.449s
user    0m0.362s
sys     0m27.998s

That's IO bound, 2,095 MB/75.5s = 27 MB/s. The laptop gets a similar transfer rate figure on the same test.

The T2 is 200 times the CPU, and the Thumper 80 times the disk transfer rate; I'll assume that the T2 system's IO is comparable to the Thumper.

tercel-2:$ time bin/read_bench datasets/hundred-o10k.ap 
real    0m6.159s
user    0m0.058s
sys     0m2.834s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
matches: 110000
real    0m7.617s
user    0m4.054s
sys     0m2.504s

At 440 MHz, it's scanning 201 MB in 4s, which is 440*4/201 = 8 cycles per byte processed; it's IO dominated but not IO bound. Optimising the matching code would at best give a 20% improvement.

It's also using about 5 cycles per byte system time for the IO.

Since the Netra runs Solaris 10 and is a Sparc variant, and in the absence of better information, I'm working on the assumption that the 1.4GHz T2 would be close to it in terms of clock cycles per work unit, so to process the log files it would also need around 13 cycles per byte delivered. So either we'd need 13/0.7 = 19 threads, assuming each thread provisions another 0.7 cycles worth of work, or run 64 threads and can afford code that's three times slower than the single threaded variant due to concurrency management. If it's closer to the fat-piped AMD64, then only 6 cycles per byte and 8 threads would do.

Guessing even more widely, with a large dataset and large grain concurrency, the C++ matching code should be at least three times faster than the IO - using 8 to 20 of the 64 threads. The erlang interpreter seems to be around 10 times slower , so would require 80 to 200 threads to balance the transfer rate [correction: only the CPU time is relevent here, so it's 83.649+1.620s vs 0.284+0.248s, which is 160 times slower, so would require 1280 threads to get the same throughput]. If the IO rate is less than 800MB/s and the erlang scales across all 64 cores then the faster matching in the C++ won't give any advantage. I've no idea how good erlang is across cores - the only times I've seen it praised is many (conceptual) processes on small numbers of cores - it's a concurrent language rather than a parallel one. Inter-core communication doesn't come free, and any message still need either lock based synchronisation or CAS/HTM retries; as it doesn't matter which actor works on the data, in a C++ implementation you'd use a fail fast lock and try with a different actor rather than it blocking, so going lock-free would cost more than it would gain. AFAIK you don't have that option in erlang, and are stuck with the given queuing and scheduling mechanisms, though you probably could subvert them.

In computer science there are only three numbers - zero, one and many. If my estimates are anything to go on (and I wouldn't bet more than half a penny on any of them), on the target system you need many threads to solve this problem most efficiently.

Looking at the code for the bfile library, it shouldn't be too hard to move the 64bit wide string match into an erlang library, which seems more fun than fiddling with MPI to get the required thread count, but the required thread count is far fewer than the number of processes in Steve's code, so MPI or pthreads should be a better model than large scale concurrency. But erlang may be to concurrency as lisp is to dynamism - I like lisp, but I haven't found been any interesting problems which I could apply it commercially under the constraints of the business, and even though I end up greenspunning in most systems I write, the subset of lisp-like features which gets implemented to give dynamic behaviour is tailored to the problem, and the rest of the code can be performance optimised more easily.

On the other hand, my fiancée comes back into the country on Saturday and I'm supposed to have finished sending out our wedding invitations by then, and all this isn't really helping that to happen.

Pete

Labels: c++, concurrency, programming, widefinder

2007-09-30

Data-wide wide-finder

Running a test with just breaking into lines without the matching, compared to the non-MPI version with matching runs on the 100x file in 380ms user time rather than 720ms. So there is work in both parts of the string processing.

Parallelising the scan for newlines to work on 8 chars at a time gives a reduction to around 220ms.

Restoring the matching, and parallelising the scan for the first character makes the CPU run at 800MHz.

A quick google search tells me that sudo dpkg-reconfigure gnome-applets enables me to use the gnome panel applet to lock the frequency at 2200MHz, so with that set, I can compare results between the linear and SIMD parallel forms. Typical results are around 400ms better for the parallelised version on the laptop:

fortinbras:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000
real    0m10.420s
user    0m0.688s
sys     0m0.268s

fortinbras:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real    0m8.622s
user    0m0.284s
sys     0m0.248s

This makes user+sys time go from 950ms to 532ms; if this process was CPU limited, that's as good as you could possibly get from running two cores.

Sorting out a few machine word dependencies so it runs on one of the Netra Solaris boxes (Linux/Solaris code is at tbray6.cpp; I was assuming size_t would be 64 bit on all 64 OSes), the results are:

tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000

real    0m7.593s
user    0m5.638s
sys     0m1.894s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000

real    0m5.915s
user    0m4.027s
sys     0m1.855s

Here we get a much more significant speed up - about 25% faster overall time, due to a different balance between CPU and disk.

These simple benchmarks indicate a few things to me:

Not all gains from parallelism require concurrent processes - more transistors can mean wider data paths as well as concurrent cores. If I were using the architecture specific SIMD registers on the AMD64, then the paths would be twice as wide again.

A slower CPU with a 'server class' disk outperforms a faster CPU with a 'consumer class' disk in this test, but shows up potential to parallelise the code.

I like playing with bit-twiddly code on the weekends. I don't like having to debug bit-twiddly code in mission critical applications for my customers to a tight deadline. The optimisations should be in the regex library, rather than complicating user code - for example, this code avoids reading past the end of the buffer only because it has a pattern longer than the word size.

I'm also wondering whether some kind of hash based wide-scan can detect two or more characters of a pattern well enough to give an improvement, rather than just a wide-scan for first one - there's probably only a couple of bits of information per character in those log files.

Pete

ETA:

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/tbray6.cpp | wc -l
85

It's hairy but it's not huge.

Labels: algorithms, c++, parallel, programming, widefinder

2007-09-29

Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).

fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real    2m28.201s
user    0m10.193s
sys     0m4.752s

At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
48

So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:

fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
...
real    1m20.744s
user    0m24.710s
sys     0m4.420s

Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):

fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real    1m6.780s
user    0m9.593s
sys     0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
60

Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:

tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real    1m23.930s
user    0m57.070s
sys     0m24.636s

Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.

Conclusion

This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.

Pete

Labels: c++, concurrency, fortress, languages, parallel, programming, widefinder

2007-05-10

Dancing bear-ware

I missed a post last week because I was saying goodbye to my lass, who's now in Romania to do research for her PhD.

There's been a flurry all-over with Silverlight and now JavaFX, which is close to what I'm doing at work, and my interest in zero-install rich information applications, but what most interested me in the last week has been the use of template metaprogramming to express code constraints, which inspires the title of this blog entry.

I've not much to add, but there must be better ways of doing this without excessive cost in either developer productivity or runtime performance.

Pete

Labels: c++, metaprogramming