2007-11-01

One more wide-finder

Since previous efforts didn't print the results summary, they weren't really complete, so here are two more variants using a vector (tbrayA.cpp) and a priority queue (tbrayB.cpp) to get the top 10 entries.

There's a pthread parallel version of the B variant (tbrayB_parallel.cpp). The main thread reads chunks, scans back to a line break, and hands off the chunk to a pool of worker objects, which run the matching in their own thread. Once all the data is read, the workers' counts are summed and the top 10 counts found, then the results printed as in the single threaded version. Each worker's data buffer is written only by the main thread, for the most part in the call to the OS' read function; this helps reduce the amount of copying of data - though whether the T5120 would have to copy anyway I don't know (some CPUs share cache between cores, others don't, those that don't may have to copy the buffer from one cache to the other when the matching thread starts processing).

The code hasn't been built on anything other than linux this time, so I can't guarantee it will run on Solaris. There's also a deadlock on the scheduler if you only use 2 workers (you need at least two so the buffers are only being written or read by one thread at a time), which I have to track down.

Obviously, the multi-threaded version requires quite a lot of infrastructure - it's twice the size of the single-threaded version. It also runs significantly slower than the single-threaded version:

fortinbras:$ time bin/tbrayB_parallel datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m1.358s
user 0m0.372s
sys 0m0.232s
fortinbras:$ time bin/tbrayB datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m0.531s
user 0m0.232s
sys 0m0.156s
Commenting out the call to process_chunk(), so it just does the fork and join thread communications:
fortinbras:$ time ./widefinder datasets/hundred-o10k.ap 3

real 0m0.288s
user 0m0.056s
sys 0m0.220s
So there is some additional cost in the parallelism other than just the inter-thread messaging (372 > 232 + 56).

There's an archive of the code here; unpack it, run make and call ./widefinder with the path of the logfile and the number of worker threads. The number of worker threads is 3 minimum. The other examples compile to the ./bin directory from make tbray.

I'm still not overly convinced that it will be CPU bound if it has has a small, slow, miserable disk unless you are using a language whose virtual machine is excessively baggy; I've been careful not to use any optimisation which is specific to the data, rather than what a compiler for a higher-level language would know, given suitable annotations for parallelism.

The latest erlang effort from Pichi is twice as fast as the original ruby; ruby takes 2.580s CPU time on my machine, so the single-threaded C++ version is 11 times faster than ruby on one core, the parallel one 7 times. So once you have half-a-dozen cores then the erlang should beat the single threaded C++ code. Though I'd like to see whether on the disk on the T5120 means it is CPU limited or IO limited with the faster code, and I'm unconvinced that the T5120 is what future will look like - I'd expect more cache and flash on chip, rather than more cores, and a multi-node rather than a multi-core future.


Pete

Labels: , , ,

2007-10-05

A few more thoughts on concurrency and high-level languages

I little while ago, I started playing with a pattern-matching and relational language for meta-programming called 'kin' (it's got relations, and does pattern, so kin/kinship).

Being a machine elf rather than a language lawyer, I never could come up with a decent user syntax for it; I've toyed with making it a lisp (either L1 like scheme or L2 like common lisp; I wrote something that used lisp s-exprs to manipulate JVM class files, but never got further, and then 1.5 came out and increased the verbosity of the type system without increasing its power significantly, and I moved on), with making it a prolog derivative (I rewrote the core of a Java WAM engine so it ran a few times faster, but never got anywhere much with that either, apart from looking at it and thinking, 'yep, it's faster now, so what do I do with prolog?'); then I thought about a faster JavaScript engine, but Tamarin's handling that well enough for most use-cases, and I want better type inference rather than having to tell the compiler everything. Tamarin also makes the mistake of relying on programmer specified type to give performance; usually the type can be inferred if you have the whole system available, and even if not, the Self and StrongTalk VMs show you don't need explicit type to help the compiler. Type systems should help programmers, not compilers. But the JavaScript story doesn't really matter - the XULRunner front end I spend half my time on at work spends 90% of its time in Mozilla's SVG rendering code rather than my JavaScript anyway. I write my best software when there are users asking me to, and when it's also a problem which interests me; without users I get bored too quickly, without interest I don't give more than the users expect.

I've spent a few hours poking around Rubinius code; they appear to be making a ruby interpreter very like lisp interpreters were a couple of decades ago. I don't program ruby; I do heavy math backends and light front ends (with occasional forays into OpenGL), and never touch RDBMS, so use C++, Fortran and Java backends, C++, Java, XSLT middleware, C++, Java, JavaScript front ends (depending what level of 3D visuals they require; I haven't had cause to look at canvas3D yet so the OpenGL work has always been in C or Java, though given the current Gecko SVG performance I've gone off basing a front end on Mozilla code) - though I'd quite happily never use Java for a UI again (last time I got RSI from all those anonymous inner classes). I've been known to use lisp to prototype math code before coding it up it C++ for speed (though that's mainly been due to working for BAE at the time, who take 6 months to buy software for a 3 month project, or due to other business constraints on the implementation language and isn't a reflection on professionally produced lisp compilers, which are fast enough not to require porting to C++).

An old housemate and good friend of mine did his PhD in an embedded Erlang system; that's as close as I came to it before a year or so ago, when I started using XMPP at work and tried installing ejabberd. It didn't work, and I gave up after a few tries, so I tested our server's federation code against Wildfire instead. That's nothing to do with Erlang; ejabberd appears to scale well in independent tests, and I suspect the Windows installer wasn't a priority (our customers require Windows). I've read comments that there are smart people who have written good software in Erlang. This is true. Smart people also write good software in BASIC or COBOL or ADA; that's because smart people write good software. Writing a language that smart people can write good software in is easy; writing a language that average people write non-crap software in is far harder. For my opinion, for this sort of problem, ruby seems such a language (well, it seems much better than Perl or PHP, and a bit better than python), but the implementation is pants. Conversely, ejabberd scales to many more users than Twitter can, and I suspect is a cleaner expression of the problem, because IM's a concurrency problem and erlang far exceeds ruby's capabilities at concurrency.

Here's some good advice on how a high level language should express a problem in something I started reading earlier:
"The structure of the program should exactly follow the structure of the problem. Each real world concurrent activity should be mapped onto exactly one concurrent process in our programming language. If there is a 1:1 mapping of the problem onto the program we say that the program is isomorphic to the problem. It is extremely important that the mapping is exactly 1:1." -- Joe Armstrong (PhD thesis, Dec 2003).

You have a log file. You scan each line (in any order) for a hit and accumulate the number of times each hit occurs. Then you list the hits in decreasing order of the accumulated hit counts.

I don't see any concurrency intrinsic in the problem. The only reason we're even thinking about concurrency is that we want the implementation of the solution to the problem to be parallelised over many cores for performance reasons, and (as SIMD isn't sufficiently general to scale much beyond 4 times wider paths except in GPUs), many concurrent cores is the only thing we can think that will make computers have more throughput. For this problem, concurrency is as much an accident as memory management. Languages which handle memory management automatically have already eclipsed those which require manual memory management in most areas where it isn't an essential part of the problem. As it's not an essential part of the problem, no language where the concurrency is manifest in the implementation of the solution will be following the structure of the problem. Admittedly, I grew up on BASIC and 6502 assembler in my teens, and if you teach kids pi-calculus in high-school the next generation might see the problem differently.

Neo: What are you trying to tell me? That I can dodge bullets?
Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't have to.


With Erlang you can express a problem in terms of concurrency.
It may be a long time before ruby's ready.


Pete

Labels: , , ,

POITROAI

For this particular problem, the principles are more important that the exact results, but I'm happier with tbray7.cpp which gives the same results as the regex in the original ruby code:
fortinbras:$ time bin/tbray7 datasets/hundred-o10k.ap 
matches: 95400

real 0m0.439s
user 0m0.284s
sys 0m0.144s
The file was in cache for that one so ignore the real time; the important thing is that the user time wasn't adversely effected by getting the 'right' answer, instead of 1100 matches.


Pete

Labels: , ,

2007-10-04

Of course, it helps if you tell gcc to generate a 64 bit executable

... when you're using 64 bit operations:
tercel-2:$ time bin32/tbray4 datasets/hundred-o10k.ap 
user 0m5.644s
tercel-2:$ time bin32/tbray6 datasets/hundred-o10k.ap
user 0m4.048s
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
user 0m5.725s
tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
user 0m3.839s
I'd assumed it would default as it does on linux, but no. Which also explains why I was confused about sizeof(size_t) being less than sizeof(long long) here.

It's still around 8 clock cycles per byte - the 5% gain is not enough to alter yesterday's estimates, which is a little encouraging actually, as I wasn't sure whether or not 32 bit Sparc instruction generators, such as gnu lightning (used in rubinius) or whatever erlang's hipe uses would cause a significant slow down.


Pete

Labels: , , ,

2007-10-02

Some fingers in the air.

I got Steve Vinoski's 2007/09/29 erlang code, installed hipe and the bfile module, and it ran on the laptop:
fortinbras:$ cat ../datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time erl -smp -noshell -run tbray5 main 512 ../datasets/hundred-o10k.ap
110100 matches found

user 1m23.649s
real 1m33.683s
sys 0m1.620s
I'm not sure looking at either mine or Steve's code where the 1101th match comes from - there are #ifdefs in my line splitting code to print the lines, and if you run diff that output with the input it's the same, and if it was related to the last line in the sample it would give a difference of 1 not 100 for the hundred-times repeated file. But something's inconsistent between the two somewhere, and also with the original ruby code which gives 954 matches for the o10k.ap, or 1097 for the regex %r{GET /ongoing/When/([^ .]+) }.

From the gnome-panel dials, the erlang isn't IO bound in total on this machine - for the first few seconds it is running at max IO and 90% CPU, then for the remainder 99% CPU and does zero IO, so it's reading it all into memory then spawning processes to scan it. I'll leave it running overnight on the 10 million line file to how it fares when it can't fit the file into memory, though I doubt that has much of a bearing on what would happen on the T2 box, as that has plenty of RAM.

[Update: It took 52 minutes 26, and was doing disk IO throughout, but that could well have been paging rather than the IO it has to. Nothing to conclude, other than that it doesn't scale linearly - 10 times bigger file takes 34 times longer.]

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time ~/projects/quad-store/bin/read_bench datasets/hundred-o10k.ap > /dev/null
real 0m8.754s
user 0m0.000s
sys 0m0.180s
So the 64 bit-wide matcher on the AMD64 laptop is very IO bound; the difference between the total time of just doing the IO and scannning the lines is negligible.

At 2200 MHz, it's scanning 201 MB data in 0.284s, which is 2200*0.284/201 = 3 cycles per byte processed.

Sun's Thumper gives 2GB/s transfer into memory; the T2 runs at 1.4 GHz and assuming the same IO rate, we have 0.7 clock cycles per byte delivered to play with. 64 hardware threads would give about 44.

Running read_bench on one of the 'ancient creaking' Netras, gives:
tercel-2:$ time bin/read_bench datasets/thousand-o10k.ap 
real 1m15.449s
user 0m0.362s
sys 0m27.998s
That's IO bound, 2,095 MB/75.5s = 27 MB/s. The laptop gets a similar transfer rate figure on the same test.

The T2 is 200 times the CPU, and the Thumper 80 times the disk transfer rate; I'll assume that the T2 system's IO is comparable to the Thumper.

tercel-2:$ time bin/read_bench datasets/hundred-o10k.ap 
real 0m6.159s
user 0m0.058s
sys 0m2.834s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
matches: 110000
real 0m7.617s
user 0m4.054s
sys 0m2.504s
At 440 MHz, it's scanning 201 MB in 4s, which is 440*4/201 = 8 cycles per byte processed; it's IO dominated but not IO bound. Optimising the matching code would at best give a 20% improvement.

It's also using about 5 cycles per byte system time for the IO.

Since the Netra runs Solaris 10 and is a Sparc variant, and in the absence of better information, I'm working on the assumption that the 1.4GHz T2 would be close to it in terms of clock cycles per work unit, so to process the log files it would also need around 13 cycles per byte delivered. So either we'd need 13/0.7 = 19 threads, assuming each thread provisions another 0.7 cycles worth of work, or run 64 threads and can afford code that's three times slower than the single threaded variant due to concurrency management. If it's closer to the fat-piped AMD64, then only 6 cycles per byte and 8 threads would do.

Guessing even more widely, with a large dataset and large grain concurrency, the C++ matching code should be at least three times faster than the IO - using 8 to 20 of the 64 threads. The erlang interpreter seems to be around 10 times slower , so would require 80 to 200 threads to balance the transfer rate [correction: only the CPU time is relevent here, so it's 83.649+1.620s vs 0.284+0.248s, which is 160 times slower, so would require 1280 threads to get the same throughput]. If the IO rate is less than 800MB/s and the erlang scales across all 64 cores then the faster matching in the C++ won't give any advantage. I've no idea how good erlang is across cores - the only times I've seen it praised is many (conceptual) processes on small numbers of cores - it's a concurrent language rather than a parallel one. Inter-core communication doesn't come free, and any message still need either lock based synchronisation or CAS/HTM retries; as it doesn't matter which actor works on the data, in a C++ implementation you'd use a fail fast lock and try with a different actor rather than it blocking, so going lock-free would cost more than it would gain. AFAIK you don't have that option in erlang, and are stuck with the given queuing and scheduling mechanisms, though you probably could subvert them.

In computer science there are only three numbers - zero, one and many. If my estimates are anything to go on (and I wouldn't bet more than half a penny on any of them), on the target system you need many threads to solve this problem most efficiently.

Looking at the code for the bfile library, it shouldn't be too hard to move the 64bit wide string match into an erlang library, which seems more fun than fiddling with MPI to get the required thread count, but the required thread count is far fewer than the number of processes in Steve's code, so MPI or pthreads should be a better model than large scale concurrency. But erlang may be to concurrency as lisp is to dynamism - I like lisp, but I haven't found been any interesting problems which I could apply it commercially under the constraints of the business, and even though I end up greenspunning in most systems I write, the subset of lisp-like features which gets implemented to give dynamic behaviour is tailored to the problem, and the rest of the code can be performance optimised more easily.

On the other hand, my fiancée comes back into the country on Saturday and I'm supposed to have finished sending out our wedding invitations by then, and all this isn't really helping that to happen.


Pete

Labels: , , ,

2007-09-30

Data-wide wide-finder

Running a test with just breaking into lines without the matching, compared to the non-MPI version with matching runs on the 100x file in 380ms user time rather than 720ms. So there is work in both parts of the string processing.

Parallelising the scan for newlines to work on 8 chars at a time gives a reduction to around 220ms.

Restoring the matching, and parallelising the scan for the first character makes the CPU run at 800MHz.

A quick google search tells me that sudo dpkg-reconfigure gnome-applets enables me to use the gnome panel applet to lock the frequency at 2200MHz, so with that set, I can compare results between the linear and SIMD parallel forms. Typical results are around 400ms better for the parallelised version on the laptop:
fortinbras:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000
real 0m10.420s
user 0m0.688s
sys 0m0.268s

fortinbras:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s
This makes user+sys time go from 950ms to 532ms; if this process was CPU limited, that's as good as you could possibly get from running two cores.

Sorting out a few machine word dependencies so it runs on one of the Netra Solaris boxes (Linux/Solaris code is at tbray6.cpp; I was assuming size_t would be 64 bit on all 64 OSes), the results are:
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000

real 0m7.593s
user 0m5.638s
sys 0m1.894s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000

real 0m5.915s
user 0m4.027s
sys 0m1.855s
Here we get a much more significant speed up - about 25% faster overall time, due to a different balance between CPU and disk.

These simple benchmarks indicate a few things to me:

Not all gains from parallelism require concurrent processes - more transistors can mean wider data paths as well as concurrent cores. If I were using the architecture specific SIMD registers on the AMD64, then the paths would be twice as wide again.

A slower CPU with a 'server class' disk outperforms a faster CPU with a 'consumer class' disk in this test, but shows up potential to parallelise the code.

I like playing with bit-twiddly code on the weekends. I don't like having to debug bit-twiddly code in mission critical applications for my customers to a tight deadline. The optimisations should be in the regex library, rather than complicating user code - for example, this code avoids reading past the end of the buffer only because it has a pattern longer than the word size.

I'm also wondering whether some kind of hash based wide-scan can detect two or more characters of a pattern well enough to give an improvement, rather than just a wide-scan for first one - there's probably only a couple of bits of information per character in those log files.


Pete

ETA:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/tbray6.cpp | wc -l
85

It's hairy but it's not huge.

Labels: , , , ,

2007-09-29

Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).
fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real 2m28.201s
user 0m10.193s
sys 0m4.752s
At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
48
So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:
fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
...
real 1m20.744s
user 0m24.710s
sys 0m4.420s
Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):
fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real 1m6.780s
user 0m9.593s
sys 0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
60
Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:
tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real 1m23.930s
user 0m57.070s
sys 0m24.636s
Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.

Conclusion

This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.


Pete

Labels: , , , , , ,