2009-06-19

Language-level Immutablity and Shared Memory Concurrency

I've been a fan of Bartosz Milewski’s Programming Café ( or possibly Bartosz Milewski’s Programming Caff; I'm not sure if lack of an accent implies greasy chips and tea outside of the UK ). He's exploring the options of applying the leverage of type checking to concurrency problems within shared-memory systems. In a different camp, there's the Erlang crowd, typified by Steve Vinoski which says that language-level immutability and message passing is the way to go. A third perspective, probably closer to what I would like to see, are CSP based approaches such as Limbo and stackless Python, which are shared nothing and use channels to pass messages by value. The channels and pass-by-value approach I'm experimenting with in kin is close to the Limbo model, except that channels are objects with support different messages. Apart from the D extensions by Bartosz Milewski, these techniques require the language integrates with a scheduler, usually a green threads system rather than the OS scheduler.

One thing which bothers me about Erlang and pure-functional languages in general is the assumption that language level immutability has some bearing on concurrency.

It's fairly obvious that any message shared by machines on a network will be unlikely to include memory addresses in one of the processes, other than possibly as opaque values. In other words, if it's on the net, it's pass by value. Similarly most inter-process communication is pass by value, and the sharing of messages containing values and names of channels is well understood using CSP and pi-calculus. The net is unconcerned about what paradigm the languages which implement the processes communicating by message passing; it only cares that the messages are values and names. It doesn't follow at all that the implementation language for a process which operates on messages needs to present immutable variables to the programmer.

The other area functional languages have been touted has been for problems where there are shared memory concurrency within a process.

A trivial counterexample is given by lazy evaluation - in trivial implementations, expressions are passed around as thunks, consisting of a place to store the value of the expression, and a function to evaluate the value. The expression is evaluated lazily by calling the function the 'first' time the value is needed. Being a pure function, it doesn't matter as far as the value is concerned whether the function is called multiple times in different cores, but if the value returned is a non-primitive there are still write order issues with sharing the value. So you end up with similar barrier ordering problems that you have in non-pure languages, and barriers cost performance.

For strict functional languages, you still need the barriers, and other synchronisation points, if you are going to communicate values between cores. If you are passing messages by value, then it is necessary that the sender or receiver of a message cannot change the value of the message. This isn't necessarily the same as language level immutability - in a shared memory space, values sent on a channel should not change, but that can be achieved by copy-on-send or copy-on-write approaches as well as by general immutability. Having a shared memory space between threads requires synchronisation with the garbage collector - otherwise memory can be reclaimed and rewritten, breaking the immutability. None of this is particularly surprising, and the implementation of any SMP shared-memory language with automatic memory allocation will have to deal with it.

The presence of immutability at the language level allows the optimisation that a message can be passed as an immutable reference rather than by value. This introduces the pessimisation that any garbage collection of the arena that the messages' storage is allocated from has to be coordinated between all threads. If the optimisation of mostly-concurrent memory management is made, where each thread has its own arena, memory for messages is either allocated from a special shared arena when constructed, requiring some synchronisation, or messages are copied when sent.

It's fairly obvious that if you are allocating from a shared arena, then you're having to serialise some of the requests, so are losing concurrency ( as well as having to do escape analysis so you know what objects constitute the message and so require to be allocated from the shared arena ), and that if you're copying messages into a shared arena when sent then you have lost the pass-by-immutable-reference optimisation. Either way, I've yet to see a truly convincing case that immutability in itself actually gives any performance gain for concurrent code and typical message sizes ( if you have very large messages which are not referenced by the sender, then any copying scheme will be inefficient, but that seems to be a case for escape analysis rather than immutability; copy-on-send would be at a disadvantage there, but Erlang style encourages small messages anyway ).


So the problems in terms of performance are acquiring lock on the memory bus, which takes around 40 cycles for a lock cmpxchg, barriers to order access, and copying. Languages with channels as first-class constructs hide the details of locking and barriers, and can trade various copying or shared memory techniques.

I've not seen anything in Erlang that would suggest that its shared nothing messaging is less prone to programmer error than Limbo's channels. There are other reasons to use pure-functional languages in terms of proving theorems about the program, and below the language level some optimisations are facilitated, and some are harder to perform. But there seems to be a confusion between which bits of a particular combination of facilities in a language are actually supplying the gains, and my opinion is that message passing combining values and channel names ( however implemented or enforced ) is both safe, and open to analysis at the systems level by pi-calculus.

Labels: , , ,

2007-11-01

One more wide-finder

Since previous efforts didn't print the results summary, they weren't really complete, so here are two more variants using a vector (tbrayA.cpp) and a priority queue (tbrayB.cpp) to get the top 10 entries.

There's a pthread parallel version of the B variant (tbrayB_parallel.cpp). The main thread reads chunks, scans back to a line break, and hands off the chunk to a pool of worker objects, which run the matching in their own thread. Once all the data is read, the workers' counts are summed and the top 10 counts found, then the results printed as in the single threaded version. Each worker's data buffer is written only by the main thread, for the most part in the call to the OS' read function; this helps reduce the amount of copying of data - though whether the T5120 would have to copy anyway I don't know (some CPUs share cache between cores, others don't, those that don't may have to copy the buffer from one cache to the other when the matching thread starts processing).

The code hasn't been built on anything other than linux this time, so I can't guarantee it will run on Solaris. There's also a deadlock on the scheduler if you only use 2 workers (you need at least two so the buffers are only being written or read by one thread at a time), which I have to track down.

Obviously, the multi-threaded version requires quite a lot of infrastructure - it's twice the size of the single-threaded version. It also runs significantly slower than the single-threaded version:

fortinbras:$ time bin/tbrayB_parallel datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m1.358s
user 0m0.372s
sys 0m0.232s
fortinbras:$ time bin/tbrayB datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m0.531s
user 0m0.232s
sys 0m0.156s
Commenting out the call to process_chunk(), so it just does the fork and join thread communications:
fortinbras:$ time ./widefinder datasets/hundred-o10k.ap 3

real 0m0.288s
user 0m0.056s
sys 0m0.220s
So there is some additional cost in the parallelism other than just the inter-thread messaging (372 > 232 + 56).

There's an archive of the code here; unpack it, run make and call ./widefinder with the path of the logfile and the number of worker threads. The number of worker threads is 3 minimum. The other examples compile to the ./bin directory from make tbray.

I'm still not overly convinced that it will be CPU bound if it has has a small, slow, miserable disk unless you are using a language whose virtual machine is excessively baggy; I've been careful not to use any optimisation which is specific to the data, rather than what a compiler for a higher-level language would know, given suitable annotations for parallelism.

The latest erlang effort from Pichi is twice as fast as the original ruby; ruby takes 2.580s CPU time on my machine, so the single-threaded C++ version is 11 times faster than ruby on one core, the parallel one 7 times. So once you have half-a-dozen cores then the erlang should beat the single threaded C++ code. Though I'd like to see whether on the disk on the T5120 means it is CPU limited or IO limited with the faster code, and I'm unconvinced that the T5120 is what future will look like - I'd expect more cache and flash on chip, rather than more cores, and a multi-node rather than a multi-core future.


Pete

Labels: , , ,

2007-10-05

A few more thoughts on concurrency and high-level languages

I little while ago, I started playing with a pattern-matching and relational language for meta-programming called 'kin' (it's got relations, and does pattern, so kin/kinship).

Being a machine elf rather than a language lawyer, I never could come up with a decent user syntax for it; I've toyed with making it a lisp (either L1 like scheme or L2 like common lisp; I wrote something that used lisp s-exprs to manipulate JVM class files, but never got further, and then 1.5 came out and increased the verbosity of the type system without increasing its power significantly, and I moved on), with making it a prolog derivative (I rewrote the core of a Java WAM engine so it ran a few times faster, but never got anywhere much with that either, apart from looking at it and thinking, 'yep, it's faster now, so what do I do with prolog?'); then I thought about a faster JavaScript engine, but Tamarin's handling that well enough for most use-cases, and I want better type inference rather than having to tell the compiler everything. Tamarin also makes the mistake of relying on programmer specified type to give performance; usually the type can be inferred if you have the whole system available, and even if not, the Self and StrongTalk VMs show you don't need explicit type to help the compiler. Type systems should help programmers, not compilers. But the JavaScript story doesn't really matter - the XULRunner front end I spend half my time on at work spends 90% of its time in Mozilla's SVG rendering code rather than my JavaScript anyway. I write my best software when there are users asking me to, and when it's also a problem which interests me; without users I get bored too quickly, without interest I don't give more than the users expect.

I've spent a few hours poking around Rubinius code; they appear to be making a ruby interpreter very like lisp interpreters were a couple of decades ago. I don't program ruby; I do heavy math backends and light front ends (with occasional forays into OpenGL), and never touch RDBMS, so use C++, Fortran and Java backends, C++, Java, XSLT middleware, C++, Java, JavaScript front ends (depending what level of 3D visuals they require; I haven't had cause to look at canvas3D yet so the OpenGL work has always been in C or Java, though given the current Gecko SVG performance I've gone off basing a front end on Mozilla code) - though I'd quite happily never use Java for a UI again (last time I got RSI from all those anonymous inner classes). I've been known to use lisp to prototype math code before coding it up it C++ for speed (though that's mainly been due to working for BAE at the time, who take 6 months to buy software for a 3 month project, or due to other business constraints on the implementation language and isn't a reflection on professionally produced lisp compilers, which are fast enough not to require porting to C++).

An old housemate and good friend of mine did his PhD in an embedded Erlang system; that's as close as I came to it before a year or so ago, when I started using XMPP at work and tried installing ejabberd. It didn't work, and I gave up after a few tries, so I tested our server's federation code against Wildfire instead. That's nothing to do with Erlang; ejabberd appears to scale well in independent tests, and I suspect the Windows installer wasn't a priority (our customers require Windows). I've read comments that there are smart people who have written good software in Erlang. This is true. Smart people also write good software in BASIC or COBOL or ADA; that's because smart people write good software. Writing a language that smart people can write good software in is easy; writing a language that average people write non-crap software in is far harder. For my opinion, for this sort of problem, ruby seems such a language (well, it seems much better than Perl or PHP, and a bit better than python), but the implementation is pants. Conversely, ejabberd scales to many more users than Twitter can, and I suspect is a cleaner expression of the problem, because IM's a concurrency problem and erlang far exceeds ruby's capabilities at concurrency.

Here's some good advice on how a high level language should express a problem in something I started reading earlier:
"The structure of the program should exactly follow the structure of the problem. Each real world concurrent activity should be mapped onto exactly one concurrent process in our programming language. If there is a 1:1 mapping of the problem onto the program we say that the program is isomorphic to the problem. It is extremely important that the mapping is exactly 1:1." -- Joe Armstrong (PhD thesis, Dec 2003).

You have a log file. You scan each line (in any order) for a hit and accumulate the number of times each hit occurs. Then you list the hits in decreasing order of the accumulated hit counts.

I don't see any concurrency intrinsic in the problem. The only reason we're even thinking about concurrency is that we want the implementation of the solution to the problem to be parallelised over many cores for performance reasons, and (as SIMD isn't sufficiently general to scale much beyond 4 times wider paths except in GPUs), many concurrent cores is the only thing we can think that will make computers have more throughput. For this problem, concurrency is as much an accident as memory management. Languages which handle memory management automatically have already eclipsed those which require manual memory management in most areas where it isn't an essential part of the problem. As it's not an essential part of the problem, no language where the concurrency is manifest in the implementation of the solution will be following the structure of the problem. Admittedly, I grew up on BASIC and 6502 assembler in my teens, and if you teach kids pi-calculus in high-school the next generation might see the problem differently.

Neo: What are you trying to tell me? That I can dodge bullets?
Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't have to.


With Erlang you can express a problem in terms of concurrency.
It may be a long time before ruby's ready.


Pete

Labels: , , ,

2007-10-02

Some fingers in the air.

I got Steve Vinoski's 2007/09/29 erlang code, installed hipe and the bfile module, and it ran on the laptop:
fortinbras:$ cat ../datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time erl -smp -noshell -run tbray5 main 512 ../datasets/hundred-o10k.ap
110100 matches found

user 1m23.649s
real 1m33.683s
sys 0m1.620s
I'm not sure looking at either mine or Steve's code where the 1101th match comes from - there are #ifdefs in my line splitting code to print the lines, and if you run diff that output with the input it's the same, and if it was related to the last line in the sample it would give a difference of 1 not 100 for the hundred-times repeated file. But something's inconsistent between the two somewhere, and also with the original ruby code which gives 954 matches for the o10k.ap, or 1097 for the regex %r{GET /ongoing/When/([^ .]+) }.

From the gnome-panel dials, the erlang isn't IO bound in total on this machine - for the first few seconds it is running at max IO and 90% CPU, then for the remainder 99% CPU and does zero IO, so it's reading it all into memory then spawning processes to scan it. I'll leave it running overnight on the 10 million line file to how it fares when it can't fit the file into memory, though I doubt that has much of a bearing on what would happen on the T2 box, as that has plenty of RAM.

[Update: It took 52 minutes 26, and was doing disk IO throughout, but that could well have been paging rather than the IO it has to. Nothing to conclude, other than that it doesn't scale linearly - 10 times bigger file takes 34 times longer.]

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time ~/projects/quad-store/bin/read_bench datasets/hundred-o10k.ap > /dev/null
real 0m8.754s
user 0m0.000s
sys 0m0.180s
So the 64 bit-wide matcher on the AMD64 laptop is very IO bound; the difference between the total time of just doing the IO and scannning the lines is negligible.

At 2200 MHz, it's scanning 201 MB data in 0.284s, which is 2200*0.284/201 = 3 cycles per byte processed.

Sun's Thumper gives 2GB/s transfer into memory; the T2 runs at 1.4 GHz and assuming the same IO rate, we have 0.7 clock cycles per byte delivered to play with. 64 hardware threads would give about 44.

Running read_bench on one of the 'ancient creaking' Netras, gives:
tercel-2:$ time bin/read_bench datasets/thousand-o10k.ap 
real 1m15.449s
user 0m0.362s
sys 0m27.998s
That's IO bound, 2,095 MB/75.5s = 27 MB/s. The laptop gets a similar transfer rate figure on the same test.

The T2 is 200 times the CPU, and the Thumper 80 times the disk transfer rate; I'll assume that the T2 system's IO is comparable to the Thumper.

tercel-2:$ time bin/read_bench datasets/hundred-o10k.ap 
real 0m6.159s
user 0m0.058s
sys 0m2.834s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
matches: 110000
real 0m7.617s
user 0m4.054s
sys 0m2.504s
At 440 MHz, it's scanning 201 MB in 4s, which is 440*4/201 = 8 cycles per byte processed; it's IO dominated but not IO bound. Optimising the matching code would at best give a 20% improvement.

It's also using about 5 cycles per byte system time for the IO.

Since the Netra runs Solaris 10 and is a Sparc variant, and in the absence of better information, I'm working on the assumption that the 1.4GHz T2 would be close to it in terms of clock cycles per work unit, so to process the log files it would also need around 13 cycles per byte delivered. So either we'd need 13/0.7 = 19 threads, assuming each thread provisions another 0.7 cycles worth of work, or run 64 threads and can afford code that's three times slower than the single threaded variant due to concurrency management. If it's closer to the fat-piped AMD64, then only 6 cycles per byte and 8 threads would do.

Guessing even more widely, with a large dataset and large grain concurrency, the C++ matching code should be at least three times faster than the IO - using 8 to 20 of the 64 threads. The erlang interpreter seems to be around 10 times slower , so would require 80 to 200 threads to balance the transfer rate [correction: only the CPU time is relevent here, so it's 83.649+1.620s vs 0.284+0.248s, which is 160 times slower, so would require 1280 threads to get the same throughput]. If the IO rate is less than 800MB/s and the erlang scales across all 64 cores then the faster matching in the C++ won't give any advantage. I've no idea how good erlang is across cores - the only times I've seen it praised is many (conceptual) processes on small numbers of cores - it's a concurrent language rather than a parallel one. Inter-core communication doesn't come free, and any message still need either lock based synchronisation or CAS/HTM retries; as it doesn't matter which actor works on the data, in a C++ implementation you'd use a fail fast lock and try with a different actor rather than it blocking, so going lock-free would cost more than it would gain. AFAIK you don't have that option in erlang, and are stuck with the given queuing and scheduling mechanisms, though you probably could subvert them.

In computer science there are only three numbers - zero, one and many. If my estimates are anything to go on (and I wouldn't bet more than half a penny on any of them), on the target system you need many threads to solve this problem most efficiently.

Looking at the code for the bfile library, it shouldn't be too hard to move the 64bit wide string match into an erlang library, which seems more fun than fiddling with MPI to get the required thread count, but the required thread count is far fewer than the number of processes in Steve's code, so MPI or pthreads should be a better model than large scale concurrency. But erlang may be to concurrency as lisp is to dynamism - I like lisp, but I haven't found been any interesting problems which I could apply it commercially under the constraints of the business, and even though I end up greenspunning in most systems I write, the subset of lisp-like features which gets implemented to give dynamic behaviour is tailored to the problem, and the rest of the code can be performance optimised more easily.

On the other hand, my fiancée comes back into the country on Saturday and I'm supposed to have finished sending out our wedding invitations by then, and all this isn't really helping that to happen.


Pete

Labels: , , ,

2007-09-29

Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).
fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real 2m28.201s
user 0m10.193s
sys 0m4.752s
At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
48
So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:
fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
...
real 1m20.744s
user 0m24.710s
sys 0m4.420s
Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):
fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real 1m6.780s
user 0m9.593s
sys 0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
60
Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:
tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real 1m23.930s
user 0m57.070s
sys 0m24.636s
Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.

Conclusion

This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.


Pete

Labels: , , , , , ,

2007-04-01

Language of the year

For me, this year's language is assembly. I did a little 6502 assembly as a teen - wrote a dissembler for the BBC B's OS and mucked around with its interrupts in high school and some 6800 during university, and did a tiny bit of 386 inline assembly for loop optimisation of Gouraud shading parts of my MSc in 1992, so it's not quite a new language, but it's one I haven't used in a few years.

It seems that there is more assembly around these days - some marginally disguised such as C++ SIMD intrinsics, but quite a bit of the lock-free code I've been seeing is assembly. Perhaps I'm just looking at more parallel and concurrent code. It seems the high level languages we currently have don't have abstractions which match multi core architectures, and so we have to make our own low-level libraries to map them to them.

There's also big gains in using GPUs for parallel computation, which is again getting closer to the hardware, though there are C and C++ tool kits for GPUs out there. The last few years I've looked at languages which have been higher - Python, Common Lisp, ML, JavaScript. Fortress stands somewhere else - it's a very high level language, but very careful to have a rich enough type system to allow a strongly optimising compiler for parallel hardware. I still don't believe that type systems have to be static to be optimised, but they do need to be strong.

Targeting these new architectures, the PeakStream platform looks interesting - though it seems to only target workstation GPUs, and may not be a platform (which I take as synonymous with virtual machine) but rather a set of GPU optimised libraries. As my only home machine is a laptop without I don't have machines which have its target GPU or one for NVIDIA CUDA. The Sh shader metaprogramming library also looks like the sort of thing that may be useful - it uses C++ operator overloading, much like Boost lambda or RealLib to generate an AST of the expression, which is then calculated on either a CPU or a GPU backend (RealLib expressions are calculated using the CPU, sometimes with SIMD intrinsics, to required precision). I'm interested as to how Fortress' parallel for might get implemented on a GPU, and whether it's still necessary to detect glitches in gamer rather than workstation class GPUs - traditionally the gamer class GPU's ran hotter so weren't guaranteed to produce accurate results all the time, but there's no mention of that in recent generations.



This post is delayed from Thursday as I was off in York, finding a venue for my wedding reception. My lovely lass stayed with me for a fortnight, but now she's back in Glasgow :(.


TME

Labels: , , ,

2007-03-22

Prunes. Thursdays. Fortress. Locks.

I'm going to try and get a bit more regular with my blogging. Like every Thursday.

This week, I've got my hands on an iRex Iliad from work so I can read the 400 odd page Fortress language specification (1.0 beta) without killing a small tree.

I like Fortress' traits system, it seems to have enough information for the sort of semantic compiler optimisations and automatic parallelism I'm interested in. Although the language doesn't expose these, it's very similar to some of the sorts of thing I was trying to do at BAE with code generation from a maths model, which led me to start playing with kin. Though kin is relational rather than functional. The reference interpreter isn't a high performance system; if I get some time and the kin interpreter gets anywhere, I may use Fortress as an input language, though I do want to do something more with the relational aspect. I like that its definition of an integer is based on rings and fields rather than bit-string operations. My fiancée is in Romania from May to September, so, instead of just pining away, I may actually stop talking and do something on kin rather than a sequence of 400 line micro-benchmarks.

Both for work and kin I've been trying to get up to speed with these guy's work; the publish-subscribe idiom is used heavily within the virtual machine for the system I'm creating at work, and it seems to spend an inordinate proportion of its time waiting for locks on the properties it's published (I'm still getting to grips with interpreting VTune's output for multi-threaded apps).

kin related I'm trying to think of a nice way of packing R32 and R64 vectors so you can both look up type information and operate on them using SSE. For objects with a reference to their type, if each type is aligned to a cache line, it gives more state bits in the word - 4 rather than 2 for a 32bit aligned pointer, which may be useful. For a primitive field in an object, the field type can be associated with the object's value rather than the field's value, as it can't be aliases, so all code which accesses it has to go through the object's traits first. The cases I want to optimise, such as Array[/ (R32)^4, (ZZ32,ZZ32) /] (a 2D array of 4D 32 bit vectors, which has a very efficient contiguous representation) I imagine would also have access to the type of the array and functions would operate primarily on the array. But mapping a magnitude function onto the Array would need to operate on the type of the element, and I was imagining a fully dynamic dispatch (which then solidifies and inlines the commonest branches). Maybe pushing the type onto the function argument stack would work in those cases, rather than making every value carry its type around. In that case, the array member case and the object field or method local variable cases are all the same - the context you got the value from can give the type of the value, which is either the global ref to the type of a primitive, or the reference to the metaclass of a reference type. That should work for both Fortress and ECMAScript type objects constructed from traits. My current model for running ECMAScript is that each time you set a field on an ECMAScript object to a value of a different type to the type you create a new reference type trait object to describe it; these anonymous traits are reference counted in addition to any memory management so may modify rather than copy previous traits if they are only used by the one object; not original but amenable to having a common runtime for all the sorts of languages I tend to use - hpc, dynamic UI, and relational logic.


TME

Labels: , , ,