2008-04-25

Some things I'm thinking about

I've been playing more with my mostly functional modelling and simulation language (kin), here are some ideas which I want to try and profile to see if they offer gains:

Carrying type and data as a tuple:


Some implementation of Scheme and JavaScript use a single void* size datum for each object, and for integers set the lowest bit and encode the value in the rest, and for other objects the lowest bits are zero due to alignment, and the whole is a pointer to object. That doesn't work very well for doubles, and requires that the type is stored in the object's header. Java, C#, and C++ (for object with a vtable) have type information in the object header, which costs in large arrays. I've been thinking about passing the type and the data around separately, which means you can have more tightly packed arrays for non-vtable types.

One-bit markers for escape detection:


Based on One-bit Reference Counting, a marker in the type field to distinguish 'shared' objects you can take a reference to, and 'non shared' which you must not.

Write-once arrays:


As an extension to 'one bit reference counting' style; you maintain a high water mark instead of always copying, for use in structures such as VLists so you don't copy on passing automatically, and it allows you to mutate an array if it can be proven that all accesses happen before writes in matrix multiplications:

ax[i] <- ax[i] + row[i] * col[i]

Block order arrays for cache sensitive iteration and super compiled functors:


In a simulation, it's not uncommon to have a reduction on some function of a few attributes of an entity:

let total_force = reduce(array_of_entities, lambda(e:entity, ax) => f.mass * f.acceleration + ax, 0)

If this is over a homomorphic array, and mass and acceleration are simple double values, this would in C translate to picking 8 bytes here and there from the memory for the array. If instead each field is kept either in 'column order', or in what I'll call block order - so the fields for (cacheline size in bytes) objects are held contiguously. This should both reduce cache misses, and allow the use of SIMD instructions to process the data. The obvious disadvantage is that an object in an array is no longer the same layout as a single object on the heap, and to exploit it you need either a super-compiler or a trace-based JIT.

Effective model for scopes based on lists, symbols and objects:


Trying to build an interpreter in Java, which makes it tempting to use maps for everything, I found that the properties of an object and the properties of a lexical scope are much the same, (the duality of closures and objects is well known) so will try and define the binding model for values in kin using symbols, integers and lists only.

Using CAS for thread safe homogeneous VLists


Similar to the vector implementation in Lock-free Dynamically Resizable Arrays.

Options for compiling Verilog to shader language


Having had an interview last week as a systems modelling engineer with a company who were using C as the modelling language for timing simulations in embedded memory controllers, which is a bit like going for a job as a car mechanic and discovering that you're being quizzed on your wood-working skills, I was thinking about compiling an appropriate modelling language to something which executes efficiently. Though their particulary problem - timing conflicts - I would have though as having an analytic solution.

Something after UML


Speaking of Verilog, Verilog came into existance because even standardized schematic diagrams don't carry strong enough semantics and are not amenable to algebraic analysis, and graphical notations don't scale to large systems without hiding information.
Pi-calculus came into existance as Petri nets don't scale to large systems and are not amenable to algebraic analysis.
UML is very much in the standardised schematic mould, lacks formal semantics, and relies on hiding information to scale.

Often the hidden information in UML is very important - what appears to be a good design is not a good design if its simplicity is achieved by hiding most of the attributes and operations of a class, as often happens when a class gets multiple unrelated responsibilities. The notation allow you to hide details which should indicate that the design should be factored into smaller units, and in fact encourage such behaviour as a mechanism to scale up to showing many classes. For example, in Altova UModel:
You can customize the display of classes in your diagram to show or hide individual class properties and operations. ... This feature lets you simplify the diagram to focus on the properties and operations relevant to the task at hand.

If the class has features which are not relevant, then refactor it. Don't hide it so it doesn't make the documentation more complicated. The classes for well-factored systems have nothing to hide, but require more relations between the classes, which UML tools don't handle well (last time I checked none provided basic auto-routing or anything like good layout engines) and look more complicated, even though they are more simple - in the same way a dry stone wall is simpler than a minimalist plastered white apartment, but is visually more complex.

So what might come after UML then? What would not mitigate against refactoring, but allow a visual overview? What notation might give good analytic properties for software, rather than being a system schematic with loose semantics?

I don't know yet.

Labels: , , ,

2008-04-20

Bjarne Stroustrup on the Evolution of Languages

Link: http://msdn2.microsoft.com/en-us/magazine/cc500572.aspx

A few comments on this -

The illustrations in the interview don't appear related other than that they are of an IDE ("I might have a different opinion if a good IDE was universally available", queue Visual Studio, which implies that VS isn't a good IDE) and that the UML is illustrating multiple inheritance, but mechanism for dispatch in the paper is not based on multiple inheritance (at least as shown in the diagram).

It's an interesting paper though, and I would like open methods in a language.

I don't think that DSLs should be just extensions to general purpose languages. Mainly because that implies a specific execution model, and if you have a DSL then execution shouldn't be part of the domain model (unless it's something like Verilog, or the input to a program proof tool, in which case execution is a domain concept). DSLs should allow the modelling of a domain, and lose their benefit - abstraction away from execution - if they are tied to a programming language. If you have a model of the physical domain, then you can state 'F = m * a' and not have that imply that the values named by 'm' and 'a' are multiplied and stored in the location named 'F', as binding to C++ semantics would have it. Otherwise you need at least three models - 'F = m * a', 'm = F/a' and 'a = F/m' - to let you represent everything that the one entity in the domain represents. Which would create a mismatch between the language and the domain, eliminating the benefit of a DSL.


I've also been reading 'Programming Languages Pragmatics' and working on my model-based language kin a lot, and adapting the parser and code model of the modelling language I used to do a state model of some concurrent code to support kin. I was generating Java as a first cut to bootstrap a full compiler, but Java's generics aren't rich enough to support kin's parametric types, and without them it's not easier to generate Java than C. So I will go straight to a C back end, and generate an intermediate form which can be interpreted. It's more important for kin to be able to call C libraries such as the Intel Math Kernal Library than for Java libraries, and I don't really like the sort of compromises you have to make to fit in with Java's libraries, however nice the JVM's HotSpot and garbage collector are to have.

Labels: , , ,

2008-03-29

More kin musings

Back from finishing a contract in Malvern, and have had a few ideas on kin.

If general allocation is monomorphic arrays in column order, then single object representation is just a stride of 1 and a length of 1, and padded as required.

Assumption is that the code will operate on arrays, so compiler should default to vector ops where possible.

Need some basic constructs before starting, including mono and poly VLists, and a base lisp-like language to write the parser in.

Problem is that it's easiest to built an interpreter on a higher level language, such as Java, but the reasons I'm looking at kin - such as SIMD and cache optimisations, pattern matching and relations - aren't available in Java.

Also, I'm getting more and more convinced that polymorphic references should be pairs rather than as in C++ and Java.

The big issue I'm having is that each time I think how to do something, I write it in kin, then can't be bothered to write it in C or NASM.

Starting with the string type, I end up finding there's only a couple of functions which are needed over seq<uint8>, and those are easier to specify in kin than code up in C. No points are performance critical - or rather, no points can be written faster in C than in kin.

So I will try to define a metacircular kin interpreter, and work out where the smallest subset lies.

Also, for my specific interest of event driven simulations, I'd been thinking about pushing events forward for handling collisions. This requires that updates on event don't change the pre-event state, but perform copy-on-modify, as the pre-event state is also used to derive the intermediate event state. If you do modify it, then the best you can do is say 'there was a collision in the time step', and you have to keep the time step short enough that it doesn't matter when the collision occured within it.

I'm trying to think where C++ style identity tied to memory is useful; usually it's a premature optimisation, causing inconsistent states if you're not careful with the visibility of an update to a simulation entity after an event (the work I was doing up to November suffered from this, but that architectural decision was made before I joined that team). In C++ it's hard to work out how to create a 'make an updated copy' operation which doesn't perform redundant copying. It either involves writing a constructor which is aware of the update operation (causing slicing issues if the update is polymorphic) or requiring redundant initialisation if you use something like:

void entity::update_kinematics (double time_step) {
position += (velocity + 0.5 * acceleration * time_step) * time_step;
velocity += acceleration * time_step;
}

// create a copy of previous step's state
entity* updated = new entity(*previous);

// update it
updated->update_kinematics (time_step);
// store it in the active entities list for the new time

Then all the updated entities go into the collision detection routine, and any collisions are handled before the next interaction stage, where acceleration is recalculated. If there is a collision, then a new time event is added before the current time step, so that the interactions due to the collision can be worked on, the collision is handled in some other way. Then any other interactions are handled, based on the positions (or trajectories) at the current time, and then the actors can updated their intentions based on the sensing done in the interaction phase. Each time there's a copy-on-write going on, so, for example, if aircraft X decides to shut its radar off at time t1, the radar emmision is still visible along its trajectory from time t0 to t1, and any other actors can sense it and respond.

For a 6DoF vector position and velocity, the cost of writing the data twice isn't that great, but for larger structures it can be. Maybe some sort of copy-on-write at the language level, and an escape analysis for whether to use that or mutable structures. That would of course need an experiment to see if copy-on-write costs more than mutability though; it does for large structures, but not always for smaller ones.

How this relates to the sort of purely functional data structures described by Chris Okasaki I've yet to think.

Other things I've been thinking of - modelling actors as state-machine transducers, implemented as coroutines (see this paper for some inspiration) using trace-based code generator. The problem is that coroutines fit well with transducers and the style of event-driven actor programming I would like to express, and also with mostly sequence based processing, but not with the VM implementations I'm familiar with (or C++ coding, especially RAII). Actors accept a set of values, and also yield symbols, or accept a different symbol.

For example, a translator from UTF-8 to Unicode codepoints would accept a variable number of uint8 values for each uint32 value it produces:

# Convert UTF-8 string to a sequence of Unicode codepoints
codepoints (str:string) : seq<uint32> =>
yield str -> (x0:uint8) =>
if ((x0 & 0x80) == 0) then
yield uint32(x0)
else
accept (x1:uint8) where ((x1 & 0xc0) == 0x80) # UTF-8 encoding valid
if ((x0 & 0xe0) == 0xc0) then # 2 byte sequence
yield (uint32(x0 & 0x1f) << 6) | uint32(x1 & 0x3f)
else
accept (x2:uint8) where ((x2 & 0xc0) == 0x80) # UTF-8 encoding valid
if ((x0 & 0xf0) == 0xe0) then # 3 byte sequence
yield (uint32(x0 & 0x0f) << 12) | (uint32(x1 & 0x3f) << 6) | uint32(x2 & 0x3f)
else # 4 byte sequence
assert ((x0 & 0xf8) == 0xf0) # UTF-8 encoding valid

accept (x3:uint8) where ((x3 & 0xc0) == 0x80) # UTF-8 encoding valid
yield (uint32(x0 & 0x07) << 18) | (uint32(x1 & 0x3f) << 12) |
(uint32(x2 & 0x3f) << 6) | uint32(x3 & 0x3f)


My lower-level implementation experience is only with fairly procedural languages, such as Java byte code or conventional use of assembly, so translating this to code that runs is a bit interesting; it sort of goes round in circles - if I already had a language which worked on the accept/yield principles (with enough control to create SIMD code and column-ordered VLists), then creating a parser and an interpreter would be easier.

Labels: , ,

2008-01-28

Plays well with others?

> I know large systems can be built statically...I've also done it dynamically in more than one dynamic language.
That hasn't quite been my experience - every large system I've worked on has ended up getting layered into a static core with dynamic or DSL scripting. Greenspun's tenth rule.

On the other hand, even medium applications I've been involved in which were implemented only in a dynamic language got into difficulty with performance or modularity, and so either evolved into a mixed language system or were reimplemented.

Often I also use dynamic languages or environments to prototype algorithms, and sometimes then have to port to static languages for performance reasons - writing better code generators is rarely cost effective, as the dynamic languages are sufficiently good at the general cases, if you need to optimise you're optimising for a specific requirement. The same happens (albeit more rarely) where you need to add inline assembler code to C or Fortran where you need to go below their abstraction level (CAS or bitwise rotation, integer overflow etc).

Also in large systems you often interface with extant libraries, which may be implemented in diverse languages. You're not going to port half a million lines of a numerical library into Smalltalk so you can call it.

So you end up with interfaces between the user code and the library, and having to specify in a machine processable manner the types at that interface, as you're now below the abstraction barrier of the dynamic language.

Other dynamic languages require definitions for such interfaces, though foreign ones may not be visible in the language's introspection facilities - for example, Mozilla's JavaScript uses IDL for the definition, but the type and arity of functions isn't visible via introspection.

This introduces cost and brittleness to the system; and in the types of systems I've most experience with it also makes it harder to move a prototype algorithm between the dynamic and library codebases than it feels it ought to be - the dynamic runtime usually has the facilities for doing most of the code generation work already, but there's no means of accessing the results and tweaking it. I am looking for mechanisms which make that sort of task easier, and which don't require custom interpreters for DSLs, which are a pain to get right or maintain.

>"What happens in Smalltalk projects where you grow larger? Where you want to divide a system up into components, and some of those components don't exist yet?"
> You have a need; you have an idea to meet the need; you try it; you grow it. I guess I don't understand the question.

In my experience, that only happens up to a certain size. My question was about large systems - systems where you have to subcontract components, either to another individual in your team, to a second local team or to a separate organisation. Bottom up approaches don't scale out to that level, top down approaches introduce brittleness and (code production) inefficiency. The Smalltalk browser appears to be strongly a bottom up tool; I'm curious as to whether it allows that approach to scale out better.

> Semi-formal notations, tests and other examples, documentation, good tools -- these all contribute to success. But if you are looking for some kind of a "static declaration" as they saving mechanism for large systems -- it may work for you, but I doubt you can prove they are necessary and sufficient. I have counter-examples.

I do consider some form of specification of the domain and codomain of a function to be required, and the stronger the specification which can be automatically tested, the better. Verifying natural language specifications by hand is simply not viable for large software projects. What's provided by the type systems of most languages is not sufficient even for that - you can't say the domain of a square root function is the set of positive reals. Specifying pre- and post- conditions helps, but for anything involving concurrency you need 'for all' rather than the 'there exists' specifications given by unit tests. But most of the system doesn't require formal specification, and good design minimises shared state.

So I'm looking for something with the gains of dynamic languages without barriers to performance, and good modularisation. As such, I'm more interested in gradual typing than Smalltalk. But if it does have anything to offer then I'll borrow from it.

> "Can such interfaces be browsed in the same way if the code can't be called?"
> Your question betrays your bias. "There must be some such mechanism as the one I am used to," he thinks.
Yes, I have a bias based on my experience and problem domains.

One project I was involved in was refreshing the library used for airworthiness stress analysis, with a view to improving the performance (it was costing around half a million pounds in CPU time to run the analysis for each design revision, so in the company at that time, making a 10% improvement in the run time would mean someone didn't have to get made redundant). This consisted of a set of Fortran libraries, some OS scripts, and a domain specific scripting language which encoded the models and analysis algorithms. A set of flight states was mapped over the model, and the outputs reduced with a filter to indicate any resonances. The core of the system was mostly unchanged since the late 1970s, only the models used being updated with successive revisions to the aircraft.

There was a large amount of model and library code, so I wrote a quick and dirty call browser to explore data flows in the parts of the system I had access to. Since the code was significantly larger than the memory capacity of the computer I was running the analyses on, let alone expanding that to the AST and call graph, the interactive image and execution based querying as presented in the Smalltalk browser post would not have been suitable. The project ended up going nowhere, as translating the implementation of non-standard dynamic language from the mainframe to PC would take more resources than the team had in the time available.

If the scripting was in a standardised dynamic language, then the maintenance cost would have been much lower, and the chances are the code could have been moved from the costly mainframe to instead run overnight on PCs.

So is there any way such tools can extend for a system of a thousand executables, rather than one? A system which doesn't have everything visible? The types are somewhat irrelevant to this, but most programs people are developing are not whole; the question is of scaling approaches which assume you have a single image which is everything you care about, and all changes to the system are monotonic. If you have a system with multiple teams working on it, if you are developing a library which may be used in a number of ways, or if you have a system which is so big you can't load it without paging, then I'm interested in what happens, since anything you deduce from calls to code you can't inspect is not monotonic - some other use of the function may be a correct use and widen its domain. Partly it's an issue of extensional rather than intentional analysis - some systems you need to prove that the core is correct rather than show it is not incorrect in parts. But an awful lot of the code I work with doesn't need that level of confidence, and as much as possible I write that in dynamic languages. Sometimes I make that judgement wrongly, and so I'm looking for mechanisms which allow 'fixing' of parts of the system more easily, as well as allowing assertions to be made for compatibility between parts of a heterogeneous system.

> Most large systems developed with dynamic languages have not used the kind of mechanism you seek.
I've seen such mechanisms in JavaScript, Lisp, Erlang and Python. Maybe Smalltalk just doesn't play well with others, and is only suitable to monolithic applications.

Labels: ,

2007-10-18

I hate everybody

Well, every programming language I've ever used anyway.

Not a lot to report, I've faded off of doing the wide-finder thing as the MPI ruby code was about as good as I could expect in elegance, and I'm more interested in other stuff at the moment. It did get me annoyed with C++ again, in that its primary mechanisms for specialisation - template metaprogramming and single dispatch polymorphism - aren't quite a good fit for the problems I want to solve, - the boost::regex version was much slower, and shouldn't be. I want to have good enough performance and clarity and open objects when developing and strongly typed interfaces when building libraries and extensible grammar and access to the low-level machine when bit-twiddling (tbray9.cpp uses a better 8-wide '\n' scan, adapted off this zero-in-word hack; parallelising it with pthreads or MPI is a left to the reader.) I don't want to have to use unhygienic macros; and this discussion of units in measures in Java's verbose, strict, non-expressive generics made me think about kin, my pet lisp+prolog+strong types language again.

So most of this week I've either been thinking how to make fast VList based stack work for a fork+join worker model, or playing with a higher level language which looks a bit like python with strongly type annotations - I prefer infix syntax, Dylan has always been a bit wierd looking to me, and I don't want JavaScript's C-like braces. Python got a bit interesting when it looked like it was drifting towards lisp-land, but then drifted away again.

One thing that struck reading the dylan manual (it's always worth looking at how other languages do things) was its convention that
    bar(foo)
and
    foo.bar
were the same thing. If that was combined with a strong form of memoization such that writing
    bar(foo) := baz
guaranteed that in future
    bar(foo) == baz
you would get the ability to define dynamic fields in objects, similar to what JavaScript and other associative object based languages offer, as they could be written in the dot syntax as:
    foo.bar := baz
------------------
foo.bar == baz
Assuming type specialisation on the function trace path is also cached, lookup becomes a single hashtable access. You do lose locality, but if you want performance you'd have to add it as a field on the object's class, though I hate that - language features shouldn't be used for indirect effects; type safety and performance optimisation should be orthogonal, and lack of locality saps performance (though it depends on context - if you're iterating over lots of objects, accessing the same field, the locality could be better than if you had a conventional C++ (different fields in the same object local, different objects anywhere) or JavaScript (different fields in the same object possibly local, using a hashtable, or similar).

As an aside, recursively rewriting the slot accessor form
    x.f.g
== (x.f).g
== f(x).g
== g(f(x))
Which, although apparently missing from the DRM, is the opposite of the normal math interpretation of (g o f)(x) = g(f(x)). Pity. Now I also hate Dylan.

My more recent ramblings about kin are on kintab.net; I've thought a bit about what I'd need to do units and measures nicely, and have been going through SCIP coding up what the examples would look like in infix notation, which is a useful test for conciseness and clarity.


Pete

Labels: , , ,

2007-09-29

Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).
fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real 2m28.201s
user 0m10.193s
sys 0m4.752s
At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
48
So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:
fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
...
real 1m20.744s
user 0m24.710s
sys 0m4.420s
Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):
fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real 1m6.780s
user 0m9.593s
sys 0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
60
Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:
tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real 1m23.930s
user 0m57.070s
sys 0m24.636s
Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.

Conclusion

This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.


Pete

Labels: , , , , , ,

2005-06-21

Smart or fluffy, Ozy. You can't have both

Link: http://www.ozyandmillie.org/d/20050608.html

Most of my earlier career has been in formal (usually graphical) notations surrounding systems engineering, trying to create strong, machine usable ontologies and tools that allow users to create structures that represent the systems they're designing.

Graphical tools that allow you too create these things are great fun to program.

But using them is like alphabetising your record collection in seven dimensions - there are product, process, validation, verification, stakeholder, structure, function, behaviour, purpose axes that each system has a position on.

And to a lesser extent, UML tools suffer the same problems, and as the UML gets more and more complicated it only gets worse.

Part of the problem is that the value of getting machine usable information about your systems is high cost, and the cost is incurred by users who don't feel the benefit.

And part of the problem is the tools and notations themselves, the lack direct manipulation, and of notation condensation.

Flow happens in tools when the user forgets they exist, and feels like they're directly manipulating the concepts. Part of that is in the UI design - you have to design for flow - and related to that is notation condensation.

In the UML graphical notation for class diagrams, there are conventions for warts to add to indicate access modifiers on attributes and methods, for example +foo : integer[0..7] means the public attribute named 'foo' being of a collection of zero to seven elements of the type named 'integer'. Of the four UML tools I can think of off hand (ARTiSAN RTS, Poseidon, Rose 2002, Enterprise Architect) only Poseidon lets you enter the text directly and updates the relevent properties in the abstract syntax; the others you either have to open a dialog to edit them (EA), edit each part in a separate in-place widget (Rose) or use their own, non-standard syntax (RTS). So you have to both learn the notation, and learn the tool's take on how to interact with the notation.

ArgoUML/Poseidon has a very good HCI, it's a pity it's currently broken by bugs in Apple's Java so I can't use it. Of course, OmniGraffle lets you enter whatever you like, as it's not based on the semantics of UML, but is a diagram editor.

But even with the condensation, to say something consistent in UML means there are many non-graphical links. Some of these are apparent in the source code, such as a method containing its sequence of operations - not possible to show in UML; others, such as embodiment or requirements trace are relations between models, and the UML doesn't show these well. There is notation in SysML for some of the interesting relations between models, but I've yet to see anything that implements it, and I'm not convinced it will get used - it's a notation that spans diagrams, and the unit of context in UML is the diagram, which is it's greatest flaw. It's a language that hasn't the means to say the interesting things about a project.

So you can go to 'fluffy' approaches. One of the things I'm playing with is getting an atom feed of changes to the codebase of one of my projects, and building a temporal web showing a measure of the time each part was updated. I suspect there is a nexus around a couple of classes, which moves over time as the work progresses. Though I'm not sure what such a measure would mean, other than I was updating these things about the same time.

The thing that we want is to use as many sources of ambient information, information that is zero user cost, to get the most useful view of the project. UML is high cost and fractured; what I want is something like the feeling I have of the shape of the system.

Smart and fluffy, I want both.

Labels: ,

2005-06-14

Emerging Languages

Link: http://www.developerdotstar.com/community/node/144

In the last few days, Martin Fowler posted a few essays on language workbenches and their relation to MDA and MDD. Today John Cowan posted on his Recycled Knowledge blog about traits, as used in Sun's proposed technical computing language, Fortress. I'd been aware of Fortress, but had missed the draft spec coming out.

I don't want to knock Fortress, as it's hard to have a technical computing language better than Fortran and as net-savvy and secure as Java, but I'd agree with Edward G Nilges's blog. Having α := x2 isn't enough.

The big thing that Fortress claims as its selling point isn't that it's a secure, robust, transactional, object oriented language. It tries to sell itself on its syntax looking like maths.

Whatever you do with monospaced text, unicode or no, it's not going to look like text-book maths. For that you need a decent document rendering system, and can use block formatting model result from on relation or transforming the abstract syntax of the language, much like the language workbench systems do. Fortress' examples of 'maths like' syntax simply don't do it well enough. There was also a quite sexy demo of Mathematica's means of exploring computational model spaces at Apple's WWDC, which shows what you can do with a compound document style UI in mathematical programming.

But as well as not being quite enough in terms of becoming literate maths, it seems to ignore Fortran and the existing codebase.

The big problems with agile, high productivity technical computing are how to build models quickly, and scripting languages that apply leverage to existing systems (such as SciPy and my own (unfortunately closed source) Jini based system for creating a distributed agile front-end to Fortran code) offer much better reuse of current code by integration rather replacement. You don't want to rewrite your aerodynamic stress library to use it, so these techniques give better return on effort. Which is what high productivity is all about.

That said, there is also some hard to read code out there that results from the lack of data structures in F77 and the aggregation of patches to 20 or 30 year old systems. There is some merit in applying relational KR models, such as conceptual graphs to documented legacy code to extract the intent, and I hope to put aside some time to play with applying something like a pattern match to some of our larger application sets (107 lines Fortran, mainly single purpose stove-piped programs with no shared libraries) for finding redundant code that can be extracted out to a managable codebase.

There's simply no way we can rewrite our code in Fortress, and FFI and code transformation are more important than syntax.


TME

Labels: , ,