2008-08-07

Why Type

Link: http://blogs.msdn.com/ericlippert/archive/2007/12/06/immutability-in-c-part-three-a-covariant-immutable-stack.aspx

For work, I'm demoing a GUI editor, which means undo would be nice, which is easiest to implement if you enforce purely functional data structures. So I'm doing a bit of reading as to how to implement suitable structures in Java 1.6, as that's the project's target language, and using Eric Lippert's examples as part of my reference material.

There was also yet another thread on forums.java.sun.com about what can be done with generics, erasure and making arrays safe.

I like having something which proves that all possible execution paths don't result in obviously erroneous actions. For many cases, a strong type system supports this.

But the more I look at it, being able to assert any simple thing about any action on an object would make a lot of the proofs simpler. There are better and better proof tools around, and the kinds of things you want to prove about concurrent systems are not easily expressed in a type.

I'm increasingly thinking that a dynamically typed runtime, with a good static model checker, would trump a strongly typed language in performance, productivity, and safety. But creating such a beast is highly non-trivial.

Labels: ,

2008-06-30

Cedric's challenge, C port and analytic attempt

I was a bit disappointed that the C port of the previous code wasn't quite as fast as another approach which cached more state instead of generating the i'th value. The second attempt, using the bcd lists was comparable to John Wilson's Java version (1.3 seconds rather that 1.5 reported by the Java), and an attempt using bitsets a bit slower. Timings on Eee pc 900, Ubuntu hardy heron. The code is nrnat2.c.

The biggest jumps are between 9(descending digits) and 10(ascending) and 10(descending) and 12(ascending). Generating those four values will give the largest jump in a range.

The number of values will be the total of combinations: 9 + 9 * 9 + 9 * 9 * 8 ... up to the max value.

The total for 1..9 is 9*(9+1)/2 = 45;
For numbers 10 to 98 there are 9 numbers starting with each digit, so you have 450*9 for the first digit, and 9 lots of a odd sum, where there's one digit missing - so 9 * 45 - 45 for taking one digit out each time (the range is 0..9 instead of 1..9).

That gives a total up to 10 of 45

Up to 100 we get the 45 up to 10, 45 * 10 * 9 for the first digit shifted once repeated 9 times, and 40 * 9 for the second digit repeated 9 times (as the first digit is always non-zero, it biases the remaining digit's average by 0.5 from 4.5 to 4). So total for up to 100 is 45 + 45 * 10 * 9 + 40 * 9 = 4455.

Up to 1000 we get the values so far, 45 * 100 * 9 * 8 for the first digit shifted twice repeated 9 * 8 times - the number of combinations of the other digits. The remaining digits occur the same number of times as each other in each position and once for each value of the first digit, so you get 40 * 9 * 8 * 11 from them, so total is (4455 + 45 * 100 * 9 * 8 + 40 * 9 * 8 * 11) = 360135.

Similarly sum up to 10000 is (360135 + 45 * 1000 * 9 * 8 * 7 + 40 * 9 * 8 * 7 * 111) = 25277895

Simplifying, the sum of the values with n digits is (9!/(9-(n-1))!) * ( 45 * pow(10, n-1) + 40 * floor(pow(10, n-1) / 9) ) and the maximum jump is floor( ( 1203456789-1029876543 ) / pow(10, 10-n) )


Pete

Labels: ,

2008-06-28

A coding challenge

An implementation of the coding challenge of Cedric's.

Somewhat coloured by the fact I normally use python as a prototyping language prior to production coding in C++.

The combinations technique (using a bcd string instead of list) gets faster than the filtering technique for max somewhere between 1,000,0000 and 10,000,0000. As python is bytecode, it may be slower to bit-twiddle than use a dynamic list; IIRC it has to allocate memory to hold longs anyway, but getting it right in Python and then porting to something with good performance is the way I often work, and longs are fast in production programming languages.

import time

def has_repeat_digits(i, available = (1 << 10) - 1):
digit = i % 10

if available & (1 << digit):
if i < 10:
return False
else:
return has_repeat_digits(i // 10, available &~(1< else:
return True

def brute_force_counter(max):
for i in range(1, max):
if not has_repeat_digits(i):
yield(i)

def test(max, counter, timed_run = False):
last_i = 0
biggest_jump = 0,0,0
total = 0
start_time = time.time()

for i in counter(max):
if not timed_run:
print i,
total = total + i
jump = i - last_i
if jump > biggest_jump[0]:
biggest_jump = jump, last_i, i
last_i = i
stop_time = time.time()
print
print 'total', total
print 'biggest jump', biggest_jump
if timed_run:
print 'time', stop_time - start_time

# select the index'th permutation
# first digit is special case, is in range(1,10) rather than range(0,10)
# otherwise it's a standard permutation. Using BCD encoded value to represent
# remaining digits to select from rather than a list as deleting from a python
# list is O(length) whereas deleting a digit from the BCD is O(1), and I like
# fiddling with bits.
def bcd_select(index):
if index < 10:
return index

pow = 1L
fact = 1L
digits_available = 10

index = index - 1L

while index >= fact * 9:
index = index - fact * 9
digits_available = digits_available - 1
fact = fact * digits_available
pow = pow * 10

#~ print 'fact', fact, 'pow', pow, 'i', index, 'i//fact', index//fact

# first digit is selected from 1..9
available = 0x9876543210
digits_available = 9
sel = (index // fact)
index = index - sel * fact
sel = sel + 1
available = (( available & ~((1 << ((sel + 1) << 2)) - 1)) >> 4) | ( available & ((1 << (sel << 2)) - 1) )
acc = sel * pow
pow = pow // 10
fact = fact // 9
#~ print fact
while pow > 0:
#~ print digits_available
#~ print 'fact', fact, 'pow', pow, 'i', index, 'i//fact', index//fact
#~ print 'available', digits_available, hex(available)
sel = (index // fact)
acc = acc + pow * (( available & (0xf << (sel << 2)) ) >> (sel << 2))
available = (( available & ~((1 << ((sel + 1) << 2)) - 1)) >> 4) | ( available & ((1 << (sel << 2)) - 1) )
index = index % fact
pow = pow // 10
digits_available = digits_available - 1
fact = fact // digits_available

return acc

def bcd_counter(max):
index = 1
while index < max:
value = bcd_select(index)
if value > max:
break
yield value
index = index + 1

Labels: ,

2008-06-03

On the map

Received a little gps 'mouse' receiver in the post today, and (having installed gpsd and python-gps) have got it running as a local service with my first google maps mashup. The server is here, it just shows a google map centred on the current location. I now need to wait until I get a mobile broadband dongle and the new tiny laptop I've ordered (Asus eee 900) and then can have gps on the road (though I don't really want to be using a laptop in the car unless I'm really lost)

Labels: , ,

2008-05-28

Python and EA

For my current contract, I'm part of a team producing scripts which define interfaces to equipment in a satellite communications system for monitoring and control. Various interfaces, serial, low-level TCP and SNMP.

The application is fairly old - originated on Unix in early 1990s, ported to MFC late 1990s - and there are many links between the various configuration files. It's a high reliability system, so uses the classic pattern of a watchdog process with spawns the active processes, monitors them for liveliness, CPU and memory use, and respawns them or kills them in event of failure.

The goal is to use Enterprise Architect for modelling the system in UML. EA has a limited custom code generation - one file per class, poor tree walking functionality - which is just sufficient to create the interface scripts, but none of the other configuration which is more highly intertwined - for each rack or group of racks of equipment to be monitored, there's a server which runs a set of monitoring processes, each monitoring a set of instances of the equipment. Various parts of the configuration, such as the port ID of each server process, need to be put into different configuration files, so the desired layout of generated files doesn't follow EA's assumptions.

Also, there needs to be a parser for the configurations which reverse engineers a UML model from the configuration files - we don't want to have to generate the UML models for existing equipments by hand.

So I started writing a little parser in Python which generated an XMI file for import into EA. That worked well enough, and impressed the line manager, so have been spending the last couple of days fleshing it out.

EA doesn't quite treat tagged values as first-class citizens of UML, relegating them to a second window (though it's nice to have them docked and always visible, but that could apply to all the panes in the properties dialogs). So I wanted a table view which looked like the config file layout, of all the extra tagged values we're using to define the application specific properities used to implement the attributes of the equipment in the configuration files.

I've been using GTK for that - I've played with GTK a bit in C, and quite like it's level of abstraction and simple event model. It's not too hard to get things working in Python (some of the handling of attributes is a bit counter intuitive), and it mostly looks native (compared to Java or Gecko, or even EA's UI, which I believe is written in C#, but isn't using WinAPI menus).

I quite like Python, but it's a bit too imperitive for my taste, and there's no type system to tell you if you've put a right-handed vector into a left-handed library, or referenced an attribute of an object which doesn't exist. But being able to create a simple dialog with a REPL (read/eval/print/loop) in it to do the initial exploration of the EA COM automation interface was very handy.

The thing I dislike most in Python is that it doesn't actually use indentation to indicate blocks (I read Ken Arnold's chapter in The Best Software Writing over the weekend, and think I agree). It uses the colon character to indicate a block, and following a colon you have to use indentation. I'm always just doing the indentation, and forgetting the colons.

Labels: , ,

2008-05-20

Lambda The Ultimate and Matrix Distractions

I had been reading up a bit on trace-based JIT. I'm not sure that there's any real difference between trace based and method based JIT if each branch is converted to CPS style call with type feedback. Which means removing the optimisation of having jumps and branches from the bytecode, and replacing them with dynamic calls.

But then I got distracted by a discussion of Strassen matrix multiplication, and haven't done anything with it yet. I need to get back reading through Golub & Van Loan too; it's over a decade since I last read it.

Labels:

2008-04-30

Test of syntax highlight

Test of the syntaxhighlighter:


# generic sequence trait
trait seq<T>
def empty : boolean
def head : T
def tail : seq<T>
def to_string => '[' ++ join(self, ', ') ++ ']'
def concat (other:seq<T>)
=> other.empty ? self : shift(other, reverse)
def reverse => shift([], self)

Labels:

2008-04-25

Some things I'm thinking about

I've been playing more with my mostly functional modelling and simulation language (kin), here are some ideas which I want to try and profile to see if they offer gains:

Carrying type and data as a tuple:


Some implementation of Scheme and JavaScript use a single void* size datum for each object, and for integers set the lowest bit and encode the value in the rest, and for other objects the lowest bits are zero due to alignment, and the whole is a pointer to object. That doesn't work very well for doubles, and requires that the type is stored in the object's header. Java, C#, and C++ (for object with a vtable) have type information in the object header, which costs in large arrays. I've been thinking about passing the type and the data around separately, which means you can have more tightly packed arrays for non-vtable types.

One-bit markers for escape detection:


Based on One-bit Reference Counting, a marker in the type field to distinguish 'shared' objects you can take a reference to, and 'non shared' which you must not.

Write-once arrays:


As an extension to 'one bit reference counting' style; you maintain a high water mark instead of always copying, for use in structures such as VLists so you don't copy on passing automatically, and it allows you to mutate an array if it can be proven that all accesses happen before writes in matrix multiplications:

ax[i] <- ax[i] + row[i] * col[i]

Block order arrays for cache sensitive iteration and super compiled functors:


In a simulation, it's not uncommon to have a reduction on some function of a few attributes of an entity:

let total_force = reduce(array_of_entities, lambda(e:entity, ax) => f.mass * f.acceleration + ax, 0)

If this is over a homomorphic array, and mass and acceleration are simple double values, this would in C translate to picking 8 bytes here and there from the memory for the array. If instead each field is kept either in 'column order', or in what I'll call block order - so the fields for (cacheline size in bytes) objects are held contiguously. This should both reduce cache misses, and allow the use of SIMD instructions to process the data. The obvious disadvantage is that an object in an array is no longer the same layout as a single object on the heap, and to exploit it you need either a super-compiler or a trace-based JIT.

Effective model for scopes based on lists, symbols and objects:


Trying to build an interpreter in Java, which makes it tempting to use maps for everything, I found that the properties of an object and the properties of a lexical scope are much the same, (the duality of closures and objects is well known) so will try and define the binding model for values in kin using symbols, integers and lists only.

Using CAS for thread safe homogeneous VLists


Similar to the vector implementation in Lock-free Dynamically Resizable Arrays.

Options for compiling Verilog to shader language


Having had an interview last week as a systems modelling engineer with a company who were using C as the modelling language for timing simulations in embedded memory controllers, which is a bit like going for a job as a car mechanic and discovering that you're being quizzed on your wood-working skills, I was thinking about compiling an appropriate modelling language to something which executes efficiently. Though their particulary problem - timing conflicts - I would have though as having an analytic solution.

Something after UML


Speaking of Verilog, Verilog came into existance because even standardized schematic diagrams don't carry strong enough semantics and are not amenable to algebraic analysis, and graphical notations don't scale to large systems without hiding information.
Pi-calculus came into existance as Petri nets don't scale to large systems and are not amenable to algebraic analysis.
UML is very much in the standardised schematic mould, lacks formal semantics, and relies on hiding information to scale.

Often the hidden information in UML is very important - what appears to be a good design is not a good design if its simplicity is achieved by hiding most of the attributes and operations of a class, as often happens when a class gets multiple unrelated responsibilities. The notation allow you to hide details which should indicate that the design should be factored into smaller units, and in fact encourage such behaviour as a mechanism to scale up to showing many classes. For example, in Altova UModel:
You can customize the display of classes in your diagram to show or hide individual class properties and operations. ... This feature lets you simplify the diagram to focus on the properties and operations relevant to the task at hand.

If the class has features which are not relevant, then refactor it. Don't hide it so it doesn't make the documentation more complicated. The classes for well-factored systems have nothing to hide, but require more relations between the classes, which UML tools don't handle well (last time I checked none provided basic auto-routing or anything like good layout engines) and look more complicated, even though they are more simple - in the same way a dry stone wall is simpler than a minimalist plastered white apartment, but is visually more complex.

So what might come after UML then? What would not mitigate against refactoring, but allow a visual overview? What notation might give good analytic properties for software, rather than being a system schematic with loose semantics?

I don't know yet.

Labels: , , ,

2008-04-20

Bjarne Stroustrup on the Evolution of Languages

Link: http://msdn2.microsoft.com/en-us/magazine/cc500572.aspx

A few comments on this -

The illustrations in the interview don't appear related other than that they are of an IDE ("I might have a different opinion if a good IDE was universally available", queue Visual Studio, which implies that VS isn't a good IDE) and that the UML is illustrating multiple inheritance, but mechanism for dispatch in the paper is not based on multiple inheritance (at least as shown in the diagram).

It's an interesting paper though, and I would like open methods in a language.

I don't think that DSLs should be just extensions to general purpose languages. Mainly because that implies a specific execution model, and if you have a DSL then execution shouldn't be part of the domain model (unless it's something like Verilog, or the input to a program proof tool, in which case execution is a domain concept). DSLs should allow the modelling of a domain, and lose their benefit - abstraction away from execution - if they are tied to a programming language. If you have a model of the physical domain, then you can state 'F = m * a' and not have that imply that the values named by 'm' and 'a' are multiplied and stored in the location named 'F', as binding to C++ semantics would have it. Otherwise you need at least three models - 'F = m * a', 'm = F/a' and 'a = F/m' - to let you represent everything that the one entity in the domain represents. Which would create a mismatch between the language and the domain, eliminating the benefit of a DSL.


I've also been reading 'Programming Languages Pragmatics' and working on my model-based language kin a lot, and adapting the parser and code model of the modelling language I used to do a state model of some concurrent code to support kin. I was generating Java as a first cut to bootstrap a full compiler, but Java's generics aren't rich enough to support kin's parametric types, and without them it's not easier to generate Java than C. So I will go straight to a C back end, and generate an intermediate form which can be interpreted. It's more important for kin to be able to call C libraries such as the Intel Math Kernal Library than for Java libraries, and I don't really like the sort of compromises you have to make to fit in with Java's libraries, however nice the JVM's HotSpot and garbage collector are to have.

Labels: , , ,

2008-04-13

Andy Goldsworthy - Rivers and Tides

We got the DVD as a wedding gift, and watched it last night.

I like Andy Goldsworthy, both for his aesthetic and the humility that lets a work exist within an environment which makes it transient, or fluid.

Watching him create a work, sometimes it would fall. Often it was compromised to fit with the time scale imposed by the environment by tide or light.

At each fall, a discovery. More of the nature of the work is revealed; the more failures, the more effortless and whole the final work becomes.

I really like the idea of a flowing form moving through or around obstacles, adapting to what's there, constructed out of what is available, but maintaining its own identity, a recognised figure.

The hanging works, stalks randomly pinned together with thorns, which then form a coherence to mark a hole. A form emerging from chaos, using the chaos, constructed with the chaos.

The works accept that they are not static, but will be changed after the artist commits them to the environment.

There's something in programming - called the quality without a name - which is what software architecture is about. It isn't about perfection, or locking everything down to nuts and bolts, or about hiding the chaff and chaos under languages and frameworks, but about habitability and fluidity, and coherence. It's what the original pattern movement was about.

Is this work good to live with? Does it fit with the environment? Can it flow and adapt with the rivers and tides?

Labels: , ,

2008-01-28

Plays well with others?

> I know large systems can be built statically...I've also done it dynamically in more than one dynamic language.
That hasn't quite been my experience - every large system I've worked on has ended up getting layered into a static core with dynamic or DSL scripting. Greenspun's tenth rule.

On the other hand, even medium applications I've been involved in which were implemented only in a dynamic language got into difficulty with performance or modularity, and so either evolved into a mixed language system or were reimplemented.

Often I also use dynamic languages or environments to prototype algorithms, and sometimes then have to port to static languages for performance reasons - writing better code generators is rarely cost effective, as the dynamic languages are sufficiently good at the general cases, if you need to optimise you're optimising for a specific requirement. The same happens (albeit more rarely) where you need to add inline assembler code to C or Fortran where you need to go below their abstraction level (CAS or bitwise rotation, integer overflow etc).

Also in large systems you often interface with extant libraries, which may be implemented in diverse languages. You're not going to port half a million lines of a numerical library into Smalltalk so you can call it.

So you end up with interfaces between the user code and the library, and having to specify in a machine processable manner the types at that interface, as you're now below the abstraction barrier of the dynamic language.

Other dynamic languages require definitions for such interfaces, though foreign ones may not be visible in the language's introspection facilities - for example, Mozilla's JavaScript uses IDL for the definition, but the type and arity of functions isn't visible via introspection.

This introduces cost and brittleness to the system; and in the types of systems I've most experience with it also makes it harder to move a prototype algorithm between the dynamic and library codebases than it feels it ought to be - the dynamic runtime usually has the facilities for doing most of the code generation work already, but there's no means of accessing the results and tweaking it. I am looking for mechanisms which make that sort of task easier, and which don't require custom interpreters for DSLs, which are a pain to get right or maintain.

>"What happens in Smalltalk projects where you grow larger? Where you want to divide a system up into components, and some of those components don't exist yet?"
> You have a need; you have an idea to meet the need; you try it; you grow it. I guess I don't understand the question.

In my experience, that only happens up to a certain size. My question was about large systems - systems where you have to subcontract components, either to another individual in your team, to a second local team or to a separate organisation. Bottom up approaches don't scale out to that level, top down approaches introduce brittleness and (code production) inefficiency. The Smalltalk browser appears to be strongly a bottom up tool; I'm curious as to whether it allows that approach to scale out better.

> Semi-formal notations, tests and other examples, documentation, good tools -- these all contribute to success. But if you are looking for some kind of a "static declaration" as they saving mechanism for large systems -- it may work for you, but I doubt you can prove they are necessary and sufficient. I have counter-examples.

I do consider some form of specification of the domain and codomain of a function to be required, and the stronger the specification which can be automatically tested, the better. Verifying natural language specifications by hand is simply not viable for large software projects. What's provided by the type systems of most languages is not sufficient even for that - you can't say the domain of a square root function is the set of positive reals. Specifying pre- and post- conditions helps, but for anything involving concurrency you need 'for all' rather than the 'there exists' specifications given by unit tests. But most of the system doesn't require formal specification, and good design minimises shared state.

So I'm looking for something with the gains of dynamic languages without barriers to performance, and good modularisation. As such, I'm more interested in gradual typing than Smalltalk. But if it does have anything to offer then I'll borrow from it.

> "Can such interfaces be browsed in the same way if the code can't be called?"
> Your question betrays your bias. "There must be some such mechanism as the one I am used to," he thinks.
Yes, I have a bias based on my experience and problem domains.

One project I was involved in was refreshing the library used for airworthiness stress analysis, with a view to improving the performance (it was costing around half a million pounds in CPU time to run the analysis for each design revision, so in the company at that time, making a 10% improvement in the run time would mean someone didn't have to get made redundant). This consisted of a set of Fortran libraries, some OS scripts, and a domain specific scripting language which encoded the models and analysis algorithms. A set of flight states was mapped over the model, and the outputs reduced with a filter to indicate any resonances. The core of the system was mostly unchanged since the late 1970s, only the models used being updated with successive revisions to the aircraft.

There was a large amount of model and library code, so I wrote a quick and dirty call browser to explore data flows in the parts of the system I had access to. Since the code was significantly larger than the memory capacity of the computer I was running the analyses on, let alone expanding that to the AST and call graph, the interactive image and execution based querying as presented in the Smalltalk browser post would not have been suitable. The project ended up going nowhere, as translating the implementation of non-standard dynamic language from the mainframe to PC would take more resources than the team had in the time available.

If the scripting was in a standardised dynamic language, then the maintenance cost would have been much lower, and the chances are the code could have been moved from the costly mainframe to instead run overnight on PCs.

So is there any way such tools can extend for a system of a thousand executables, rather than one? A system which doesn't have everything visible? The types are somewhat irrelevant to this, but most programs people are developing are not whole; the question is of scaling approaches which assume you have a single image which is everything you care about, and all changes to the system are monotonic. If you have a system with multiple teams working on it, if you are developing a library which may be used in a number of ways, or if you have a system which is so big you can't load it without paging, then I'm interested in what happens, since anything you deduce from calls to code you can't inspect is not monotonic - some other use of the function may be a correct use and widen its domain. Partly it's an issue of extensional rather than intentional analysis - some systems you need to prove that the core is correct rather than show it is not incorrect in parts. But an awful lot of the code I work with doesn't need that level of confidence, and as much as possible I write that in dynamic languages. Sometimes I make that judgement wrongly, and so I'm looking for mechanisms which allow 'fixing' of parts of the system more easily, as well as allowing assertions to be made for compatibility between parts of a heterogeneous system.

> Most large systems developed with dynamic languages have not used the kind of mechanism you seek.
I've seen such mechanisms in JavaScript, Lisp, Erlang and Python. Maybe Smalltalk just doesn't play well with others, and is only suitable to monolithic applications.

Labels: ,

2007-11-01

One more wide-finder

Since previous efforts didn't print the results summary, they weren't really complete, so here are two more variants using a vector (tbrayA.cpp) and a priority queue (tbrayB.cpp) to get the top 10 entries.

There's a pthread parallel version of the B variant (tbrayB_parallel.cpp). The main thread reads chunks, scans back to a line break, and hands off the chunk to a pool of worker objects, which run the matching in their own thread. Once all the data is read, the workers' counts are summed and the top 10 counts found, then the results printed as in the single threaded version. Each worker's data buffer is written only by the main thread, for the most part in the call to the OS' read function; this helps reduce the amount of copying of data - though whether the T5120 would have to copy anyway I don't know (some CPUs share cache between cores, others don't, those that don't may have to copy the buffer from one cache to the other when the matching thread starts processing).

The code hasn't been built on anything other than linux this time, so I can't guarantee it will run on Solaris. There's also a deadlock on the scheduler if you only use 2 workers (you need at least two so the buffers are only being written or read by one thread at a time), which I have to track down.

Obviously, the multi-threaded version requires quite a lot of infrastructure - it's twice the size of the single-threaded version. It also runs significantly slower than the single-threaded version:

fortinbras:$ time bin/tbrayB_parallel datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m1.358s
user 0m0.372s
sys 0m0.232s
fortinbras:$ time bin/tbrayB datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m0.531s
user 0m0.232s
sys 0m0.156s
Commenting out the call to process_chunk(), so it just does the fork and join thread communications:
fortinbras:$ time ./widefinder datasets/hundred-o10k.ap 3

real 0m0.288s
user 0m0.056s
sys 0m0.220s
So there is some additional cost in the parallelism other than just the inter-thread messaging (372 > 232 + 56).

There's an archive of the code here; unpack it, run make and call ./widefinder with the path of the logfile and the number of worker threads. The number of worker threads is 3 minimum. The other examples compile to the ./bin directory from make tbray.

I'm still not overly convinced that it will be CPU bound if it has has a small, slow, miserable disk unless you are using a language whose virtual machine is excessively baggy; I've been careful not to use any optimisation which is specific to the data, rather than what a compiler for a higher-level language would know, given suitable annotations for parallelism.

The latest erlang effort from Pichi is twice as fast as the original ruby; ruby takes 2.580s CPU time on my machine, so the single-threaded C++ version is 11 times faster than ruby on one core, the parallel one 7 times. So once you have half-a-dozen cores then the erlang should beat the single threaded C++ code. Though I'd like to see whether on the disk on the T5120 means it is CPU limited or IO limited with the faster code, and I'm unconvinced that the T5120 is what future will look like - I'd expect more cache and flash on chip, rather than more cores, and a multi-node rather than a multi-core future.


Pete

Labels: , , ,

2007-10-26

Garbage

I've been thinking a bit about garbage collection and stack allocation, and both Robert O'Callahan and phaeron at virtualdub seem to have been.

Chicken scheme seems to have something interesting - a single stack which is collected on exhaustion (rather than C# or Java VMs which have a traditional program stack and a stack based allocator), so I'm thinking about implementing a tail-call-elimination variant of the single stack. That requires escape analysis, which may be costly - in its simplest form, it requires a flag to indicate whether any reference was made to anything on the stack above the point the tail call returns to. Either you can be conservative and set the flag on all function calls where a reference to a stack object is passed as an argument, or you can use a single assignment model and test whether anything was allocated. Making the core single-assignment would possibly complicate supporting other languages, and means you trade space in one condition - consing to hold the assigned value - for space in another - being able to recover some of the stack early in a tail call.

Having just the one stack means that, on x64 at least, you can use the one stack register.

Hmm. These things are all rather hard to do right, and don't have simple 40 line benchmarks.

See also a simple worked example.


Pete

Labels: , ,

2007-10-18

I hate everybody

Well, every programming language I've ever used anyway.

Not a lot to report, I've faded off of doing the wide-finder thing as the MPI ruby code was about as good as I could expect in elegance, and I'm more interested in other stuff at the moment. It did get me annoyed with C++ again, in that its primary mechanisms for specialisation - template metaprogramming and single dispatch polymorphism - aren't quite a good fit for the problems I want to solve, - the boost::regex version was much slower, and shouldn't be. I want to have good enough performance and clarity and open objects when developing and strongly typed interfaces when building libraries and extensible grammar and access to the low-level machine when bit-twiddling (tbray9.cpp uses a better 8-wide '\n' scan, adapted off this zero-in-word hack; parallelising it with pthreads or MPI is a left to the reader.) I don't want to have to use unhygienic macros; and this discussion of units in measures in Java's verbose, strict, non-expressive generics made me think about kin, my pet lisp+prolog+strong types language again.

So most of this week I've either been thinking how to make fast VList based stack work for a fork+join worker model, or playing with a higher level language which looks a bit like python with strongly type annotations - I prefer infix syntax, Dylan has always been a bit wierd looking to me, and I don't want JavaScript's C-like braces. Python got a bit interesting when it looked like it was drifting towards lisp-land, but then drifted away again.

One thing that struck reading the dylan manual (it's always worth looking at how other languages do things) was its convention that
    bar(foo)
and
    foo.bar
were the same thing. If that was combined with a strong form of memoization such that writing
    bar(foo) := baz
guaranteed that in future
    bar(foo) == baz
you would get the ability to define dynamic fields in objects, similar to what JavaScript and other associative object based languages offer, as they could be written in the dot syntax as:
    foo.bar := baz
------------------
foo.bar == baz
Assuming type specialisation on the function trace path is also cached, lookup becomes a single hashtable access. You do lose locality, but if you want performance you'd have to add it as a field on the object's class, though I hate that - language features shouldn't be used for indirect effects; type safety and performance optimisation should be orthogonal, and lack of locality saps performance (though it depends on context - if you're iterating over lots of objects, accessing the same field, the locality could be better than if you had a conventional C++ (different fields in the same object local, different objects anywhere) or JavaScript (different fields in the same object possibly local, using a hashtable, or similar).

As an aside, recursively rewriting the slot accessor form
    x.f.g
== (x.f).g
== f(x).g
== g(f(x))
Which, although apparently missing from the DRM, is the opposite of the normal math interpretation of (g o f)(x) = g(f(x)). Pity. Now I also hate Dylan.

My more recent ramblings about kin are on kintab.net; I've thought a bit about what I'd need to do units and measures nicely, and have been going through SCIP coding up what the examples would look like in infix notation, which is a useful test for conciseness and clarity.


Pete

Labels: , , ,

2007-10-05

A few more thoughts on concurrency and high-level languages

I little while ago, I started playing with a pattern-matching and relational language for meta-programming called 'kin' (it's got relations, and does pattern, so kin/kinship).

Being a machine elf rather than a language lawyer, I never could come up with a decent user syntax for it; I've toyed with making it a lisp (either L1 like scheme or L2 like common lisp; I wrote something that used lisp s-exprs to manipulate JVM class files, but never got further, and then 1.5 came out and increased the verbosity of the type system without increasing its power significantly, and I moved on), with making it a prolog derivative (I rewrote the core of a Java WAM engine so it ran a few times faster, but never got anywhere much with that either, apart from looking at it and thinking, 'yep, it's faster now, so what do I do with prolog?'); then I thought about a faster JavaScript engine, but Tamarin's handling that well enough for most use-cases, and I want better type inference rather than having to tell the compiler everything. Tamarin also makes the mistake of relying on programmer specified type to give performance; usually the type can be inferred if you have the whole system available, and even if not, the Self and StrongTalk VMs show you don't need explicit type to help the compiler. Type systems should help programmers, not compilers. But the JavaScript story doesn't really matter - the XULRunner front end I spend half my time on at work spends 90% of its time in Mozilla's SVG rendering code rather than my JavaScript anyway. I write my best software when there are users asking me to, and when it's also a problem which interests me; without users I get bored too quickly, without interest I don't give more than the users expect.

I've spent a few hours poking around Rubinius code; they appear to be making a ruby interpreter very like lisp interpreters were a couple of decades ago. I don't program ruby; I do heavy math backends and light front ends (with occasional forays into OpenGL), and never touch RDBMS, so use C++, Fortran and Java backends, C++, Java, XSLT middleware, C++, Java, JavaScript front ends (depending what level of 3D visuals they require; I haven't had cause to look at canvas3D yet so the OpenGL work has always been in C or Java, though given the current Gecko SVG performance I've gone off basing a front end on Mozilla code) - though I'd quite happily never use Java for a UI again (last time I got RSI from all those anonymous inner classes). I've been known to use lisp to prototype math code before coding it up it C++ for speed (though that's mainly been due to working for BAE at the time, who take 6 months to buy software for a 3 month project, or due to other business constraints on the implementation language and isn't a reflection on professionally produced lisp compilers, which are fast enough not to require porting to C++).

An old housemate and good friend of mine did his PhD in an embedded Erlang system; that's as close as I came to it before a year or so ago, when I started using XMPP at work and tried installing ejabberd. It didn't work, and I gave up after a few tries, so I tested our server's federation code against Wildfire instead. That's nothing to do with Erlang; ejabberd appears to scale well in independent tests, and I suspect the Windows installer wasn't a priority (our customers require Windows). I've read comments that there are smart people who have written good software in Erlang. This is true. Smart people also write good software in BASIC or COBOL or ADA; that's because smart people write good software. Writing a language that smart people can write good software in is easy; writing a language that average people write non-crap software in is far harder. For my opinion, for this sort of problem, ruby seems such a language (well, it seems much better than Perl or PHP, and a bit better than python), but the implementation is pants. Conversely, ejabberd scales to many more users than Twitter can, and I suspect is a cleaner expression of the problem, because IM's a concurrency problem and erlang far exceeds ruby's capabilities at concurrency.

Here's some good advice on how a high level language should express a problem in something I started reading earlier:
"The structure of the program should exactly follow the structure of the problem. Each real world concurrent activity should be mapped onto exactly one concurrent process in our programming language. If there is a 1:1 mapping of the problem onto the program we say that the program is isomorphic to the problem. It is extremely important that the mapping is exactly 1:1." -- Joe Armstrong (PhD thesis, Dec 2003).

You have a log file. You scan each line (in any order) for a hit and accumulate the number of times each hit occurs. Then you list the hits in decreasing order of the accumulated hit counts.

I don't see any concurrency intrinsic in the problem. The only reason we're even thinking about concurrency is that we want the implementation of the solution to the problem to be parallelised over many cores for performance reasons, and (as SIMD isn't sufficiently general to scale much beyond 4 times wider paths except in GPUs), many concurrent cores is the only thing we can think that will make computers have more throughput. For this problem, concurrency is as much an accident as memory management. Languages which handle memory management automatically have already eclipsed those which require manual memory management in most areas where it isn't an essential part of the problem. As it's not an essential part of the problem, no language where the concurrency is manifest in the implementation of the solution will be following the structure of the problem. Admittedly, I grew up on BASIC and 6502 assembler in my teens, and if you teach kids pi-calculus in high-school the next generation might see the problem differently.

Neo: What are you trying to tell me? That I can dodge bullets?
Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't have to.


With Erlang you can express a problem in terms of concurrency.
It may be a long time before ruby's ready.


Pete

Labels: , , ,

POITROAI

For this particular problem, the principles are more important that the exact results, but I'm happier with tbray7.cpp which gives the same results as the regex in the original ruby code:
fortinbras:$ time bin/tbray7 datasets/hundred-o10k.ap 
matches: 95400

real 0m0.439s
user 0m0.284s
sys 0m0.144s
The file was in cache for that one so ignore the real time; the important thing is that the user time wasn't adversely effected by getting the 'right' answer, instead of 1100 matches.


Pete

Labels: , ,

2007-10-04

Of course, it helps if you tell gcc to generate a 64 bit executable

... when you're using 64 bit operations:
tercel-2:$ time bin32/tbray4 datasets/hundred-o10k.ap 
user 0m5.644s
tercel-2:$ time bin32/tbray6 datasets/hundred-o10k.ap
user 0m4.048s
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
user 0m5.725s
tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
user 0m3.839s
I'd assumed it would default as it does on linux, but no. Which also explains why I was confused about sizeof(size_t) being less than sizeof(long long) here.

It's still around 8 clock cycles per byte - the 5% gain is not enough to alter yesterday's estimates, which is a little encouraging actually, as I wasn't sure whether or not 32 bit Sparc instruction generators, such as gnu lightning (used in rubinius) or whatever erlang's hipe uses would cause a significant slow down.


Pete

Labels: , , ,

2007-10-02

Some fingers in the air.

I got Steve Vinoski's 2007/09/29 erlang code, installed hipe and the bfile module, and it ran on the laptop:
fortinbras:$ cat ../datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time erl -smp -noshell -run tbray5 main 512 ../datasets/hundred-o10k.ap
110100 matches found

user 1m23.649s
real 1m33.683s
sys 0m1.620s
I'm not sure looking at either mine or Steve's code where the 1101th match comes from - there are #ifdefs in my line splitting code to print the lines, and if you run diff that output with the input it's the same, and if it was related to the last line in the sample it would give a difference of 1 not 100 for the hundred-times repeated file. But something's inconsistent between the two somewhere, and also with the original ruby code which gives 954 matches for the o10k.ap, or 1097 for the regex %r{GET /ongoing/When/([^ .]+) }.

From the gnome-panel dials, the erlang isn't IO bound in total on this machine - for the first few seconds it is running at max IO and 90% CPU, then for the remainder 99% CPU and does zero IO, so it's reading it all into memory then spawning processes to scan it. I'll leave it running overnight on the 10 million line file to how it fares when it can't fit the file into memory, though I doubt that has much of a bearing on what would happen on the T2 box, as that has plenty of RAM.

[Update: It took 52 minutes 26, and was doing disk IO throughout, but that could well have been paging rather than the IO it has to. Nothing to conclude, other than that it doesn't scale linearly - 10 times bigger file takes 34 times longer.]

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time ~/projects/quad-store/bin/read_bench datasets/hundred-o10k.ap > /dev/null
real 0m8.754s
user 0m0.000s
sys 0m0.180s
So the 64 bit-wide matcher on the AMD64 laptop is very IO bound; the difference between the total time of just doing the IO and scannning the lines is negligible.

At 2200 MHz, it's scanning 201 MB data in 0.284s, which is 2200*0.284/201 = 3 cycles per byte processed.

Sun's Thumper gives 2GB/s transfer into memory; the T2 runs at 1.4 GHz and assuming the same IO rate, we have 0.7 clock cycles per byte delivered to play with. 64 hardware threads would give about 44.

Running read_bench on one of the 'ancient creaking' Netras, gives:
tercel-2:$ time bin/read_bench datasets/thousand-o10k.ap 
real 1m15.449s
user 0m0.362s
sys 0m27.998s
That's IO bound, 2,095 MB/75.5s = 27 MB/s. The laptop gets a similar transfer rate figure on the same test.

The T2 is 200 times the CPU, and the Thumper 80 times the disk transfer rate; I'll assume that the T2 system's IO is comparable to the Thumper.

tercel-2:$ time bin/read_bench datasets/hundred-o10k.ap 
real 0m6.159s
user 0m0.058s
sys 0m2.834s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
matches: 110000
real 0m7.617s
user 0m4.054s
sys 0m2.504s
At 440 MHz, it's scanning 201 MB in 4s, which is 440*4/201 = 8 cycles per byte processed; it's IO dominated but not IO bound. Optimising the matching code would at best give a 20% improvement.

It's also using about 5 cycles per byte system time for the IO.

Since the Netra runs Solaris 10 and is a Sparc variant, and in the absence of better information, I'm working on the assumption that the 1.4GHz T2 would be close to it in terms of clock cycles per work unit, so to process the log files it would also need around 13 cycles per byte delivered. So either we'd need 13/0.7 = 19 threads, assuming each thread provisions another 0.7 cycles worth of work, or run 64 threads and can afford code that's three times slower than the single threaded variant due to concurrency management. If it's closer to the fat-piped AMD64, then only 6 cycles per byte and 8 threads would do.

Guessing even more widely, with a large dataset and large grain concurrency, the C++ matching code should be at least three times faster than the IO - using 8 to 20 of the 64 threads. The erlang interpreter seems to be around 10 times slower , so would require 80 to 200 threads to balance the transfer rate [correction: only the CPU time is relevent here, so it's 83.649+1.620s vs 0.284+0.248s, which is 160 times slower, so would require 1280 threads to get the same throughput]. If the IO rate is less than 800MB/s and the erlang scales across all 64 cores then the faster matching in the C++ won't give any advantage. I've no idea how good erlang is across cores - the only times I've seen it praised is many (conceptual) processes on small numbers of cores - it's a concurrent language rather than a parallel one. Inter-core communication doesn't come free, and any message still need either lock based synchronisation or CAS/HTM retries; as it doesn't matter which actor works on the data, in a C++ implementation you'd use a fail fast lock and try with a different actor rather than it blocking, so going lock-free would cost more than it would gain. AFAIK you don't have that option in erlang, and are stuck with the given queuing and scheduling mechanisms, though you probably could subvert them.

In computer science there are only three numbers - zero, one and many. If my estimates are anything to go on (and I wouldn't bet more than half a penny on any of them), on the target system you need many threads to solve this problem most efficiently.

Looking at the code for the bfile library, it shouldn't be too hard to move the 64bit wide string match into an erlang library, which seems more fun than fiddling with MPI to get the required thread count, but the required thread count is far fewer than the number of processes in Steve's code, so MPI or pthreads should be a better model than large scale concurrency. But erlang may be to concurrency as lisp is to dynamism - I like lisp, but I haven't found been any interesting problems which I could apply it commercially under the constraints of the business, and even though I end up greenspunning in most systems I write, the subset of lisp-like features which gets implemented to give dynamic behaviour is tailored to the problem, and the rest of the code can be performance optimised more easily.

On the other hand, my fiancée comes back into the country on Saturday and I'm supposed to have finished sending out our wedding invitations by then, and all this isn't really helping that to happen.


Pete

Labels: , , ,

2007-09-30

Data-wide wide-finder

Running a test with just breaking into lines without the matching, compared to the non-MPI version with matching runs on the 100x file in 380ms user time rather than 720ms. So there is work in both parts of the string processing.

Parallelising the scan for newlines to work on 8 chars at a time gives a reduction to around 220ms.

Restoring the matching, and parallelising the scan for the first character makes the CPU run at 800MHz.

A quick google search tells me that sudo dpkg-reconfigure gnome-applets enables me to use the gnome panel applet to lock the frequency at 2200MHz, so with that set, I can compare results between the linear and SIMD parallel forms. Typical results are around 400ms better for the parallelised version on the laptop:
fortinbras:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000
real 0m10.420s
user 0m0.688s
sys 0m0.268s

fortinbras:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s
This makes user+sys time go from 950ms to 532ms; if this process was CPU limited, that's as good as you could possibly get from running two cores.

Sorting out a few machine word dependencies so it runs on one of the Netra Solaris boxes (Linux/Solaris code is at tbray6.cpp; I was assuming size_t would be 64 bit on all 64 OSes), the results are:
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000

real 0m7.593s
user 0m5.638s
sys 0m1.894s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000

real 0m5.915s
user 0m4.027s
sys 0m1.855s
Here we get a much more significant speed up - about 25% faster overall time, due to a different balance between CPU and disk.

These simple benchmarks indicate a few things to me:

Not all gains from parallelism require concurrent processes - more transistors can mean wider data paths as well as concurrent cores. If I were using the architecture specific SIMD registers on the AMD64, then the paths would be twice as wide again.

A slower CPU with a 'server class' disk outperforms a faster CPU with a 'consumer class' disk in this test, but shows up potential to parallelise the code.

I like playing with bit-twiddly code on the weekends. I don't like having to debug bit-twiddly code in mission critical applications for my customers to a tight deadline. The optimisations should be in the regex library, rather than complicating user code - for example, this code avoids reading past the end of the buffer only because it has a pattern longer than the word size.

I'm also wondering whether some kind of hash based wide-scan can detect two or more characters of a pattern well enough to give an improvement, rather than just a wide-scan for first one - there's probably only a couple of bits of information per character in those log files.


Pete

ETA:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/tbray6.cpp | wc -l
85

It's hairy but it's not huge.

Labels: , , , ,

2007-09-29

Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).
fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real 2m28.201s
user 0m10.193s
sys 0m4.752s
At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
48
So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:
fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
...
real 1m20.744s
user 0m24.710s
sys 0m4.420s
Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):
fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real 1m6.780s
user 0m9.593s
sys 0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
60
Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:
tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real 1m23.930s
user 0m57.070s
sys 0m24.636s
Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.

Conclusion

This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.


Pete

Labels: , , , , , ,

2007-09-07

Thinking more about compression

I'm using compression on the URIs and literals in the quad-store to trade CPU for disk/memory bandwidth. The current algorithm is LZW, with a 24bit index look up table. Yesterday's benchmarks indicate it's not doing massively good compression - ; on the other hand it's small and I can see how to preform regular expression searches on the compressed data. 24bit index tables take 64MiB and give around 1:5 compression - reduces to 23%. You can't reset the table like you do in a streaming compressor, since you want the mapping of compressed bits . Bigger tables take more room both - more entries, and the entries become larger as the number of bits to reference another entry increases. If you had 25bit indexes, you'd need 5 extra bytes per entry, so take 154MiB rather than 128MiB and possibly suffer due to cache line mis-alignment.

Since each value in the quad is reduced to a 32 bit identifier in the store, and at least one bit is needed to indicate whether it's fully described by the identifier (uncompressing the identifier's bit-pattern gives the URL), or needs a look-up to longer bit pattern. Obviously, the more URLs which can be squished into 32bit patterns the faster the store will be written to, as it removes the need to look elsewhere to match the string pattern of the value with one already in a hash table to check that the same identifier is only ever used for the same string. But 24+1 bits leaves 7 unused bits of space in half the possible identifiers, which reduces the maximum identifier density to (128+1)/256 = 50.3%, which isn't very good.

Looking at the benchmarks here, there seem to be much better methods than LZW, but most of them are neural net or model based, and I can't see how to map a sequence of hints for a neural net into anything that could have regex implemented on top of it without decompressing. They also are orders of magnitude heavier on the cpu, apparently enough to fade the total performance.

The other options are to not bother with the intrinsic identifiers - which means finding out how many of the URLs would fit in the LZW table for typical datasets (certain of the computer generated ones they would), finding other uses for the spare bits such as 11xx xxxx xxxx xxxx xxxx xxxx xyyy yyyy indicating a 23 bit LZW index and a 7 bit character. Maybe the simplest way would be to have aaaa aaaa xxxx xxxx xxxx xxxx xxxx xxxx where any non-zero in the a's indicates it's a hashkey, so you get full identifier use at the cost of more hashtable lookups.

Holding the full hash in memory isn't an option; even just a population bit would take half a gig. Since each node is single-threaded, it's possible to split URLs from literals and not care whether literals match outside of a smaller MRU table cache; that would increase search times for equality of literals with non-intrinsic ids, but not effect regex searches. It may also be worth maintaining separate LZW tables for literals and URIs. Eliminating duplications can be done in the background anyway.

I think the next experiment will be scanning the datasets I've got - swoogle's ten million triples, wikipedia in a couple of formats (dbpedia and the xml dump), and the uba owl generator - to see what hit rate and spread I'd get generating ids using LSW and hashes.


TME

Labels: , ,

First compile and run on Solaris 10.

I've got first run of one of my quad-store benchmarks running on my Solaris 10 boxes.

I needed to make the machine architecture and a flag whether or not to use SSE2 environment variables, and add some extra #defines so none of the SSE stuff was included. I needed a macro to fake posix_memalign, though will think about using another malloc-based version. Compiling on Solaris also generates warnings for automatic promotion of -1 to an unsigned int so I added explicit casts (-1 is returned by some functions where 0 is a valid return and something is needed to signal an invalid value).

Running the compression benchmark, the sparc (440MHz) doesn't seem to get cpu bound for small files (100MiB):

tercel-1$ uname -a
SunOS tercel-1 5.10 Generic_118833-33 sun4u sparc SUNW,UltraSPARC-IIi-cEngine

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 100
...
read: 100 MiB, compressed to: 21 MiB.

real 0m23.094s
user 0m20.973s
sys 0m1.715s

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 1000
...
read: 1000 MiB, compressed to: 301 MiB.

real 3m55.124s
user 3m39.377s
sys 0m12.550s

fortinbras$ uname -a
Linux fortinbras 2.6.20-16-generic #2 SMP Thu Aug 30 23:16:15 UTC 2007 x86_64 GNU/Linux
fortinbras$ time bin/lzw_bench datasets/original/triples.sql 100
...
read: 100 MiB, compressed to: 21 MiB.

real 0m8.638s
user 0m5.784s
sys 0m0.296s

fortinbras$ time bin/lzw_bench datasets/original/triples.sql 1000
...
read: 1000 MiB, compressed to: 301 MiB.

real 0m56.176s
user 0m43.667s
sys 0m1.948s

The processor in fortinbras is Mobile AMD Athlon(tm) 64 Processor 3400+ at 2200 MHz, the Netra at 440 MHz, and appears to be about a quarter as fast at a fifth of the clock speed.

On the other hand, running the Netra at full doesn't heat my lap up, I've three of them, they've more, faster disks available and I can play with making the quad-store distribute between them.

Whether they hold up as well against the benchmarks which use SSE to make the Athlon wider I'll have to find out.


TME

Labels: , ,

2007-09-06

Dynamic languages tools

Link: http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html

I got a CD with Komodo Edit on it at XTech, but haven't tried it until now. The recent open-sourcing of the code, and then this post appeared on Planet Mozilla.

It looks like it would be good for DSL development, and playing with it Komodo Edit seems to be very usable. My current editor of choice is SciTE, which Komodo Edit embeds, but Komodo Edit gives projects, and language aware autocompletion and navigation (go to the definition of a function). I haven't found out if it's possible to go to definitions in other files, which would be very useful - I spend quite a lot of time doing Ctrl-C,Ctrl-O,[filename],Ctrl-F,Ctrl-V,Enter to jump around a JavaScript project. It also does polyglot programming - a simple XUL file with a script element switches nicely from XML highlighting to JavaScript highlighting in the script element's content.

The user defined language tool is lexer + state machine based; I'd have preferred it to be a packrat based one, which I've played with before. But I don't want to have to write a whole IDE, I like using the Scintilla editor control, and Komodo's approach seems to get the job done.

As a side note, del.icio.us seems awfully slow these days. You'd have though that Yahoo would be able to give it a couple more servers. I wonder if the competitors are any better.

Labels: , , , ,

2007-04-26

Rich Application Platforms

Well, there's been a bit of a kerfuffle about whether or not Silverlight (nee WPF/E) supports binding, widgets and XMLHttpRequest gubbins. And Adobe has open sourced Flex, as well as having had its Apollo platform out for a bit now. XULRunner is still going, being used for SongBird and Joost, and freshly the DVD version of Wikipedia, though whether or not Mozilla is promoting it well as a platform is under discussion.

Having done an awful lot of XULRunner programming in the last few months, there's still a lot to be done before its an easy to use platform. Quite a lot of undocumented interactions cause crashes - particularly race conditions between XBL bindings being attached and scripts in the body of the document being run, and any manipulation of the tree in script during an XBL constructor. But I don't have any idea how well Silverlight or Apollo handle such things, or even if they support anything as flexible as XBL. There needs to be an XULRunner 2.0 that is a lot more predictable in its behaviour, and less arcane in its invocations.

There's also a big question for the sort of system's I'm interested in - can a platform do something like the ClusterBall WikiPedia visualisation? XULRunner SVG grinds to a halt at a thousand or so arcs, and rotated font anti-aliasing is a bit off. Flash, using Haxe, can render ten thousand arcs, but getting text to rotate is a hassle - it requires embedded fonts, and that means more tools, you can't just set the rotation property. There's quite a lot of weirdness in Flash, half of it is object view graph based, the rest procedural, not everything works and it doesn't play nicely with the browser, for example capturing keyboard input and not being part of selections. As far as I know, there isn't an open high-quality 2D graphics library other than anti-grain 2.4 which will scale and isn't, well, a bit weird. But anti-grain has moved to GPL, and isn't hardware accelerated at 2.4, so what will become of it I don't know. I'd really like a good, object oriented language, with a mix of dynamic and static typing, support for mesh tuple spaces, event queues, first class functions and relations, orthogonal persistence, which runs fast enough, and generates portable, high quality visualisations and information displays. None of the options are quite there yet, but XULRunner's JavaScript and SVG seem to be the most sane.


Pete

Labels: , , , ,

2007-04-20

More rdf bits

I've spent a little more time on quad-store, and found some more realistic performance figures from Franz Inc and the ESW wiki.

There's now a reasonably fast in-memory sorting (using radix 256), around 2.2s/10 million triples, which should help indexing small stores. I also have had a look at MPI for linking nodes together.

I think the premise may well hold - as the limiting factor for growth is IO and memory rates, not CPU, so trade CPU for compression and limit the number of values a store can