Random Lists

As kin has support for pattern matching and sequences at its heart, it's natural to formulate a random number generator as a generator of a sequence of random numbers.

For example, this is the same linear congruent shift register as used in java.util.Random:

type lcsr ( seed : uint64 ) : seq [ uint32 ]
def head : uint32 => uint32 ( self.seed >> 16 )
def tail : lcsr => lcsr ( ( self.seed * 0x0005_deec_e66d + 0x0000_0000_000b ) & 0xffff_ffff_ffff )

You can wrap it in an actor if you want the same behaviour as a random function with hidden state:

type @actor random_source ( sequence : seq[uint32] )
def next_uint32 ( self ) : uint32
let r = self.sequence.head
self become ( sequence = self.sequence.tail )
=> r

def next_real64 ( self ) : real64
=> real64 ( natural ( self.next_uint32() ) ) / real64 ( natural ( uint32.max ) )

But you couldn't optimise matching on the value reported by a random function like you can with a pure sequence:

let rnd_seq = lcsr ( 0x12345 )

match rnd_seq
0x00000001 :: 0xadfc5fc8 :: 0x22222222 :: 0x33333333 :: _ =>
out <- "match one" <- endl
0x00000001 :: 0xadfc5fc8 :: 0x9d1a5de6 :: 0xf6b685cf :: _ =>
out <- "match two" <- endl
_ =>
out <- "no match" <- endl

The cons'd patterns are compiled to combinations of calls to seq.head or seq.tail. The first two matches get combined as the first couple of terms in the pattern are the same; if seq.head or seq.tail were not pure functions, the second match case would fail as the hidden state would change.

Tim Bray describes using this technique is "a sideways dance" in Haskell. I've always found Haskell a bit too 'stiff' and like to 'feel the bits between my toes' too much for it, but compared to the amount of messing around Java has to do with atomic longs (JDK7) or synchronized methods (GNU classpath) in the implementation of the random source, and then as an Iterable<Integer> if you want a sequence of random numbers, it doesn't seem too bad. In the Java case, either the runtime has to deduce that the interlocks are not required and remove them, or it has to process them; if the code is used in more than one place you may have to have more than one machine code compiled form of the method, or fall back to conservative locking always. For a pure sequence, no interlocks are required. Wrapping a sequence in a actor ( or a lazy list in a monad in Haskell ) gives you hidden variable behaviour if you want it without polluting the code of the sequence generator with synchronisation primitives.

Labels: ,


Language-level Immutablity and Shared Memory Concurrency

I've been a fan of Bartosz Milewski’s Programming Café ( or possibly Bartosz Milewski’s Programming Caff; I'm not sure if lack of an accent implies greasy chips and tea outside of the UK ). He's exploring the options of applying the leverage of type checking to concurrency problems within shared-memory systems. In a different camp, there's the Erlang crowd, typified by Steve Vinoski which says that language-level immutability and message passing is the way to go. A third perspective, probably closer to what I would like to see, are CSP based approaches such as Limbo and stackless Python, which are shared nothing and use channels to pass messages by value. The channels and pass-by-value approach I'm experimenting with in kin is close to the Limbo model, except that channels are objects with support different messages. Apart from the D extensions by Bartosz Milewski, these techniques require the language integrates with a scheduler, usually a green threads system rather than the OS scheduler.

One thing which bothers me about Erlang and pure-functional languages in general is the assumption that language level immutability has some bearing on concurrency.

It's fairly obvious that any message shared by machines on a network will be unlikely to include memory addresses in one of the processes, other than possibly as opaque values. In other words, if it's on the net, it's pass by value. Similarly most inter-process communication is pass by value, and the sharing of messages containing values and names of channels is well understood using CSP and pi-calculus. The net is unconcerned about what paradigm the languages which implement the processes communicating by message passing; it only cares that the messages are values and names. It doesn't follow at all that the implementation language for a process which operates on messages needs to present immutable variables to the programmer.

The other area functional languages have been touted has been for problems where there are shared memory concurrency within a process.

A trivial counterexample is given by lazy evaluation - in trivial implementations, expressions are passed around as thunks, consisting of a place to store the value of the expression, and a function to evaluate the value. The expression is evaluated lazily by calling the function the 'first' time the value is needed. Being a pure function, it doesn't matter as far as the value is concerned whether the function is called multiple times in different cores, but if the value returned is a non-primitive there are still write order issues with sharing the value. So you end up with similar barrier ordering problems that you have in non-pure languages, and barriers cost performance.

For strict functional languages, you still need the barriers, and other synchronisation points, if you are going to communicate values between cores. If you are passing messages by value, then it is necessary that the sender or receiver of a message cannot change the value of the message. This isn't necessarily the same as language level immutability - in a shared memory space, values sent on a channel should not change, but that can be achieved by copy-on-send or copy-on-write approaches as well as by general immutability. Having a shared memory space between threads requires synchronisation with the garbage collector - otherwise memory can be reclaimed and rewritten, breaking the immutability. None of this is particularly surprising, and the implementation of any SMP shared-memory language with automatic memory allocation will have to deal with it.

The presence of immutability at the language level allows the optimisation that a message can be passed as an immutable reference rather than by value. This introduces the pessimisation that any garbage collection of the arena that the messages' storage is allocated from has to be coordinated between all threads. If the optimisation of mostly-concurrent memory management is made, where each thread has its own arena, memory for messages is either allocated from a special shared arena when constructed, requiring some synchronisation, or messages are copied when sent.

It's fairly obvious that if you are allocating from a shared arena, then you're having to serialise some of the requests, so are losing concurrency ( as well as having to do escape analysis so you know what objects constitute the message and so require to be allocated from the shared arena ), and that if you're copying messages into a shared arena when sent then you have lost the pass-by-immutable-reference optimisation. Either way, I've yet to see a truly convincing case that immutability in itself actually gives any performance gain for concurrent code and typical message sizes ( if you have very large messages which are not referenced by the sender, then any copying scheme will be inefficient, but that seems to be a case for escape analysis rather than immutability; copy-on-send would be at a disadvantage there, but Erlang style encourages small messages anyway ).

So the problems in terms of performance are acquiring lock on the memory bus, which takes around 40 cycles for a lock cmpxchg, barriers to order access, and copying. Languages with channels as first-class constructs hide the details of locking and barriers, and can trade various copying or shared memory techniques.

I've not seen anything in Erlang that would suggest that its shared nothing messaging is less prone to programmer error than Limbo's channels. There are other reasons to use pure-functional languages in terms of proving theorems about the program, and below the language level some optimisations are facilitated, and some are harder to perform. But there seems to be a confusion between which bits of a particular combination of facilities in a language are actually supplying the gains, and my opinion is that message passing combining values and channel names ( however implemented or enforced ) is both safe, and open to analysis at the systems level by pi-calculus.

Labels: , , ,


Messing about with X windows

I've spent the last week or so adding bindings to kin for X11 and anti-grain geometry.

X11 gives basic windowing capability, but no widgets ( you run GTK or some other toolkit on top ), AGG gives high-quality software graphics rendering.

This is a window which attempts to draw some buttons:


The window has a transparent background. I've lost track of the number of hacks I've had to do to get transparent backgrounds working in either MFC or Java over the years ( though both now support them ), so making it happen out the box is nice. Just use a colour with a alpha less that 0xff, and the background is transparent.

Thanks to AGG, all the rendering has affine transformations applied to it; the code draws the buttons where they should be, then bigger for debugging, then messes around with rotated text and shading. It's shown running on top of SciTE, the editor I use.

The buttons are all drawn manually - currently the graphics class supports fill rectangles with colours, linear gradients or radial gradients, drawing text and finding the bounding box of text.

The code implements a tiny button_model/button_presenter.

The reason I posted now is that I realised this line of code does the most of the work of the horizontal box layout in Java Swing:

fold ( ( x, btn ) => btn.set_bounds ( x, 0.0, btn.desired_width ( gfx, `normal ), btn.desired_height ( gfx, `normal ) ).x + btn.width + 4.0, 0.0, buttons )

Having done far too much Swing programming, I often find that with what I'm trying to do with it there's nothing that fits. For example, last contract I was trying to make a nice split button with a menu, and the look and feel/ Swing component / MVC distinction meant that you couldn't reliably implement one that was both pluggable and looked good - either you built one out of two buttons, which didn't look or feel like one button with a split, or you did an owner draw and lost the pluggable look-and-feel. That contract also required a column based layout, and containers which lined up like tables - the normal typographic idioms. The intention is that kin's toolkit will be more typographic/infographic in style, rather than widgety - most of the time, it's for reading not clicking.

I'm quite happy that starting from X11 in C rather than with a toolkit such as GTK ( which again may get transparent backgrounds soon ), after a few days there's enough there to see that it'll be much easier to create your own application-specific toolkits.

Labels: , ,


Comma Quibbling

Link: http://blogs.msdn.com/ericlippert/archive/2009/04/15/comma-quibbling.aspx

An example in kin. This misses the difficulty of the C# problem - as sequences in kin are linear, rather than being the mutating iterator/enumerator pattern used in Java/C#, you can pattern match several elements at once trivially (being able to pattern match sequences and backtrack on failure is one of the main reasons for writing kin in the first place).

module examples::eric_lippert_20090415
# http://blogs.msdn.com/ericlippert/archive/2009/04/15/comma-quibbling.aspx
def comma_append ( out, s )
match s
h :: [] =>
out <- h
h :: m :: [] =>
out <- h <- " and " <- m
h :: t =>
out <- h <- ", "
comma_append ( out, t )
[] =>

def write_with_commas ( out, s )
out <- '{'
comma_append ( out,s )
out <- '}'
=> out

def to_comma_string ( s ) => write_with_commas( string_writer(), s ).str()

def main ( in, out, args )
write_with_commas ( out, args )
out <- '\n'
out <- to_comma_string( args ) <- '\n'

Labels: , ,


A new way to look at networking.

Link: http://video.google.com/videoplay?docid=-6972678839686672840

I find this video incredibly interesting. It's talking about the phase transition that occurred between telecoms networks and packet switched networks, and questioning whether the same can happen between packets and data.

In a routed telecom network, your number 1234 meant move first switch 1, second switch 2 and so on - it described a path between two endpoints. In TCP your packet has an address, and each node in the mesh routes in to the node it thinks is closer to the endpoint with the target address.

Currently I have two or three home machines (netbook, desktop and file server) where data lives locally, plus data 'in the cloud' on gmail.com and assembla.com, or on my site here at tincancamera.com . Though as far as I'm concerned, that data isn't in the cloud - it's on whatever server at those addresses resolve to. If instead I google for 'pete kirkham', about half the hits are me. If I put a UUID on a public site, then once indexed I could find that document by that ID by putting it into a web server.

In the semantic web, there's been a bit of evolution from triples without provenance to quads - triples and either an id or some other means of associating that id with the URI of the origin which asserts it. It's considered good form to have the URI for an object resolvable to some representation of that object, so there's a mix of identifier and address. The URI for kin is http://purl.oclc.org/net/kin, which uses HTTP redirection to point it to the current kin site. This depends on OCLC's generous provision of the persistent uniform resource locator service, and maps an address to another address rather than quite mapping an identifier to an address.

There isn't an obvious analogous mechanism for data to the TCP nodes routing the request to a node which might be closer to having the data, though is some ways a caching scheme with strong names might meet many of the requirements. It sort of brings to mind the pub-sub systems I've built in the past, but with a stronger form of authentication. Pub-sub replication isn't caching, as in a distributed system there isn't a master copy which the cached version is a copy of.

There's also a bit of discussion between broadcast and end-to-end messages; I've got to work out how to do zero-config messaging sometime, which again sort of makes sense in such systems - first reach your neighbours, then ask them for the either the data, or the address of a node which might have the data. But there still isn't an obvious mapping of the data space without the sort of hierarchy that TCP addresses have. (although pub-sub topics are often hierarchic, that hierarchy doesn't reflect the network topology even to the limited extent that TCP addresses do.) It also has some relation to p2p networks such as bit-torrent, and by the time you're talking about accessing my mail account in the cloud, to webs of trust. A paper on unique object references in support of large messages in distributed systems just popped up on LtU, which I'll add to my reading list this weekend.

I've also started reading Enterprise Integration Patterns. There's a bit of dissonance between the cocept of channel in these patterns, and I guess in the systems they are implemented in, and the concept of channel in pi-calculus. In EIP, a channel is a buffered pipe for messages, and is a static entity created as part of the system configuration, in pi-calculus a channel is not buffered, and the normal mechanism for a request/response pair is to create a new channel, sent that channel to the server of the request, and receive the response on the channel. The pi-calculus model is the pattern for communication I'm intending for inter-process communication in kin (though I'm not quite sure whether giving explicit support to other parts of pi-calculus is something I want). I'm not sure that having to mark request messages with an ID (as in XMPP IRQ) rather than having cheap channels is a good programming model; though there are reasons that pretending that an unreliable connection isn't a pi-calculus channel is a good idea. I'll see what I think after I've tried coding the channels. Certainly returning a channel from a query given a strong name which lets the requester communicate with an endpoint which has the named data could be a model worth pursuing. REST is a bit different in that the response channel is assumed by the protocol, but the URLs returned by the application are channels for further messages to be sent to, and of course can be addresses anywhere allowing the server for the next request to be chosen to be the one best suited to handle that request (eg, a list of the devices in a home SCADA system could list the URIs of the embedded servers on the sensor net rather than requiring the client to know the topology prior to making the initial request).

Labels: , , ,


Things I didn't know in C

I've been creating an interpreter in C99 for the last six weeks. This is the largest project I've done in C, as normally I get paid for writing C++. I've also read the C99 spec, as I was taught C in '91, and things change.

Here are a couple of things I've discovered which I didn't know before:

You can use field names in struct initialisers, such as
( foo_t ) { .bar = 3, .baz = 8 }
which give you an automatic struct with the named fields filled in. Unfortunately, astyle makes a pig's ear of formatting it.

Unlike all the other printf family, the return value of the snprintf function is not the number of bytes written, but the number of bytes which would have been written had there been space in the buffer. That caused a few bugs in some string building code I'd written before I RTFM. I'd assumed the maximum value returned would be the buffer length, which meant I was assuming this is safe:

size_t max = sizeof(buf);
size_t offs = snprintf ( buf, max, "foo" );
offs += snprintf ( buf + offs, max - offs, "foo" );

But when offs > max, max - offs is a large value.

I also have given up using splint as a checker, as that was exactly the sort of error I was hoping to get detected. The amount of annotation necessary to get splint to work with my allocation scheme was too painful to justify.

Labels: ,


SciTE and ctags

I've been using SciTE as editor of choice for some time, and spent a little time today getting a few extra things working for C99 projects. This year I'm refreshing my C a bit, as I haven't done anything which has specifically used C99.

There's support in SciTE for autocompletion of words within the file by hitting Ctrl+Enter, but you can also add .api files which list function prototypes for an API, giving autocompletion via Ctrl+I. You can either use the files provided for C and Java, or use ctags and a bit of Python to build your own.

The C.api file on scintilla.org is a bit small, but I couldn't get ctags to find most of the prototypes from /usr/include/*.h and /usr/include/sys/*.h. So I made a little bash script to pass each header through cpp, then extract it, which worked OK.

This seems to work to generate tags file for conversion to SciTE api file:

ctags -f - --c-kinds=+p-d --excmd=number /usr/include/*.h | grep -E "^[^~_]" > tags

or from the kin project folder:

ctags -f - --c-kinds=+p-d --excmd=number src/core/*.h | grep -E "^([^_~]|_[^_])" > tags

without --excmd=number the python script fails to convert the tags to api, as it uses patterns rather than line numbers.

Also changed some of user properties - to break it up a bit more with files into the ~/.SciTE/ folder. This moves the c and asm highlighing into separate files, a machine.properties file for machine specifics (such as where to put the window and whether to show tabs and status, which depends on screen size).

I also created a 'workbook.sh' script to write the date to today's workbook file and open it.

To get directory properties working, enable them in user properties with


Then the SciTEDirectory.properties file has its local settings for the project. This defines

# project home

# make
make.command=cd $(ProjectHome); make $(FileName)

# go
command.go.*.c=cd $(ProjectHome);bin/$(FileName)

# Extend C API with project API

# Retag
command.1.*=cd $(ProjectHome);\
ctags -f - --c-kinds=+p --excmd=number src/core/*.h | grep -E "^(kini?|opcode)_" > project.tags; \
python ~/.SciTE/tags2api.py project.tags > project.api\
rm project.tags

# Apropos
command.2.*=grep -E $(CurrentWord) /home/pete/.SciTE/c99.api $(ProjectHome)/project.api

# Info
command.3.*=firefox http://www.opengroup.org/onlinepubs/000095399/functions/$(CurrentWord).html

This means build and run are performed in the project folder, and adds three extra commands - the first runs ctags to rebuild the project's tag index. The second searches the tag index for a highlighted word as a substring (rather than the built-in popup, which you type the first few letters). The third opens the unix spec for the highlighted function, which includes the C standard library functions too.

I've put the contents of my ~/.SciTE folder here, along with my .SciTEUser.properties file. There's the generated C99.api, asm and cpp properties without bold fonts (which mess up formatting), a version of cleanup.cc which will compile with gcc, protags.sh which preprocesses and ctags include files, and workbook.sh which creates a workbook/TODAY.log file and opens it in SciTE.

Labels: , , ,


The Computational Beauty of Nature/Little things

Link: http://mitpress.mit.edu/books/FLAOH/cbnhtml/

Over the last few weeks, I've been reading The Computational Beauty of Nature an coding up a few of generative algorithms.

I've put them in the examples/java/fractals folder. Not much to them, just written on a netbook running Ubuntu.

I have two main computers - an Asus Eee PC 900 netbook and a Shuttle SX48P2 Deluxe desktop - and tend to use the Eee PC much more, even for programming, despite having a much smaller screen. Maybe it's because I have to sit at a desk all day at work, or maybe its that a full-screen editor on the Eee PC is a manageable chunk of code. I've been given dual monitors at work, and don't really use them - sometimes to have a web session in the one I've not got the editor open in. I use SciTE. It's no harder to Ctrl-Alt-Right arrow to switch a workspace on the Eee PC than to move your head to another screen.

I don't tend to stepwise-debug GUIs that much - which is why I was given two screens at work - I tend to break the code into smaller methods and write tests for them, which records the areas which you had difficulty with and had to pay extra attention too, information which is lost if you use a debugger.



Prolog is not co-operative

XProlog v.1.3, May 2002

?- help.
> No.

?- oh, go on.
> No.

?- please.
> No.

?- quit.



Why Type

Link: http://blogs.msdn.com/ericlippert/archive/2007/12/06/immutability-in-c-part-three-a-covariant-immutable-stack.aspx

For work, I'm demoing a GUI editor, which means undo would be nice, which is easiest to implement if you enforce purely functional data structures. So I'm doing a bit of reading as to how to implement suitable structures in Java 1.6, as that's the project's target language, and using Eric Lippert's examples as part of my reference material.

There was also yet another thread on forums.java.sun.com about what can be done with generics, erasure and making arrays safe.

I like having something which proves that all possible execution paths don't result in obviously erroneous actions. For many cases, a strong type system supports this.

But the more I look at it, being able to assert any simple thing about any action on an object would make a lot of the proofs simpler. There are better and better proof tools around, and the kinds of things you want to prove about concurrent systems are not easily expressed in a type.

I'm increasingly thinking that a dynamically typed runtime, with a good static model checker, would trump a strongly typed language in performance, productivity, and safety. But creating such a beast is highly non-trivial.

Labels: ,


Cedric's challenge, C port and analytic attempt

I was a bit disappointed that the C port of the previous code wasn't quite as fast as another approach which cached more state instead of generating the i'th value. The second attempt, using the bcd lists was comparable to John Wilson's Java version (1.3 seconds rather that 1.5 reported by the Java), and an attempt using bitsets a bit slower. Timings on Eee pc 900, Ubuntu hardy heron. The code is nrnat2.c.

The biggest jumps are between 9(descending digits) and 10(ascending) and 10(descending) and 12(ascending). Generating those four values will give the largest jump in a range.

The number of values will be the total of combinations: 9 + 9 * 9 + 9 * 9 * 8 ... up to the max value.

The total for 1..9 is 9*(9+1)/2 = 45;
For numbers 10 to 98 there are 9 numbers starting with each digit, so you have 450*9 for the first digit, and 9 lots of a odd sum, where there's one digit missing - so 9 * 45 - 45 for taking one digit out each time (the range is 0..9 instead of 1..9).

That gives a total up to 10 of 45

Up to 100 we get the 45 up to 10, 45 * 10 * 9 for the first digit shifted once repeated 9 times, and 40 * 9 for the second digit repeated 9 times (as the first digit is always non-zero, it biases the remaining digit's average by 0.5 from 4.5 to 4). So total for up to 100 is 45 + 45 * 10 * 9 + 40 * 9 = 4455.

Up to 1000 we get the values so far, 45 * 100 * 9 * 8 for the first digit shifted twice repeated 9 * 8 times - the number of combinations of the other digits. The remaining digits occur the same number of times as each other in each position and once for each value of the first digit, so you get 40 * 9 * 8 * 11 from them, so total is (4455 + 45 * 100 * 9 * 8 + 40 * 9 * 8 * 11) = 360135.

Similarly sum up to 10000 is (360135 + 45 * 1000 * 9 * 8 * 7 + 40 * 9 * 8 * 7 * 111) = 25277895

Simplifying, the sum of the values with n digits is (9!/(9-(n-1))!) * ( 45 * pow(10, n-1) + 40 * floor(pow(10, n-1) / 9) ) and the maximum jump is floor( ( 1203456789-1029876543 ) / pow(10, 10-n) )


Labels: ,


A coding challenge

An implementation of the coding challenge of Cedric's.

Somewhat coloured by the fact I normally use python as a prototyping language prior to production coding in C++.

The combinations technique (using a bcd string instead of list) gets faster than the filtering technique for max somewhere between 1,000,0000 and 10,000,0000. As python is bytecode, it may be slower to bit-twiddle than use a dynamic list; IIRC it has to allocate memory to hold longs anyway, but getting it right in Python and then porting to something with good performance is the way I often work, and longs are fast in production programming languages.

import time

def has_repeat_digits(i, available = (1 << 10) - 1):
digit = i % 10

if available & (1 << digit):
if i < 10:
return False
return has_repeat_digits(i // 10, available &~(1< else:
return True

def brute_force_counter(max):
for i in range(1, max):
if not has_repeat_digits(i):

def test(max, counter, timed_run = False):
last_i = 0
biggest_jump = 0,0,0
total = 0
start_time = time.time()

for i in counter(max):
if not timed_run:
print i,
total = total + i
jump = i - last_i
if jump > biggest_jump[0]:
biggest_jump = jump, last_i, i
last_i = i
stop_time = time.time()
print 'total', total
print 'biggest jump', biggest_jump
if timed_run:
print 'time', stop_time - start_time

# select the index'th permutation
# first digit is special case, is in range(1,10) rather than range(0,10)
# otherwise it's a standard permutation. Using BCD encoded value to represent
# remaining digits to select from rather than a list as deleting from a python
# list is O(length) whereas deleting a digit from the BCD is O(1), and I like
# fiddling with bits.
def bcd_select(index):
if index < 10:
return index

pow = 1L
fact = 1L
digits_available = 10

index = index - 1L

while index >= fact * 9:
index = index - fact * 9
digits_available = digits_available - 1
fact = fact * digits_available
pow = pow * 10

#~ print 'fact', fact, 'pow', pow, 'i', index, 'i//fact', index//fact

# first digit is selected from 1..9
available = 0x9876543210
digits_available = 9
sel = (index // fact)
index = index - sel * fact
sel = sel + 1
available = (( available & ~((1 << ((sel + 1) << 2)) - 1)) >> 4) | ( available & ((1 << (sel << 2)) - 1) )
acc = sel * pow
pow = pow // 10
fact = fact // 9
#~ print fact
while pow > 0:
#~ print digits_available
#~ print 'fact', fact, 'pow', pow, 'i', index, 'i//fact', index//fact
#~ print 'available', digits_available, hex(available)
sel = (index // fact)
acc = acc + pow * (( available & (0xf << (sel << 2)) ) >> (sel << 2))
available = (( available & ~((1 << ((sel + 1) << 2)) - 1)) >> 4) | ( available & ((1 << (sel << 2)) - 1) )
index = index % fact
pow = pow // 10
digits_available = digits_available - 1
fact = fact // digits_available

return acc

def bcd_counter(max):
index = 1
while index < max:
value = bcd_select(index)
if value > max:
yield value
index = index + 1

Labels: ,


On the map

Received a little gps 'mouse' receiver in the post today, and (having installed gpsd and python-gps) have got it running as a local service with my first google maps mashup. The server is here, it just shows a google map centred on the current location. I now need to wait until I get a mobile broadband dongle and the new tiny laptop I've ordered (Asus eee 900) and then can have gps on the road (though I don't really want to be using a laptop in the car unless I'm really lost)

Labels: , ,


Python and EA

For my current contract, I'm part of a team producing scripts which define interfaces to equipment in a satellite communications system for monitoring and control. Various interfaces, serial, low-level TCP and SNMP.

The application is fairly old - originated on Unix in early 1990s, ported to MFC late 1990s - and there are many links between the various configuration files. It's a high reliability system, so uses the classic pattern of a watchdog process with spawns the active processes, monitors them for liveliness, CPU and memory use, and respawns them or kills them in event of failure.

The goal is to use Enterprise Architect for modelling the system in UML. EA has a limited custom code generation - one file per class, poor tree walking functionality - which is just sufficient to create the interface scripts, but none of the other configuration which is more highly intertwined - for each rack or group of racks of equipment to be monitored, there's a server which runs a set of monitoring processes, each monitoring a set of instances of the equipment. Various parts of the configuration, such as the port ID of each server process, need to be put into different configuration files, so the desired layout of generated files doesn't follow EA's assumptions.

Also, there needs to be a parser for the configurations which reverse engineers a UML model from the configuration files - we don't want to have to generate the UML models for existing equipments by hand.

So I started writing a little parser in Python which generated an XMI file for import into EA. That worked well enough, and impressed the line manager, so have been spending the last couple of days fleshing it out.

EA doesn't quite treat tagged values as first-class citizens of UML, relegating them to a second window (though it's nice to have them docked and always visible, but that could apply to all the panes in the properties dialogs). So I wanted a table view which looked like the config file layout, of all the extra tagged values we're using to define the application specific properities used to implement the attributes of the equipment in the configuration files.

I've been using GTK for that - I've played with GTK a bit in C, and quite like it's level of abstraction and simple event model. It's not too hard to get things working in Python (some of the handling of attributes is a bit counter intuitive), and it mostly looks native (compared to Java or Gecko, or even EA's UI, which I believe is written in C#, but isn't using WinAPI menus).

I quite like Python, but it's a bit too imperitive for my taste, and there's no type system to tell you if you've put a right-handed vector into a left-handed library, or referenced an attribute of an object which doesn't exist. But being able to create a simple dialog with a REPL (read/eval/print/loop) in it to do the initial exploration of the EA COM automation interface was very handy.

The thing I dislike most in Python is that it doesn't actually use indentation to indicate blocks (I read Ken Arnold's chapter in The Best Software Writing over the weekend, and think I agree). It uses the colon character to indicate a block, and following a colon you have to use indentation. I'm always just doing the indentation, and forgetting the colons.

Labels: , ,


Lambda The Ultimate and Matrix Distractions

I had been reading up a bit on trace-based JIT. I'm not sure that there's any real difference between trace based and method based JIT if each branch is converted to CPS style call with type feedback. Which means removing the optimisation of having jumps and branches from the bytecode, and replacing them with dynamic calls.

But then I got distracted by a discussion of Strassen matrix multiplication, and haven't done anything with it yet. I need to get back reading through Golub & Van Loan too; it's over a decade since I last read it.



Test of syntax highlight

Test of the syntaxhighlighter:

# generic sequence trait
trait seq<T>
def empty : boolean
def head : T
def tail : seq<T>
def to_string => '[' ++ join(self, ', ') ++ ']'
def concat (other:seq<T>)
=> other.empty ? self : shift(other, reverse)
def reverse => shift([], self)

Labels: ,


Some things I'm thinking about

I've been playing more with my mostly functional modelling and simulation language (kin), here are some ideas which I want to try and profile to see if they offer gains:

Carrying type and data as a tuple:

Some implementation of Scheme and JavaScript use a single void* size datum for each object, and for integers set the lowest bit and encode the value in the rest, and for other objects the lowest bits are zero due to alignment, and the whole is a pointer to object. That doesn't work very well for doubles, and requires that the type is stored in the object's header. Java, C#, and C++ (for object with a vtable) have type information in the object header, which costs in large arrays. I've been thinking about passing the type and the data around separately, which means you can have more tightly packed arrays for non-vtable types.

One-bit markers for escape detection:

Based on One-bit Reference Counting, a marker in the type field to distinguish 'shared' objects you can take a reference to, and 'non shared' which you must not.

Write-once arrays:

As an extension to 'one bit reference counting' style; you maintain a high water mark instead of always copying, for use in structures such as VLists so you don't copy on passing automatically, and it allows you to mutate an array if it can be proven that all accesses happen before writes in matrix multiplications:

ax[i] <- ax[i] + row[i] * col[i]

Block order arrays for cache sensitive iteration and super compiled functors:

In a simulation, it's not uncommon to have a reduction on some function of a few attributes of an entity:

let total_force = reduce(array_of_entities, lambda(e:entity, ax) => f.mass * f.acceleration + ax, 0)

If this is over a homomorphic array, and mass and acceleration are simple double values, this would in C translate to picking 8 bytes here and there from the memory for the array. If instead each field is kept either in 'column order', or in what I'll call block order - so the fields for (cacheline size in bytes) objects are held contiguously. This should both reduce cache misses, and allow the use of SIMD instructions to process the data. The obvious disadvantage is that an object in an array is no longer the same layout as a single object on the heap, and to exploit it you need either a super-compiler or a trace-based JIT.

Effective model for scopes based on lists, symbols and objects:

Trying to build an interpreter in Java, which makes it tempting to use maps for everything, I found that the properties of an object and the properties of a lexical scope are much the same, (the duality of closures and objects is well known) so will try and define the binding model for values in kin using symbols, integers and lists only.

Using CAS for thread safe homogeneous VLists

Similar to the vector implementation in Lock-free Dynamically Resizable Arrays.

Options for compiling Verilog to shader language

Having had an interview last week as a systems modelling engineer with a company who were using C as the modelling language for timing simulations in embedded memory controllers, which is a bit like going for a job as a car mechanic and discovering that you're being quizzed on your wood-working skills, I was thinking about compiling an appropriate modelling language to something which executes efficiently. Though their particulary problem - timing conflicts - I would have though as having an analytic solution.

Something after UML

Speaking of Verilog, Verilog came into existance because even standardized schematic diagrams don't carry strong enough semantics and are not amenable to algebraic analysis, and graphical notations don't scale to large systems without hiding information.
Pi-calculus came into existance as Petri nets don't scale to large systems and are not amenable to algebraic analysis.
UML is very much in the standardised schematic mould, lacks formal semantics, and relies on hiding information to scale.

Often the hidden information in UML is very important - what appears to be a good design is not a good design if its simplicity is achieved by hiding most of the attributes and operations of a class, as often happens when a class gets multiple unrelated responsibilities. The notation allow you to hide details which should indicate that the design should be factored into smaller units, and in fact encourage such behaviour as a mechanism to scale up to showing many classes. For example, in Altova UModel:
You can customize the display of classes in your diagram to show or hide individual class properties and operations. ... This feature lets you simplify the diagram to focus on the properties and operations relevant to the task at hand.

If the class has features which are not relevant, then refactor it. Don't hide it so it doesn't make the documentation more complicated. The classes for well-factored systems have nothing to hide, but require more relations between the classes, which UML tools don't handle well (last time I checked none provided basic auto-routing or anything like good layout engines) and look more complicated, even though they are more simple - in the same way a dry stone wall is simpler than a minimalist plastered white apartment, but is visually more complex.

So what might come after UML then? What would not mitigate against refactoring, but allow a visual overview? What notation might give good analytic properties for software, rather than being a system schematic with loose semantics?

I don't know yet.

Labels: , , ,


Bjarne Stroustrup on the Evolution of Languages

Link: http://msdn2.microsoft.com/en-us/magazine/cc500572.aspx

A few comments on this -

The illustrations in the interview don't appear related other than that they are of an IDE ("I might have a different opinion if a good IDE was universally available", queue Visual Studio, which implies that VS isn't a good IDE) and that the UML is illustrating multiple inheritance, but mechanism for dispatch in the paper is not based on multiple inheritance (at least as shown in the diagram).

It's an interesting paper though, and I would like open methods in a language.

I don't think that DSLs should be just extensions to general purpose languages. Mainly because that implies a specific execution model, and if you have a DSL then execution shouldn't be part of the domain model (unless it's something like Verilog, or the input to a program proof tool, in which case execution is a domain concept). DSLs should allow the modelling of a domain, and lose their benefit - abstraction away from execution - if they are tied to a programming language. If you have a model of the physical domain, then you can state 'F = m * a' and not have that imply that the values named by 'm' and 'a' are multiplied and stored in the location named 'F', as binding to C++ semantics would have it. Otherwise you need at least three models - 'F = m * a', 'm = F/a' and 'a = F/m' - to let you represent everything that the one entity in the domain represents. Which would create a mismatch between the language and the domain, eliminating the benefit of a DSL.

I've also been reading 'Programming Languages Pragmatics' and working on my model-based language kin a lot, and adapting the parser and code model of the modelling language I used to do a state model of some concurrent code to support kin. I was generating Java as a first cut to bootstrap a full compiler, but Java's generics aren't rich enough to support kin's parametric types, and without them it's not easier to generate Java than C. So I will go straight to a C back end, and generate an intermediate form which can be interpreted. It's more important for kin to be able to call C libraries such as the Intel Math Kernal Library than for Java libraries, and I don't really like the sort of compromises you have to make to fit in with Java's libraries, however nice the JVM's HotSpot and garbage collector are to have.

Labels: , , ,


Andy Goldsworthy - Rivers and Tides

We got the DVD as a wedding gift, and watched it last night.

I like Andy Goldsworthy, both for his aesthetic and the humility that lets a work exist within an environment which makes it transient, or fluid.

Watching him create a work, sometimes it would fall. Often it was compromised to fit with the time scale imposed by the environment by tide or light.

At each fall, a discovery. More of the nature of the work is revealed; the more failures, the more effortless and whole the final work becomes.

I really like the idea of a flowing form moving through or around obstacles, adapting to what's there, constructed out of what is available, but maintaining its own identity, a recognised figure.

The hanging works, stalks randomly pinned together with thorns, which then form a coherence to mark a hole. A form emerging from chaos, using the chaos, constructed with the chaos.

The works accept that they are not static, but will be changed after the artist commits them to the environment.

There's something in programming - called the quality without a name - which is what software architecture is about. It isn't about perfection, or locking everything down to nuts and bolts, or about hiding the chaff and chaos under languages and frameworks, but about habitability and fluidity, and coherence. It's what the original pattern movement was about.

Is this work good to live with? Does it fit with the environment? Can it flow and adapt with the rivers and tides?

Labels: , ,


Plays well with others?

> I know large systems can be built statically...I've also done it dynamically in more than one dynamic language.
That hasn't quite been my experience - every large system I've worked on has ended up getting layered into a static core with dynamic or DSL scripting. Greenspun's tenth rule.

On the other hand, even medium applications I've been involved in which were implemented only in a dynamic language got into difficulty with performance or modularity, and so either evolved into a mixed language system or were reimplemented.

Often I also use dynamic languages or environments to prototype algorithms, and sometimes then have to port to static languages for performance reasons - writing better code generators is rarely cost effective, as the dynamic languages are sufficiently good at the general cases, if you need to optimise you're optimising for a specific requirement. The same happens (albeit more rarely) where you need to add inline assembler code to C or Fortran where you need to go below their abstraction level (CAS or bitwise rotation, integer overflow etc).

Also in large systems you often interface with extant libraries, which may be implemented in diverse languages. You're not going to port half a million lines of a numerical library into Smalltalk so you can call it.

So you end up with interfaces between the user code and the library, and having to specify in a machine processable manner the types at that interface, as you're now below the abstraction barrier of the dynamic language.

Other dynamic languages require definitions for such interfaces, though foreign ones may not be visible in the language's introspection facilities - for example, Mozilla's JavaScript uses IDL for the definition, but the type and arity of functions isn't visible via introspection.

This introduces cost and brittleness to the system; and in the types of systems I've most experience with it also makes it harder to move a prototype algorithm between the dynamic and library codebases than it feels it ought to be - the dynamic runtime usually has the facilities for doing most of the code generation work already, but there's no means of accessing the results and tweaking it. I am looking for mechanisms which make that sort of task easier, and which don't require custom interpreters for DSLs, which are a pain to get right or maintain.

>"What happens in Smalltalk projects where you grow larger? Where you want to divide a system up into components, and some of those components don't exist yet?"
> You have a need; you have an idea to meet the need; you try it; you grow it. I guess I don't understand the question.

In my experience, that only happens up to a certain size. My question was about large systems - systems where you have to subcontract components, either to another individual in your team, to a second local team or to a separate organisation. Bottom up approaches don't scale out to that level, top down approaches introduce brittleness and (code production) inefficiency. The Smalltalk browser appears to be strongly a bottom up tool; I'm curious as to whether it allows that approach to scale out better.

> Semi-formal notations, tests and other examples, documentation, good tools -- these all contribute to success. But if you are looking for some kind of a "static declaration" as they saving mechanism for large systems -- it may work for you, but I doubt you can prove they are necessary and sufficient. I have counter-examples.

I do consider some form of specification of the domain and codomain of a function to be required, and the stronger the specification which can be automatically tested, the better. Verifying natural language specifications by hand is simply not viable for large software projects. What's provided by the type systems of most languages is not sufficient even for that - you can't say the domain of a square root function is the set of positive reals. Specifying pre- and post- conditions helps, but for anything involving concurrency you need 'for all' rather than the 'there exists' specifications given by unit tests. But most of the system doesn't require formal specification, and good design minimises shared state.

So I'm looking for something with the gains of dynamic languages without barriers to performance, and good modularisation. As such, I'm more interested in gradual typing than Smalltalk. But if it does have anything to offer then I'll borrow from it.

> "Can such interfaces be browsed in the same way if the code can't be called?"
> Your question betrays your bias. "There must be some such mechanism as the one I am used to," he thinks.
Yes, I have a bias based on my experience and problem domains.

One project I was involved in was refreshing the library used for airworthiness stress analysis, with a view to improving the performance (it was costing around half a million pounds in CPU time to run the analysis for each design revision, so in the company at that time, making a 10% improvement in the run time would mean someone didn't have to get made redundant). This consisted of a set of Fortran libraries, some OS scripts, and a domain specific scripting language which encoded the models and analysis algorithms. A set of flight states was mapped over the model, and the outputs reduced with a filter to indicate any resonances. The core of the system was mostly unchanged since the late 1970s, only the models used being updated with successive revisions to the aircraft.

There was a large amount of model and library code, so I wrote a quick and dirty call browser to explore data flows in the parts of the system I had access to. Since the code was significantly larger than the memory capacity of the computer I was running the analyses on, let alone expanding that to the AST and call graph, the interactive image and execution based querying as presented in the Smalltalk browser post would not have been suitable. The project ended up going nowhere, as translating the implementation of non-standard dynamic language from the mainframe to PC would take more resources than the team had in the time available.

If the scripting was in a standardised dynamic language, then the maintenance cost would have been much lower, and the chances are the code could have been moved from the costly mainframe to instead run overnight on PCs.

So is there any way such tools can extend for a system of a thousand executables, rather than one? A system which doesn't have everything visible? The types are somewhat irrelevant to this, but most programs people are developing are not whole; the question is of scaling approaches which assume you have a single image which is everything you care about, and all changes to the system are monotonic. If you have a system with multiple teams working on it, if you are developing a library which may be used in a number of ways, or if you have a system which is so big you can't load it without paging, then I'm interested in what happens, since anything you deduce from calls to code you can't inspect is not monotonic - some other use of the function may be a correct use and widen its domain. Partly it's an issue of extensional rather than intentional analysis - some systems you need to prove that the core is correct rather than show it is not incorrect in parts. But an awful lot of the code I work with doesn't need that level of confidence, and as much as possible I write that in dynamic languages. Sometimes I make that judgement wrongly, and so I'm looking for mechanisms which allow 'fixing' of parts of the system more easily, as well as allowing assertions to be made for compatibility between parts of a heterogeneous system.

> Most large systems developed with dynamic languages have not used the kind of mechanism you seek.
I've seen such mechanisms in JavaScript, Lisp, Erlang and Python. Maybe Smalltalk just doesn't play well with others, and is only suitable to monolithic applications.

Labels: ,


One more wide-finder

Since previous efforts didn't print the results summary, they weren't really complete, so here are two more variants using a vector (tbrayA.cpp) and a priority queue (tbrayB.cpp) to get the top 10 entries.

There's a pthread parallel version of the B variant (tbrayB_parallel.cpp). The main thread reads chunks, scans back to a line break, and hands off the chunk to a pool of worker objects, which run the matching in their own thread. Once all the data is read, the workers' counts are summed and the top 10 counts found, then the results printed as in the single threaded version. Each worker's data buffer is written only by the main thread, for the most part in the call to the OS' read function; this helps reduce the amount of copying of data - though whether the T5120 would have to copy anyway I don't know (some CPUs share cache between cores, others don't, those that don't may have to copy the buffer from one cache to the other when the matching thread starts processing).

The code hasn't been built on anything other than linux this time, so I can't guarantee it will run on Solaris. There's also a deadlock on the scheduler if you only use 2 workers (you need at least two so the buffers are only being written or read by one thread at a time), which I have to track down.

Obviously, the multi-threaded version requires quite a lot of infrastructure - it's twice the size of the single-threaded version. It also runs significantly slower than the single-threaded version:

fortinbras:$ time bin/tbrayB_parallel datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m1.358s
user 0m0.372s
sys 0m0.232s
fortinbras:$ time bin/tbrayB datasets/hundred-o10k.ap 10
8900: 2006/09/29/Dynamic-IDE
2000: 2006/07/28/Open-Data
1300: 2003/07/25/NotGaming
800: 2003/09/18/NXML
800: 2006/01/31/Data-Protection
800: 2003/10/16/Debbie
700: 2003/06/23/SamsPie
600: 2003/02/04/Construction
600: 2005/11/03/Cars-and-Office-Suites
600: 2004/04/27/RSSticker

real 0m0.531s
user 0m0.232s
sys 0m0.156s
Commenting out the call to process_chunk(), so it just does the fork and join thread communications:
fortinbras:$ time ./widefinder datasets/hundred-o10k.ap 3

real 0m0.288s
user 0m0.056s
sys 0m0.220s
So there is some additional cost in the parallelism other than just the inter-thread messaging (372 > 232 + 56).

There's an archive of the code here; unpack it, run make and call ./widefinder with the path of the logfile and the number of worker threads. The number of worker threads is 3 minimum. The other examples compile to the ./bin directory from make tbray.

I'm still not overly convinced that it will be CPU bound if it has has a small, slow, miserable disk unless you are using a language whose virtual machine is excessively baggy; I've been careful not to use any optimisation which is specific to the data, rather than what a compiler for a higher-level language would know, given suitable annotations for parallelism.

The latest erlang effort from Pichi is twice as fast as the original ruby; ruby takes 2.580s CPU time on my machine, so the single-threaded C++ version is 11 times faster than ruby on one core, the parallel one 7 times. So once you have half-a-dozen cores then the erlang should beat the single threaded C++ code. Though I'd like to see whether on the disk on the T5120 means it is CPU limited or IO limited with the faster code, and I'm unconvinced that the T5120 is what future will look like - I'd expect more cache and flash on chip, rather than more cores, and a multi-node rather than a multi-core future.


Labels: , , ,



I've been thinking a bit about garbage collection and stack allocation, and both Robert O'Callahan and phaeron at virtualdub seem to have been.

Chicken scheme seems to have something interesting - a single stack which is collected on exhaustion (rather than C# or Java VMs which have a traditional program stack and a stack based allocator), so I'm thinking about implementing a tail-call-elimination variant of the single stack. That requires escape analysis, which may be costly - in its simplest form, it requires a flag to indicate whether any reference was made to anything on the stack above the point the tail call returns to. Either you can be conservative and set the flag on all function calls where a reference to a stack object is passed as an argument, or you can use a single assignment model and test whether anything was allocated. Making the core single-assignment would possibly complicate supporting other languages, and means you trade space in one condition - consing to hold the assigned value - for space in another - being able to recover some of the stack early in a tail call.

Having just the one stack means that, on x64 at least, you can use the one stack register.

Hmm. These things are all rather hard to do right, and don't have simple 40 line benchmarks.

See also a simple worked example.


Labels: , ,


I hate everybody

Well, every programming language I've ever used anyway.

Not a lot to report, I've faded off of doing the wide-finder thing as the MPI ruby code was about as good as I could expect in elegance, and I'm more interested in other stuff at the moment. It did get me annoyed with C++ again, in that its primary mechanisms for specialisation - template metaprogramming and single dispatch polymorphism - aren't quite a good fit for the problems I want to solve, - the boost::regex version was much slower, and shouldn't be. I want to have good enough performance and clarity and open objects when developing and strongly typed interfaces when building libraries and extensible grammar and access to the low-level machine when bit-twiddling (tbray9.cpp uses a better 8-wide '\n' scan, adapted off this zero-in-word hack; parallelising it with pthreads or MPI is a left to the reader.) I don't want to have to use unhygienic macros; and this discussion of units in measures in Java's verbose, strict, non-expressive generics made me think about kin, my pet lisp+prolog+strong types language again.

So most of this week I've either been thinking how to make fast VList based stack work for a fork+join worker model, or playing with a higher level language which looks a bit like python with strongly type annotations - I prefer infix syntax, Dylan has always been a bit wierd looking to me, and I don't want JavaScript's C-like braces. Python got a bit interesting when it looked like it was drifting towards lisp-land, but then drifted away again.

One thing that struck reading the dylan manual (it's always worth looking at how other languages do things) was its convention that
were the same thing. If that was combined with a strong form of memoization such that writing
    bar(foo) := baz
guaranteed that in future
    bar(foo) == baz
you would get the ability to define dynamic fields in objects, similar to what JavaScript and other associative object based languages offer, as they could be written in the dot syntax as:
    foo.bar := baz
foo.bar == baz
Assuming type specialisation on the function trace path is also cached, lookup becomes a single hashtable access. You do lose locality, but if you want performance you'd have to add it as a field on the object's class, though I hate that - language features shouldn't be used for indirect effects; type safety and performance optimisation should be orthogonal, and lack of locality saps performance (though it depends on context - if you're iterating over lots of objects, accessing the same field, the locality could be better than if you had a conventional C++ (different fields in the same object local, different objects anywhere) or JavaScript (different fields in the same object possibly local, using a hashtable, or similar).

As an aside, recursively rewriting the slot accessor form
== (x.f).g
== f(x).g
== g(f(x))
Which, although apparently missing from the DRM, is the opposite of the normal math interpretation of (g o f)(x) = g(f(x)). Pity. Now I also hate Dylan.

My more recent ramblings about kin are on kintab.net; I've thought a bit about what I'd need to do units and measures nicely, and have been going through SCIP coding up what the examples would look like in infix notation, which is a useful test for conciseness and clarity.


Labels: , , ,


A few more thoughts on concurrency and high-level languages

I little while ago, I started playing with a pattern-matching and relational language for meta-programming called 'kin' (it's got relations, and does pattern, so kin/kinship).

Being a machine elf rather than a language lawyer, I never could come up with a decent user syntax for it; I've toyed with making it a lisp (either L1 like scheme or L2 like common lisp; I wrote something that used lisp s-exprs to manipulate JVM class files, but never got further, and then 1.5 came out and increased the verbosity of the type system without increasing its power significantly, and I moved on), with making it a prolog derivative (I rewrote the core of a Java WAM engine so it ran a few times faster, but never got anywhere much with that either, apart from looking at it and thinking, 'yep, it's faster now, so what do I do with prolog?'); then I thought about a faster JavaScript engine, but Tamarin's handling that well enough for most use-cases, and I want better type inference rather than having to tell the compiler everything. Tamarin also makes the mistake of relying on programmer specified type to give performance; usually the type can be inferred if you have the whole system available, and even if not, the Self and StrongTalk VMs show you don't need explicit type to help the compiler. Type systems should help programmers, not compilers. But the JavaScript story doesn't really matter - the XULRunner front end I spend half my time on at work spends 90% of its time in Mozilla's SVG rendering code rather than my JavaScript anyway. I write my best software when there are users asking me to, and when it's also a problem which interests me; without users I get bored too quickly, without interest I don't give more than the users expect.

I've spent a few hours poking around Rubinius code; they appear to be making a ruby interpreter very like lisp interpreters were a couple of decades ago. I don't program ruby; I do heavy math backends and light front ends (with occasional forays into OpenGL), and never touch RDBMS, so use C++, Fortran and Java backends, C++, Java, XSLT middleware, C++, Java, JavaScript front ends (depending what level of 3D visuals they require; I haven't had cause to look at canvas3D yet so the OpenGL work has always been in C or Java, though given the current Gecko SVG performance I've gone off basing a front end on Mozilla code) - though I'd quite happily never use Java for a UI again (last time I got RSI from all those anonymous inner classes). I've been known to use lisp to prototype math code before coding it up it C++ for speed (though that's mainly been due to working for BAE at the time, who take 6 months to buy software for a 3 month project, or due to other business constraints on the implementation language and isn't a reflection on professionally produced lisp compilers, which are fast enough not to require porting to C++).

An old housemate and good friend of mine did his PhD in an embedded Erlang system; that's as close as I came to it before a year or so ago, when I started using XMPP at work and tried installing ejabberd. It didn't work, and I gave up after a few tries, so I tested our server's federation code against Wildfire instead. That's nothing to do with Erlang; ejabberd appears to scale well in independent tests, and I suspect the Windows installer wasn't a priority (our customers require Windows). I've read comments that there are smart people who have written good software in Erlang. This is true. Smart people also write good software in BASIC or COBOL or ADA; that's because smart people write good software. Writing a language that smart people can write good software in is easy; writing a language that average people write non-crap software in is far harder. For my opinion, for this sort of problem, ruby seems such a language (well, it seems much better than Perl or PHP, and a bit better than python), but the implementation is pants. Conversely, ejabberd scales to many more users than Twitter can, and I suspect is a cleaner expression of the problem, because IM's a concurrency problem and erlang far exceeds ruby's capabilities at concurrency.

Here's some good advice on how a high level language should express a problem in something I started reading earlier:
"The structure of the program should exactly follow the structure of the problem. Each real world concurrent activity should be mapped onto exactly one concurrent process in our programming language. If there is a 1:1 mapping of the problem onto the program we say that the program is isomorphic to the problem. It is extremely important that the mapping is exactly 1:1." -- Joe Armstrong (PhD thesis, Dec 2003).

You have a log file. You scan each line (in any order) for a hit and accumulate the number of times each hit occurs. Then you list the hits in decreasing order of the accumulated hit counts.

I don't see any concurrency intrinsic in the problem. The only reason we're even thinking about concurrency is that we want the implementation of the solution to the problem to be parallelised over many cores for performance reasons, and (as SIMD isn't sufficiently general to scale much beyond 4 times wider paths except in GPUs), many concurrent cores is the only thing we can think that will make computers have more throughput. For this problem, concurrency is as much an accident as memory management. Languages which handle memory management automatically have already eclipsed those which require manual memory management in most areas where it isn't an essential part of the problem. As it's not an essential part of the problem, no language where the concurrency is manifest in the implementation of the solution will be following the structure of the problem. Admittedly, I grew up on BASIC and 6502 assembler in my teens, and if you teach kids pi-calculus in high-school the next generation might see the problem differently.

Neo: What are you trying to tell me? That I can dodge bullets?
Morpheus: No, Neo. I'm trying to tell you that when you're ready, you won't have to.

With Erlang you can express a problem in terms of concurrency.
It may be a long time before ruby's ready.


Labels: , , ,


For this particular problem, the principles are more important that the exact results, but I'm happier with tbray7.cpp which gives the same results as the regex in the original ruby code:
fortinbras:$ time bin/tbray7 datasets/hundred-o10k.ap 
matches: 95400

real 0m0.439s
user 0m0.284s
sys 0m0.144s
The file was in cache for that one so ignore the real time; the important thing is that the user time wasn't adversely effected by getting the 'right' answer, instead of 1100 matches.


Labels: , ,


Of course, it helps if you tell gcc to generate a 64 bit executable

... when you're using 64 bit operations:
tercel-2:$ time bin32/tbray4 datasets/hundred-o10k.ap 
user 0m5.644s
tercel-2:$ time bin32/tbray6 datasets/hundred-o10k.ap
user 0m4.048s
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
user 0m5.725s
tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
user 0m3.839s
I'd assumed it would default as it does on linux, but no. Which also explains why I was confused about sizeof(size_t) being less than sizeof(long long) here.

It's still around 8 clock cycles per byte - the 5% gain is not enough to alter yesterday's estimates, which is a little encouraging actually, as I wasn't sure whether or not 32 bit Sparc instruction generators, such as gnu lightning (used in rubinius) or whatever erlang's hipe uses would cause a significant slow down.


Labels: , , ,


Some fingers in the air.

I got Steve Vinoski's 2007/09/29 erlang code, installed hipe and the bfile module, and it ran on the laptop:
fortinbras:$ cat ../datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time erl -smp -noshell -run tbray5 main 512 ../datasets/hundred-o10k.ap
110100 matches found

user 1m23.649s
real 1m33.683s
sys 0m1.620s
I'm not sure looking at either mine or Steve's code where the 1101th match comes from - there are #ifdefs in my line splitting code to print the lines, and if you run diff that output with the input it's the same, and if it was related to the last line in the sample it would give a difference of 1 not 100 for the hundred-times repeated file. But something's inconsistent between the two somewhere, and also with the original ruby code which gives 954 matches for the o10k.ap, or 1097 for the regex %r{GET /ongoing/When/([^ .]+) }.

From the gnome-panel dials, the erlang isn't IO bound in total on this machine - for the first few seconds it is running at max IO and 90% CPU, then for the remainder 99% CPU and does zero IO, so it's reading it all into memory then spawning processes to scan it. I'll leave it running overnight on the 10 million line file to how it fares when it can't fit the file into memory, though I doubt that has much of a bearing on what would happen on the T2 box, as that has plenty of RAM.

[Update: It took 52 minutes 26, and was doing disk IO throughout, but that could well have been paging rather than the IO it has to. Nothing to conclude, other than that it doesn't scale linearly - 10 times bigger file takes 34 times longer.]

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s

fortinbras:$ cat datasets/thousand-o10k.ap > /dev/null
fortinbras:$ time ~/projects/quad-store/bin/read_bench datasets/hundred-o10k.ap > /dev/null
real 0m8.754s
user 0m0.000s
sys 0m0.180s
So the 64 bit-wide matcher on the AMD64 laptop is very IO bound; the difference between the total time of just doing the IO and scannning the lines is negligible.

At 2200 MHz, it's scanning 201 MB data in 0.284s, which is 2200*0.284/201 = 3 cycles per byte processed.

Sun's Thumper gives 2GB/s transfer into memory; the T2 runs at 1.4 GHz and assuming the same IO rate, we have 0.7 clock cycles per byte delivered to play with. 64 hardware threads would give about 44.

Running read_bench on one of the 'ancient creaking' Netras, gives:
tercel-2:$ time bin/read_bench datasets/thousand-o10k.ap 
real 1m15.449s
user 0m0.362s
sys 0m27.998s
That's IO bound, 2,095 MB/75.5s = 27 MB/s. The laptop gets a similar transfer rate figure on the same test.

The T2 is 200 times the CPU, and the Thumper 80 times the disk transfer rate; I'll assume that the T2 system's IO is comparable to the Thumper.

tercel-2:$ time bin/read_bench datasets/hundred-o10k.ap 
real 0m6.159s
user 0m0.058s
sys 0m2.834s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap 
matches: 110000
real 0m7.617s
user 0m4.054s
sys 0m2.504s
At 440 MHz, it's scanning 201 MB in 4s, which is 440*4/201 = 8 cycles per byte processed; it's IO dominated but not IO bound. Optimising the matching code would at best give a 20% improvement.

It's also using about 5 cycles per byte system time for the IO.

Since the Netra runs Solaris 10 and is a Sparc variant, and in the absence of better information, I'm working on the assumption that the 1.4GHz T2 would be close to it in terms of clock cycles per work unit, so to process the log files it would also need around 13 cycles per byte delivered. So either we'd need 13/0.7 = 19 threads, assuming each thread provisions another 0.7 cycles worth of work, or run 64 threads and can afford code that's three times slower than the single threaded variant due to concurrency management. If it's closer to the fat-piped AMD64, then only 6 cycles per byte and 8 threads would do.

Guessing even more widely, with a large dataset and large grain concurrency, the C++ matching code should be at least three times faster than the IO - using 8 to 20 of the 64 threads. The erlang interpreter seems to be around 10 times slower , so would require 80 to 200 threads to balance the transfer rate [correction: only the CPU time is relevent here, so it's 83.649+1.620s vs 0.284+0.248s, which is 160 times slower, so would require 1280 threads to get the same throughput]. If the IO rate is less than 800MB/s and the erlang scales across all 64 cores then the faster matching in the C++ won't give any advantage. I've no idea how good erlang is across cores - the only times I've seen it praised is many (conceptual) processes on small numbers of cores - it's a concurrent language rather than a parallel one. Inter-core communication doesn't come free, and any message still need either lock based synchronisation or CAS/HTM retries; as it doesn't matter which actor works on the data, in a C++ implementation you'd use a fail fast lock and try with a different actor rather than it blocking, so going lock-free would cost more than it would gain. AFAIK you don't have that option in erlang, and are stuck with the given queuing and scheduling mechanisms, though you probably could subvert them.

In computer science there are only three numbers - zero, one and many. If my estimates are anything to go on (and I wouldn't bet more than half a penny on any of them), on the target system you need many threads to solve this problem most efficiently.

Looking at the code for the bfile library, it shouldn't be too hard to move the 64bit wide string match into an erlang library, which seems more fun than fiddling with MPI to get the required thread count, but the required thread count is far fewer than the number of processes in Steve's code, so MPI or pthreads should be a better model than large scale concurrency. But erlang may be to concurrency as lisp is to dynamism - I like lisp, but I haven't found been any interesting problems which I could apply it commercially under the constraints of the business, and even though I end up greenspunning in most systems I write, the subset of lisp-like features which gets implemented to give dynamic behaviour is tailored to the problem, and the rest of the code can be performance optimised more easily.

On the other hand, my fiancée comes back into the country on Saturday and I'm supposed to have finished sending out our wedding invitations by then, and all this isn't really helping that to happen.


Labels: , , ,


Data-wide wide-finder

Running a test with just breaking into lines without the matching, compared to the non-MPI version with matching runs on the 100x file in 380ms user time rather than 720ms. So there is work in both parts of the string processing.

Parallelising the scan for newlines to work on 8 chars at a time gives a reduction to around 220ms.

Restoring the matching, and parallelising the scan for the first character makes the CPU run at 800MHz.

A quick google search tells me that sudo dpkg-reconfigure gnome-applets enables me to use the gnome panel applet to lock the frequency at 2200MHz, so with that set, I can compare results between the linear and SIMD parallel forms. Typical results are around 400ms better for the parallelised version on the laptop:
fortinbras:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000
real 0m10.420s
user 0m0.688s
sys 0m0.268s

fortinbras:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000
real 0m8.622s
user 0m0.284s
sys 0m0.248s
This makes user+sys time go from 950ms to 532ms; if this process was CPU limited, that's as good as you could possibly get from running two cores.

Sorting out a few machine word dependencies so it runs on one of the Netra Solaris boxes (Linux/Solaris code is at tbray6.cpp; I was assuming size_t would be 64 bit on all 64 OSes), the results are:
tercel-2:$ time bin/tbray4 datasets/hundred-o10k.ap
matches: 110000

real 0m7.593s
user 0m5.638s
sys 0m1.894s

tercel-2:$ time bin/tbray6 datasets/hundred-o10k.ap
matches: 110000

real 0m5.915s
user 0m4.027s
sys 0m1.855s
Here we get a much more significant speed up - about 25% faster overall time, due to a different balance between CPU and disk.

These simple benchmarks indicate a few things to me:

Not all gains from parallelism require concurrent processes - more transistors can mean wider data paths as well as concurrent cores. If I were using the architecture specific SIMD registers on the AMD64, then the paths would be twice as wide again.

A slower CPU with a 'server class' disk outperforms a faster CPU with a 'consumer class' disk in this test, but shows up potential to parallelise the code.

I like playing with bit-twiddly code on the weekends. I don't like having to debug bit-twiddly code in mission critical applications for my customers to a tight deadline. The optimisations should be in the regex library, rather than complicating user code - for example, this code avoids reading past the end of the buffer only because it has a pattern longer than the word size.

I'm also wondering whether some kind of hash based wide-scan can detect two or more characters of a pattern well enough to give an improvement, rather than just a wide-scan for first one - there's probably only a couple of bits of information per character in those log files.


fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/tbray6.cpp | wc -l

It's hairy but it's not huge.

Labels: , , , ,


Wide finder, parallelism and languages

Tim Bray, who knows about web search and was part of the XML specification process, is experimenting with exploiting parallelism in his wide finder project.

I'm interested in parallelism (running similar processes on multiple hardware) and concurrency (having multiple collaborating threads of control in a system), as many interesting problems are intrinsically concurrent, and as hardware isn't getting much faster but is getting wider, so will provide more parallel processing bandwidth.

The original problem is one of extracting the N most popular pages (not images) fetched from Tim's ongoing blog from the server's log files. These files are a few gigabytes in size.

Tim's ruby code runs in 13.5 seconds on a 1.67Ghz PowerBook, which is twin core G4. My use of my twin core 1.42 GHz G4 desktop petered out last year when I got a single core AMD64 laptop, as most of the time the laptop was more responsive. The laptop runs Ubuntu 7.04 64 bit.

Measuring the problem:

As it's not a lot of ruby code, it's easy to write a little C++ code to find out where the performance issues are.

First off I was playing with simple string matching (tbray1.cpp) vs KMP (tbray2.cpp), but as it's a requirement that the code should be as simple as possible, and KMP doesn't actually help modern CPUs as it's a jumpy algorithm, the third, simpler approach just calling strncmp works as well as far as using time to measure can determine (tbray3.cpp).
fortinbras:$ time bin/tbray3 datasets/original/thousand-o10k.ap
matches: 1100000

real 2m28.201s
user 0m10.193s
sys 0m4.752s
At that size of dataset, the laptop's CPU didn't get above 800MHz (it has 800MHz, 1600MHz, and 2200MHz speeds and stays at 800MHz). Smaller datasets which are already in the disk read-ahead buffer do cause its CPU to shift up; hence some of Steve Vinoski's figures for erlang times are not indicative of the problem - you don't run the same set of data through the system again and again, so you have to either load something else into the file cache to clear it (cat something-big > /dev/null), or use full-size datasets.

Counting non-comment lines which aren't a single { or }:
fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray3.cpp | wc -l
So a simple, sequential C++ implementation using a memory mapped file is 48 lines of code (my bracing formatting habits shouldn't count against the language), and isn't too slow.

Running the ruby for reference, which does a bit more in terms of counting occurances:
fortinbras:$ time ruby src/benchmarks/finder.rb datasets/original/thousand-o10k.ap
real 1m20.744s
user 0m24.710s
sys 0m4.420s
Presumably Tim's disk speed to CPU speed ratio is higher; on my laptop ruby's IO bound, and the CPU not maxxed, though it does shift to a faster speed. Fortinbras processes the same size dataset that Tim was using in 8.6 seconds.

But the ruby is doing IO much faster than my simple memory mapped code, so changing to use block IO rather than memory mapped (at the cost of a 25% longer program):
fortinbras:$ time bin/tbray4 datasets/original/thousand-o10k.ap
matches: 1100000

real 1m6.780s
user 0m9.593s
sys 0m2.464s

fortinbras:$ grep -v "^ *\(\($\|[}{] *$\)\|\(//.*\)\)" src/benchmarks/tbray4.cpp | wc -l
Again, the C++ implementation doesn't stress the CPU enough to get out of first gear, and it's about 14 seconds faster than the ruby, due entirely to less user CPU. I'd guess that the underlying C code the ruby interpreter calls for its line reading is similar; I don't know whether ruby strings are internally UTF-16 rather than UTF-8; if so that alone would account for CPU cost. I'm actually quite impressed that ruby isn't glacially slow, but I guess most of the work is between the lines of the program.

The C++ code also fewer lines of code than Steve Vinoski's 84 line erlang example, though the C++ doesn't show use of multi-core parallel processing. Parallel code in C++ can take more work. Concurrent code in C++ definately takes more work than in erlang.

Given an infinitely capable CPU and the same disk, ruby's 81 seconds or C++'s 67 will reduce to 55. If the CPU is using 12 seconds, to get anything from parallel CPUs, you'd need to get the data from disk in less than 12 seconds, which is probably not far off what current flash is capable of. I don't believe there's a software solution to make the IO much faster on the laptop's hardware.

Running on a Sun Netra T1 105, with a 440MHz Sparc CPU:
tercel-2:~/projects/tbray$ time bin/tbray4 datasets/thousand-o10k.ap
matches: 1100000

real 1m23.930s
user 0m57.070s
sys 0m24.636s
Much more CPU time, but the total real time is in the same ball park - the ten year old server has a fast wide SCSI disk but a slower CPU, so is close to being CPU bound.

If you have multiple cores and multiple disks to read from, you can launch multiple batch scripts to process different days' logs in parallel, like make -n or using MPI Scatter/Reduce.

MPI version

There's a simple extension of tbrayN.cpp with MPI at tbray5.cpp. I'm an MPI novice - most of the threading problems I've had to solve require concurrency rather than parallel processing. MPI defaults to use as many processes as there are cores, but you can force it to use a specific number of processes. On a single core machine, it slows from 1.2 seconds on a million line file with a single process to 14 seconds with 16 processes, to 54 seconds with 64 processes. Trying to launch 1000 processes causes the machine to spend forever context switching. Running lots of processes without context switching is what erlang is good at, but for optimal CPU throughput you only want as many processes as you have cores.

The MPI version loses in terms of conciseness, and has two separate code paths - the process to read the file into chunks, and those to scan the chunks. It's not as nice to read as the erlang. It comes in at 92 lines of code, 6 of which are error reporting that the erlang example lacks (presumably the error is passed to read-eval-print-loop to handle), so is within a couple of lines. Using pthreads would probably require more lines, as it lacks suitable message passing primitives.

Parallel languages, concurrent languages

IO aside, the actual processing seems to be a problem in parallelism - the task can be split up into work packets which can be processed independently, with a reduction at the end to a single result set. Only about one in nine lines contribute to the counts, so the dataset for the reduction is significantly smaller than the input dataset, so parallelising the matching and maybe one pass of sorting and counting should allow a speed-up on a multi-core machine.

Languages such as Sun's Fortress have intrinsically parallel constructs and atomic blocks to hide concurrency issues from the developer, and take advantage of multi-core hardware. Fortress has built in parallel for and reduction, equivalent to MPI code but without the programmer having to explicitly manage the processes.

In Fortress, the code for wide finder should be no more compicated than the ruby code; it's up to the implementation to parallelise the loop. Unfortunately, the current Fortress implementation is an interpreter on top of the JVM, and isn't speed optimised - it takes a couple of seconds to parse the scripts, and then interprets them rather than generating byte-codes.

Actor model languages such as Erlang, Alef, and Scala, are used where the problem is best expressed in terms of concurrent processes. Their implementation is designed to allow many, many concurrent processes on a finite number of hardware cores - they reduce requirements for locking, have strategies to mitigate blocking operations stalling other actor's execution, and solve many issues that OS using level threads for concurrent, communicating, mobile, robust processes exposes. I've written actor based code professionally where the problem fits that model.

The wide finder problem has no inter-process communication requirement until the workers terminate, it doesn't require more processes than there are CPUs, it doesn't write anything so doesn't suffer from multiple writer threads and locking issues, and it doesn't require much execution to continue if one worker blocks - the blocking will be due to IO, and it can't do anything anyway.

Erlang and Scala are optimised for the opposite problem to the wide finder problem - concurrency within a core rather than parallelism across cores. A better language choice would be one such as Fortress, which gives parallelism across cores without having to express the problem in anything other than terms of the operations on the data.


This problem shows up one of the reasons I'm working on compression for the quad-store - any data mining using spinning disks is IO bound; using compression is one way to shift work load off the IO pipe and onto the CPU. It's also why I spend a few tenners on a second fast disk for each of the Netras in the cluster, rather than a couple of thousand on a new multi-core box. I can't afford 30 dollars/giga-byte for a solid state wide-drive.

It's also an interesting comparison of compiled and dynamic languages - running ruby is CPU heavy, as heavy as running on the 10 year old hardware in my attic, but is allegedly easier to program in. I think I'd prefer it to Perl if I was to write text extraction and reporting scripts, but hate nearly all programming languages anyway once I have to use them. Some are more useful for certain problem classes.

Some of the discussion has been about approaches towards splitting the log file into chunks, and splitting it into lines.

None of the hits in the sample file require you to split it into lines - the pattern doesn't match anything it shouldn't if you ignore line breaks in the code. To split the file into lines you need to inspect every character; to match the pattern you may only need to inspect one in 18 (if it's not in "GET /ongoing/When/" you can then skip 18 characters and look if that's in the pattern, which is the basis of KMP); if the pattern matching can be done better than the line scanning, then that should give a boost on the CPU limited machine. It doesn't give a boost on the Netra, and I'm not sure why, possibly because KMP makes branch prediction worse.

All the code posted has been exact solutions. If you split a billion line file into 100 equal chunks for 100 cores to process, as you might using the built-in MPI Scatter function, you'll lose maybe 11 hits that get cut in half (as only one in nine requests matches the pattern), and you probably won't care - the most popular pages will have thousands of hits, so a dozen here or there won't be missed. Similarly, if you only want a 90% confidence for the most popular hits, you may only need to sample some of the log file. Whether a approximate solution is good enough is the sort of question you need to ask a customer before optimising for the exact case; it may well simplify the solution.


Labels: , , , , , ,


Thinking more about compression

I'm using compression on the URIs and literals in the quad-store to trade CPU for disk/memory bandwidth. The current algorithm is LZW, with a 24bit index look up table. Yesterday's benchmarks indicate it's not doing massively good compression - ; on the other hand it's small and I can see how to preform regular expression searches on the compressed data. 24bit index tables take 64MiB and give around 1:5 compression - reduces to 23%. You can't reset the table like you do in a streaming compressor, since you want the mapping of compressed bits . Bigger tables take more room both - more entries, and the entries become larger as the number of bits to reference another entry increases. If you had 25bit indexes, you'd need 5 extra bytes per entry, so take 154MiB rather than 128MiB and possibly suffer due to cache line mis-alignment.

Since each value in the quad is reduced to a 32 bit identifier in the store, and at least one bit is needed to indicate whether it's fully described by the identifier (uncompressing the identifier's bit-pattern gives the URL), or needs a look-up to longer bit pattern. Obviously, the more URLs which can be squished into 32bit patterns the faster the store will be written to, as it removes the need to look elsewhere to match the string pattern of the value with one already in a hash table to check that the same identifier is only ever used for the same string. But 24+1 bits leaves 7 unused bits of space in half the possible identifiers, which reduces the maximum identifier density to (128+1)/256 = 50.3%, which isn't very good.

Looking at the benchmarks here, there seem to be much better methods than LZW, but most of them are neural net or model based, and I can't see how to map a sequence of hints for a neural net into anything that could have regex implemented on top of it without decompressing. They also are orders of magnitude heavier on the cpu, apparently enough to fade the total performance.

The other options are to not bother with the intrinsic identifiers - which means finding out how many of the URLs would fit in the LZW table for typical datasets (certain of the computer generated ones they would), finding other uses for the spare bits such as 11xx xxxx xxxx xxxx xxxx xxxx xyyy yyyy indicating a 23 bit LZW index and a 7 bit character. Maybe the simplest way would be to have aaaa aaaa xxxx xxxx xxxx xxxx xxxx xxxx where any non-zero in the a's indicates it's a hashkey, so you get full identifier use at the cost of more hashtable lookups.

Holding the full hash in memory isn't an option; even just a population bit would take half a gig. Since each node is single-threaded, it's possible to split URLs from literals and not care whether literals match outside of a smaller MRU table cache; that would increase search times for equality of literals with non-intrinsic ids, but not effect regex searches. It may also be worth maintaining separate LZW tables for literals and URIs. Eliminating duplications can be done in the background anyway.

I think the next experiment will be scanning the datasets I've got - swoogle's ten million triples, wikipedia in a couple of formats (dbpedia and the xml dump), and the uba owl generator - to see what hit rate and spread I'd get generating ids using LSW and hashes.


Labels: , ,

First compile and run on Solaris 10.

I've got first run of one of my quad-store benchmarks running on my Solaris 10 boxes.

I needed to make the machine architecture and a flag whether or not to use SSE2 environment variables, and add some extra #defines so none of the SSE stuff was included. I needed a macro to fake posix_memalign, though will think about using another malloc-based version. Compiling on Solaris also generates warnings for automatic promotion of -1 to an unsigned int so I added explicit casts (-1 is returned by some functions where 0 is a valid return and something is needed to signal an invalid value).

Running the compression benchmark, the sparc (440MHz) doesn't seem to get cpu bound for small files (100MiB):

tercel-1$ uname -a
SunOS tercel-1 5.10 Generic_118833-33 sun4u sparc SUNW,UltraSPARC-IIi-cEngine

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 100
read: 100 MiB, compressed to: 21 MiB.

real 0m23.094s
user 0m20.973s
sys 0m1.715s

tercel-1$ time bin/lzw_bench datasets/original/triples.sql 1000
read: 1000 MiB, compressed to: 301 MiB.

real 3m55.124s
user 3m39.377s
sys 0m12.550s

fortinbras$ uname -a
Linux fortinbras 2.6.20-16-generic #2 SMP Thu Aug 30 23:16:15 UTC 2007 x86_64 GNU/Linux
fortinbras$ time bin/lzw_bench datasets/original/triples.sql 100
read: 100 MiB, compressed to: 21 MiB.

real 0m8.638s
user 0m5.784s
sys 0m0.296s

fortinbras$ time bin/lzw_bench datasets/original/triples.sql 1000
read: 1000 MiB, compressed to: 301 MiB.

real 0m56.176s
user 0m43.667s
sys 0m1.948s

The processor in fortinbras is Mobile AMD Athlon(tm) 64 Processor 3400+ at 2200 MHz, the Netra at 440 MHz, and appears to be about a quarter as fast at a fifth of the clock speed.

On the other hand, running the Netra at full doesn't heat my lap up, I've three of them, they've more, faster disks available and I can play with making the quad-store distribute between them.

Whether they hold up as well against the benchmarks which use SSE to make the Athlon wider I'll have to find out.


Labels: , ,


Dynamic languages tools

Link: http://blogs.activestate.com/ericp/2007/01/kid_adding_a_ne.html

I got a CD with Komodo Edit on it at XTech, but haven't tried it until now. The recent open-sourcing of the code, and then this post appeared on Planet Mozilla.

It looks like it would be good for DSL development, and playing with it Komodo Edit seems to be very usable. My current editor of choice is SciTE, which Komodo Edit embeds, but Komodo Edit gives projects, and language aware autocompletion and navigation (go to the definition of a function). I haven't found out if it's possible to go to definitions in other files, which would be very useful - I spend quite a lot of time doing Ctrl-C,Ctrl-O,[filename],Ctrl-F,Ctrl-V,Enter to jump around a JavaScript project. It also does polyglot programming - a simple XUL file with a script element switches nicely from XML highlighting to JavaScript highlighting in the script element's content.

The user defined language tool is lexer + state machine based; I'd have preferred it to be a packrat based one, which I've played with before. But I don't want to have to write a whole IDE, I like using the Scintilla editor control, and Komodo's approach seems to get the job done.

As a side note, del.icio.us seems awfully slow these days. You'd have though that Yahoo would be able to give it a couple more servers. I wonder if the competitors are any better.

Labels: , , , ,


Rich Application Platforms

Well, there's been a bit of a kerfuffle about whether or not Silverlight (nee WPF/E) supports binding, widgets and XMLHttpRequest gubbins. And Adobe has open sourced Flex, as well as having had its Apollo platform out for a bit now. XULRunner is still going, being used for SongBird and Joost, and freshly the DVD version of Wikipedia, though whether or not Mozilla is promoting it well as a platform is under discussion.

Having done an awful lot of XULRunner programming in the last few months, there's still a lot to be done before its an easy to use platform. Quite a lot of undocumented interactions cause crashes - particularly race conditions between XBL bindings being attached and scripts in the body of the document being run, and any manipulation of the tree in script during an XBL constructor. But I don't have any idea how well Silverlight or Apollo handle such things, or even if they support anything as flexible as XBL. There needs to be an XULRunner 2.0 that is a lot more predictable in its behaviour, and less arcane in its invocations.

There's also a big question for the sort of system's I'm interested in - can a platform do something like the ClusterBall WikiPedia visualisation? XULRunner SVG grinds to a halt at a thousand or so arcs, and rotated font anti-aliasing is a bit off. Flash, using Haxe, can render ten thousand arcs, but getting text to rotate is a hassle - it requires embedded fonts, and that means more tools, you can't just set the rotation property. There's quite a lot of weirdness in Flash, half of it is object view graph based, the rest procedural, not everything works and it doesn't play nicely with the browser, for example capturing keyboard input and not being part of selections. As far as I know, there isn't an open high-quality 2D graphics library other than anti-grain 2.4 which will scale and isn't, well, a bit weird. But anti-grain has moved to GPL, and isn't hardware accelerated at 2.4, so what will become of it I don't know. I'd really like a good, object oriented language, with a mix of dynamic and static typing, support for mesh tuple spaces, event queues, first class functions and relations, orthogonal persistence, which runs fast enough, and generates portable, high quality visualisations and information displays. None of the options are quite there yet, but XULRunner's JavaScript and SVG seem to be the most sane.


Labels: , , , ,


More rdf bits

I've spent a little more time on quad-store, and found some more realistic performance figures from Franz Inc and the ESW wiki.

There's now a reasonably fast in-memory sorting (using radix 256), around 2.2s/10 million triples, which should help indexing small stores. I also have had a look at MPI for linking nodes together.

I think the premise may well hold - as the limiting factor for growth is IO and memory rates, not CPU, so trade CPU for compression and limit the number of values a store can hold, so making the store's representation smaller, reducing IO. Structure for locality and divide the store into nodes when the limits are exceeded. I still haven't found a large enough dataset to push the 230 values it can hold. And 230 values is up to 290 triples, as each triple has 3 values. 2 bits of the 32 are reserved to indicate whether it's a compressed URI, or a hash table URI, a blank or an literal.

OpenLink Virtuoso also seem to be using something similar in their 'bitmap indexes', though limiting to 32 bit values gives you a set of sorted b-trees of blocks of quads, rather than indices - in a pure index you have to look up the value, so incur an indirection.

I'm not yet bored with playing with this, as I am learning quite a bit about how bandwidth limited systems behave, whereas most of my previous work has been cpu limited numerical stuff, distributed services, or km and visualisation apps.


Labels: , , ,


MegaData-QuadStore micro benchmarks

2007-04-05 21:36
Starting from Joe Gregorio's BitWorking post on 'megadata', I'm wondering what happens if you work this thing from the other end - rather than trying to make a database that scales well over distributed compute nodes, instead focus on optimising each node whilst trying not to compromise external scalability.

Getting parallelism working, whether via network or just concurrent threads in the same machine, is harder than straight-ahead computing in a single node.

Since my background is in engineering computing and simulation (with a little KM and recreational prolog), I've little idea why it takes 15 minutes to load 10,000,000 triples into a RDF store , whereas reading a 150 MiB file itself takes seconds.

So I'm trying to find the cost points for nodes, rather than looking at distribution schemes here.

For this, I'm taking a node as a virtual processor and half a GiB RAM, partly because my laptop has a single non-hyperthreaded core and 640 MiB RAM, so is a good model for a single node in a server array. A quad core hyperthreaded server with RAID should be eight nodes -eight times my laptop in Mips, disk speed and RAM.

At that size, a node doesn't have to handle more than a billion tuples, and I'm assuming a little bit that some of the strategies for distributing queries over nodes will still apply to such nodes. Limiting a node to a billion tuples also makes coding easier - you can use 32 bit identifiers, as long as they are reasonably local to the node. By 'reasonably local', I'd expect some number of IDs to be common to all nodes to represent the most frequently occurring term values, but distributed queries not to have to rely on having unique IDs that span between nodes.

Most of the RDF stores seem to be written in Java, and what I've seen of the implementations do a lot of pointer chasing, so this is a bit of fun to see what we can do with SSE2 to make a RDF-like query engine without pointer chasing.

If each tuple in the store is a quad vector of 32 bit ints, we can do quite fast searches using SSE2. As John Sowa predicted, many triple-stores have found that triples aren't enough, and are quad stores now. Though maybe a hexa-store is best for human-oriented data, as most of the relations we come up with have arity less than seven.

The log of my first week's hacking is available here.

Importing from the 1716 MiB SQL dump mentioned in the BitWorking post's comments, using a compression schema and a string table to generate a 160 MiB table of around 10 million triples and keep the full form of each quad within the memory limits takes around 2 minutes, around twice disk throughput speed.

Getting that far took most of Good Friday and Low Saturday, then I went to my sisters for the rest of Easter.

Coming back and coding in the evenings this week, I was playing with merge sort on the quads. Sorting one term over a 10 mega-quad table takes around 9 seconds, which is around 24 passes with merge sort, so visiting 28 million quads a second at ~85 clock cycles per quad (clock speed is 2.4GHz). That does seem a little high, as there's only around 20 instructions in the lowest level loop, though there is a branch. Using SSE to eliminate the branch costs more than the simple C variant.

Thinking about indexing next, it's noted here that
to generate indices where all quads matching two terms are contiguous, you need an index for each selection of two terms from four, ie six indices.
Let P, O and G denote the four terms in the quad - subject, predicate, object and graph.
For one or three term queries, four indices suffices - for one term, one index for each of SPOG, for three terms, one index sorted on the three terms in any order.
For a query of four terms, any index will have the matches contiguous.
So for quads, six indices will give you an O(log(N)) match on any single part of a query, as long as SPOG appear first and last in at least one of the indices.

So a selection such as these might be used:

SPOG -> S, SP, SPO matches contiguous, SP matches ordered by O over G.
POSG -> P, PO, matches contiguous, PO matches ordered by S over G.
OSPG -> O, SO, matches contiguous, SO matches ordered by S over G.
GSPO -> G, SG, SPG matches contiguous
GPOS -> PG, POG matches contiguous
GOSP -> OG, SOG matches contiguous

Based on the assumptions I've made about what a node in a distributed array needs to provide, ie 32 bit IDs is enough, and how well the sample quads compress, these don't have to be indices - a reference into a file takes as much space as the data itself - instead they are the IDs. I'm using these for direct matches, not comparisons such as SPARQL filter provides; you'd filter a potential, or create a union of a small range (such as 2 < x <= 5 translates to x=3|x=4|x=5 for integers).

I'd expect that most of the time you'd want the results around a single subject, particularly if you have lots of blank nodes, ie you'd be using one ordering to get ?PO_, then use SP?_ (? desired term, _ don't care, SPOG specified value).

Taking the example query from http://xtech06.usefulinc.com/schedule/paper/18 :

SELECT ?pub ?title ?issue ?article
?title rdf:type struct:Title .
?title dc:identifier .
?issue prism:isPartOf ?title .
?issue prism:volume "20" .
?issue prism:number "4" .
?article prism:isPartOf ?issue .
?article prism:startingPage "539" .

An optimiser may first look at the POGS and OGSP orderings to determine the spans of the predicates and literals to determine which patterns to match first.

Either you can build a topological sort and start with finding ?title candidates, then ?issue candidates, then ?article, or you can merge the results of all matches with one ungrounded term.

The thing about this index approach is that this pattern:
  ?title    rdf:type        struct:Title .

would use the POGS index, and the result would be far from the result for
  ?title    dc:identifier    .

This lack of locality, I think, is going to be a pain point for this kind of normalised store.

Whether it's an index with pointers into a larger store, or an ordering of a compressed form of the quads, you've got a cache (disk or memory) miss, which is why I'm trying to implement this without chasing pointers.

But so far I'm happy that it's possible make a midi-store node that's good for a billion quads, and then distribute a mega-store between them using a higher level protocol than that which is optimised for the node to use internally.


Labels: , ,


Gpu and meta programming

This last week I've been looking a bit at some of the gpu toolkits that I mentioned last week. Mostly that has concerned getting used to building them, since my home computers done have high enough spec graphics cards to run them. I'm getting tempted to buy a shuttle to play with, but like using my laptop more than a desktop, and it seems a waste buying a top-flight graphics card then interacting with it via a laptop and ssh (which is how I often use my Mac).

I've also started porting a large program from Windows to Linux so I can operate on it using oink static analysis tools and Mozilla's scriptable metaprogramming tools. Actually it only needs to preprocess on gcc rather than be ported, but it would be nice if I could run it at home.

Also this got me thinking about how to do an efficient triple-store.


(removed link to chocolate bunnies pic)

Labels: , ,


Language of the year

For me, this year's language is assembly. I did a little 6502 assembly as a teen - wrote a dissembler for the BBC B's OS and mucked around with its interrupts in high school and some 6800 during university, and did a tiny bit of 386 inline assembly for loop optimisation of Gouraud shading parts of my MSc in 1992, so it's not quite a new language, but it's one I haven't used in a few years.

It seems that there is more assembly around these days - some marginally disguised such as C++ SIMD intrinsics, but quite a bit of the lock-free code I've been seeing is assembly. Perhaps I'm just looking at more parallel and concurrent code. It seems the high level languages we currently have don't have abstractions which match multi core architectures, and so we have to make our own low-level libraries to map them to them.

There's also big gains in using GPUs for parallel computation, which is again getting closer to the hardware, though there are C and C++ tool kits for GPUs out there. The last few years I've looked at languages which have been higher - Python, Common Lisp, ML, JavaScript. Fortress stands somewhere else - it's a very high level language, but very careful to have a rich enough type system to allow a strongly optimising compiler for parallel hardware. I still don't believe that type systems have to be static to be optimised, but they do need to be strong.

Targeting these new architectures, the PeakStream platform looks interesting - though it seems to only target workstation GPUs, and may not be a platform (which I take as synonymous with virtual machine) but rather a set of GPU optimised libraries. As my only home machine is a laptop without I don't have machines which have its target GPU or one for NVIDIA CUDA. The Sh shader metaprogramming library also looks like the sort of thing that may be useful - it uses C++ operator overloading, much like Boost lambda or RealLib to generate an AST of the expression, which is then calculated on either a CPU or a GPU backend (RealLib expressions are calculated using the CPU, sometimes with SIMD intrinsics, to required precision). I'm interested as to how Fortress' parallel for might get implemented on a GPU, and whether it's still necessary to detect glitches in gamer rather than workstation class GPUs - traditionally the gamer class GPU's ran hotter so weren't guaranteed to produce accurate results all the time, but there's no mention of that in recent generations.

This post is delayed from Thursday as I was off in York, finding a venue for my wedding reception. My lovely lass stayed with me for a fortnight, but now she's back in Glasgow :(.


Labels: , , ,


Programmable toaster ovens

"In contrast the Java community has stagnated, stuck with a language designed in a hurry to power toasters. "

Random image of the Banana Jr. in 'Bloom Country' which, when it's owner phoned up the support hotline as it was misbehaving, was put on the phone and the technician threatened it with its chips being used for programmable toaster ovens.

So much of Java seems to be very same-y, putting a web front on a database, EJB for scaling out. Tuple spaces are interesting; C4I systems are blackboard based, some of those fielded ten years or more years ago. It went mainstream in that market a decade ago.


Labels: , ,


Getting started in C++ on Linux

I've recently updated my Ubuntu partition on my laptop to 6.10 Edgy Eft (presumably an orange newt rather than a religious movement), which plays much more nicely with its touch pad, and am using it about half the time - mostly I live in Firefox, which doesn't care much what the OS is.

I've hidden my Civilisation DVD at the office, so now have other things to waste time on, such as finding out how to program Linux. So I've downloaded Eclipse and am getting the C++ tools as I write. KDevelop doesn't find the help system on the Ubuntu I'm using (which is Gnome rather than KDE), and I think I need help.

Today at work I was trying to formulate a pi-calculus representation of the C++ networking library I'm maintaining, which has a concurrency bug I can't seem to locate. That was hard, so I'm falling back to creating an object syntax shorthand that can be simulated. I'll map the semantics to pi-calculus; it's all sugar, but means the representation that's animated in the simulator is closer to what's being modelled, and you don't have to use the rather complex forms for data structures and places presented in Milner's work.
Putting a CD of the installer image into the drive started package manager, which meant I could get rid of the errors in the install of graphviz-cairo, which had failed from Add Programs, and was causing other programs to fail to install.
Running Eclipse as a normal user doesn't allow you to add the C++ plugin; whereas normally programs ask you for permissions, Eclipse just fails.
Running Eclipse again using sudo, it has to download all the installer files again. But this time it seems to be able to write to the plugins directory where Eclipse is installed by default.
Restarting, still under super-user, seems to have C++ tools. HelloWorld.cpp compiles and runs.
Close and start under normal user account; it's forgotten my choices (which is not surprising), but appears to be all there.
Created a new C++ managed make project.
Opened a file on my Windows partition, chose save as, clicked OK assuming it would save to my project's workspace. Instead, Eclipse crashes.
I must say that the reason I have never used Eclipse is that each time I have done so - on Windows and OS X, it has fallen over far too quickly for me to get anything done with it. It's also got a pig-ugly UI. It looks like the same may be true on Ubuntu.
Restart Eclipse. File>Open doesn't locate previous project. Try to create project again, Eclipse says a project with that name already exists. Project > Open is greyed out, so I can't use that, though why I can't open the project if it exists, or create one with the same name if it doesn't.

Deleted the directory in the workspace, which was empty. Attempt to create a project with the name, but still Eclipse prevents me.

Exit Eclipse, find the empty project directory has re-appeared event though it could not be created, so delete it again and the workspace folder.

Start Eclipse, try to create project again, again it complains "A project with that name already exists in the workspace." A quick search says "This error appears when you have the folder with the same name as your project in your workspace." This is not the case.

Give up and call the project something else for now.

Exit and get an error reporting " Problems occurred while trying to save the state of the workbench."

Start Eclipse again. It reports that an error has occurred.

Remove workbench folder and try again, now Eclipse loads. Try and make project with desired name, and that works now.
Eclipse is saving workbench info in hidden folder in workspace; so the metadata there needs to change rather than just deleting the folder; deleting the folder does not cause an refresh of the metadata to ensure it is consistent.

So doing the obvious thing to erase a project doesn't work, and Eclipse requires that nothing else touches its project folders. Which is a bit of a problem, as I am wanting to use the same source folder layout for both Windows and Linux targets.
Open files from the Windows partition; they are read only, so it's not quite going to be as easy as I hoped. Use File>Save As and save first file to workspace, having to navigate there from /home/pete. Select second file (there are several), and it's forgotten where I am saving to. Instead of navigating, I enter the full path in the name box and hit Ok. After a brief pause, Eclipse crashes. F__k knows where my file went.
Copy files to workspace/project_name in command prompt. Attempt to add them to the project, get "Source is in the hierarchy of the destination." error. Try again, this time importing from the Windows partition directly. Select the overwrite without warning option, but get warnings that files are being overwritten. But now it seems to have imported the files.
Ok, so we are now compiling, and getting errors. The help option brings up a browser in a
tiny sidebar which searches Google. Yay. Given I was using Eclipse in the hope of getting proper help, this is not good.
What's the conditional define for Linux?
g++ gives some nice warnings, for example member initialisation order in constructors
SSE doesn't compile as the option for g++ is not set. Eclipse shows a blank page for the help topic about setting makefile options.
Found g++ options in project properties (which is fairly obvious place). No option to enable SSE, so add option manually. Not too difficult this step.
The little hash table example compiles and works; tomorrow if it snows I'll be working at home rather than driving into the office, so I may try to get a basic simulator for a process calculus running under Linux.


Labels: ,


SSE String

(to rhyme with silly string)

Since I'm not seeing the fine lass this weekend, I've spent some time twiddling my bits - playing with applying SSE2 to string algorithms.

SSE2 operations shift the costs of certain operations drastically, turning O(N*M) to O(N) for M <= 16, and making register comparisons far faster than loads. I've played with algorithms for searching for a single character in a string and for searching for a substring. The platform is WinXP, AMD 3400 mobile. Applying SSE to the naive algorithm for finding a character - loop and compare each - runs around 5.3 times faster; using _mm_cmpeq_epi8 to compare 16 bytes at a time.

Applying SSE to the naive algorithm for finding a substring - find the first character in the substring, then test the remainder - gives a 2 to 5 times speed up (best case is where the first character never occurs in the string) using SSE to match both the first character (compare a block of input with the first character 16 times in parallel), then to compare with the remainder (compare the current input to the pattern 16 bytes at a time).

Using Boyer-Moore on the same input gives results usually between the two. Attempting to use SSE for the inner part of Boyer-Moore doesn't help - extracting the index of the first mismatch from the 16 bit pattern indicating which characters in the pattern match the input costs more than the non-SSE version.

Boyer-Moore-Horspool is faster than Boyer-Moore, and faster than naive+SSE for longer strings, but its simplifications don't seem to be enough to admit any obvious parallelisations. The speed advantage over Boyer-Moore is probably that the algorithms are implemented as functions, and Boyer-Moore has to allocated and free an array on each call. So the performance would be different for a pattern matcher object.

Knuth–Morris–Pratt seems to under perform on these tests, again probably due to the dynamic memory.

I need to think about SSEifying Boyer-Moore-Horspool and Knuth–Morris–Pratt, and create another test harness that allows for pattern objects so the initialization of the tables in the more complicated algorithms doesn't count against them. But in a single function use, it appears that naive is best when the first character is rare, and Boyer-Moore-Horspool where the string is longer. Rareness also helps the algorithms with dynamic memory, as each is called until all the input is scanned, and the rarer the pattern the fewer the times the function is called.

Before I do that, I'll get the core object system in kin working (see next post) - I'm just about happy to go on with that now.


Labels: , ,


Current reading

Following the announcement of Mozilla Tamarin - the adaptation of Adobe's ActionScript virtual machine to give JIT compilation of JavaScript in Gecko applications, and the starting up of the SmallTalk.rb project to get Ruby running on a better VM, I've been doing a bit of reading:

StrongTalk - fast, typed smalltalk
- interestingly, unlike ActionScript, the typing is orthogonal to the speed optimisations
- everything is implemented as mixins, functions which add features to classes

Mirrors - a mechanism for separating reflection and conventional object APIs

JS Typechecking via JS-Sourcerer or a more formal approach which I don't know half the notation of
- JS-Sourcerer doesn't have a model for the interfaces to XUL, and doesn't accept IDL to define it (though it should be possible to create an IDL to its proprietary type definition language)
- there's also a more complete modal type model; I need to investigate more of the theory here, and here.
- at the least you can infer types at any execution point and then assume that changes to object's or prototypes relaxes assumptions
- I need to find whether most type inference systems for dynamic languages are Prolog-like, and whether a Rete-like system is more suitable for dynamic languages
- assuming the cost of the inference actually is worth it in execution terms; it may be better to do some pattern matching then allow constraint violations to push execution back to the interpreter. So choosing an inference engine which is suited to the pattern matching may be more important.
- Type Feedback vs. Concrete Type Inference - maybe TI in the bytecode generator, then TF in the VM?

And, in other news, there's the re-branding of the Semantic Web as Web3.0. Personally, I've been using Web3 for community driven 3D on the web. Let's see if RDF hacking get as popular as World of Warcraft machinima or Virtual Life bartering.


Labels: , , , ,


Getting annoyed again

Lately I've been rationalising and refactoring the code base of the large-ish XUL application which I've spent about half my work time this year prototyping.

Which naturally means I'm annoyed with JavaScript this week.

  • I really need a type checking JS lint. We're doing demos tomorrow, but after that I'll give JS Sorcerer a whirl. Last Friday I wrote my ToDo list, which currently has 117 items on it, my line manager is talking quite seriously about getting two or three extra staff to share my workload.

  • Managing packages and dependencies is a little clunky. This is true of all languages I've yet seen, and possibly Java is least worse (I've not done anything real in Common Lisp so don't know, though it seems neither better nor worse). I don't want to have to use a script to build the XUL files to ensure that everything is included; part of using XUL was to make it easily built.

  • I'm using the observer pattern a lot in the UI. This means virtually entity in the UI uses __defineSetter__ to map each of the object's properties to a structure containing the value and an observer list. I also am getting performance hits where multiple values are changed within a function, so observers are called repeatedly rather than firing them only one the values are all consistent. I really would like a language with transactions, concurrency and observers built in. (I must re-read the Fortress spec to see if it includes observers, or at least hooks on its equivalent of (setf (slot-value foo 'bar) baz) to allow them to be added).

  • There's nice JS syntax for initialising objects, and mechanisms for prototype inheritance, but there still seems to be repetition in creating constructors for objects with prototypes, lots of function Foo(a, b, c) { this.a = a; this.b = b; this.c = c; } Foo.prototype = new Bar();, where I'd like to get rid of all the this.'s. Maybe some type of prototype + merge/copy rather than classic prototype + constructor.

  • Probably my own fault for not using E4X, but creating SVG using the DOM is a pain.

  • Every now and again I still get bitten by XBL's asynchronous loading.

On the other hand, R. is back from paternity leave and so we now have subversion installed, so supporting the users who are using earlier branches should be less stressful.


Labels: ,


Better GUIs

Every now and then I go and have a look back at Java land. Yesterday I found this blog entry
John O'Conner's Blog: Better GUIs are one step closer.

It's very painful making good layouts in Java. For my last large Java UI project, which was based on porting a large mainframe ISPF application to run as a desktop application on PCs, I ended up implementing a layout and look-and-feel with most of the CSS box model on top of swing. I would have used XUL for it, but it was a Java shop and adding another platform was too political. There already is CSS look-and-feels in Java, so it shouldn't still be an issue getting things lined up right.

Anyway, I tried to do the same thing in XUL, so that it looks like this on a Mac:


Now, this took a little while while I remembered that you have to specify widths to ensure each flexible box ends up the same size, but the declarative syntax means you don't really need an IDE and a graphical editor, and the full CSS support means you can skin things if you like.

But that's not the real problem. Better guis are not just better aligned guis - they are concerned with user experience. And if you have to produce that amount of code just to align your fields, then you won't be agile enough to respond to your users.


Labels: , , ,


XTech 2006 - Friday

Link: http://xtech06.usefulinc.com/schedule/#d2006-05-19

(* TME: Since I was supposed to write this up for work, I may as well finish putting it up here *)

Eric van der Vlist, an xml-dev regular, introduced his Treebind framework for binding hierarchical data to a Java object model. It is based on coding conventions and reflection, with full paper here. Seems pragmatic and well designed.

Oleg Parashchenko showed us his XSieve transformation language. It incorporates pieces of scheme into XML for functional transformations, bridging XSLTproc and Guile. Technically interesting, but I'd rather use XSLT2 or scheme rather than mixing.

Jeff Barr spoke on Amazon Web Services . Amazon is providing various APIs into its compute infrastructure, for example the Alexa web index, S3 for storage, Mechanical Turk for coordinating human tasks. The S3 model (simple storage service) allows either thick client applications to use it for archive or publishing, or can be used for hosting application-on-a-page, such as this wiki. (* Amazon is also offering compute facilities, competing with Sun's Utility offering. *)

The Mechanical Turk is interesting - for non-time critical, non-secure 'AI' problems it shifts the balance. Already you can get speech transcribed; go the other way any you have the book from The Diamond Age. If it gets out to the world's population centres, which projects such as One laptop per child promise, then services like this could shift AI massively toward concerted augmentation.

There's obviously lots of things that can be done once you have large providers doing the systems management stuff and giving simple APIs to lightweight languages.

Brendon Eich introduced JavaScript 2.
They're working on an update that can be implemented on small devices (Opera having input on this), that also facilitates Programming in the Large.

They are adding packages and type annotations, which may or may not help PitL - inferring traits should give as good performance improvements, and the type system isn't much more powerful than Java's, though with the addition of indicating where a reference may be null and arrays as structured tuple types, a bit like Nice or even Haskell.

JS2 will also have classes and interfaces. Classes are sealed - you can't add arbitrary members; presumably interfaces will, as well as being contracts between JS objects, also allow JS objects to map to XPCOM interfaces without having to explicitly provide a QueryInterface implementation.

Array comprehensions and generators are also coming into the language, so it's not all Java influenced.

In some ways the improvements are good - moving JS towards 'industry best practice' for PitL - though I'd have preferred something that allows implementations to break encapsulation of objects, and JS2 seems to be strongly XPCOM component oriented.




XTech 2006 - Thursday

Link: http://xtech06.usefulinc.com/schedule#d2006-05-18

Same story. Trying to sleep off cold. Thursday was mostly in the browser technology stream, starting with the second seminar.

Daniel Glazman really believes in Etna (talk | project home). It's always good to hear someone who's both passionate and technically competent talk.

Etna is an open source XML editor, wysiwyg (using CSS). It is being built to allow non-techies to write XML reports, so its non-functional requirements include being able to be used by people with no wish to learn XML.

It uses its own RelaxNG engine, with extensions to provide initial content, and mappings for key events (such as ending a list when the user presses [↵] twice).

Further RNG extensions for external doctypes, processing instructions, label and description of content, localisation properties, and behavioural semantics. By behavioural semantics, Etna allows you to specify tabular, list, outline or block style editing controls - although the CSS could make any XML look like any of these, the idioms the user uses to navigate and edit such content varies, and Etna supports these idioms.

All operations in Etna yield valid XML - gone are such low level operations such as renaming a div tag to a paragraph tag - instead it uses pattern matching to transform between content models, preserving the text content and a maximum amount of structure. This approach was somewhat contentious with some of the audience, since it isn't obvious that it will allow all the operations an old sgml hack would want to do in emacs, but it is the right thing to do in a wysiwyg editor, and I think it's the right thing for a DSL editor too. (* TME: Though for a DSL, I would want the editor to use a different presentation HTML or XUL tree than the XML representation of the AST, and to be able to modify the type labels of the AST based on the text content, which is beyond where Etna is at the moment.)

Mark Birbeck gave a talk about building rich, encapsulated widgets using XBL, XForms and SVG. As before, no mention of breaking the SVG presentation apart into layers, nor how to cope with removing content from the document tree and losing the bindings, but still interesting to see other things you can do with what's available.

Thomas Meinike gave a talk about SVG under Firefox 1.5; nothing new, but he's collected some reference material.

Ryan King gave a talk on the design of microformats (I'll ignore his USA parochial title). Basically - don't repeat yourself + pave the cow-paths. Reuse existing HTML and annotate with semantics, allowing convergence and openness and a very low entry point.

When designing a new microformat, he asked these questions:
  • Is there an HTML element that will work?

  • Is there an HTML compound that will work?

  • Is there an existing microformat that will mostly work?

  • Are there any existing standards which are both established and interoperable?

If the answer's yes to any of them, then you don't create a new microformat.

When creating microformats, take a large sample of the existing documents on the web that contain information of the type you are trying to add semantic annotations to. Design your format so that it's in keeping with the consensus of those documents.

After that, I took myself away from the browser crowd and went to hear Benoît Marchal on UML modelling for XML. His main points were

  • Integrate XML schema into the rest of the design using the same UML toolset

  • Non-UML tools favour hierarchic schemata; UML allows more horizontal relations so can get a better data model

  • Use the UML tool as a database with a graph-based front end, not just a drawing tool.

  • Use XSLT on XMI to generate the schema. Use an XSLT pipeline, with an intermediate form to simplify the XMI and standardise it across tools.

(* TME: Although I think most current UML tools are poor from a usability stand point, and the lack of denotational semantics in UML2 and laxity in the XMI2 standard restrict model portability, I sympathise with most of what Benoît was saying. I also tend to go further and use XSLT for code generation instead of any vendor specific MDA implementation, again for portability and independence. Some of what he said about schema generation is specific to W3C XSD, and since I work on largely greenfield development I live in a RNG world, aren't as pressing *)

Next up on the browser track was Dean Jackson of the W3C, who is working with the industry to update web standards, such as XMLHttpRequest (why one acronym is all caps and the other is camel cased, I'll never know).

Then came XForms: an alternative to Ajax? You can do some things declaratively using XForms rather than procedurally using JavaScript. The same synchrony and complex rendering issues apply as to XBL.

That was the end of the day at 17:30, and after staying a while at drinks, I went to my room for a nap. Got thinking about yesterday's schema validation talk and today's Etna and other browser stuff, ordered pizza and coffee from room service and started hacking this, getting to bed around 5am.




XTech 2006 - Wednesday

Wednesday morning I had a cold and was late getting up, so missed Paul Graham's keynote on startups.

SQL, XQuery, and SPARQL: What's Wrong With This Picture? covered the lack of difference between the power of SPARQL vs recursive SQL. This means either that you can use existing optimised SQL databases to store and process RDF triples, converting SPARQL to SQL3 as required, or use RDF at the perimeter and SQL internally.

(* TME: Since SPARQL update seems to be a long time coming, and the restrictions imposed by triples make even simple context problems more complex, I'm not sure if there is ever going to be any merit in RDF. I overhead someone from Sesame saying they are moving to quads, because of the many use cases that triples don't meet (I think the tipping point would be hepts, but that's more anthropomorphic than graph-theoretical). Similarly others have suggested named graphs, eg TriX, as a solution to the context problem in RDF. Currently I haven't seen anything that would indicate a rest based, plain ol' xml interface onto an sql database wouldn't be better for anything with context, uncertainty or transitive relations. The better RDF applications all seem only to deal with information that is equally trusted and true. *)

Michael Kay spoke on Using XSLT and XQuery for life-size applications. Speaking of using fixed schemas to validate human data - for example the format of an address field which forces you to enter a house number forcing the operator to massage the data to fit - 'Integrity rules mean you only accept data if it's false.'

He tends to observe the documents which are collected in the business process, rather than trying to get the experts to create an abstract model. For example, an employee's record is a collection of completed forms. So if you want the current employee state, query the history and make a summary taking the most recent values.

Applications are built by composing pipelines which transform these documents. Using automatically generated schemata to describe the intermediate stages in these pipelines gives regression tests.

(* I like Michael's anthropological approach, and imagine that it would build applications that augment the current process, rather than trying to impose an idealised process - which is what some managers attempt to do. *)

Next up, Jonas Sicking of Mozilla talked about XBL2, a revision of Mozilla's XML binding language.

Some of the improvements are tidier means of binding XML (the original XBL is somewhat polluted from being an RDF binding language), support for better separation of implementation and presentation, richer binding mechanisms and plugin security improvements.

(* All of these seem good, but doesn't address the three show-stopping problems I've had with XBL which have meant I've had to stop using it everywhere I'd like to:

  • The binding of methods and properties is asynchronous, so you can't easily create an element with an XBL binding, then call Javascript methods on it.

  • The method and property bindings only exist when the element is in the document tree. This complicates Undo and Redo, as removing or re-adding an element changes its interface.

  • Trying to use XBL with SVG, sometimes the same data is presented in multiple layers, so there isn't a single point to insert content.

Everywhere I've hit the third point, I've had to move from XBL; most times I've hit the second, XBL got too painful and I've moved away. I'm still thinking about moving away from XBL for some of the first case, but probably will have by the time I'm selling software based on XUL as having intemittant glitches caused by races is not a good user experience. *)

Ralph Meijer (who was also at BarCamp) allegedly spoke on Publish-subscribe using Jabber. Or maybe he spoke on publish-subscribe using XMPP, since there are more applications than Jabber. This was interesting in as much as seeing how much take up there was so far - not much, though some other people have noticed that it has potential both for distributed computing, for human interaction, for location based systems, and for hybrid networks. It's been mooted that in the internet of things sensor nets will dominate over comms nets; but already we communicate by publishing our availability and playlists - your iPod is another sensor measuring your mood; a keyboard/mouse activity monitor on your pc is sensing your presence; pushing smarter sensors to the periphery mean it's all the same pub-sub tuple space.

Vladimir Vukićević spoke about canvas and SVG; more a putting a face to the implementer, since I've been playing with canvas for nearly a year, and SVG since last millennium. But all good, and there will also be OpenGL bindings for canvas in the future.

Henry S. Thompson, whom I've often enjoyed the wisdom of on xml-dev, gave a seminar on Efficient implementation of content models with numerical occurrence constraints. Transforming schemata to finite state machines, then adding counters which are incremented or reset on transitions, with maximum and minimum value guards. These allow you to transform large numerical constraints into simple state machines without having a very large number of states. Of course, this simply doesn't happen if you don't use a state machine for schema validation, which got me thinking and I ended up writing another little packrat parser. I'm sure you could create a packrat schema validator that wouldn't suffer the state explosion problem.

I hung around at the Mozilla reception for a bit, but was tired and cold-y, so went to bed early.

Thus ended the second day.




How fast is Java?

Link: http://blogs.sun.com/roller/page/dagastine?entry=java_is_faster_than_c

In response to David Dagastine's use of the SciMark numeric benchmark, where he finds that there's very little difference between in speed between Sun's JVM and native C compiled with Visual C++ 6, using a numeric benchmark that uses static methods to manipulate arrays of primitive data types, primarily doubles.

This agrees with my experience for numerical code. When I was lead software engineer in the technical computing group of the world's largest aerospace company, I moved several numerical projects from Fortran or C++ to Java. At the time, one the major reason were bugs in VC's optimisation code which meant the results were wrong, and discrepancies between release and debug builds that made C++ bugs harder to track, whereas Java had performance within 10% and much stronger guarantees of its results.

But MS got round to fixing the bugs, and in VS 2005 appear to have made some major gains in performance.

On my laptop (WinXP SP2, AMD Athlon 3400+), comparing Java 1.6.0-beta with VS 2005 express gives 388.67 and 631.12 mflops respectively, which is a much bigger difference than I observed between Java 1.4 and VS6. The SciMark code is portable C, so they don't use C++ intrinsics, which can give an order of magnitude speed up for certain code (though since it's working in double precision you probably wouldn't get that much improvement for quite a lot less readability, as you can only process 2 doubles at a time with SSE2, but can process 4 floats, and so keep whole vectors in single registers).

So I'd disagree that Java 1.6 is faster than C/C++ in this case on the two counts - firstly it's significantly slower than the C code it was benchmarked against, and secondly SciMark is not using current C++ techniques for optimising numeric code, which any serious numerical library would.

The SciMark code is optimised to a similar degree on both C and Java platforms, but the nature of the Java language means that there are further optimisations which are impossible (due to not having access to intrinsics or inline assembler), and many more that go against Java's intent of being a safe object oriented programming language (such as passing around char arrays rather than String objects).

I've observed before, that without any kind of meta-programming facility, writing optimised numerical Java is very painful (though I haven't had cause to try again - even with better function inlining, you have to copy all the parameters you need, and construct and object for any result which is more complicated than a single type, so it doesn't impact on the complexity much). For small projects, you can do it be hand, for others you have to either use a code generator from a higher level model, or just put up with it running slow. It would be nice if having to hard code optimisations disappear from Java code as Sun improves the JVM's capabilities. For example, copying all state out of an object into local variables, then performing a tight loop, then copying back and gives a measurable performance boost. The same applies for nested loops on multidimensional arrays - you hoist the reference to the row. Currently you have to do such tedious code by hand, though Fortress seems to favour such optimisations in its transactions, so may be a better way ahead for numeric code.

For something a little closer to Java's enterprise use, say date formatting functions, the difference between Java6 and C++ in small benchmarks is closer to a factor of 7 slower, for example see this thread.

However, what I learnt in that thread isn't that if you port C to Java it runs seven times slower (usually it's not as bad as that, and as the SciMark above shows - a good programmer can write Fortran in any language and it'll run quickly), but rather that even good Java programmers won't think of solving the problem in the faster way, but use Java idioms instead.

The idiomatic Java way of solving that problem was twenty times slower than the C code. If you're used to thinking that StringBuilder is faster than StringBuffer is faster than concatenating String, you won't write the sort of code that works fast, however good the VM gets. The granularity at which safety is assured in Java - immutable objects - means you can't optimise across method calls. You need to wrap up a string as an immutable object to protect it from modification, rather than proving its const via the type system.

Java is designed to be a safe, object oriented language. Fast code either operates above this level - for example, using a DSL specify that you want to multiple a matrix row by factor and it'll generate the code inline for it which works on the primitive types - or below this level, such as C can do, and the generated code would. Having to work only with objects with a statically typed interface means you don't have the flexibility to get a high level view of the logic, and get to feel the bits between your toes if you want to.

The granularity problem in Java also applies to array bounds and other checks - much of what can be known at compile time is lost by the Java type system. The attempt to fix this with generics didn't get far in terms of capability - you can only specify the generic type as far as the current Java type system allows, and the information is erased at runtime, so the JVM can't selectively remove redundant checks (other than checkcast). Traditionally, this hit you in loops of a known number of iterations against an array of known length – the array access instruction would still check the bounds. That one may have been mitigated, but only by means of specifically getting the JVM to look for that particular pattern. Traits allow you to provide a general mechanism. For example, a method which can only take a square matrix only needs to check its arguments if they are not known to be square at compile time, which depends where the method's being called from, so requires a backwards possible-caller analysis, and different code to be generated based on the results. Having an NxM matrix with a trait for squareness, also square matrix subclasses with the trait fixed, and knowing this trait doesn't change after the object is created would allow the check to be done both at compile and run time as appropriate. I'm not aware of any language yet that has that level of sophistication of traits inference. (I also don't like the compiler's lack of escape analysis for generics - if you've only ever put a Foo into a list, then you shouldn't have to tell it that it's a list of Foo. But that's for another time, as I'm not doing much Java programming in the real world now, and getting on to how traits can improve dynamic language performance is another post.)

So I don't think the JVM will ever be the fastest thing. Java's too much in the middle ground:

  • The language lacks the meta-programming facilities that hide the tedium of hand optimisations

  • The language lacks access to low level instructions that allow optimisations

  • There is no concept of a cross-object inference to prove unsafe operations cannot happen, rather it relies on information hiding to prevent them, which forces conservative restrictions and redundancy

  • Java idioms use object level safety and synchronised blocks, rather than traits and transactions. This prevents many possible optimisations based on direct access, thread-safe caching, and often requires redundant checks on information already known.

There were good reasons for these decisions in the evolution of Java, but the effects of them won't ever go away, and as long as it's the Java virtual machine, the JVM will have some compromise on its performance.

But it never was intended to compete with Fortran, and is fast enough for many uses.


Labels: , ,


XTech2006 - Tuesday

As usual, I don't get round to anything, but here's the notes I made when attending XTech 2006 last month.

The first day was an Ajax developer's day; I was there as I do quite a lot of UI work, this year I've shifted from Java front ends to using XulRunner; see earlier for some technical reasons (apart from having written a couple of million lines of Java over the last ten years and that's quite enough for the moment).

The Ajax developer's day had presentations from several framework vendors. As they are targeting web applications - professionally I'm employed to produce a framework for a suite of applications that sit on closed networks, we go as far as selling laptops with the software installed since the cost of a laptop is not significant compared to the software license, so it's rather a different model to Web2.0 - so some were interesting as examples of frameworks, but many Ajax frameworks simply exist to abstract away browser differences. Since we control our deployment platform, that's not an issue, but some of the showcased frameworks did quite a lot more.

The Yahoo! User Interface Library has some pretty effects, and compound controls for business type applications (calendars and so on). Personally, I prefer having a sensible date field like the one in DabbleDB, which accepts any unambiguous format and doesn't require clicking on, but with an option for a calendar as a sidebar.

Dojo and the future of the open web sort of want to standardize the api of the abstraction layer.

I looked at the OpenLaszlo platform a year and a half ago, and it seemed then and now to not quite be ready. It's a VM and interpreter for an XML language for constructing UIs. Currently, the interpreter is based on the Flash runtime. As a concept, it's interesting, but the execution was very slow on the corporate desktops of my previous employer, and the actual XML language loses many of the advantages of using a traditional XML pipeline such as separation of content, execution logic and presentation. The guy giving the demo apologized for the designer of the box model having been a MFC programmer who didn't know XML, which should be enough for anyone to be put off the product.

Hijax: Progressive Enhancement with Ajax was a solid presentation on enhancing a web application with Ajax, maintaining accessibility in a continuous integration style. Read the full paper if you're building anything on the web.

Developing Enterprise Applications with Ajax and XUL was closer to home - an intranet tool for a single client company - but the actual applications could be made by Dentrassi*. Soooo much grey, bad field alignment, no context, very Windows 95 VB.

Web 2.0 with Adobe Flex and Ajax seemed to involve installing a rather quantity of Adobe's software on your server, so sort of lost my interest. Filtered off my radar for the next couple of years.

As a bit of a language freak, I was looking forward to Combining E4X and AJAX, but Kurt Cagle's flight was too much delayed for him to give his talk. Instead, someone made an impromptu demo of some web-based IM clients, which were competently done but not particularly involving. But good on them for stepping in at the last moment.

AjaX with a Capital X! was an introduction to the BackBase framework. This is a framework using XML to define UI and data bindings. It was interesting, but was not quite what it says on the tin - putting method calls into XML syntax does not make procedural code magically declarative. Which is a pity, as the example introduced as 'you have to stop using declarative bindings when you need conditionals' could have been written in one line of XSTL2 (without conditionals, which are orthogonal to declarative/imperitive), rather than the several lines that the not-quite-declarative binding used. But they least know that they should be getting closer to pure functional code for anything they can, it's just a question of getting the right idioms out there into practice.

Unfortunately, having caught a bit of a cold, I went to bed early and didn't see anything of Amsterdam that evening.

More XTech days later.


*H2G2: The servants of the Vogons, excellent chefs, but who make tasteless food as they hate their masters.



Back to the Rats

Since it came up again in LtU, and I was thinking about it after Burning Chrome mentioned parsers for syntax highlighting, I was at Henry S. Thompson's seminar on XML schema validation (which is essentially a similar problem to parser design), then Brenden Eich mentioned Narcissus in his closing session at XTech I've been playing with a packrat parser written in JavaScript.

I started again 7pm on Thursday during XTech*, and spent most the night on it, then did more on it on Saturday at BarCamp Amsterdam. Current work in progress is in this directory, with a simple demo which parses a definition of the grammar used to define grammars, then creates divs to render it prettily in the iframe. The grammar below the iframe is the grammar used to parse what's in the iframe, and below it are debug messages. Editing what's in the iframe causes events to mark it dirty, but the incremental reparsing doesn't work yet.

The next stage is to add the incremental reparsing, so when you do edit it it tries to reparse, and to link that to the divs so that you don't have to throw away all of the content of a div when you reparse it, as at present.

The Triep stuff is a combination of Trie with predicates that was another idea for incremental parsing taylored for interactive development. I'm going to add some form of trie to the parse states for autocompletion, though I may go back to using a standard trie as the calculus for the predicates is tricky, and reusing the packrat grammar may mean it's not needed.

One thing about using PackRat using the visitor pattern (or any other tree-walking recursive descent), is that you don't need to convert the grammar into a finite state machine, which means that while Henry S. Thompson's handling of numerical constraints is an elegant solution, the problem simply doen't exist - you can simply put the counter in a for-loop in the visitor's method, and you're done. The parser doesn't do that since most text grammar don't require it - it only handles zeroOrMore, oneOrMore or zeroOrOne, but nToM would be handled in much the same as those, and I don't know in principal why you couldn't use tree-walking descent for XML - my ASN XER to Java data bindings used exactly that approach, and that's a superset of WXD.


Labels: , ,


Decamel and the Colossal Cave

One thing I've noticed, when working in languages sufficiently introspective that you can access names of functions, is that any large enough system ends up implementing a decamel method.

Basically it takes a string, being a method or class name, which (due to limitations language design though based in the days when the only way you'd get timely parsing on the available hardware was to use recursive descent) is MadeOfSeveralWordsUsingCapitalsToIndicateBreaks and break it apart for user presentation.

Normally it's the text of a menu that's late bound to a function call, or the default label of an entity type on the palette of a modelling tool. I've also used it with a unit test framework that output the test names and results in an XML file, and with the magic of Ant, Saxon and FOP presented a properly formatted PDF automated test report for my line manager.

There's an example of an OSS tool on code_poet of it being used for documenting unit tests (though his 'self describing system' label is more than a little off mark - labelling a unit test 'Dog walks when you kick it' doesn't guarantee that the test has anything to do with the label; and as the fact that both his labels are present simple but the test for barking is past perfect and for walking is present progressive indicates, even giving a long, meaningful label doesn't guarantee to uncover the underlying assumptions that make creating good unit tests tricky).

Now, that stuff is very easy, and a very powerful tool for creating agile applications (you don't want a thousand anonymous ActionListeners cluttering up your code - even if they're generated by your IDE, it's all hardwired and a pain to maintain), and maybe a help for readable unit tests.

Something else on my radar, this time via decafbad but also now on LtU, is Inform7, which is creating a bit of a buzz.

Now I've known about controlled English systems such CLCE and ACE, both of which are general purpose mappings of a subset of natural language onto first order logic, and given effort you can create a model that can be transformed into running code. But it seems that the generalness of the AI originated systems gets in the way - or maybe I'm just too lazy to dig into them far enough to get a return on my effort.

It's also true that both CLCE and ACE have been used to create running code for systems specified in natural language, but in my experience there are always problems with general purpose code generators (not that I'm saying anything of the quality of those projects in particular).

So instead of applying the full systems to the general case, what happens if you add a simple adventure game style parser to the decamelled code? Can you generate unit tests like that, or do you need the full linguistic systems to get the assumptions about tense correct?




Art, state of, one year on from last year (the).

Previous frustrations with XMLHttpRequest, and more recently finding DeltaV didn't appear to be supported even in Firefox at work may be changed if a bit of sensible flexibility gets the W3C spec to conform to the HTTP rfc's extension-method = token rather than a vendor specific white-list.

I'd still really like a browser based, version controlled, graph drawing tool for modelling and knowledge capture, but with the WhatWG's canvas and support for SVG in Firefox stable enough that I'm writing production code based on it, and the real possibility of single page applications such as this wiki using Amazon Simple Storage Solution, I'm thinking of retiring the Java based, serverside image code of my LapisBlue, which I never got round to connecting to a versioned store anyway.

So I'm thinking of retiring LapisBlue, since I'm paying monthly for a full featured server solution that's not getting any use, whereas I can pay for a tiny amount of data storage and get the clients to do the rendering work now. Though proper version control would be nice, saving deltas or labelled versions to S3 should also be possible, more fun that configuring a tomcat installation that pulls in a thousand or so libraries, and not reliant on extension methods as subversion's DeltaV implementation is. What you lose is a queryable database, but I'm thinking of using it for a pattern wiki rather than anything else.

In other news, I got rather exited over the weekend thinking about using SSE for a faster 'byte' code interpreter, and resurrecting kin - my toy language for graph matching based code generators to generate simulation models defined generically on traits, which I'd partly implemented on the JVM - as a scripting language plugin for the gecko platform. If you can SIMD the graph matching, and maybe also either SIMD the bytecode scripting, or (since kin uses pure visitor functions a lot) use SIMD optimised blocks with scripting, you may get close to Java performance without having to track Sun's generics cruft.

It's still easier for me to write Java than C++, especially when you need to use libraries - each library having its own code conventions and memory management model - or Lisp for that matter (since I've done far more Java than Lisp in anger), but for many things JavaScript's good enough. The only things I've found this year that I've written in Java have been something to test an algorithm for work, which could have been written in anything really, and an annealing based graph layout, which ran too slow in JS to be useable. But annealing graphs is exactly what kin would be suited to, and be designed to parallelise it, so it may be that the Java world gets even smaller for me.

I'm not sure how useful web-based simulation tools would be, and suspect a good enough interpreter + a really, really good code generator would be a better match to a lot of the problems I like thinking about than trying to do anything like Sun's Hotspot, brilliant though it is.

Third point of this summary - I'm also excited about building distributed clusters of collaborating applications and services on top of xmpp. It's something I've been pushing at work, and I've got enough of the infrastructure there that the rest of my team are starting to play with it, building models and connecting them to XUL UI's with xmpp pub-sub. I've got till mid June to build it out to a full distributed system with service discovery, which means a mix of quite easy xml binding and doing some fairly hard concurrency work to get the simulators' execution model and the pubsub code working well without excessive threads or critical sections.

Oh, and I'm going to XTech2006 next month. It's nice to be working for somewhere that's not too stingy to send it's people away again.


Labels: , , , , ,


Smart or fluffy, Ozy. You can't have both

Link: http://www.ozyandmillie.org/d/20050608.html

Most of my earlier career has been in formal (usually graphical) notations surrounding systems engineering, trying to create strong, machine usable ontologies and tools that allow users to create structures that represent the systems they're designing.

Graphical tools that allow you too create these things are great fun to program.

But using them is like alphabetising your record collection in seven dimensions - there are product, process, validation, verification, stakeholder, structure, function, behaviour, purpose axes that each system has a position on.

And to a lesser extent, UML tools suffer the same problems, and as the UML gets more and more complicated it only gets worse.

Part of the problem is that the value of getting machine usable information about your systems is high cost, and the cost is incurred by users who don't feel the benefit.

And part of the problem is the tools and notations themselves, the lack direct manipulation, and of notation condensation.

Flow happens in tools when the user forgets they exist, and feels like they're directly manipulating the concepts. Part of that is in the UI design - you have to design for flow - and related to that is notation condensation.

In the UML graphical notation for class diagrams, there are conventions for warts to add to indicate access modifiers on attributes and methods, for example +foo : integer[0..7] means the public attribute named 'foo' being of a collection of zero to seven elements of the type named 'integer'. Of the four UML tools I can think of off hand (ARTiSAN RTS, Poseidon, Rose 2002, Enterprise Architect) only Poseidon lets you enter the text directly and updates the relevent properties in the abstract syntax; the others you either have to open a dialog to edit them (EA), edit each part in a separate in-place widget (Rose) or use their own, non-standard syntax (RTS). So you have to both learn the notation, and learn the tool's take on how to interact with the notation.

ArgoUML/Poseidon has a very good HCI, it's a pity it's currently broken by bugs in Apple's Java so I can't use it. Of course, OmniGraffle lets you enter whatever you like, as it's not based on the semantics of UML, but is a diagram editor.

But even with the condensation, to say something consistent in UML means there are many non-graphical links. Some of these are apparent in the source code, such as a method containing its sequence of operations - not possible to show in UML; others, such as embodiment or requirements trace are relations between models, and the UML doesn't show these well. There is notation in SysML for some of the interesting relations between models, but I've yet to see anything that implements it, and I'm not convinced it will get used - it's a notation that spans diagrams, and the unit of context in UML is the diagram, which is it's greatest flaw. It's a language that hasn't the means to say the interesting things about a project.

So you can go to 'fluffy' approaches. One of the things I'm playing with is getting an atom feed of changes to the codebase of one of my projects, and building a temporal web showing a measure of the time each part was updated. I suspect there is a nexus around a couple of classes, which moves over time as the work progresses. Though I'm not sure what such a measure would mean, other than I was updating these things about the same time.

The thing that we want is to use as many sources of ambient information, information that is zero user cost, to get the most useful view of the project. UML is high cost and fractured; what I want is something like the feeling I have of the shape of the system.

Smart and fluffy, I want both.

Labels: ,


Emerging Languages

Link: http://www.developerdotstar.com/community/node/144

In the last few days, Martin Fowler posted a few essays on language workbenches and their relation to MDA and MDD. Today John Cowan posted on his Recycled Knowledge blog about traits, as used in Sun's proposed technical computing language, Fortress. I'd been aware of Fortress, but had missed the draft spec coming out.

I don't want to knock Fortress, as it's hard to have a technical computing language better than Fortran and as net-savvy and secure as Java, but I'd agree with Edward G Nilges's blog. Having α := x2 isn't enough.

The big thing that Fortress claims as its selling point isn't that it's a secure, robust, transactional, object oriented language. It tries to sell itself on its syntax looking like maths.

Whatever you do with monospaced text, unicode or no, it's not going to look like text-book maths. For that you need a decent document rendering system, and can use block formatting model result from on relation or transforming the abstract syntax of the language, much like the language workbench systems do. Fortress' examples of 'maths like' syntax simply don't do it well enough. There was also a quite sexy demo of Mathematica's means of exploring computational model spaces at Apple's WWDC, which shows what you can do with a compound document style UI in mathematical programming.

But as well as not being quite enough in terms of becoming literate maths, it seems to ignore Fortran and the existing codebase.

The big problems with agile, high productivity technical computing are how to build models quickly, and scripting languages that apply leverage to existing systems (such as SciPy and my own (unfortunately closed source) Jini based system for creating a distributed agile front-end to Fortran code) offer much better reuse of current code by integration rather replacement. You don't want to rewrite your aerodynamic stress library to use it, so these techniques give better return on effort. Which is what high productivity is all about.

That said, there is also some hard to read code out there that results from the lack of data structures in F77 and the aggregation of patches to 20 or 30 year old systems. There is some merit in applying relational KR models, such as conceptual graphs to documented legacy code to extract the intent, and I hope to put aside some time to play with applying something like a pattern match to some of our larger application sets (107 lines Fortran, mainly single purpose stove-piped programs with no shared libraries) for finding redundant code that can be extracted out to a managable codebase.

There's simply no way we can rewrite our code in Fortress, and FFI and code transformation are more important than syntax.


Labels: , ,