2009-10-05

Random Lists

As kin has support for pattern matching and sequences at its heart, it's natural to formulate a random number generator as a generator of a sequence of random numbers.

For example, this is the same linear congruent shift register as used in java.util.Random:


type lcsr ( seed : uint64 ) : seq [ uint32 ]
def head : uint32 => uint32 ( self.seed >> 16 )
def tail : lcsr => lcsr ( ( self.seed * 0x0005_deec_e66d + 0x0000_0000_000b ) & 0xffff_ffff_ffff )


You can wrap it in an actor if you want the same behaviour as a random function with hidden state:

type @actor random_source ( sequence : seq[uint32] )
def next_uint32 ( self ) : uint32
let r = self.sequence.head
self become ( sequence = self.sequence.tail )
=> r

def next_real64 ( self ) : real64
=> real64 ( natural ( self.next_uint32() ) ) / real64 ( natural ( uint32.max ) )

But you couldn't optimise matching on the value reported by a random function like you can with a pure sequence:

let rnd_seq = lcsr ( 0x12345 )

match rnd_seq
0x00000001 :: 0xadfc5fc8 :: 0x22222222 :: 0x33333333 :: _ =>
out <- "match one" <- endl
0x00000001 :: 0xadfc5fc8 :: 0x9d1a5de6 :: 0xf6b685cf :: _ =>
out <- "match two" <- endl
_ =>
out <- "no match" <- endl

The cons'd patterns are compiled to combinations of calls to seq.head or seq.tail. The first two matches get combined as the first couple of terms in the pattern are the same; if seq.head or seq.tail were not pure functions, the second match case would fail as the hidden state would change.


Tim Bray describes using this technique is "a sideways dance" in Haskell. I've always found Haskell a bit too 'stiff' and like to 'feel the bits between my toes' too much for it, but compared to the amount of messing around Java has to do with atomic longs (JDK7) or synchronized methods (GNU classpath) in the implementation of the random source, and then as an Iterable<Integer> if you want a sequence of random numbers, it doesn't seem too bad. In the Java case, either the runtime has to deduce that the interlocks are not required and remove them, or it has to process them; if the code is used in more than one place you may have to have more than one machine code compiled form of the method, or fall back to conservative locking always. For a pure sequence, no interlocks are required. Wrapping a sequence in a actor ( or a lazy list in a monad in Haskell ) gives you hidden variable behaviour if you want it without polluting the code of the sequence generator with synchronisation primitives.

Labels: ,

2009-06-19

Language-level Immutablity and Shared Memory Concurrency

I've been a fan of Bartosz Milewski’s Programming Café ( or possibly Bartosz Milewski’s Programming Caff; I'm not sure if lack of an accent implies greasy chips and tea outside of the UK ). He's exploring the options of applying the leverage of type checking to concurrency problems within shared-memory systems. In a different camp, there's the Erlang crowd, typified by Steve Vinoski which says that language-level immutability and message passing is the way to go. A third perspective, probably closer to what I would like to see, are CSP based approaches such as Limbo and stackless Python, which are shared nothing and use channels to pass messages by value. The channels and pass-by-value approach I'm experimenting with in kin is close to the Limbo model, except that channels are objects with support different messages. Apart from the D extensions by Bartosz Milewski, these techniques require the language integrates with a scheduler, usually a green threads system rather than the OS scheduler.

One thing which bothers me about Erlang and pure-functional languages in general is the assumption that language level immutability has some bearing on concurrency.

It's fairly obvious that any message shared by machines on a network will be unlikely to include memory addresses in one of the processes, other than possibly as opaque values. In other words, if it's on the net, it's pass by value. Similarly most inter-process communication is pass by value, and the sharing of messages containing values and names of channels is well understood using CSP and pi-calculus. The net is unconcerned about what paradigm the languages which implement the processes communicating by message passing; it only cares that the messages are values and names. It doesn't follow at all that the implementation language for a process which operates on messages needs to present immutable variables to the programmer.

The other area functional languages have been touted has been for problems where there are shared memory concurrency within a process.

A trivial counterexample is given by lazy evaluation - in trivial implementations, expressions are passed around as thunks, consisting of a place to store the value of the expression, and a function to evaluate the value. The expression is evaluated lazily by calling the function the 'first' time the value is needed. Being a pure function, it doesn't matter as far as the value is concerned whether the function is called multiple times in different cores, but if the value returned is a non-primitive there are still write order issues with sharing the value. So you end up with similar barrier ordering problems that you have in non-pure languages, and barriers cost performance.

For strict functional languages, you still need the barriers, and other synchronisation points, if you are going to communicate values between cores. If you are passing messages by value, then it is necessary that the sender or receiver of a message cannot change the value of the message. This isn't necessarily the same as language level immutability - in a shared memory space, values sent on a channel should not change, but that can be achieved by copy-on-send or copy-on-write approaches as well as by general immutability. Having a shared memory space between threads requires synchronisation with the garbage collector - otherwise memory can be reclaimed and rewritten, breaking the immutability. None of this is particularly surprising, and the implementation of any SMP shared-memory language with automatic memory allocation will have to deal with it.

The presence of immutability at the language level allows the optimisation that a message can be passed as an immutable reference rather than by value. This introduces the pessimisation that any garbage collection of the arena that the messages' storage is allocated from has to be coordinated between all threads. If the optimisation of mostly-concurrent memory management is made, where each thread has its own arena, memory for messages is either allocated from a special shared arena when constructed, requiring some synchronisation, or messages are copied when sent.

It's fairly obvious that if you are allocating from a shared arena, then you're having to serialise some of the requests, so are losing concurrency ( as well as having to do escape analysis so you know what objects constitute the message and so require to be allocated from the shared arena ), and that if you're copying messages into a shared arena when sent then you have lost the pass-by-immutable-reference optimisation. Either way, I've yet to see a truly convincing case that immutability in itself actually gives any performance gain for concurrent code and typical message sizes ( if you have very large messages which are not referenced by the sender, then any copying scheme will be inefficient, but that seems to be a case for escape analysis rather than immutability; copy-on-send would be at a disadvantage there, but Erlang style encourages small messages anyway ).


So the problems in terms of performance are acquiring lock on the memory bus, which takes around 40 cycles for a lock cmpxchg, barriers to order access, and copying. Languages with channels as first-class constructs hide the details of locking and barriers, and can trade various copying or shared memory techniques.

I've not seen anything in Erlang that would suggest that its shared nothing messaging is less prone to programmer error than Limbo's channels. There are other reasons to use pure-functional languages in terms of proving theorems about the program, and below the language level some optimisations are facilitated, and some are harder to perform. But there seems to be a confusion between which bits of a particular combination of facilities in a language are actually supplying the gains, and my opinion is that message passing combining values and channel names ( however implemented or enforced ) is both safe, and open to analysis at the systems level by pi-calculus.

Labels: , , ,

2009-05-23

Messing about with X windows

I've spent the last week or so adding bindings to kin for X11 and anti-grain geometry.

X11 gives basic windowing capability, but no widgets ( you run GTK or some other toolkit on top ), AGG gives high-quality software graphics rendering.

This is a window which attempts to draw some buttons:

screenshot

The window has a transparent background. I've lost track of the number of hacks I've had to do to get transparent backgrounds working in either MFC or Java over the years ( though both now support them ), so making it happen out the box is nice. Just use a colour with a alpha less that 0xff, and the background is transparent.

Thanks to AGG, all the rendering has affine transformations applied to it; the code draws the buttons where they should be, then bigger for debugging, then messes around with rotated text and shading. It's shown running on top of SciTE, the editor I use.

The buttons are all drawn manually - currently the graphics class supports fill rectangles with colours, linear gradients or radial gradients, drawing text and finding the bounding box of text.

The code implements a tiny button_model/button_presenter.

The reason I posted now is that I realised this line of code does the most of the work of the horizontal box layout in Java Swing:


fold ( ( x, btn ) => btn.set_bounds ( x, 0.0, btn.desired_width ( gfx, `normal ), btn.desired_height ( gfx, `normal ) ).x + btn.width + 4.0, 0.0, buttons )


Having done far too much Swing programming, I often find that with what I'm trying to do with it there's nothing that fits. For example, last contract I was trying to make a nice split button with a menu, and the look and feel/ Swing component / MVC distinction meant that you couldn't reliably implement one that was both pluggable and looked good - either you built one out of two buttons, which didn't look or feel like one button with a split, or you did an owner draw and lost the pluggable look-and-feel. That contract also required a column based layout, and containers which lined up like tables - the normal typographic idioms. The intention is that kin's toolkit will be more typographic/infographic in style, rather than widgety - most of the time, it's for reading not clicking.

I'm quite happy that starting from X11 in C rather than with a toolkit such as GTK ( which again may get transparent backgrounds soon ), after a few days there's enough there to see that it'll be much easier to create your own application-specific toolkits.

Labels: , ,

2009-04-16

Comma Quibbling

Link: http://blogs.msdn.com/ericlippert/archive/2009/04/15/comma-quibbling.aspx

An example in kin. This misses the difficulty of the C# problem - as sequences in kin are linear, rather than being the mutating iterator/enumerator pattern used in Java/C#, you can pattern match several elements at once trivially (being able to pattern match sequences and backtrack on failure is one of the main reasons for writing kin in the first place).

module examples::eric_lippert_20090415
# http://blogs.msdn.com/ericlippert/archive/2009/04/15/comma-quibbling.aspx
def comma_append ( out, s )
match s
h :: [] =>
out <- h
h :: m :: [] =>
out <- h <- " and " <- m
h :: t =>
out <- h <- ", "
comma_append ( out, t )
[] =>
pass

def write_with_commas ( out, s )
out <- '{'
comma_append ( out,s )
out <- '}'
=> out

def to_comma_string ( s ) => write_with_commas( string_writer(), s ).str()

def main ( in, out, args )
write_with_commas ( out, args )
out <- '\n'
out <- to_comma_string( args ) <- '\n'

Labels: , ,

2009-04-09

A new way to look at networking.

Link: http://video.google.com/videoplay?docid=-6972678839686672840

I find this video incredibly interesting. It's talking about the phase transition that occurred between telecoms networks and packet switched networks, and questioning whether the same can happen between packets and data.

In a routed telecom network, your number 1234 meant move first switch 1, second switch 2 and so on - it described a path between two endpoints. In TCP your packet has an address, and each node in the mesh routes in to the node it thinks is closer to the endpoint with the target address.

Currently I have two or three home machines (netbook, desktop and file server) where data lives locally, plus data 'in the cloud' on gmail.com and assembla.com, or on my site here at tincancamera.com . Though as far as I'm concerned, that data isn't in the cloud - it's on whatever server at those addresses resolve to. If instead I google for 'pete kirkham', about half the hits are me. If I put a UUID on a public site, then once indexed I could find that document by that ID by putting it into a web server.

In the semantic web, there's been a bit of evolution from triples without provenance to quads - triples and either an id or some other means of associating that id with the URI of the origin which asserts it. It's considered good form to have the URI for an object resolvable to some representation of that object, so there's a mix of identifier and address. The URI for kin is http://purl.oclc.org/net/kin, which uses HTTP redirection to point it to the current kin site. This depends on OCLC's generous provision of the persistent uniform resource locator service, and maps an address to another address rather than quite mapping an identifier to an address.

There isn't an obvious analogous mechanism for data to the TCP nodes routing the request to a node which might be closer to having the data, though is some ways a caching scheme with strong names might meet many of the requirements. It sort of brings to mind the pub-sub systems I've built in the past, but with a stronger form of authentication. Pub-sub replication isn't caching, as in a distributed system there isn't a master copy which the cached version is a copy of.

There's also a bit of discussion between broadcast and end-to-end messages; I've got to work out how to do zero-config messaging sometime, which again sort of makes sense in such systems - first reach your neighbours, then ask them for the either the data, or the address of a node which might have the data. But there still isn't an obvious mapping of the data space without the sort of hierarchy that TCP addresses have. (although pub-sub topics are often hierarchic, that hierarchy doesn't reflect the network topology even to the limited extent that TCP addresses do.) It also has some relation to p2p networks such as bit-torrent, and by the time you're talking about accessing my mail account in the cloud, to webs of trust. A paper on unique object references in support of large messages in distributed systems just popped up on LtU, which I'll add to my reading list this weekend.

I've also started reading Enterprise Integration Patterns. There's a bit of dissonance between the cocept of channel in these patterns, and I guess in the systems they are implemented in, and the concept of channel in pi-calculus. In EIP, a channel is a buffered pipe for messages, and is a static entity created as part of the system configuration, in pi-calculus a channel is not buffered, and the normal mechanism for a request/response pair is to create a new channel, sent that channel to the server of the request, and receive the response on the channel. The pi-calculus model is the pattern for communication I'm intending for inter-process communication in kin (though I'm not quite sure whether giving explicit support to other parts of pi-calculus is something I want). I'm not sure that having to mark request messages with an ID (as in XMPP IRQ) rather than having cheap channels is a good programming model; though there are reasons that pretending that an unreliable connection isn't a pi-calculus channel is a good idea. I'll see what I think after I've tried coding the channels. Certainly returning a channel from a query given a strong name which lets the requester communicate with an endpoint which has the named data could be a model worth pursuing. REST is a bit different in that the response channel is assumed by the protocol, but the URLs returned by the application are channels for further messages to be sent to, and of course can be addresses anywhere allowing the server for the next request to be chosen to be the one best suited to handle that request (eg, a list of the devices in a home SCADA system could list the URIs of the embedded servers on the sensor net rather than requiring the client to know the topology prior to making the initial request).

Labels: , , ,

2008-08-07

Why Type

Link: http://blogs.msdn.com/ericlippert/archive/2007/12/06/immutability-in-c-part-three-a-covariant-immutable-stack.aspx

For work, I'm demoing a GUI editor, which means undo would be nice, which is easiest to implement if you enforce purely functional data structures. So I'm doing a bit of reading as to how to implement suitable structures in Java 1.6, as that's the project's target language, and using Eric Lippert's examples as part of my reference material.

There was also yet another thread on forums.java.sun.com about what can be done with generics, erasure and making arrays safe.

I like having something which proves that all possible execution paths don't result in obviously erroneous actions. For many cases, a strong type system supports this.

But the more I look at it, being able to assert any simple thing about any action on an object would make a lot of the proofs simpler. There are better and better proof tools around, and the kinds of things you want to prove about concurrent systems are not easily expressed in a type.

I'm increasingly thinking that a dynamically typed runtime, with a good static model checker, would trump a strongly typed language in performance, productivity, and safety. But creating such a beast is highly non-trivial.

Labels: ,

2008-04-30

Test of syntax highlight

Test of the syntaxhighlighter:


# generic sequence trait
trait seq<T>
def empty : boolean
def head : T
def tail : seq<T>
def to_string => '[' ++ join(self, ', ') ++ ']'
def concat (other:seq<T>)
=> other.empty ? self : shift(other, reverse)
def reverse => shift([], self)

Labels: ,

2008-04-25

Some things I'm thinking about

I've been playing more with my mostly functional modelling and simulation language (kin), here are some ideas which I want to try and profile to see if they offer gains:

Carrying type and data as a tuple:


Some implementation of Scheme and JavaScript use a single void* size datum for each object, and for integers set the lowest bit and encode the value in the rest, and for other objects the lowest bits are zero due to alignment, and the whole is a pointer to object. That doesn't work very well for doubles, and requires that the type is stored in the object's header. Java, C#, and C++ (for object with a vtable) have type information in the object header, which costs in large arrays. I've been thinking about passing the type and the data around separately, which means you can have more tightly packed arrays for non-vtable types.

One-bit markers for escape detection:


Based on One-bit Reference Counting, a marker in the type field to distinguish 'shared' objects you can take a reference to, and 'non shared' which you must not.

Write-once arrays:


As an extension to 'one bit reference counting' style; you maintain a high water mark instead of always copying, for use in structures such as VLists so you don't copy on passing automatically, and it allows you to mutate an array if it can be proven that all accesses happen before writes in matrix multiplications:

ax[i] <- ax[i] + row[i] * col[i]

Block order arrays for cache sensitive iteration and super compiled functors:


In a simulation, it's not uncommon to have a reduction on some function of a few attributes of an entity:

let total_force = reduce(array_of_entities, lambda(e:entity, ax) => f.mass * f.acceleration + ax, 0)

If this is over a homomorphic array, and mass and acceleration are simple double values, this would in C translate to picking 8 bytes here and there from the memory for the array. If instead each field is kept either in 'column order', or in what I'll call block order - so the fields for (cacheline size in bytes) objects are held contiguously. This should both reduce cache misses, and allow the use of SIMD instructions to process the data. The obvious disadvantage is that an object in an array is no longer the same layout as a single object on the heap, and to exploit it you need either a super-compiler or a trace-based JIT.

Effective model for scopes based on lists, symbols and objects:


Trying to build an interpreter in Java, which makes it tempting to use maps for everything, I found that the properties of an object and the properties of a lexical scope are much the same, (the duality of closures and objects is well known) so will try and define the binding model for values in kin using symbols, integers and lists only.

Using CAS for thread safe homogeneous VLists


Similar to the vector implementation in Lock-free Dynamically Resizable Arrays.

Options for compiling Verilog to shader language


Having had an interview last week as a systems modelling engineer with a company who were using C as the modelling language for timing simulations in embedded memory controllers, which is a bit like going for a job as a car mechanic and discovering that you're being quizzed on your wood-working skills, I was thinking about compiling an appropriate modelling language to something which executes efficiently. Though their particulary problem - timing conflicts - I would have though as having an analytic solution.

Something after UML


Speaking of Verilog, Verilog came into existance because even standardized schematic diagrams don't carry strong enough semantics and are not amenable to algebraic analysis, and graphical notations don't scale to large systems without hiding information.
Pi-calculus came into existance as Petri nets don't scale to large systems and are not amenable to algebraic analysis.
UML is very much in the standardised schematic mould, lacks formal semantics, and relies on hiding information to scale.

Often the hidden information in UML is very important - what appears to be a good design is not a good design if its simplicity is achieved by hiding most of the attributes and operations of a class, as often happens when a class gets multiple unrelated responsibilities. The notation allow you to hide details which should indicate that the design should be factored into smaller units, and in fact encourage such behaviour as a mechanism to scale up to showing many classes. For example, in Altova UModel:
You can customize the display of classes in your diagram to show or hide individual class properties and operations. ... This feature lets you simplify the diagram to focus on the properties and operations relevant to the task at hand.

If the class has features which are not relevant, then refactor it. Don't hide it so it doesn't make the documentation more complicated. The classes for well-factored systems have nothing to hide, but require more relations between the classes, which UML tools don't handle well (last time I checked none provided basic auto-routing or anything like good layout engines) and look more complicated, even though they are more simple - in the same way a dry stone wall is simpler than a minimalist plastered white apartment, but is visually more complex.

So what might come after UML then? What would not mitigate against refactoring, but allow a visual overview? What notation might give good analytic properties for software, rather than being a system schematic with loose semantics?

I don't know yet.

Labels: , , ,

2008-04-20

Bjarne Stroustrup on the Evolution of Languages

Link: http://msdn2.microsoft.com/en-us/magazine/cc500572.aspx

A few comments on this -

The illustrations in the interview don't appear related other than that they are of an IDE ("I might have a different opinion if a good IDE was universally available", queue Visual Studio, which implies that VS isn't a good IDE) and that the UML is illustrating multiple inheritance, but mechanism for dispatch in the paper is not based on multiple inheritance (at least as shown in the diagram).

It's an interesting paper though, and I would like open methods in a language.

I don't think that DSLs should be just extensions to general purpose languages. Mainly because that implies a specific execution model, and if you have a DSL then execution shouldn't be part of the domain model (unless it's something like Verilog, or the input to a program proof tool, in which case execution is a domain concept). DSLs should allow the modelling of a domain, and lose their benefit - abstraction away from execution - if they are tied to a programming language. If you have a model of the physical domain, then you can state 'F = m * a' and not have that imply that the values named by 'm' and 'a' are multiplied and stored in the location named 'F', as binding to C++ semantics would have it. Otherwise you need at least three models - 'F = m * a', 'm = F/a' and 'a = F/m' - to let you represent everything that the one entity in the domain represents. Which would create a mismatch between the language and the domain, eliminating the benefit of a DSL.


I've also been reading 'Programming Languages Pragmatics' and working on my model-based language kin a lot, and adapting the parser and code model of the modelling language I used to do a state model of some concurrent code to support kin. I was generating Java as a first cut to bootstrap a full compiler, but Java's generics aren't rich enough to support kin's parametric types, and without them it's not easier to generate Java than C. So I will go straight to a C back end, and generate an intermediate form which can be interpreted. It's more important for kin to be able to call C libraries such as the Intel Math Kernal Library than for Java libraries, and I don't really like the sort of compromises you have to make to fit in with Java's libraries, however nice the JVM's HotSpot and garbage collector are to have.

Labels: , , ,

2008-03-29

More kin musings

Back from finishing a contract in Malvern, and have had a few ideas on kin.

If general allocation is monomorphic arrays in column order, then single object representation is just a stride of 1 and a length of 1, and padded as required.

Assumption is that the code will operate on arrays, so compiler should default to vector ops where possible.

Need some basic constructs before starting, including mono and poly VLists, and a base lisp-like language to write the parser in.

Problem is that it's easiest to built an interpreter on a higher level language, such as Java, but the reasons I'm looking at kin - such as SIMD and cache optimisations, pattern matching and relations - aren't available in Java.

Also, I'm getting more and more convinced that polymorphic references should be pairs rather than as in C++ and Java.

The big issue I'm having is that each time I think how to do something, I write it in kin, then can't be bothered to write it in C or NASM.

Starting with the string type, I end up finding there's only a couple of functions which are needed over seq<uint8>, and those are easier to specify in kin than code up in C. No points are performance critical - or rather, no points can be written faster in C than in kin.

So I will try to define a metacircular kin interpreter, and work out where the smallest subset lies.

Also, for my specific interest of event driven simulations, I'd been thinking about pushing events forward for handling collisions. This requires that updates on event don't change the pre-event state, but perform copy-on-modify, as the pre-event state is also used to derive the intermediate event state. If you do modify it, then the best you can do is say 'there was a collision in the time step', and you have to keep the time step short enough that it doesn't matter when the collision occured within it.

I'm trying to think where C++ style identity tied to memory is useful; usually it's a premature optimisation, causing inconsistent states if you're not careful with the visibility of an update to a simulation entity after an event (the work I was doing up to November suffered from this, but that architectural decision was made before I joined that team). In C++ it's hard to work out how to create a 'make an updated copy' operation which doesn't perform redundant copying. It either involves writing a constructor which is aware of the update operation (causing slicing issues if the update is polymorphic) or requiring redundant initialisation if you use something like:

void entity::update_kinematics (double time_step) {
position += (velocity + 0.5 * acceleration * time_step) * time_step;
velocity += acceleration * time_step;
}

// create a copy of previous step's state
entity* updated = new entity(*previous);

// update it
updated->update_kinematics (time_step);
// store it in the active entities list for the new time

Then all the updated entities go into the collision detection routine, and any collisions are handled before the next interaction stage, where acceleration is recalculated. If there is a collision, then a new time event is added before the current time step, so that the interactions due to the collision can be worked on, the collision is handled in some other way. Then any other interactions are handled, based on the positions (or trajectories) at the current time, and then the actors can updated their intentions based on the sensing done in the interaction phase. Each time there's a copy-on-write going on, so, for example, if aircraft X decides to shut its radar off at time t1, the radar emmision is still visible along its trajectory from time t0 to t1, and any other actors can sense it and respond.

For a 6DoF vector position and velocity, the cost of writing the data twice isn't that great, but for larger structures it can be. Maybe some sort of copy-on-write at the language level, and an escape analysis for whether to use that or mutable structures. That would of course need an experiment to see if copy-on-write costs more than mutability though; it does for large structures, but not always for smaller ones.

How this relates to the sort of purely functional data structures described by Chris Okasaki I've yet to think.

Other things I've been thinking of - modelling actors as state-machine transducers, implemented as coroutines (see this paper for some inspiration) using trace-based code generator. The problem is that coroutines fit well with transducers and the style of event-driven actor programming I would like to express, and also with mostly sequence based processing, but not with the VM implementations I'm familiar with (or C++ coding, especially RAII). Actors accept a set of values, and also yield symbols, or accept a different symbol.

For example, a translator from UTF-8 to Unicode codepoints would accept a variable number of uint8 values for each uint32 value it produces:

# Convert UTF-8 string to a sequence of Unicode codepoints
codepoints (str:string) : seq<uint32> =>
yield str -> (x0:uint8) =>
if ((x0 & 0x80) == 0) then
yield uint32(x0)
else
accept (x1:uint8) where ((x1 & 0xc0) == 0x80) # UTF-8 encoding valid
if ((x0 & 0xe0) == 0xc0) then # 2 byte sequence
yield (uint32(x0 & 0x1f) << 6) | uint32(x1 & 0x3f)
else
accept (x2:uint8) where ((x2 & 0xc0) == 0x80) # UTF-8 encoding valid
if ((x0 & 0xf0) == 0xe0) then # 3 byte sequence
yield (uint32(x0 & 0x0f) << 12) | (uint32(x1 & 0x3f) << 6) | uint32(x2 & 0x3f)
else # 4 byte sequence
assert ((x0 & 0xf8) == 0xf0) # UTF-8 encoding valid

accept (x3:uint8) where ((x3 & 0xc0) == 0x80) # UTF-8 encoding valid
yield (uint32(x0 & 0x07) << 18) | (uint32(x1 & 0x3f) << 12) |
(uint32(x2 & 0x3f) << 6) | uint32(x3 & 0x3f)


My lower-level implementation experience is only with fairly procedural languages, such as Java byte code or conventional use of assembly, so translating this to code that runs is a bit interesting; it sort of goes round in circles - if I already had a language which worked on the accept/yield principles (with enough control to create SIMD code and column-ordered VLists), then creating a parser and an interpreter would be easier.

Labels: , ,

2007-10-26

Garbage

I've been thinking a bit about garbage collection and stack allocation, and both Robert O'Callahan and phaeron at virtualdub seem to have been.

Chicken scheme seems to have something interesting - a single stack which is collected on exhaustion (rather than C# or Java VMs which have a traditional program stack and a stack based allocator), so I'm thinking about implementing a tail-call-elimination variant of the single stack. That requires escape analysis, which may be costly - in its simplest form, it requires a flag to indicate whether any reference was made to anything on the stack above the point the tail call returns to. Either you can be conservative and set the flag on all function calls where a reference to a stack object is passed as an argument, or you can use a single assignment model and test whether anything was allocated. Making the core single-assignment would possibly complicate supporting other languages, and means you trade space in one condition - consing to hold the assigned value - for space in another - being able to recover some of the stack early in a tail call.

Having just the one stack means that, on x64 at least, you can use the one stack register.

Hmm. These things are all rather hard to do right, and don't have simple 40 line benchmarks.

See also a simple worked example.


Pete

Labels: , ,

2007-03-22

Prunes. Thursdays. Fortress. Locks.

I'm going to try and get a bit more regular with my blogging. Like every Thursday.

This week, I've got my hands on an iRex Iliad from work so I can read the 400 odd page Fortress language specification (1.0 beta) without killing a small tree.

I like Fortress' traits system, it seems to have enough information for the sort of semantic compiler optimisations and automatic parallelism I'm interested in. Although the language doesn't expose these, it's very similar to some of the sorts of thing I was trying to do at BAE with code generation from a maths model, which led me to start playing with kin. Though kin is relational rather than functional. The reference interpreter isn't a high performance system; if I get some time and the kin interpreter gets anywhere, I may use Fortress as an input language, though I do want to do something more with the relational aspect. I like that its definition of an integer is based on rings and fields rather than bit-string operations. My fiancée is in Romania from May to September, so, instead of just pining away, I may actually stop talking and do something on kin rather than a sequence of 400 line micro-benchmarks.

Both for work and kin I've been trying to get up to speed with these guy's work; the publish-subscribe idiom is used heavily within the virtual machine for the system I'm creating at work, and it seems to spend an inordinate proportion of its time waiting for locks on the properties it's published (I'm still getting to grips with interpreting VTune's output for multi-threaded apps).

kin related I'm trying to think of a nice way of packing R32 and R64 vectors so you can both look up type information and operate on them using SSE. For objects with a reference to their type, if each type is aligned to a cache line, it gives more state bits in the word - 4 rather than 2 for a 32bit aligned pointer, which may be useful. For a primitive field in an object, the field type can be associated with the object's value rather than the field's value, as it can't be aliases, so all code which accesses it has to go through the object's traits first. The cases I want to optimise, such as Array[/ (R32)^4, (ZZ32,ZZ32) /] (a 2D array of 4D 32 bit vectors, which has a very efficient contiguous representation) I imagine would also have access to the type of the array and functions would operate primarily on the array. But mapping a magnitude function onto the Array would need to operate on the type of the element, and I was imagining a fully dynamic dispatch (which then solidifies and inlines the commonest branches). Maybe pushing the type onto the function argument stack would work in those cases, rather than making every value carry its type around. In that case, the array member case and the object field or method local variable cases are all the same - the context you got the value from can give the type of the value, which is either the global ref to the type of a primitive, or the reference to the metaclass of a reference type. That should work for both Fortress and ECMAScript type objects constructed from traits. My current model for running ECMAScript is that each time you set a field on an ECMAScript object to a value of a different type to the type you create a new reference type trait object to describe it; these anonymous traits are reference counted in addition to any memory management so may modify rather than copy previous traits if they are only used by the one object; not original but amenable to having a common runtime for all the sorts of languages I tend to use - hpc, dynamic UI, and relational logic.


TME

Labels: , , ,

2006-11-19

SSE String

(to rhyme with silly string)

Since I'm not seeing the fine lass this weekend, I've spent some time twiddling my bits - playing with applying SSE2 to string algorithms.

SSE2 operations shift the costs of certain operations drastically, turning O(N*M) to O(N) for M <= 16, and making register comparisons far faster than loads. I've played with algorithms for searching for a single character in a string and for searching for a substring. The platform is WinXP, AMD 3400 mobile. Applying SSE to the naive algorithm for finding a character - loop and compare each - runs around 5.3 times faster; using _mm_cmpeq_epi8 to compare 16 bytes at a time.

Applying SSE to the naive algorithm for finding a substring - find the first character in the substring, then test the remainder - gives a 2 to 5 times speed up (best case is where the first character never occurs in the string) using SSE to match both the first character (compare a block of input with the first character 16 times in parallel), then to compare with the remainder (compare the current input to the pattern 16 bytes at a time).

Using Boyer-Moore on the same input gives results usually between the two. Attempting to use SSE for the inner part of Boyer-Moore doesn't help - extracting the index of the first mismatch from the 16 bit pattern indicating which characters in the pattern match the input costs more than the non-SSE version.

Boyer-Moore-Horspool is faster than Boyer-Moore, and faster than naive+SSE for longer strings, but its simplifications don't seem to be enough to admit any obvious parallelisations. The speed advantage over Boyer-Moore is probably that the algorithms are implemented as functions, and Boyer-Moore has to allocated and free an array on each call. So the performance would be different for a pattern matcher object.

Knuth–Morris–Pratt seems to under perform on these tests, again probably due to the dynamic memory.

I need to think about SSEifying Boyer-Moore-Horspool and Knuth–Morris–Pratt, and create another test harness that allows for pattern objects so the initialization of the tables in the more complicated algorithms doesn't count against them. But in a single function use, it appears that naive is best when the first character is rare, and Boyer-Moore-Horspool where the string is longer. Rareness also helps the algorithms with dynamic memory, as each is called until all the input is scanned, and the rarer the pattern the fewer the times the function is called.

Before I do that, I'll get the core object system in kin working (see next post) - I'm just about happy to go on with that now.


TME

Labels: , ,