2008-08-07

Why Type

Link: http://blogs.msdn.com/ericlippert/archive/2007/12/06/immutability-in-c-part-three-a-covariant-immutable-stack.aspx

For work, I'm demoing a GUI editor, which means undo would be nice, which is easiest to implement if you enforce purely functional data structures. So I'm doing a bit of reading as to how to implement suitable structures in Java 1.6, as that's the project's target language, and using Eric Lippert's examples as part of my reference material.

There was also yet another thread on forums.java.sun.com about what can be done with generics, erasure and making arrays safe.

I like having something which proves that all possible execution paths don't result in obviously erroneous actions. For many cases, a strong type system supports this.

But the more I look at it, being able to assert any simple thing about any action on an object would make a lot of the proofs simpler. There are better and better proof tools around, and the kinds of things you want to prove about concurrent systems are not easily expressed in a type.

I'm increasingly thinking that a dynamically typed runtime, with a good static model checker, would trump a strongly typed language in performance, productivity, and safety. But creating such a beast is highly non-trivial.

Labels: ,

2008-04-25

Some things I'm thinking about

I've been playing more with my mostly functional modelling and simulation language (kin), here are some ideas which I want to try and profile to see if they offer gains:

Carrying type and data as a tuple:


Some implementation of Scheme and JavaScript use a single void* size datum for each object, and for integers set the lowest bit and encode the value in the rest, and for other objects the lowest bits are zero due to alignment, and the whole is a pointer to object. That doesn't work very well for doubles, and requires that the type is stored in the object's header. Java, C#, and C++ (for object with a vtable) have type information in the object header, which costs in large arrays. I've been thinking about passing the type and the data around separately, which means you can have more tightly packed arrays for non-vtable types.

One-bit markers for escape detection:


Based on One-bit Reference Counting, a marker in the type field to distinguish 'shared' objects you can take a reference to, and 'non shared' which you must not.

Write-once arrays:


As an extension to 'one bit reference counting' style; you maintain a high water mark instead of always copying, for use in structures such as VLists so you don't copy on passing automatically, and it allows you to mutate an array if it can be proven that all accesses happen before writes in matrix multiplications:

ax[i] <- ax[i] + row[i] * col[i]

Block order arrays for cache sensitive iteration and super compiled functors:


In a simulation, it's not uncommon to have a reduction on some function of a few attributes of an entity:

let total_force = reduce(array_of_entities, lambda(e:entity, ax) => f.mass * f.acceleration + ax, 0)

If this is over a homomorphic array, and mass and acceleration are simple double values, this would in C translate to picking 8 bytes here and there from the memory for the array. If instead each field is kept either in 'column order', or in what I'll call block order - so the fields for (cacheline size in bytes) objects are held contiguously. This should both reduce cache misses, and allow the use of SIMD instructions to process the data. The obvious disadvantage is that an object in an array is no longer the same layout as a single object on the heap, and to exploit it you need either a super-compiler or a trace-based JIT.

Effective model for scopes based on lists, symbols and objects:


Trying to build an interpreter in Java, which makes it tempting to use maps for everything, I found that the properties of an object and the properties of a lexical scope are much the same, (the duality of closures and objects is well known) so will try and define the binding model for values in kin using symbols, integers and lists only.

Using CAS for thread safe homogeneous VLists


Similar to the vector implementation in Lock-free Dynamically Resizable Arrays.

Options for compiling Verilog to shader language


Having had an interview last week as a systems modelling engineer with a company who were using C as the modelling language for timing simulations in embedded memory controllers, which is a bit like going for a job as a car mechanic and discovering that you're being quizzed on your wood-working skills, I was thinking about compiling an appropriate modelling language to something which executes efficiently. Though their particulary problem - timing conflicts - I would have though as having an analytic solution.

Something after UML


Speaking of Verilog, Verilog came into existance because even standardized schematic diagrams don't carry strong enough semantics and are not amenable to algebraic analysis, and graphical notations don't scale to large systems without hiding information.
Pi-calculus came into existance as Petri nets don't scale to large systems and are not amenable to algebraic analysis.
UML is very much in the standardised schematic mould, lacks formal semantics, and relies on hiding information to scale.

Often the hidden information in UML is very important - what appears to be a good design is not a good design if its simplicity is achieved by hiding most of the attributes and operations of a class, as often happens when a class gets multiple unrelated responsibilities. The notation allow you to hide details which should indicate that the design should be factored into smaller units, and in fact encourage such behaviour as a mechanism to scale up to showing many classes. For example, in Altova UModel:
You can customize the display of classes in your diagram to show or hide individual class properties and operations. ... This feature lets you simplify the diagram to focus on the properties and operations relevant to the task at hand.

If the class has features which are not relevant, then refactor it. Don't hide it so it doesn't make the documentation more complicated. The classes for well-factored systems have nothing to hide, but require more relations between the classes, which UML tools don't handle well (last time I checked none provided basic auto-routing or anything like good layout engines) and look more complicated, even though they are more simple - in the same way a dry stone wall is simpler than a minimalist plastered white apartment, but is visually more complex.

So what might come after UML then? What would not mitigate against refactoring, but allow a visual overview? What notation might give good analytic properties for software, rather than being a system schematic with loose semantics?

I don't know yet.

Labels: , , ,

2008-04-20

Bjarne Stroustrup on the Evolution of Languages

Link: http://msdn2.microsoft.com/en-us/magazine/cc500572.aspx

A few comments on this -

The illustrations in the interview don't appear related other than that they are of an IDE ("I might have a different opinion if a good IDE was universally available", queue Visual Studio, which implies that VS isn't a good IDE) and that the UML is illustrating multiple inheritance, but mechanism for dispatch in the paper is not based on multiple inheritance (at least as shown in the diagram).

It's an interesting paper though, and I would like open methods in a language.

I don't think that DSLs should be just extensions to general purpose languages. Mainly because that implies a specific execution model, and if you have a DSL then execution shouldn't be part of the domain model (unless it's something like Verilog, or the input to a program proof tool, in which case execution is a domain concept). DSLs should allow the modelling of a domain, and lose their benefit - abstraction away from execution - if they are tied to a programming language. If you have a model of the physical domain, then you can state 'F = m * a' and not have that imply that the values named by 'm' and 'a' are multiplied and stored in the location named 'F', as binding to C++ semantics would have it. Otherwise you need at least three models - 'F = m * a', 'm = F/a' and 'a = F/m' - to let you represent everything that the one entity in the domain represents. Which would create a mismatch between the language and the domain, eliminating the benefit of a DSL.


I've also been reading 'Programming Languages Pragmatics' and working on my model-based language kin a lot, and adapting the parser and code model of the modelling language I used to do a state model of some concurrent code to support kin. I was generating Java as a first cut to bootstrap a full compiler, but Java's generics aren't rich enough to support kin's parametric types, and without them it's not easier to generate Java than C. So I will go straight to a C back end, and generate an intermediate form which can be interpreted. It's more important for kin to be able to call C libraries such as the Intel Math Kernal Library than for Java libraries, and I don't really like the sort of compromises you have to make to fit in with Java's libraries, however nice the JVM's HotSpot and garbage collector are to have.

Labels: , , ,

2008-03-29

More kin musings

Back from finishing a contract in Malvern, and have had a few ideas on kin.

If general allocation is monomorphic arrays in column order, then single object representation is just a stride of 1 and a length of 1, and padded as required.

Assumption is that the code will operate on arrays, so compiler should default to vector ops where possible.

Need some basic constructs before starting, including mono and poly VLists, and a base lisp-like language to write the parser in.

Problem is that it's easiest to built an interpreter on a higher level language, such as Java, but the reasons I'm looking at kin - such as SIMD and cache optimisations, pattern matching and relations - aren't available in Java.

Also, I'm getting more and more convinced that polymorphic references should be pairs rather than as in C++ and Java.

The big issue I'm having is that each time I think how to do something, I write it in kin, then can't be bothered to write it in C or NASM.

Starting with the string type, I end up finding there's only a couple of functions which are needed over seq<uint8>, and those are easier to specify in kin than code up in C. No points are performance critical - or rather, no points can be written faster in C than in kin.

So I will try to define a metacircular kin interpreter, and work out where the smallest subset lies.

Also, for my specific interest of event driven simulations, I'd been thinking about pushing events forward for handling collisions. This requires that updates on event don't change the pre-event state, but perform copy-on-modify, as the pre-event state is also used to derive the intermediate event state. If you do modify it, then the best you can do is say 'there was a collision in the time step', and you have to keep the time step short enough that it doesn't matter when the collision occured within it.

I'm trying to think where C++ style identity tied to memory is useful; usually it's a premature optimisation, causing inconsistent states if you're not careful with the visibility of an update to a simulation entity after an event (the work I was doing up to November suffered from this, but that architectural decision was made before I joined that team). In C++ it's hard to work out how to create a 'make an updated copy' operation which doesn't perform redundant copying. It either involves writing a constructor which is aware of the update operation (causing slicing issues if the update is polymorphic) or requiring redundant initialisation if you use something like:

void entity::update_kinematics (double time_step) {
position += (velocity + 0.5 * acceleration * time_step) * time_step;
velocity += acceleration * time_step;
}

// create a copy of previous step's state
entity* updated = new entity(*previous);

// update it
updated->update_kinematics (time_step);
// store it in the active entities list for the new time

Then all the updated entities go into the collision detection routine, and any collisions are handled before the next interaction stage, where acceleration is recalculated. If there is a collision, then a new time event is added before the current time step, so that the interactions due to the collision can be worked on, the collision is handled in some other way. Then any other interactions are handled, based on the positions (or trajectories) at the current time, and then the actors can updated their intentions based on the sensing done in the interaction phase. Each time there's a copy-on-write going on, so, for example, if aircraft X decides to shut its radar off at time t1, the radar emmision is still visible along its trajectory from time t0 to t1, and any other actors can sense it and respond.

For a 6DoF vector position and velocity, the cost of writing the data twice isn't that great, but for larger structures it can be. Maybe some sort of copy-on-write at the language level, and an escape analysis for whether to use that or mutable structures. That would of course need an experiment to see if copy-on-write costs more than mutability though; it does for large structures, but not always for smaller ones.

How this relates to the sort of purely functional data structures described by Chris Okasaki I've yet to think.

Other things I've been thinking of - modelling actors as state-machine transducers, implemented as coroutines (see this paper for some inspiration) using trace-based code generator. The problem is that coroutines fit well with transducers and the style of event-driven actor programming I would like to express, and also with mostly sequence based processing, but not with the VM implementations I'm familiar with (or C++ coding, especially RAII). Actors accept a set of values, and also yield symbols, or accept a different symbol.

For example, a translator from UTF-8 to Unicode codepoints would accept a variable number of uint8 values for each uint32 value it produces:

# Convert UTF-8 string to a sequence of Unicode codepoints
codepoints (str:string) : seq<uint32> =>
yield str -> (x0:uint8) =>
if ((x0 & 0x80) == 0) then
yield uint32(x0)
else
accept (x1:uint8) where ((x1 & 0xc0) == 0x80) # UTF-8 encoding valid
if ((x0 & 0xe0) == 0xc0) then # 2 byte sequence
yield (uint32(x0 & 0x1f) << 6) | uint32(x1 & 0x3f)
else
accept (x2:uint8) where ((x2 & 0xc0) == 0x80) # UTF-8 encoding valid
if ((x0 & 0xf0) == 0xe0) then # 3 byte sequence
yield (uint32(x0 & 0x0f) << 12) | (uint32(x1 & 0x3f) << 6) | uint32(x2 & 0x3f)
else # 4 byte sequence
assert ((x0 & 0xf8) == 0xf0) # UTF-8 encoding valid

accept (x3:uint8) where ((x3 & 0xc0) == 0x80) # UTF-8 encoding valid
yield (uint32(x0 & 0x07) << 18) | (uint32(x1 & 0x3f) << 12) |
(uint32(x2 & 0x3f) << 6) | uint32(x3 & 0x3f)


My lower-level implementation experience is only with fairly procedural languages, such as Java byte code or conventional use of assembly, so translating this to code that runs is a bit interesting; it sort of goes round in circles - if I already had a language which worked on the accept/yield principles (with enough control to create SIMD code and column-ordered VLists), then creating a parser and an interpreter would be easier.

Labels: , ,

2007-10-26

Garbage

I've been thinking a bit about garbage collection and stack allocation, and both Robert O'Callahan and phaeron at virtualdub seem to have been.

Chicken scheme seems to have something interesting - a single stack which is collected on exhaustion (rather than C# or Java VMs which have a traditional program stack and a stack based allocator), so I'm thinking about implementing a tail-call-elimination variant of the single stack. That requires escape analysis, which may be costly - in its simplest form, it requires a flag to indicate whether any reference was made to anything on the stack above the point the tail call returns to. Either you can be conservative and set the flag on all function calls where a reference to a stack object is passed as an argument, or you can use a single assignment model and test whether anything was allocated. Making the core single-assignment would possibly complicate supporting other languages, and means you trade space in one condition - consing to hold the assigned value - for space in another - being able to recover some of the stack early in a tail call.

Having just the one stack means that, on x64 at least, you can use the one stack register.

Hmm. These things are all rather hard to do right, and don't have simple 40 line benchmarks.

See also a simple worked example.


Pete

Labels: , ,

2007-03-22

Prunes. Thursdays. Fortress. Locks.

I'm going to try and get a bit more regular with my blogging. Like every Thursday.

This week, I've got my hands on an iRex Iliad from work so I can read the 400 odd page Fortress language specification (1.0 beta) without killing a small tree.

I like Fortress' traits system, it seems to have enough information for the sort of semantic compiler optimisations and automatic parallelism I'm interested in. Although the language doesn't expose these, it's very similar to some of the sorts of thing I was trying to do at BAE with code generation from a maths model, which led me to start playing with kin. Though kin is relational rather than functional. The reference interpreter isn't a high performance system; if I get some time and the kin interpreter gets anywhere, I may use Fortress as an input language, though I do want to do something more with the relational aspect. I like that its definition of an integer is based on rings and fields rather than bit-string operations. My fiancée is in Romania from May to September, so, instead of just pining away, I may actually stop talking and do something on kin rather than a sequence of 400 line micro-benchmarks.

Both for work and kin I've been trying to get up to speed with these guy's work; the publish-subscribe idiom is used heavily within the virtual machine for the system I'm creating at work, and it seems to spend an inordinate proportion of its time waiting for locks on the properties it's published (I'm still getting to grips with interpreting VTune's output for multi-threaded apps).

kin related I'm trying to think of a nice way of packing R32 and R64 vectors so you can both look up type information and operate on them using SSE. For objects with a reference to their type, if each type is aligned to a cache line, it gives more state bits in the word - 4 rather than 2 for a 32bit aligned pointer, which may be useful. For a primitive field in an object, the field type can be associated with the object's value rather than the field's value, as it can't be aliases, so all code which accesses it has to go through the object's traits first. The cases I want to optimise, such as Array[/ (R32)^4, (ZZ32,ZZ32) /] (a 2D array of 4D 32 bit vectors, which has a very efficient contiguous representation) I imagine would also have access to the type of the array and functions would operate primarily on the array. But mapping a magnitude function onto the Array would need to operate on the type of the element, and I was imagining a fully dynamic dispatch (which then solidifies and inlines the commonest branches). Maybe pushing the type onto the function argument stack would work in those cases, rather than making every value carry its type around. In that case, the array member case and the object field or method local variable cases are all the same - the context you got the value from can give the type of the value, which is either the global ref to the type of a primitive, or the reference to the metaclass of a reference type. That should work for both Fortress and ECMAScript type objects constructed from traits. My current model for running ECMAScript is that each time you set a field on an ECMAScript object to a value of a different type to the type you create a new reference type trait object to describe it; these anonymous traits are reference counted in addition to any memory management so may modify rather than copy previous traits if they are only used by the one object; not original but amenable to having a common runtime for all the sorts of languages I tend to use - hpc, dynamic UI, and relational logic.


TME

Labels: , , ,

2006-11-19

SSE String

(to rhyme with silly string)

Since I'm not seeing the fine lass this weekend, I've spent some time twiddling my bits - playing with applying SSE2 to string algorithms.

SSE2 operations shift the costs of certain operations drastically, turning O(N*M) to O(N) for M <= 16, and making register comparisons far faster than loads. I've played with algorithms for searching for a single character in a string and for searching for a substring. The platform is WinXP, AMD 3400 mobile. Applying SSE to the naive algorithm for finding a character - loop and compare each - runs around 5.3 times faster; using _mm_cmpeq_epi8 to compare 16 bytes at a time.

Applying SSE to the naive algorithm for finding a substring - find the first character in the substring, then test the remainder - gives a 2 to 5 times speed up (best case is where the first character never occurs in the string) using SSE to match both the first character (compare a block of input with the first character 16 times in parallel), then to compare with the remainder (compare the current input to the pattern 16 bytes at a time).

Using Boyer-Moore on the same input gives results usually between the two. Attempting to use SSE for the inner part of Boyer-Moore doesn't help - extracting the index of the first mismatch from the 16 bit pattern indicating which characters in the pattern match the input costs more than the non-SSE version.

Boyer-Moore-Horspool is faster than Boyer-Moore, and faster than naive+SSE for longer strings, but its simplifications don't seem to be enough to admit any obvious parallelisations. The speed advantage over Boyer-Moore is probably that the algorithms are implemented as functions, and Boyer-Moore has to allocated and free an array on each call. So the performance would be different for a pattern matcher object.

Knuth–Morris–Pratt seems to under perform on these tests, again probably due to the dynamic memory.

I need to think about SSEifying Boyer-Moore-Horspool and Knuth–Morris–Pratt, and create another test harness that allows for pattern objects so the initialization of the tables in the more complicated algorithms doesn't count against them. But in a single function use, it appears that naive is best when the first character is rare, and Boyer-Moore-Horspool where the string is longer. Rareness also helps the algorithms with dynamic memory, as each is called until all the input is scanned, and the rarer the pattern the fewer the times the function is called.

Before I do that, I'll get the core object system in kin working (see next post) - I'm just about happy to go on with that now.


TME

Labels: , ,