2009-11-21

Trees, Forests, Vines.

My current clients are using Model Driven Development for a new product line. Obviously I'm not able to say what the product line is, but that doesn't matter for the work I have to do, which to do with issues they are having working with model consistency between distributed teams on three sites. It actually is quite close in spirit to a project on hypertext repositories for multi-view engineering models I started and couldn't get continued funding for when I was a research associate at York.

They have a large UML/SysML and Simulink model which they are trying to maintain traceablitily. The UML model is created in Enterprise Architect and shared between teams using XMI export and a ClearCase repository. They have modelled their requirements in EA, and are tracing these system requirements to requirements in each sub-system, and from these local requirements to the implementation. They aren't using separate domain and implementation models as is practiced in some UML styles.

EA has a rather awkward XMI import mechanism when it comes to links between elements in different packages - when a package is imported, all links where the client end is in the imported package are erased, and all links defined in the XMI are created. Links are only created if both the client and supplier end exist in the model. This means that unless an inter-package link is recorded in the XMI at both ends, a link created to a new element might not get imported correctly, and so disappear from the XMI next time it is committed to ClearCase.

There are work-arounds for this, but basically the problem comes down to a mis-match between a hierarchical version control system, where users work on a package and the contents of a package, and a hyperlinked model, where links can exist between different nodes in the hierarchy, and really belong to neither end.

Once you also introduce baselines and viewpoints into such an environment, you get to a state where either you have to restrict the types of links - most UML links are designed so the client is dependent on the supplier, so the semantics are compatibily with the link being recorded in the client's package only - and also order the updates - the supplier must exist in the model before the client is loaded into the model for the link to be created. This makes it harder to update models, and for peer teams to push out baselines - you have to update the models for each subsystem team in an order consistent with any dependencies between the subsystems.

The difficulty in ordering baselines between is mitigated by designing to an interface, as it reduces the dependencies between subsystems, but it does not eliminate them. High integrity systems also have common effects which create dependencies between subsystems which cannot be designed around using interfaces (thermal characteristics, electromagnetic interference, power use, etc), and a model without these has lower predictive value. One technique to get around these dependencies is to apportion a budget for them rather than calculating them, which pushes them up towards the root of the tree. The other is to create a dependent model which is traced to the design model but represents the EM view or the thermal view of the system. So in addition to having vines between the branches of the design tree, there can be a forest of trees which model different aspects of the system.

Having looked at the capabilities - and been a bit hopeful about 'model views' - EA doesn't seem to have anything to support multiple viewpoints on the model. You can't create a second tree based on a different viewpoint, and there isn't a navigation via backward «trace» dependency ( you can create a second tree of packages which trace to the original tree, but navigation to the original requires clicking through dialogs ). EA also doesn't create synthetic nesting or dependency relationships between packages in the tree or packages whose elements depend on each other, which are useful if you have more than one hierarchy in the system. Multiple hierarchies arise when you are dividing a system in different ways - for example a structural view is divided into zones, and a functional view into sub-functions, and sub-systems view into sub-systems.

I have a strong feeling that a distributed model based on XMPP or ATOM protocols between nodes for each team should be good, but that doesn't replace the requirement to externalise the model for baselining or backup using industry standard version control tools or the issues with import into existing UML tools. There is also a distinct difference in view between the idea of 'the model' and 'a model' - having moved awy from CASE systems to distributed C4I systems for a couple of years, there are techniques for working with uncertain data and distributed databases which might be interesting to pursue, but are not going to sell to most engineering organisations large enough to need them - if you don't control the situation, then you use distributed models based on the best information available at a time, and then corrolating information from teams, rather than partitioning a system into a rigid hierarchy and trying to manage updates to that hierarchy. In such a distributed model, different viewpoints do not conflict - there is no one 'true' hierarchy of the system, only whatever hierarchy is useful for the team using that viewpoint - and assertions are made about an object which the owners of the object then choose to confirm or reject as authentic.

Last time I looked at RDF is was still disappointing when it comes to baselining parts of a model - although most triple stores have now got some kind of context or graph ID, there isn't 'native' support for versioning or saying graph X contains Y at version Z, and instead the movement seems to be towards having a graph ID per triple, which doesn't seem very manageable to apply version control to - the model has a few thousand packages, each with a few classes, each with a few members, each of which would require several triples to describe, so would be of the order of one or two million triples to include in each baseline. Baselining on a graph dump would be possible - just dump all the triples in the baseline out to a file and put that in source control - but that moves the mechanics out of the RDF metadata. Doing that with XMI and source control means there is nothing other that diff'ing the XMI files between versions to say what has changed; part of wanting to move to a distributed graph format is to get a difference mechanism which understands that the model is a graph rather than a text file.

Labels: ,

2009-10-09

Re: Why is UML so hard?

In response to Why is UML so hard?

In the late '90s I was working as a research associate at the University of York looking at CASE ( computer aided systems engineering ) tools and notations when UML started to happen to the industry as a merger of some OO notations developed in industry from experience in the '70s and '80s.

The academic world had already learnt from the cognitive science and HCI people that graphical notations don't help understanding above a certain level of detail, and that the strongest aids to understanding are spatial rather than graph based. Ignoring this, most UML notations use graph based and attach no meaning to spatial information ( you get swimlanes in some diagrams, but most diagrams the layout has no significance ).

In addition, as a species our brains have a million years of constructing graphs from narratives - it's how social groups evolve - and so creating graphs of relationships between objects from linear descriptions isn't that hard for us.

One thing which is hard for us to understand is concurrency, and having gone through a similar evolution finding that Petri nets rapidly become too complex to read as diagrams, most concurrency notations by the late 1990s used algebraic notations instead. You can't make money selling tools for algebraic notations for software - for many applications, source code is the most useful algebraic notation ( notations such as Z or pi-calculus are useful in special cases, but most systems the cost of producing a model in those notations isn't worth the benefit - you only do it if you want to analyse the system with those tools ).

The trend in academic CASE research at the time was away from single view models in CASE - these were considered unwieldy in practice - towards a 'viewpoints oriented' approach to modelling, where you had a web of different hyperlinked models of a system, each of which was tailored to a different purpose. It's obviously harder to make a tool that tolerates creative inconsistency than one based on a central authoritative model. UML sort of adopted this in it's distinction of Domain, Platform Independent and Implementation Models, but within each model there is often only single viewpoint. It's an open problem of how different viewpoints and the don't-repeat-yourself principle applies to software engineering. I've never had any success with PIMs, finding it more useful to create a profile that maps to the domain, and generating from that directly. But that requires hand-tooling of the code generation to the domain, which COTS UML tools obviously don't supply.

Sometimes a graph based notation is the right thing - module dependency diagrams are one CASE notation that can tell you something important about a system's complexity and design quality at a glance. A group did such an analysis at York for the Rolls Royce Trent FADECs in the '90s; some UML tools now available will do it for you as part of their reverse engineering, others force you to create the dependencies yourself. These UML package diagrams get used in presentations, so people will hide dependencies which they make the diagram look cluttered and hard to read - but the clutter is most useful thing the diagram is showing.

Quite a lot of the problems I've seen with applying UML seem to come from forgetting that it's a model. I take the Pragmatic view - a good model is one where you get useful results from applying that model; attempting to capture the 'truth' of a system isn't useful in itself.

Duffymo says 'We'd have bike shed arguments for hours about what "customer" meant.'. If you're creating a customer class, you should already have

  • a customer requirement which has been described using

  • a user story or a persona taking part in a use case

  • which involves analysis use cases which have been decomposed into areas of responsibility

  • which are then allotted to packages

  • which contain classes, each which has a single responsibility


Sometimes this works well, but it is brittle - there's always a temptation to jump to design because you know what a customer is - and requires management. It's a bit easier to do if you have a concrete system in mind, and a hard boundary on the responsibilities.

But the biggest step is stopping arguing about what "customer" means, but arguing about how an object which has the semantics represented by the "customer" class in the model satisfies the responsibilities of the "accounts payable ( account greater than one month arrears )" use case ( or takes part in that user story, or fills out its class responsibility cards, or whatever driver for the model you're using to inject the requirements ). Software has to do something, so what matters is what "customer" does in the context of the software it's part of.

Maybe the most important thing about UML is that deep down, it doesn't mean anything - there are no formal semantics; it's a notation with syntax only and it's up to users to apply semantics from their domain. Added to that that almost all the actual designing being done is the work between the diagrams, which is lost in UML ( the better tools let you create associations showing allotments of functionality to packages from the use cases, but there's no notation to record why and how such allotments were made ), and you get a tool that's not easy to use for its stated purpose - recording the design of a system.

So the goal when applying UML to a problem is for the UML model to be useful, rather than attempting to capture meaning. Arguments as to how to represent things in the model applying the semantics and notations of UML should be grounded in the effect such representations have on the utility of the model.

The first question you should ask when applying a UML tool to a new system design, is to create a sketch of how the inputs and outputs of the process of creating a model of the system will interact with the design process as a whole.

A rule of thumb is to only record the information which has utility in the next phase of the software engineering process. Anything else will probably change as the system evolves, so is just a rough note, and I've never seen a UML tool which was as easily to use for rough notes as a pencil and paper.

In terms of using the parts of UML which represent the system, you can try minimizing cost by reducing the level of detail and reverse engineering. You can maximise benefit by generating code or performing analyses before committing to designs. If you're not using a tool that allows you to either generate high quality code or run speculative analyses, it's unlikely to be worth creating a detailed model. If all you want from your model is documentation, think about using a tool such as Doxygen instead - any documentation of the system-as-built which isn't generated from the code is unlikely to be accurate.

Another approach is to use UML to record design commitments - which usually means designing down to package or façade level, rather than to implementation classes. I've used that for small projects ( 6 person monts ), and it seemed to be about the right level for that. Larger systems should be broken into modules, and if you need more detail than the façade of an external module you're using then there's probably something wrong.

The time I've had most success with UML it has been used more as a configuration tool than as a design tool - we had an existing system which was to be customised, we agreed a simple profile of UML that had a mapping to the problem domain, and a code generation/reverse engineering tool specific to the system.

Labels: ,

2009-04-01

On UML

Over the last few weeks I've been working on kin, and hanging around stackoverflow. For one question on usefulness of UML, I ended writing a rather long answer, so as I haven't blogged in a bit I thought I'd post it here too.


There's a difference between modelling and models.

Initially in the design process, the value in producing a model is that you have to to get to a concrete enough representation of the system that it can be written down. The actual models can be, and probably should be, temporary artefacts such as whiteboards, paper sketches or post-it notes and string.

In businesses where there is a requirement to record the design process for auditing, these artefacts need to be captured. You can encode these sketchy models in a UML tool, but you rarely get a great deal of value from it over just scanning the sketches. Here we see UML tools used as fussy documentation repositories. They don't have much added value for that use.

I've also seen UML tools used to convert freehand sketches to graphics for presentations. This is rarely a good idea, for two reasons -

1. most model-based UML tools don't produce high quality diagrams. They often don't anti-alias correctly, and have apalling 'autorouting' implementations.
2. understandable presentations don't have complicated diagrams; they show an abstraction. The abstraction mechanism in UML is packages, but every UML tool also has an option to hide the internals of classes. Getting into the habit of presenting UML models with the details missing hides complexity, rather than managing it. It means that a simple diagram of four classes with 400 members get through code review, but one based on a better division of responsibilities will look more complicated.

During the elaboration of large systems (more than a handful of developers), it's common to break the system into sub-systems, and map these sub-systems to packages (functionally) and components (structurally). These mappings are again fairly broad-brush, but they are more formal than the initial sketch. You can put them into a tool, and then you will have a structure in the tool which you can later populate. A good tool will also warn you of circular dependencies, and (if you have recorded mappings from use cases to requirements to the packages to which the requirements are assigned) then you also have useful dependency graphs and can generate Gantt charts as to what you need for a feature and when you can expect that feature to ship. (AFAIK state-of-the art is dependency modelling and adding time attributes, but I haven't seen anything which goes as far as Gantt.)

So if you are in a project which has to record requirements capture and assignment, you can do that in a UML tool, and you may get some extra benefit on top in terms of being able to check the dependencies and extract information plan work breakdown schedules.

Most of that doesn't help in small, agile shops which don't care about CMMI or ISO-9001 compliance.

(There are also some COTS tools which provide executable UML and BPML models. These claim to provide a rapid means to de-risk a design. I haven't used them myself so won't go into details.)

At the design stage, you can model software down to modelling classes, method and the procedural aspects of methods with sequence diagrams, state models and action languages. I've tended not to, and prefer to think in code rather than in the model at that stage. That's partly because the code generators in the tools I've used have either been poor, or too inflexible for creating high quality implementations.

OTOH I have written simulation frameworks which take SysML models of components and systems and simulate their behavior based on such techniques. In that case there is a gain, as such a model of a system doesn't assume an execution model, whereas the code generation tools assume a fixed execution model.

For a model to be useful, I've found it important to be able to decouple the domain model from execution semantics. You can't represent the relation f = m * a in action semantics. You can only represent the evaluation followed by the assignment f := m * a, so to get a general-purpose model that has three bidirectional ports f, m and a you'd have to write three actions, f := m * a, m := f / a, a := f / m. So in a model where a single constraint of a 7-ary relation will suffice, if your tool requires you to express it in action semantics you have to rewrite the relation 7 times. I haven't seen a COTS UML tool which can process constraint network models well enough to give a sevenfold gain over coding it yourself, but that sort of reuse can be made with a bespoke engine processing a standard UML model. If you have a rapidly changing domain model and then build your own interpreter/compiler against the meta-model for that domain, then you can have a big win. I believe some BPML tools work in a similar way to this, but haven't used them, as that isn't a domain I've worked.

Where the model is decoupled from the execution language, this process is called model driven development, and Matlab is the most common example; if you're generating software from a model which matches the execution semantics of the target language it's called model driven architecture. In MDA you have both a domain and an implementation model, in MDD you have a domain model and a specialised transformation to map the domain to multiple executable implementations. I'm a MDD fan, and MDA seems to have little gain - you're restricting yourself to whatever subset of the implementation language your tool supports and your model can represent, you can't tune to your environment, and graphical models are often much harder to understand than linear ones - we've a million years evolution constructing complex relationships between individuals from linear narratives, (who was Pooh's youngest friend's mother?) whereas constructing an execution flow from several disjoint graphs is something we've only had to do in the last century or so.

I've also created domain specific profiles of UML, and used it as a component description language. It's very good for that, and by processing the model you can create custom configuration and installation scripts for a complicated system. That's most useful where you have a system or systems comprising of stock components with some parametrisation.

When working in environments which require UML documentation of the implementation of a software product, I tend to reverse engineer it rather than the other way.

When there's some compression of information to be had by using a machine-processable detailed model, and the cost of setting that up code-generation of sufficient quality is amortized across multiple uses of the model or by reuse across multiple models, then I use UML modelling tools. If I can spend a week setting up a tool which stamps out parameterised components like a cookie-cutter in a day, and it takes 3 days to do it by hand, and I have ten such components in my systems, then I'll spend that week tooling up.

Apply the rules 'Once and Once Only' and 'You Aren't Gonna Need It' to the tools as much as to the rest of your programming.

So the short answer is yes, I've found modelling useful, but models are less so. Unless you're creating families of similar systems, you don't gain sufficient benefit from detailed models to amortize the cost of creating them

Labels:

2008-05-28

Python and EA

For my current contract, I'm part of a team producing scripts which define interfaces to equipment in a satellite communications system for monitoring and control. Various interfaces, serial, low-level TCP and SNMP.

The application is fairly old - originated on Unix in early 1990s, ported to MFC late 1990s - and there are many links between the various configuration files. It's a high reliability system, so uses the classic pattern of a watchdog process with spawns the active processes, monitors them for liveliness, CPU and memory use, and respawns them or kills them in event of failure.

The goal is to use Enterprise Architect for modelling the system in UML. EA has a limited custom code generation - one file per class, poor tree walking functionality - which is just sufficient to create the interface scripts, but none of the other configuration which is more highly intertwined - for each rack or group of racks of equipment to be monitored, there's a server which runs a set of monitoring processes, each monitoring a set of instances of the equipment. Various parts of the configuration, such as the port ID of each server process, need to be put into different configuration files, so the desired layout of generated files doesn't follow EA's assumptions.

Also, there needs to be a parser for the configurations which reverse engineers a UML model from the configuration files - we don't want to have to generate the UML models for existing equipments by hand.

So I started writing a little parser in Python which generated an XMI file for import into EA. That worked well enough, and impressed the line manager, so have been spending the last couple of days fleshing it out.

EA doesn't quite treat tagged values as first-class citizens of UML, relegating them to a second window (though it's nice to have them docked and always visible, but that could apply to all the panes in the properties dialogs). So I wanted a table view which looked like the config file layout, of all the extra tagged values we're using to define the application specific properities used to implement the attributes of the equipment in the configuration files.

I've been using GTK for that - I've played with GTK a bit in C, and quite like it's level of abstraction and simple event model. It's not too hard to get things working in Python (some of the handling of attributes is a bit counter intuitive), and it mostly looks native (compared to Java or Gecko, or even EA's UI, which I believe is written in C#, but isn't using WinAPI menus).

I quite like Python, but it's a bit too imperitive for my taste, and there's no type system to tell you if you've put a right-handed vector into a left-handed library, or referenced an attribute of an object which doesn't exist. But being able to create a simple dialog with a REPL (read/eval/print/loop) in it to do the initial exploration of the EA COM automation interface was very handy.

The thing I dislike most in Python is that it doesn't actually use indentation to indicate blocks (I read Ken Arnold's chapter in The Best Software Writing over the weekend, and think I agree). It uses the colon character to indicate a block, and following a colon you have to use indentation. I'm always just doing the indentation, and forgetting the colons.

Labels: , ,

2008-04-25

Some things I'm thinking about

I've been playing more with my mostly functional modelling and simulation language (kin), here are some ideas which I want to try and profile to see if they offer gains:

Carrying type and data as a tuple:


Some implementation of Scheme and JavaScript use a single void* size datum for each object, and for integers set the lowest bit and encode the value in the rest, and for other objects the lowest bits are zero due to alignment, and the whole is a pointer to object. That doesn't work very well for doubles, and requires that the type is stored in the object's header. Java, C#, and C++ (for object with a vtable) have type information in the object header, which costs in large arrays. I've been thinking about passing the type and the data around separately, which means you can have more tightly packed arrays for non-vtable types.

One-bit markers for escape detection:


Based on One-bit Reference Counting, a marker in the type field to distinguish 'shared' objects you can take a reference to, and 'non shared' which you must not.

Write-once arrays:


As an extension to 'one bit reference counting' style; you maintain a high water mark instead of always copying, for use in structures such as VLists so you don't copy on passing automatically, and it allows you to mutate an array if it can be proven that all accesses happen before writes in matrix multiplications:

ax[i] <- ax[i] + row[i] * col[i]

Block order arrays for cache sensitive iteration and super compiled functors:


In a simulation, it's not uncommon to have a reduction on some function of a few attributes of an entity:

let total_force = reduce(array_of_entities, lambda(e:entity, ax) => f.mass * f.acceleration + ax, 0)

If this is over a homomorphic array, and mass and acceleration are simple double values, this would in C translate to picking 8 bytes here and there from the memory for the array. If instead each field is kept either in 'column order', or in what I'll call block order - so the fields for (cacheline size in bytes) objects are held contiguously. This should both reduce cache misses, and allow the use of SIMD instructions to process the data. The obvious disadvantage is that an object in an array is no longer the same layout as a single object on the heap, and to exploit it you need either a super-compiler or a trace-based JIT.

Effective model for scopes based on lists, symbols and objects:


Trying to build an interpreter in Java, which makes it tempting to use maps for everything, I found that the properties of an object and the properties of a lexical scope are much the same, (the duality of closures and objects is well known) so will try and define the binding model for values in kin using symbols, integers and lists only.

Using CAS for thread safe homogeneous VLists


Similar to the vector implementation in Lock-free Dynamically Resizable Arrays.

Options for compiling Verilog to shader language


Having had an interview last week as a systems modelling engineer with a company who were using C as the modelling language for timing simulations in embedded memory controllers, which is a bit like going for a job as a car mechanic and discovering that you're being quizzed on your wood-working skills, I was thinking about compiling an appropriate modelling language to something which executes efficiently. Though their particulary problem - timing conflicts - I would have though as having an analytic solution.

Something after UML


Speaking of Verilog, Verilog came into existance because even standardized schematic diagrams don't carry strong enough semantics and are not amenable to algebraic analysis, and graphical notations don't scale to large systems without hiding information.
Pi-calculus came into existance as Petri nets don't scale to large systems and are not amenable to algebraic analysis.
UML is very much in the standardised schematic mould, lacks formal semantics, and relies on hiding information to scale.

Often the hidden information in UML is very important - what appears to be a good design is not a good design if its simplicity is achieved by hiding most of the attributes and operations of a class, as often happens when a class gets multiple unrelated responsibilities. The notation allow you to hide details which should indicate that the design should be factored into smaller units, and in fact encourage such behaviour as a mechanism to scale up to showing many classes. For example, in Altova UModel:
You can customize the display of classes in your diagram to show or hide individual class properties and operations. ... This feature lets you simplify the diagram to focus on the properties and operations relevant to the task at hand.

If the class has features which are not relevant, then refactor it. Don't hide it so it doesn't make the documentation more complicated. The classes for well-factored systems have nothing to hide, but require more relations between the classes, which UML tools don't handle well (last time I checked none provided basic auto-routing or anything like good layout engines) and look more complicated, even though they are more simple - in the same way a dry stone wall is simpler than a minimalist plastered white apartment, but is visually more complex.

So what might come after UML then? What would not mitigate against refactoring, but allow a visual overview? What notation might give good analytic properties for software, rather than being a system schematic with loose semantics?

I don't know yet.

Labels: , , ,