2009-11-21

Trees, Forests, Vines.

My current clients are using Model Driven Development for a new product line. Obviously I'm not able to say what the product line is, but that doesn't matter for the work I have to do, which to do with issues they are having working with model consistency between distributed teams on three sites. It actually is quite close in spirit to a project on hypertext repositories for multi-view engineering models I started and couldn't get continued funding for when I was a research associate at York.

They have a large UML/SysML and Simulink model which they are trying to maintain traceablitily. The UML model is created in Enterprise Architect and shared between teams using XMI export and a ClearCase repository. They have modelled their requirements in EA, and are tracing these system requirements to requirements in each sub-system, and from these local requirements to the implementation. They aren't using separate domain and implementation models as is practiced in some UML styles.

EA has a rather awkward XMI import mechanism when it comes to links between elements in different packages - when a package is imported, all links where the client end is in the imported package are erased, and all links defined in the XMI are created. Links are only created if both the client and supplier end exist in the model. This means that unless an inter-package link is recorded in the XMI at both ends, a link created to a new element might not get imported correctly, and so disappear from the XMI next time it is committed to ClearCase.

There are work-arounds for this, but basically the problem comes down to a mis-match between a hierarchical version control system, where users work on a package and the contents of a package, and a hyperlinked model, where links can exist between different nodes in the hierarchy, and really belong to neither end.

Once you also introduce baselines and viewpoints into such an environment, you get to a state where either you have to restrict the types of links - most UML links are designed so the client is dependent on the supplier, so the semantics are compatibily with the link being recorded in the client's package only - and also order the updates - the supplier must exist in the model before the client is loaded into the model for the link to be created. This makes it harder to update models, and for peer teams to push out baselines - you have to update the models for each subsystem team in an order consistent with any dependencies between the subsystems.

The difficulty in ordering baselines between is mitigated by designing to an interface, as it reduces the dependencies between subsystems, but it does not eliminate them. High integrity systems also have common effects which create dependencies between subsystems which cannot be designed around using interfaces (thermal characteristics, electromagnetic interference, power use, etc), and a model without these has lower predictive value. One technique to get around these dependencies is to apportion a budget for them rather than calculating them, which pushes them up towards the root of the tree. The other is to create a dependent model which is traced to the design model but represents the EM view or the thermal view of the system. So in addition to having vines between the branches of the design tree, there can be a forest of trees which model different aspects of the system.

Having looked at the capabilities - and been a bit hopeful about 'model views' - EA doesn't seem to have anything to support multiple viewpoints on the model. You can't create a second tree based on a different viewpoint, and there isn't a navigation via backward «trace» dependency ( you can create a second tree of packages which trace to the original tree, but navigation to the original requires clicking through dialogs ). EA also doesn't create synthetic nesting or dependency relationships between packages in the tree or packages whose elements depend on each other, which are useful if you have more than one hierarchy in the system. Multiple hierarchies arise when you are dividing a system in different ways - for example a structural view is divided into zones, and a functional view into sub-functions, and sub-systems view into sub-systems.

I have a strong feeling that a distributed model based on XMPP or ATOM protocols between nodes for each team should be good, but that doesn't replace the requirement to externalise the model for baselining or backup using industry standard version control tools or the issues with import into existing UML tools. There is also a distinct difference in view between the idea of 'the model' and 'a model' - having moved awy from CASE systems to distributed C4I systems for a couple of years, there are techniques for working with uncertain data and distributed databases which might be interesting to pursue, but are not going to sell to most engineering organisations large enough to need them - if you don't control the situation, then you use distributed models based on the best information available at a time, and then corrolating information from teams, rather than partitioning a system into a rigid hierarchy and trying to manage updates to that hierarchy. In such a distributed model, different viewpoints do not conflict - there is no one 'true' hierarchy of the system, only whatever hierarchy is useful for the team using that viewpoint - and assertions are made about an object which the owners of the object then choose to confirm or reject as authentic.

Last time I looked at RDF is was still disappointing when it comes to baselining parts of a model - although most triple stores have now got some kind of context or graph ID, there isn't 'native' support for versioning or saying graph X contains Y at version Z, and instead the movement seems to be towards having a graph ID per triple, which doesn't seem very manageable to apply version control to - the model has a few thousand packages, each with a few classes, each with a few members, each of which would require several triples to describe, so would be of the order of one or two million triples to include in each baseline. Baselining on a graph dump would be possible - just dump all the triples in the baseline out to a file and put that in source control - but that moves the mechanics out of the RDF metadata. Doing that with XMI and source control means there is nothing other that diff'ing the XMI files between versions to say what has changed; part of wanting to move to a distributed graph format is to get a difference mechanism which understands that the model is a graph rather than a text file.

Labels: ,

2009-04-09

A new way to look at networking.

Link: http://video.google.com/videoplay?docid=-6972678839686672840

I find this video incredibly interesting. It's talking about the phase transition that occurred between telecoms networks and packet switched networks, and questioning whether the same can happen between packets and data.

In a routed telecom network, your number 1234 meant move first switch 1, second switch 2 and so on - it described a path between two endpoints. In TCP your packet has an address, and each node in the mesh routes in to the node it thinks is closer to the endpoint with the target address.

Currently I have two or three home machines (netbook, desktop and file server) where data lives locally, plus data 'in the cloud' on gmail.com and assembla.com, or on my site here at tincancamera.com . Though as far as I'm concerned, that data isn't in the cloud - it's on whatever server at those addresses resolve to. If instead I google for 'pete kirkham', about half the hits are me. If I put a UUID on a public site, then once indexed I could find that document by that ID by putting it into a web server.

In the semantic web, there's been a bit of evolution from triples without provenance to quads - triples and either an id or some other means of associating that id with the URI of the origin which asserts it. It's considered good form to have the URI for an object resolvable to some representation of that object, so there's a mix of identifier and address. The URI for kin is http://purl.oclc.org/net/kin, which uses HTTP redirection to point it to the current kin site. This depends on OCLC's generous provision of the persistent uniform resource locator service, and maps an address to another address rather than quite mapping an identifier to an address.

There isn't an obvious analogous mechanism for data to the TCP nodes routing the request to a node which might be closer to having the data, though is some ways a caching scheme with strong names might meet many of the requirements. It sort of brings to mind the pub-sub systems I've built in the past, but with a stronger form of authentication. Pub-sub replication isn't caching, as in a distributed system there isn't a master copy which the cached version is a copy of.

There's also a bit of discussion between broadcast and end-to-end messages; I've got to work out how to do zero-config messaging sometime, which again sort of makes sense in such systems - first reach your neighbours, then ask them for the either the data, or the address of a node which might have the data. But there still isn't an obvious mapping of the data space without the sort of hierarchy that TCP addresses have. (although pub-sub topics are often hierarchic, that hierarchy doesn't reflect the network topology even to the limited extent that TCP addresses do.) It also has some relation to p2p networks such as bit-torrent, and by the time you're talking about accessing my mail account in the cloud, to webs of trust. A paper on unique object references in support of large messages in distributed systems just popped up on LtU, which I'll add to my reading list this weekend.

I've also started reading Enterprise Integration Patterns. There's a bit of dissonance between the cocept of channel in these patterns, and I guess in the systems they are implemented in, and the concept of channel in pi-calculus. In EIP, a channel is a buffered pipe for messages, and is a static entity created as part of the system configuration, in pi-calculus a channel is not buffered, and the normal mechanism for a request/response pair is to create a new channel, sent that channel to the server of the request, and receive the response on the channel. The pi-calculus model is the pattern for communication I'm intending for inter-process communication in kin (though I'm not quite sure whether giving explicit support to other parts of pi-calculus is something I want). I'm not sure that having to mark request messages with an ID (as in XMPP IRQ) rather than having cheap channels is a good programming model; though there are reasons that pretending that an unreliable connection isn't a pi-calculus channel is a good idea. I'll see what I think after I've tried coding the channels. Certainly returning a channel from a query given a strong name which lets the requester communicate with an endpoint which has the named data could be a model worth pursuing. REST is a bit different in that the response channel is assumed by the protocol, but the URLs returned by the application are channels for further messages to be sent to, and of course can be addresses anywhere allowing the server for the next request to be chosen to be the one best suited to handle that request (eg, a list of the devices in a home SCADA system could list the URIs of the embedded servers on the sensor net rather than requiring the client to know the topology prior to making the initial request).

Labels: , , ,