Outline of a typed lambda calculus system

A brief overview of how a typed lambda calculus system could work in CellML

Abstract

Introducing a typed lambda calculus system (TLCS) for type specification in CellML has the potential to make CellML more powerful, while not deviating from the goal to make the model a declaration of the mathematics. However, it has become apparent that a number of details need to be worked out more fully before it will be productive to start writing normative rules for such a specification. The goal of this document is to bridge the gap between the vague idea of using typed lambda calculus for variable typing, and having concrete, well specified rules as part of a normative specification. It is hoped that this document will also make it easier for the CellML community to participate in discussions about the typing system.
The basic idea underlying the proposal is that types will be represented as variables, and relationships between types will be expressed as mathematical operations performed on types. These relationships will be expressed using MathML inside components, that is, exactly the same mechanism used to express relationships between real numbers at present.

Under the proposal, there will no longer be special units elements in CellML. Instead, units information is treated as part of the type (so for example, we will have a built-in type called something like real_metre, which is the type for a real number in metres).

It is worth noting that to fully implement everything that would be possible under the TLCS would probably not only be impractical, but it is not clear that it would even be possible to handle in general. However, the concept of secondary specifications allows for secondary specifications to further restrict what is and is not valid in a model. For example, one secondary specification might disallow the use of types other than type, real number types, and functions of real numbers, while another might also allow vectors, matrices and lists of real numbers. The core specification, on the other hand, needs to be general to accommodate as many secondary specifications as possible, but it also needs to be set up so that it clearly defines the semantics, so that models which are valid under more than one different secondary specification have the same meaning under all of them.

Variables as types

Under the TLCS, CellML variables are defined inside components to hold types. Variables in one component can be connected to variables in other components (subject to the standard encapsulation rules), allowing different types to be defined in different components. Components designed to be imported in another model may be coded using generic types (that is, using a type variable which could be connected to more than one different type) so that different final types can be connected up when the component is imported.

In order to provide built-in types, there will need to be built-in variables. There are several ways in which this could be done (for example, there could be a built in component which is always available). However, the simplest way seems to be to have built-in variable names which act as 'superglobals' in the sense that they are accessible in every single component. The type of these superglobals would be assumed to be entirely constrained, in the sense that it would be an error to try to define their type in terms of other built-in types. It would be valid to connect a built-in type using the name of one component to another variable (to allow, for example, an imported component taking a generic type to be connected up to something like real_metre). However, it would be invalid to connect two built-in variables to each other. The semantics of the built-in variables would be that they appear automatically in all components, with public and private interfaces turned on, and they are implicitly connected to the same unit in all other components (so, for example, if I defined component A and component B, then both would automatically have a built-in variable called real_metre, and those two variables would be treated as if they were connected to each other, even though the model contains no instruction to perform this connection).

Every variable element will have on it a 'type' attribute, which defines the type of that variable. For example, a variable storing a real number in metres might have a type="real_metre" attribute on it. The type attribute always references a variable (storing a type) present in the same component. Variables holding a type have type "type", where type is a built-in variable which itself has type "type".

MathML operators for dealing with types

We will need to define some additional operators for use in MathML to help us deal with types. Some things we would probably want to have in our type system:

Making a new real number type to represent the dimensions of a product of two existing real number dimensions. For example, to define something like real_metre . real_second. We could potentially just use times from MathML for this.
Taking real number types to an integral exponent. This is a shorthand for multiplication when the integer is positive, and negative values will allow real number types like metres per second to be defined in terms of real_metre and real_second when combined with the above operator.
Creating real number types which represent an offset off another real number type, e.g. to define a type for real numbers in degrees Fahrenheit in terms of real numbers in Kelvin.
Creating types which represent the product space of two (or more) other types. For example, given the types real_metre and real_second, creates the type for both a real_metre and real_second.
Creating the type of a function in terms of the return type and the argument types. For example, to create the type of a function which returns a real_metre given a real_second. This should not be confused with functions which take and return types, which would also be valid and could potentially be useful (the type of such a function would be the type of a function which takes a type and returns a type).
Reversing the operations of the previous two operators (that is, finding the type of a given slot of a product space, and finding the return type or an argument type of the type of a function).

Dealing with real numbers

In previous versions of CellML, there have been very special rules defined for how to deal with the units on real numbers. It would be a step backwards for this information to be lost. It is therefore important that the core CellML specification defines how dimensionally equivalent real numbers are determined.

We also need to specify in more general terms how variables in two types defined separately can be connected. For example, we might define a vector of real numbers separately in two different places, once in terms of real_metres and once in terms of a derived type for real numbers in millimetres. It would make sense to specify that if these variables are connected, that the types should be converted appropriately.

In order to do this, we need to define equivalence rules for real numbers in different types of units. The equivalence rules would enforce dimensional consistency, but automatically apply conversion factors.

For this to be possible, we would need to have operators for creating derived units from the base units. We need the following operators for this:

An operator which relates a real type A to two real types A and B such that the equation x[X] = a[A] * b[B] is dimensionally consistent (where square brackets after the variable indicate the units on that variable).
An operator which relates a real type A to a real type X and real number prefix / multiplier (or possibly also a name like 'milli'), with the semantics that multiplier units of A are the same as one unit of X.
An operator which relates a real type A to a real type X and a real number offset, with the semantics that values of real type A are equivalent to values of real type X offset by some real number.
An operator which relates a real type A to a real type X and a real exponent e such that x[X] = a[A]^e is dimensionally consistent.

The specification would define that two types are equivalent if they are structurally identical except different dimensionally consistent real number units are present. This would mean, for example, that a structured type with real number types in it could be connected to a different structure with different real number types in it, as long as the real number types are dimensionally consistent, and the connection would automatically perform conversion between the units.

This means that, unlike in many procedural languages, two types are equivalent and variables of those types can be connected if they are defined in the same way, even if they are defined separately in two different places. Identifier types are the exception, and can be used to provide semantics like 'tag types' in other languages when this is required.

Identifier types

A number of potential features of CellML 1.2 require that we have identifier types - that is, types which have no actual numerical value, but exist solely to be unique in a certain context.

Firstly, there will need to be a built-in type variable called 'identifier'.

Secondly, we new context-sensitive MathML constructors for values with type 'identifier'. It would be good to ensure that these are the only source of context sensitive information in the entire specification (that is, whenever we need such context sensitivity, we use these constructors).

A constructor which produces the same identifier for the particular fragment of MathML (that is, for the particular equation in the particular component in the particular XML infoset), even if the same component is imported more than once.
A constructor which produces an identifier which is specific to the particular instantiation of the import (that is, unlike identifiers from the above constructor, if you import the same component twice you will get different identifiers).

There are several key things for which these constructors can be used for:

Defining real number types in new base units. These types should be defined in terms of an identifier, using an operator which relates an identifier to a real number type in base units. In this case, importing the same component (containing units type variables) multiple times should result in the same base units, and so the first form of the constructor would normally be used.
Attaching discriminating identifiers to other numbers for set membership. By attaching a discriminating identifier to, say, a real number, we ensure that identifier annotated real number is not identical to any other set members added in another context, even if the value of the real number is the same. This is important in the semantic description of, say, a general rate law in terms of a sum of fluxes. We will need a new mathematical operator for relating identifiers and values to values annotated with the identifier. For most applications of this we would use the second type of identifier constructor.

Functions

In CellML 1.1, variables are implicitly related to the variable of integration (often time), but there is no explicit ability to define functions, and no explicit ability to reference the variables which are actually functions of time at any point other than the current time in a simulation.

The implicit formulation of functions of time as variables is a very convenient short-hand... it would be very tedious to always write things like x(t) = y(t) * a(t) + b(t) instead of just x = y*a + b. We therefore want to preserve the ability to use this short-hand in CellML 1.2, but make it an option which is available when it is explicitly asked for.

To do this, we can add a new attribute called implicit_function_of to variable. So for example, a component might contain something like:
<variable name="time" type="real_second" />

This would be the equivalent short-form to something like:

<variable name="time" type="real_second" />
<variable name="fn_seconds_to_metre" type="type" />
<variable name="x" type="fn_seconds_to_metre" />
...
<math xmlns="...">
<apply><equals/>
    <ci> fn_seconds_to_metre </ci>
    <apply><cellml:function_type />
      <ci> real_second </ci>
      <ci> real_metre </ci>
    </apply>
</apply>
<apply><equals/>
      <ci> x </ci>
      <lambda>
         <ci> time </ci>
         <apply><times/>
            <cn cellml:units="metres_per_second"> 5 </cn>
            <ci> time </ci>
         </apply>
      </lambda>
</apply>
</math>

Note that it would also be possible to define functions which return types. This could be convenient for metaprogramming style components which work with more than one different type (along the lines of how the template system works in C++).

To define functions which take more than one argument, we could either use list facilities (the function doesn't actually take several arguments, instead it takes one argument which happens to be a list), or we could, instead of making a function which takes 3 real arguments and returns a real value, make a function which takes one argument which returns a function which takes one argument which returns a function which takes one argument and returns a real number. The problem with both these approaches is they are a little more verbose to use than necessary, and we would probably end up needing to define some syntactic sugar around them anyway. So instead, we should allow functions to take one or more parameters (but functions which want to return more than one value can use another data structure).

Advantages of making all time dependence explicit include:

The ability to define things like rates in terms of historical functional values (time delays).
The ability to apply operators across all time, for example to estimate an average or maximum value of a function over a range of times.
More meaningful semantics for calculus operators and other functionals.

Structured types

MathML already has facilities for constructing several different types of structured type (vectors, matrices, sets, lists, intervals, and also has an imaginaryi constant for the construction of complex numbers).

However, simply mirroring these types is not necessarily sufficient. While it probably makes sense for complex numbers and intervals to consist of two different real numbers, vectors, matrices, and sets could meaningfully have elements made up of arbitrary other types. However, for these types, it most likely would be reasonable that all members are of the type would be of the same type, and so we should specify the element type for all members.

In addition, the question of whether to fix the size of the vector in the type arises. Since we are not defining an inheritance hierarchy for types, we do not have the facilities to say that something like a vector of size n is a special case of a vector of unspecified size. However, it is likely that it would very often be useful to say that the type of a variable is a vector of size n for some fixed n. It therefore makes sense that we should make the type of vectors and matrices include the size (but make sets and lists be arbitrarily sized objects). However, if the secondary specification allows it, the type could be defined in terms of size types which are also connected up, allowing components the option of making the vector / matrix dimensions controlled by other variables.

Another useful type to have would be a structured type which differently typed elements, with the types of each element defined a priori (similar to structs in languages like C / C++). We could define this type using a constructor which takes the types of each element as arguments.

In addition, to allow for better metaprogramming support, we need operators for:

Extracting the sizes / dimensions of vector / matrix / set / list types. This is different to the card operator for sets in MathML because it works on the type, not on instances of the type.
Extracting the types of elements for all of the above types from the aggregate type.

Other useful operators we should have

Extracting the type of a variable from a variable.
Extracting the return and argument types from a function.
Comparing types for equivalence, as opposed to identity.
Extracting members from instances of structured types. MathML supplies the selector operator for this on vectors, matrices, sets, and lists, and so we would only need to extend it to structs.