Curation Status Representation

Curation Status Representation


by James Lawson, 28/08/07


This document discusses the issues with the current system of model curation status representation on the cellml.org model repository and outlines a proposal to change this representation from a sequential level based system, which is currently represented by one, two or three stars, to a discretised 'descriptive flag' system, to be represented by a set of icons or glyphs.


Almost all information is useless when out of context. The job of a curator of any information resource, whether it be a museum or an internet database of biological data, is to provide this context - curation metadata. Information in a curated digital database must be represented in a manner that fulfills two main requirements. Firstly, it can be read easily and quickly, because people are busy. Secondly, it must be comprehensive.


The cellml.org model repository has (as of 11:42am, 28/08/07,) 282 unique models. All of these models have been curated to at least some extent. Their curation status (in brief,) however, may be anything from "non-functional" to "can not currently be represented in CellML, as model contains discrete delay elements," to "model produces identical output to published model, but contains inherent unit inconsistencies." We need an effective, flexible, and intuitive system of informing repository users of the curation status of each model.


Description of the Current System:

The current system for representation of the curation status of each model in the CellML model repository uses stars.

The main model listing page uses a binary system, where a star next to the model name/link denotes that at least one version of that model has been curated to Level 1 of the general curation system. A description of the current system of curation levels can be found at http://www.cellml.org/repository-info/info and is as follows:

CellML Model Curation

The basic measure of curation in a CellML model is described by the curation level of the model document. We have defined four levels of curation:

  • Level 0: not curated;
  • Level 1: the CellML model is consistent with the mathematics in the original published paper;
  • Level 2: the CellML models has been checked for (i) typographical errors, (ii) consistency of units, (iii) that all parameters and initial conditions are defined, (iv) that the model is not over-constrained, in the sense that it contains equations or initial values which are either redundant or inconsistent, and (v) that running the model in an appropriate simulation environment reproduces the results published in the original paper;
  • Level 3: the model is checked for the extent to which it satisfies physical constraints such as conservation of mass, momentum, charge, etc. This level of curation needs to be conducted by specialised domain experts.

CellML Model Simulation

As part of the model curation, it is important to know what tools were used to simulate (run) the model and how well the model runs in a specific simulation environment. Thus, we provide a level of confidence as part of a simulator's metadata for each model. The four confidence levels are defined as:

  1. not curated;
  2. the model loads and runs in the specified simulation environment;
  3. the model produces results that are qualitatively similar to those previously published for the model;
  4. the model has been quantitatively and rigorously verified as producing identical results to the original published model.
On the model specific pages, the representation of curation status of each model file is split into two categories:
  • the first, denoted by the term 'standard,' refers to the general curation status of the model (described in 'CellML Model Curation' above.)
  • the second, denoted by the names of three CellML model simulation programs, 'PCEnv,' 'COR,' and 'JSim,' refers to the ability of those programs to perform a simulation of the model using the CellML file (described in 'CellML Model Simulation' above.)

For each of 'standard,' 'PCEnv,' 'COR,' and 'JSim,' a maximum of three stars can be allocated as a representation of the curation level of the model file. Therefore, it is currently possible to inform the user of the general curation level of the model, and also which software packages (out of the three currently shown,) the model can be run in. Ideally, a model conforming exactly to the CellML specification should be able to be run in any software that simulates CellML models. One of the major differences between the three programs mentioned here is how rigorously they apply the requirements of the CellML specification. For example, PCEnv will allow a model to be integrated simply if it can be, whereas COR will require that no connections are duplicated, amongst other things that are required by the specification but not necessary to actually integrate the model. JSim requires that the models have unit consistency, which virtually none of the models have, usually by fault of the model author. Andrew Miller comments, "there are currently not many limitations (as defined in the specification) on what can be represented in CellML. One can represent constructs which have a mathematical meaning but would require a specific algorithm to actually evaluate (such as various types of infinite sums)." A further example of this is the model "Modeling beta-adrenergic control of cardiac myocyte contractility in silico" by Saucerman et al. This model can be represented in CellML, but the simulation packages currently available for CellML models are not able to perform Newton-Raphson solve operations, and can therefore not solve the set of non-linear simultaneous equations that part of the model depends on.


Motivations for Change:


There are a number of problems with the current system of representing the curation status of the models in the cellml.org repository.
  • The curation status of a model is a relatively complex issue. I believe the current system is too simple to accurately represent enough of the relevant information.
  • On the model specific pages, the star system is aesthetically unpleasing, and can be confusing.
  • We currently offer three software packages which we apparently test the model in. To date, most models that have been tested in PCEnv, but more testing is beginning to be done in COR, and more testing in JSim will follow.
  • We do not specify which version of the simulation software was used to test the software. This is an issue which has been discussed among the cellML team and should be relatively simple to amend.
  • If we offer three packages, and a group develops another package, do we support testing in that package as well? Matt Halstead's response here is "I guess the answer there is how much resource we have to apply to this and how important this new package is to the community. It could also be left to the software package owners to do this testing."
  • The specification of what the actual curation levels need refinement, in the case of 'general curation,' or in the case of the simulator specific levels, some of the requirements are not achievable. Replacement of the 'level system' with the system of descriptive flags proposed herein would alleviate this problem.
  • One example of the above is that we can not currently 'quantitatively and rigorously verify models as producing identical results to the original published model,' since in most cases, all we have for reference is a small, low resolution graph in a pdf file. This is an issue that requires more flexibility in the system used to describe the current curation status of a model. We need to be able to describe what kinds of tests have been applied here, and give either a quantitative or qualitative rating of how well the cellML model output matches the publication.
  • In the case of 'general curation,' many of the better curated models in the repository at present fulfil all the conditions required to achieve level 2 curation, except for unit consistency. This is usually due to no fault of the person who coded the model into CellML, or the curator - many, if not most, of the models published in the scientific literature are neither dimensionally nor units consistent. Therefore fixing the model so that it has units consistency would require an investment of resources into the model that would be beyond the scope of our current model of curation strategy, beyond the skills of the current curator, and would indeed require that the model in question be of disproportionate importance. Since this is the case, there is a large disparity in the curation status of many models that can not be represented using the current system. Models that have been extensively curated (i.e. "the CellML models has been checked for (i) typographical errors, (ii) consistency of units, (iii) that all parameters and initial conditions are defined, (iv) that the model is not over-constrained, in the sense that it contains equations or initial values which are either redundant or inconsistent, and (v) that running the model in an appropriate simulation environment reproduces the results published in the original paper" must be given the same rating as models that are simply mathematically consistent with the original paper but no more. A solution to this particular problem would be to remove the requirement for units consistency from level 2 and put it either in level 3 or create a new level between the current level 2 and level 3. However I believe this issue highlights the insufficiency of the current curation system.
  • Satisfying level 2 of the 'general curation' system might require changing the model to get it to reproduce the published results. This would thereby invalidate the level 1 requirement for mathematical consistency with the published paper. Currently it is not possible to give a model a second star without the first. This is simply a technical issue which could be fixed, but if it were fixed, I believe it would be confusing.
  • The current system is very general. Often there is very specific information about the curation status of a model that it would be useful to be able to represent. For example, there might be an issue which the curator is not qualified to fix, but that someone else might find very easy. If this information could be displayed, then someone might be able to fix the model.

Ideas for Improvement of the Curation Status Representation System:


Remove the 'level' system


Create two levels of representation:

1.) A simple tag system that uses glyphs, which can be used to sort/order/fiter models as well as just display information. On the main repository page, several tags could be shown for each model in the listing (or set of models as grouped by the reference they are based on). The number of tags shown on the main page would be ideally kept to a minimum, but the general advantage over using a single star here would be the amount of information able to be represented. Discussions to date have identified several tags that should be shown on the main page for each model. Peter has suggested some kind of column format, which would allow different glyphs to be put in the same column to represent different permutations of the same type of information, i.e. for a 'runs in simulator X' column, the icon(s) of the simulator(s) in which it runs could be used. The actual display of this information is certainly moot - again the idea is just to represent more information about the curation status of the model on the front page. Glyphs could have question marks over them to represent that the information is unknown, or be crossed out to represent a FALSE state.

i.) Model is CellML X.Y compliant (some models are not CellML compliant, and contain vectors, matrices, discrete delays, etc. but the general consensus appears to be that these models should be kept in the repository until such time as CellML incorporates these features and the models can be properly represented.) Note, I do not entirely concur with displaying these models on the main listing - my personal opinion is that they should be moved to a separate listing of models which have been partially coded but are incomplete because they are either unable to be represented, or because the paper describes them incompletely. Glyph is simply the CellML logo or part of it, along with the most recent version of the specification the model complies with.

ii.) Model runs in simulator A, B, C. Glyph is the icon for the simulator. Implicit here is that levels describing how well the simulator works for a particular model will be dropped. If this information is relevant, in that a particular simulator handles a model incorrectly, this information will be flagged in the wiki. This is similar to a situation where a particular integrator may give distorted output.

iii.) Model agrees with published paper / model source code. These can be two quite different things. Sometimes the source code is informally published by referring to it in the paper. Some papers say that any information about the model not included in the paper can be found in the source code.) This term is also a conglomeration of the requirements defined in "CellML Model Curation" Level 1, and "CellML Model Simulation" Levels 2 and 3. Which should we refer to here?

iv.) Model produces some/all correct output

v.) Units are consistent / not consistent.

vi.) Model is dimensionally valid and has been checked by a domain expert.

vii.) A wiki page is available providing further detail.

viii.) The model complies with MIRIAM (including what specification version of MIRIAM if further versions are released)

ix.) The model is published/unpublished

x.) the model is mathematical&quantitative / qualitative and contains no mathematics and/or parameterisation

Additionally, each glyph will have a 'tooltip,' so the user can put their mouse over the glyph, and some text will appear describing what the glyph means.

In the model pages, the list of versions and variants could be ascribed glyphs in a similar manner to the main listing.

2.) A wiki page providing further detail on the work that has been done on the model and its curation status. This wiki will link to any bugs/tracker issues or mailing list posts in the archives regarding the model. Any correspondence with model authors will also be posted (might need permission for this...) and any source code for models will also be accessible. Perhaps some kind of restriction could be placed on this, as it is not our aim to provide model code in any form but CellML, except for the purpose of aiding curation of CellML files. There will eventually be a wiki page for every model in the repository. While the glyph system is intended as a quick reference to curation status for people scanning through models in the repository, the curation wiki page for each model is intended as a comprehensive resource that not only contains all the information represented by the glyph system, but also contains all the information on the model that has been generated or collected. This includes notes on any communication with the model author - including transcripts of the communications themselves if consent is obtained, notes on what work has been done on the model, what further work is required, and any other information that could conceivably be useful to people who might use or work on the model. When it is fully implemented, this system of wikis will encompass the entire sphere of knowledge about every CellML model in the repository. At present, the wiki proposed here, which is to be a long term, large scale effort, will be generated, and itself curated by the present cellml.org model repository curator, James Lawson. It is intended that the cellml community, specifically those people who create, modify and use the models in the repository, will be responsible for its content. Perhaps a salient analogy here is a 'Wikipedia' for the cellml.org repository.

James Lawson is currently in the process of recording information about model curation, and has in fact begun experimenting with using the wiki format for representing this information. It may be some time, however, until this develops into a useful resource. If anyone in the community has any information regarding specific models that they believe might be relevant to this 'CellML-curation-wikipedia,' they would be strongly encouraged to share this information with the community.

The more people that work on this project, the faster it will develop, which will make the site more useful to users, which will increase the community, etc. etc.


Sorting by Curation Status

The ability to sort/filter/search by curation status would be a useful feature to have on the site. Currently we can only sort models by the number of 'general curation' stars assigned to the latest version of each model.


Pros:

  • We will be able to represent the model curation status in a clear, concise manner.
  • We will be able to record all the relevant information about the curation process a model has undergone
  • The proposed system of wikis will create some proper documentation for each model, over and above the scope of the metadata

Cons:

  • The user is presented with several icons they have to learn the meaning of before they can get any value out of the proposed system
  • Displaying 6 or 7 different glyphs next to each model in the main model listing is going to take up a fair bit of space
  • The glyphs might look tacky
  • Displaying correspondence with authors in the model wikis might be a problem
  • Each listing entry on the main page is an umbrella representing multiple versions and variants. The glyphs would probably only describe the most highly curated version/variant, which might be misleading if other versions/variants are in some way useful.

Questions & Considerations:


Do we have a model wiki for each version/variant? Perhaps we could have an umbrella wiki which describes all versions and variants.*

What will this glyph/icon based system look like? How will we avoid 'cluttering' the glyphs/icons?

We would need quite a formal structure for the wikis to represent information in a clear, consistent manner - what will this be?

What will a CellML 1.1 repository look like and how might the representation of curation status information be required to change?

What considerations do we need to make to ensure the consistency of the proposed representation system throughout future versions of the cellml.org model repository.

The above proposed descriptive flags are limited to describing the mathematical status of the model. We will need flags for representing the curation status of models with respect to annotation with ontologies and database references, but we have not yet decided on exactly how this process will work

Further discussion on curation theory is needed to make a set of standards (that can be represented by the proposed descriptive flags) that will continue to be relevant

Input from other groups working with CellML and indeed other groups developing repositories of curated computational biology models would be very helpful. It is possible that a universal standard of curation status representation could be developed.


Discussion:


A mailing list thread on team@cellml.org has been started on this topic. The archives of this thread can be found at http://www.cellml.org/pipermail/team/2007-August/000102.html


At the Auckland CellML meeting of 15/08/07, these matters were discussed. The minutes for this meeting can be found at: http://www.cellml.org/meeting_minutes/MeetingMinutes15August2007/


The objective of this proposal is to kick-start a discussion with the wider community on how to better represent the curation status of models in our cellml.org repository. A 'descriptive flag' system is currently one of the better ideas we have come up with. If anyone has an idea that out of the scope of this document, please feel free to express it.


Ideally, after discussion on the cellml-discussion mailing list, we will be able to flesh out this document to the point where it becomes an official action plan, and also we will have, at least to some extent, co-ordinated our efforts with the BioModels group and other groups who are providing curated, community oriented repositories/databases for computational models of biology.