CellML Metadata Library Observations

Some notes and observations of the CellML Metadata Library I have taken as I am working on/with it along with the model repository. Hopefully this will be quickly outdated.

CellML Metadata Library

The new CellML Metadata Library is fairly ready for usage within the repository (as the metadata handling of it really needs an overhaul), but not quite ready for general consumption for the provided methods are still unstable (naming can change). Also, it's not quite in the naming convention of the CellML API, assuming this code will be somehow merged into it. There are still other various shortcomings for it, such as:


  • Unstable naming (method names may change)

  • Inconsistencies within the code and algorithm (see next point)

  • Has not been refactored, that are duplicate algorithms that do nearly the same thing but not quite the same.

  • Does not preserve all data it does not deal with.

    • This case is arguable. What if the subgraph created by user had invalid data, should we bother keeping that?

    • However, if the library removes important data, this will be a problem. More extensive testcases will be written for this.

  • Does not deal with all forms of vCard (data will be lost if it's not compatible)

    • For instance, vCard:FN and vCard:N are both used, and it's somewhat hardcoded for whatever field I saw as the common value (i.e. vCard:FN used for cmeta:comment, vCard:N and its subclasses used for other places such as model creator and journal citations).

  • Does not deal with xml:base

    • Andrew Miller suggested that the repository should strip out all xml:base references to make things relative for easier management. I also suggested that for similar reasons, and to remove all the file:// URIs to shield the uploaders from revealing working path information.

  • Cannot reliably handle bad graphs in a graceful manner

    • Should this be an expected feature?

  • More on inconsistencies

    • Values returned. I blame the specifications for this though. Sometimes the values are unique (but not specifically specified to be so in the metadata specification) and other times there are multiple ones (vCard:OtherName comes into mind). Should all values be in a list, strings, or should objects contain native values such at page numbers are integers, dates are in Python (or whatever language's datetime variable)?

      • Perhaps, but page numbers could potentially be alphabetical.

  • Does not handle multiple model creators

    • The repository didn't, and I was in the interest of getting this to be compatible with the repository in the first place.

    • I believe this should be addressed in the metadata spec, as when creators are wrapped inside a rdf:Seq or without one makes a difference in Versa query. I don't know how other RDF querying languages deal with rdf:Seq's, but if they do have the same issue and require multiple queries to retrieve same data, I am not going to be pleased.

  • Assumptions taken which may or may not be true

    • Such as single model creator, single model title, and extra data not part of this will be removed.

  • RDF nodes may be made an orphan

    • Testcases, si vous ples

  • JournalArticle, BookArticle... references/citations in general

    • Only JournalArticle is handled. Provided methods not stabilized to retrieve all references.

      • Again, this was done in the interest of getting this running ASAP for the repository.

  • Other issues I couldn't remember or not written down in the FIXME comments inside CMLmetadata.py file of the CellMLMetadata project found in the subversion repository for the Physiome Project.


All the above issues (some more pressing than others) will have to be addressed eventually. However, the repository will be using this code nevertheless (as I am the maintainer, I can fix both code base with relative ease).


Unfortunately, as I have wrote elsewhere, the repository has very broken RDF graph capabilities and I tried to fix that but decided that it wasn't worth the effort (hence a separate CellMLMetadata library written from scratch with minimal dependencies). Those broken code generated broken RDF graph which will need to be fixed somehow, or else the CellMLMetadata library will not be able to read that. As a temporary measure, I will be making provisions, somehow, to address this problem (probably by having both codebase and go through models one by one, perhaps with the assistance of a script). More testing will definitely be needed throughout the process. This is actually further ahead in the future, only when my library becomes more stable (as in, with the major kinks worked out). This will mean limited updates for the repository as I am overhauling code as I am putting in new code.