[NeXML]Rich phyloinformatic data
NeXML is an exchange standard for representing phyloinformatic data — inspired by the commonly used NEXUS format, but more robust and easier to process.
The NEXUS flat file format is a commonly used syntax for phylogenetic data. Unfortunately, over time, non-compliant NEXUS implementations have overloaded the standard - which has caused various problems. Meanwhile, mature technologies around the XML standard have emerged. These technologies have the potential to greatly simplify and improve robustness in the processing of rich phylogenetic data. This website is the home for the community-driven NeXML project, which seeks to leverage XML technologies in the development of a data standard that translates NEXUS concepts into a syntax that is more easily validated and processed. This approach promises several advantages:
- Syntax validation — some of the issues hampering interoperability are caused by the fact that no formal specification exists for NEXUS and other flat files, and no unambiguous way to validate them. Using XML Schema we have defined a versioned grammar against which data files can be validated syntactically. In addition, this website has a validation service (the orange box in the center of every page) that also checks the semantics of uploaded NeXML files beyond ways that can be expressed in XSD schema language.
- Semantic annotation — an issue in current file formats is that their semantics are not well-defined. For example, what does it mean to use an ambiguity code in a matrix? Is it uncertainty or polymorphism? With the wider EvoInfo working group we are developing an ontology on which we are mapping NeXML schema types so that the semantics of data files become well-defined. In addition, NeXML has a facility for annotating fundamental phylogenetic data objects (such as trees, character state matrices and taxa) with ontology predicates and objects using RDFa.
- Web services — a number of different technologies (such as XML-RPC, REST and SOAP) have emerged allowing disparate, xml-based services to be glued together over the internet. For example, the PhyloWS initiative seeks to develop conventions for RESTful phylogenetic web services for which NeXML is one of the preferred response formats.
Because of the advantages of NeXML over current standards, developers of phylogenetic software have come together as part of the NESCent working group for evolutionary informatics to develop this new data exchange standard based on XML technologies.
We have recently published a description of NeXML in Systematic Biology. If you use NeXML in your research, please consider citing doi:10.1093/sysbio/sys025
What are we doing about it?
NeXML development is being undertaken along a number of tracks:
- In the first place, there's an XML schema. This schema (currently designated as namespace http://www.nexml.org/2009) is explained on our wiki and formally documented; the latest version is available from git.
Secondly, the community is implementing NeXML read and/or write abilities in a
number of software applications:
Carl Boettiger and Scott Chamberlain have developed an excellent NeXML library for R for rOpenSci. TreeBASE supports serialization to NeXML.
The Mesquite project supports reading and writing of NeXML. Wayne Maddison and Peter Midford helped start an implementation for this that is currently being maintained by Rutger Vos.
Xuhua Xia's DAMBE version 5.2.31 for Windows Vista/7 reads and writes NeXML data.
The PhenoScape project uses NeXML to annotate complex morphological character states with ontology terms in its Phenex editor.
Jeet Sukumaran has implemented NeXML I/O for python in the DendroPy package. There are many DendroPy code samples for dealing with NeXML data in the wiki manual.
Chase Miller has implemented Bio::NexmlIO for BioPerl, which under the hood reuse Rutger Vos's Bio::Phylo parser libraries.
Anurag Priyam and Rutger Vos have developed a NeXML I/O plugin for the BioRuby open source bioinformatics library for Ruby.
Jaime Huerta-Cepas' team is working on NeXML I/O for the ETE Python environment for tree exploration.
Matt Yoder has implemented NeXML serialization for the mx collaborative web-based content management system for evolutionary systematists.
Andrew Hill has added NeXML support to PhyloBox.
Sam Smits has made it so that the jsPhyloSVG tree visualization widget can now show NeXML trees.
Mike Keesey has added NeXML support to Names On Nodes, a web application that automatically applies biological nomenclature to datasets. Daniel Huson's DendroScope development team has adopted NeXML as its primary file format for storing visualization (styling) metadata. However, at present, its implementation is not fully compliant so should not be taken as a template to generate NeXML output. For the 2011 Google Summer of Code, Apurv Verma has added NeXML reading capability to phyloGeoRef.
Mark Jensen has implemented NeXML compatability for the HIVQuery web application.
- Third, we're crossreferencing the NeXML schema with the Character Data Analysis Ontology which is being developed by other members of the EvoInfo working group.
If you are interested in being involved in the NeXML project in any way, please do! Here are some ways to get involved:
- Get informed — information about the NeXML project is distributed over the manual (for an overview of vision, plans, implementation), documentation (for formal description of the schema) and the mailing list (for immediate plans and discussion).
- Try it out — the download section of the website has nightly builds of bindings for various languages. Take these for a spin!
- Contribute — if you are a programmer interested in extending NeXML support, please contact us through the mailing list to get commit support for the subversion repository.
The research leading to these results has received funding from the [European Community's] Seventh Framework Programme ([FP7/2007-2013] under grant agreement n° .
Rutger A. Vos, J. P. Balhoff, J. A. Caravas, M. T. Holder, H. Lapp, P. E. Midford, A. Priyam, J. Sukumaran, X. Xia, and A. Stoltzfus. 2012. NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Systematic Biology 61(4): 675-689 [doi:10.1093/sysbio/sys025]