EBI - Making Australian Data Discoverable

BRIEF DESCRIPTION OF THE PROJECT

Populating Research Data Australia with collection descriptions of data held in the European Bioinformatics Institute databanks.

BACKGROUND

The European Bioinformatics Institute (EBI, part of the European Molecular Biology Laboratory, EMBL) provides international access to data in molecular bioscience generated by researchers worldwide, including Australia. In its present state, Australian specific data is difficult to isolate within the EBI databases, particularly for the non-domain user. The establishment of the EMBL Australia Bioinformatics Resource at the University of Queensland has provided the opportunity for linking data of Australian interest deposited at the EBI, to Research Data Australia (RDA), a cohesive repository of research data collections enabling Australian researchers to easily publish, discover, access and use research data.

The aim of this project was to develop a set of software to allow nucleotide and protein sequence data of Australian interest to be discoverable through RDA in the form of collections.

The project was funded by the Australian National Data Service through the DIISRTE Education Infrastructure Fund.

WHAT WHERE THE OUTCOMES?

In this project, more than 13,000 collection records describing Australian-related content of the EBI nucleotide and protein sequence databases were created.  A large effort was made to divide and describe the content of large databases into many smaller datasets that are of potential interest to a wide and varied range of researchers. The collections encompass two types of Australian data: a) data submitted from Australian-based researchers; b) data associated with sets (and subsets thereof) of Australian species. 

The link between RDA and the EBI is provided through the use of landing pages that are simple to use and contain structured information useful to non-domain specialists who are unfamiliar with the content of the EBI databases (http://rda.ebi.edu.au). Molecular data of Australian interest that is present on the EBI are now more easily found, accessible and re-usable through RDA (http://researchdata.ands.org.au)

The technical solutions developed for this project were:

  • Identification of Australian research institutions: A list of relevant Australian research institutions conducting biological research was compiled. This list includes institutions identified through ARC and NHMRC grant information and having an National Library of Australia (NLA) Party persistent identifier. Research institutions were then grouped by states and territories.
  • Identification of Australian species: A list of Australian species was sourced from the Atlas of Living Australia through the IBIS taxonomy web services (http://www.ala.org.au/tools-services/species-name-services/). These species were assigned to approximately 800 higher level taxonomic ranking groups (eg genus, class, order) using the NCBI taxonomy (http://www.ncbi.nlm.nih.gov/taxonomy). The higher order groupings were selected in consultation with ANDS whilst the NCBI taxonomy was used for species assignment as this taxonomy is used in the EBI databases.
  • Extraction of data from EBI databases: Australian species or research institutions were used as query items to interrogate the EBI databases: Uniprot (http://www.ebi.ac.uk/uniprot/) for protein sequences and ENA (http://www.ebi.ac.uk/ena/) for nucleotide sequences. A Java library was implemented which used EBI hosted web services (http://www.ebi.ac.uk/Tools/webservices/) to query these databases. This library then inserted the extracted data into a MySQL database. Other EBI databases were not interrogated as they either did not contain data that could be definitively identified as Australian, or were not able to be queried using the web services.
  • Automatic generation of collections and submission to RDA: Data stored in the MySQL database was converted into ANDS compliant RIF-CS xml (using an ANDS supplied RIF-CS Java library) and made accessible to a RDA harvest data source. More than 13,000 collections were generated.
  • Landing page for collections: The landing page is a webpage that is accessible from RDA and acts as a link between RDA and the primary data housed at the EBI. The webpage lists basic metadata for the collection (eg a short description, synonyms for the collection) as well as displaying a list of records (eg records of DNA or protein sequences) relevant to that collection. It also allows for navigation back to the primary source at EBI and navigation to related collections. The webpage was developed using Java servlets, JSP and JavaScript. The web interface is deployed on an Apache Tomcat web server on an ESX server with RedHat enterprise Linux 5.4.

The software is freely available for download from Scourceforge under the GNU General Public Licence. http://sourceforge.net/projects/ebi-rda-linkage/?source=directory

QFAB staff contact:

Dominique Gorse, Project Manager