Data Policy

Bioplatforms Australia Antibiotic Resistant Sepsis Pathogens Framework Initiative Data Policy  

 

1. Introduction

The Bioplatforms Australia (Bioplatforms)-sponsored Antibiotic Resistant Pathogens consortium[1] is generating a resource consisting of reference datasets of the core genome, transcript, protein and metabolite data for several infectious disease pathogens of importance to Australia. This ownership of this reference datasets resides with Bioplatforms. 

The consortium reserves the right to conduct a ‘global analysis’ across these genome, transcript, protein and metabolite reference datasets and publish the results in the scientific literature. However, in accordance with the Bermuda[2] and Fort Lauderdale[3] agreements, and the more recent Toronto Statement[4], which provide guidelines for scientific data sharing, Bioplatforms are committed to ensuring that data produced in this effort is shared at appropriate times and with as few restrictions as possible, to advance scientific discovery and maximize the value to the community from this Australian Government National Collaborative Research Infrastructure Strategy (NCRIS)-funded dataset.

This policy describes the data associated with the consortium, roles and responsibilities of various consortium members and data users, and data release schedules.

 

2. Reference Dataset Description and Overall data or information flow

The reference datasets to be produced by the consortium will cover five microbial pathogen species that have been selected by a group of research champions:

  • Streptococcus pneumoniae 
  • Staphylococcus aureus
  • Streptococcus pyogenes 
  • Escherichia coli
  • Klebsiella pneumoniae

For each of the above species, several strains will be selected by the research champions, isolated and grown in several controlled conditions. DNA, RNA, protein and metabolite samples will be extracted and then data produced from several Bioplatforms network data generation facilities to examine the genome, transcriptome, proteome and metabolome of each sample.

 

Table 1: The following technologies will be used to generate data for the different “omic” type:

Omics Type

Technology

Facility

Genomics

PacBio sequencing

Ramaciotti Centre for Genomics, Sydney[5]

Illumina MiSeq sequencing

Ramaciotti Centre for Genomics, Sydney

Transcriptomics

Illumina HiSeq sequencing

Australian Genome Research Facility (AGRF), Melbourne[6]

Proteomics

MS1 quantification on DDA data

Monash Biomedical Proteomics Facility (MBPF), Melbourne[7]

MS2 quantification on DIA/SWATH data

Australian Proteomics Analysis Facility (APAF), Sydney[8]

 

Metabolomics

LC-MS analysis

Bio21, Metabolomics Australia (MA), Melbourne[9]

 

Following production, raw data will be uploaded to a password-secured central data repository, which is physically located in Perth and managed by the Centre for Comparative Genomics (CCG)[10] on behalf of Bioplatforms. To ensure for disaster recovery, all data in the CCG-managed repository will be mirrored at a second site in Brisbane that is managed by the Queensland Cyber Infrastructure Foundation (QCIF)[11]. Data from the QCIF mirror of the central repository will be made available for direct download via a password protected web interface[12] to authorised users (see Section 3).

 

Metadata associated with each file and files names will be made publicly available via a web portal and associated Application Programming Interface (API), which is managed by CCG for Bioplatforms[13]. Access to the actual data files via the web portal and API will be restricted to authorised users and will require authentication through password use.

 

As and when determined by Bioplatforms, in consultation with various research champions, copies of the data and associated metadata may also be stored elsewhere (e.g. it is intended to make a proportion of the dataset available as a bioinformatics training dataset [14]). When this option is executed, access to any data copies must be controlled under identical conditions as required for the primary copies.

 

Raw data will be shared with bioinformatics facilities and/or researchers at various sites around Australia to generate intermediate and analysed data for the consortium. These facilities will include (but are not necessarily limited to) the following:

 

Table 2: Bioinformatics facilities and/or researchers generating intermediate and analysed data

Omics Type

Technology

Bioinformatics Facility

Genomics

PacBio sequencing

Victorian Life Sciences Computational Initiative (VLSCI), now Melbourne Bioinformatics (MB)[15]
Systems Biology Initiative (SBI), Sydney[16]

Illumina MiSeq sequencing

Victorian Life Sciences Computational Initiative (VLSCI), now Melbourne Bioinformatics (MB)
Systems Biology Initiative (SBI), Sydney

Transcriptomics

Illumina HiSeq sequencing

Australian Genome Research Facility (AGRF), Melbourne
 

Proteomics

MS1 quantification on DDA data

Monash Biomedical Proteomics Facility (MBPF), Melbourne

 

MS2 quantification on DIA/SWATH data

Australian Proteomics Analysis Facility (APAF), Sydney

Metabolomics

LC-MS analysis

Bio21, Metabolomics Australia (MA), Melbourne

 

Intermediate and/or analysed data will be uploaded by bioinformatics staff to the secure central data repository in Perth and mirrored in Brisbane as described above. Data downloads and API access will be provided under the same set-up as for the raw data.

 

As and when determined by Bioplatforms, in consultation with various research champions, copies of the intermediate and analysed data may also be stored elsewhere. As noted above, when this option is executed, access to any copies of the data and metadata must be controlled under identical conditions as required for the primary copies.

 

Ultimately, the data generated in this project will be made available under open-access conditions to the international research community, through a variety of relevant established international data repositories including the European Nucleotide Archive (ENA)[17], ArrayExpress[18], ProteomeXchange[19] and MetaboLights[20] - See also Section 4 – Data Sharing Schedule).

 

3. Roles and Responsibilities

3.1 Data Sponsor and Owner

Bioplatforms Australia is the Data Sponsor, undertakes the overall duties of ownership, and is responsible for the following tasks (in consultation with various research champions):

 

  • Defining the purpose of the data items;
  • Defining access arrangements;
  • Authorising any Data Users;
  • Appointing a Data Custodian for copies of the data stored at various sites/on various systems.

 

 

3.2 Data Producers

Two broad types of data will be produced: raw and processed. Raw and processed data will be produced from the facilities listed in Section 2.

Producers of both raw and processed data are responsible for:

  • Assigning a Data Custodian for copies of the data stored locally;
  • Data generation and temporary storage;
  • Ensuring data use is compliant with this policy;
  • Quality assurance.

 

 

3.3 Data Infrastructure Providers

Data infrastructure providers provide data storage and/or compute infrastructure for the raw or processed data, and are responsible for:  

  • Assigning a Data Custodian for copies of the data stored locally.

 

 

3.4 Data Custodians

The Data Custodian undertakes the day-to-day management of each item of data stored at various sites or on various systems, and is responsible for:  

  • Data storage and disposal on that system;
  • Ensuring data use is compliant with this and other policies/agreements;
  • Providing access to Data Users that have been authorised by the Data Sponsor;
  • Ensuring that any Data User who is given access to the data is aware of any data use policies (including this Policy) and their responsibilities.

 

3.5 Data Users

Data users include all end-users of the raw or processed data generated by the consortium. These comprise consortium researchers, any collaborators, training dataset users and any other approved members of the international research community.

The Data User is any party who has been granted access, by a Data Custodian, to any item of data. They are responsible for:

  • Requesting authorisation from the Data Sponsor;  
  • Requesting access from the Data Custodian;
  • Using and safeguarding information according to the conditions stipulated by the Data Sponsor and/or Custodian - including observing any relevant ethics approvals, legislation, data use policies (including this Policy and other relevant data use policies imposed by the Data Owner) and their responsibilities.

 

4. Data Sharing Schedule

Various data types will be made available at appropriate times throughout the multistep process of generating, processing, assembling, annotating and dispersing the reference datasets.

Broadly, this will fall into two phases: a “mediated-access” phase, where access to the data will be limited to members of the consortium and other authorised parties; and an “open-access” phase where the data will be made openly available from resources including International Data Repositories.

During the “mediated-access” phase, the process for gaining authorisation to access the data is to email data.access@bioplatforms.com with name, affiliation, specific data for which access is being requested and a brief outline of the intended data use. This information will be assessed by the Data Sponsor Bioplatforms Australia and the appropriate consortium research champion(s). If approved, Bioplatforms, as the Data Sponsor will inform an appropriate Data Custodian to provide access. Data sharing and collaborative interactions are encouraged to advance scientific discovery and maximize the value to the community from this Australian Government (NCRIS)-funded dataset.

 

Table 3: Data Release Timescales for “mediated-access” and “open-access” phase

Data type

Schedule for release of data to authorised users during the “Mediated-access” phase

Schedule for public release of data - resulting in the “Open-access” phase

Pre-pilot

 

Pilot

Immediately following deposition of data into CCG data repository

Pre-pilot or pilot data will only be made openly accessible on a case-by-case basis when there is appropriate value/utility in these data

Main dataset

Immediately following deposition of data into CCG data repository

No earlier than 9 months from deposition of data into CCG data repository

 

5. Consortium Group Members and Roles

Data Sponsor

Bioplatforms Australia

Research Champions

Prof Mark Walker, University of Queensland
Prof Tim Stinear, University of Melbourne
Prof James Paton, University of Adelaide
Prof Mark Schembri, University of Sydney
Prof Dick Strugnell, University of Melbourne
A/Prof Scott Beatson, University of Queensland
Prof Jonathan Iredell, University of Sydney
Prof Benjamin Howden, University of Melbourne
Prof Tania Sorrell, University of Sydney
Prof Steven Djordjevic, University of Technology Sydney
Prof Stuart Cordwell, University of Sydney
Prof Anton Peleg, Monash University
Prof Trevor Lithgow, Monash University

Data Producers (raw)

Ramaciotti Centre for Genomics, Sydney
Australian Genomics Research Facility (AGRF), Melbourne
Australian Proteomics Analysis Facility (APAF), Sydney
Monash Biomedical Proteomics Facility (MBPF), Melbourne
Bio21, Metabolomics Australia (MA), Melbourne

Data Producers (processed)

Victorian Life Sciences Computational Initiative (VLSCI), now Melbourne Bioinformatics (MB)
Systems Biology Initiative (SBI), Sydney
Australian Genomics Research Facility (AGRF), Melbourne
Monash Biomedical Proteomics Facility (MBPF), Melbourne
Australian Proteomics Analysis Facility (APAF), Sydney
Bio21, Metabolomics Australia (MA), Melbourne

Data Infrastructure Providers

Centre for Comparative Genomics (CCG), Perth
Queensland Cyber Infrastructure Foundation (QCIF), Brisbane
Victorian Life Sciences Computational Initiative (VLSCI), now Melbourne Bioinformatics (MB)
VicNode @ The University of Melbourne, Melbourne
Intersect Australia Ltd, Sydney

Data Custodians

All groups above are required to appoint a designated Data Custodian to ensure data assets generated throughout this project are managed according to the requirements of this policy

 

Appendix 1

Table 4: Examples of raw data files generated

Omics Type

Technology

File type

Data description

Genomics

PacBio sequencing

*.bax.h5

*.bas.h5

Main output files produced by the primary analysis pipelines (Primary analysis)

*.fasta

 

FASTA-formatted sequence files contains either nucleic acid sequence (such as DNA) or protein sequence information (Secondary analysis)

*.fastq

Base call and quality information for all reads passing filtering (Secondary analysis)

*metadata.xml

Top-level information about the data, including what sequencing enzyme and chemistry were used, sample name, and other metadata.

Illumina MiSeq sequencing

*.fastq

Base call and quality information for all reads passing filtering

Transcriptomics

Illumina HiSeq sequencing

*.fastq

Base call and quality information for all reads passing filtering

Proteomics

MS1 quantification on DDA data

*.raw

Raw mass spectrometry data

MS2 quantification on DIA/SWATH data

*.wiff

*.wiff.scan

1D IDA raw mass spectrometry data (individual samples)

2D IDA raw mass spectrometry data (pooled samples)

SWATH mass  spectrometry run files (individual samples)

*.txt

*.xlsx

Spectral libraries to identify peptides acquired in SWATH-MS runs

Metabolomics

LC-MS analysis

*.csv

Raw LC-MS files

 

Table 5: Examples of analysed data files generated

Omics Type

Technology

File type

Data description

Genomics

PacBio sequencing

Illumina MiSeq sequencing

*.faa

Chromosome - predicted protein sequences (amino acid FASTA)

*.ffn

Chromosome - predicted transcript sequences (nucleotide FASTA)

*.fsa

Chromosome - assembled sequence (nucleotide FASTA)

*.gbk

Chromosome - annotated sequence (Genbank flatfile)

*.gff

Chromosome - annotated sequence (General Feature Format 3.0)

*.log

 

Chromosome - Prokka annotation log file (ASCII text)

*.txt

Chromosome - Prokka annotation summary statistics (ASCII text)

Transcriptomics

Illumina HiSeq sequencing

*.fastq.bam

Alignment file

*.txt

Gene list (raw counts) file

Proteomics

MS1 quantification on DDA data

*.zip

Zipped folder containing all intermediate and final MaxQuant search result files. The most important file is the proteinGroup.txt file within the combined/txt folder, which contains all quantitative information of all identified proteins

*.sps

Proprietary sps file containing all Perseus analysis steps. Version-specific to version 1.5.5.3. The MaxQuant proteinGroup.txt file was used as input.

*.png

Image showing distribution of CVs

Image showing first 2 dimensions of a principal component analysis

*txt

Raw quantitative matrix summarising the analysis from Perseus. No imputation of missing values performed.

Final quantitative matrix summarising the analysis from Perseus, with missing values imputed.

MS2 quantification on DIA/SWATH data

*.xlsx

PeakView output containing ion, peptide and protein level peak area quantitation and additional information (FDR, score)

Metabolomics

LC-MS analysis

*.csv

Final metabolite profile (data matrix) for each sample replicate

 

 


[1] http://www.bioplatforms.com/antibiotic-resistant-pathogens/

[2] https://wellcome.ac.uk/funding/managing-grant/statement-genome-data-release

[3] https://www.genome.gov/pages/research/wellcomereport0303.pdf

[4] https://www.nature.com/articles/461168a.epdf

[5] http://www.ramaciotti.unsw.edu.au/

[6] http://www.agrf.org.au/

[7] http://monash.edu/proteomics/facility/

[8] http://www.proteome.org.au/

[9] http://www.bio21.unimelb.edu.au/metabolomics-australia

[10] https://ccg.murdoch.edu.au/

[11] https://www.qcif.edu.au/

[12] https://downloads-qcif.bioplatforms.com/bpa/sepsis/

[13] https://downloads.bioplatforms.com/antibiotic_resistant_pathogens/

[14] Selected subsets of the data generated through the Antibiotic Resistant Pathogens initiative will be used to develop training material for tutorials and in in-person workshops. These "cut-down" and/or "synthetic" data sets will be developed by consortium bioinformaticians to elucidate key features and concepts for demonstration purposes. It is expected that these data subsets used for training purposes will be made available to training workshop participants after they have finished the workshops and returned to their host institutions. Training workshop participants will only be able to access the raw and/or analysed data in its entirety when the datasets are made available publicly, as outlined in the timescales in this document. Data for these workshops will be provided to workshop participants via the RDS OMICs platform analysis environment, which is access controlled.

[15] https://www.melbournebioinformatics.org.au/

[16] http://www.systemsbiology.org.au

[17] http://www.ebi.ac.uk/ena

[18] https://www.ebi.ac.uk/arrayexpress/

[19] http://www.proteomexchange.org/

[20] http://www.ebi.ac.uk/metabolights/

 

Version 1, updated on 2018-12-13