Australian Microbiome processed data access

The genomics data generated through the Australian Microbiome Initiative is processed into secondary data publicly available to registered users of the data portal through a dedicated search facility portal.

Processed data available:

  • Denoised amplicon sequence variants (ASV) or zero radius operational taxonomic units (zOTU) 
    • Amplicon sequence abundance
    • Metagenome-derived taxonomy sequence abundance
  • Processed shotgun metagenome sequence data
  • All unique (non-denoised) sequences

Access denoised ASVs and metagenomes

Request non-denoised sequences

                    For information on these datatypes and possible uses see information below

 

Analytical workflows used to produce these analysed datasets are documented and made available here.

Please refer to the Australian Microbiome initiative Data use policy and Communication policy.

AMPLICON DATA TYPES BACKGROUND INFORMATION

Although denoised amplicon sequence variants have been advocated as the standard for microbiome analyses (e.g. Callahan et al., 2017), it has become apparent that for some lines of inquiry the lack of sensitivity, caused by trying to balance removal of noise (denoising) vs maintenance of real sequences, in denoising algorithms has resulted in denoised ASV data not fulfilling its promise in terms of investigating closely related real sequences. We are interested in providing data of higher resolution to our user community for use cases where standard ASV data is not adequate, hence, the provision of non-denoised data. We are seeking user feedback as to the utility of this data and the format in which it is provided, in the hope that this feedback can guide us in making it of greater utility. If you would like to provide information on how you’re using the data and how we could improve the way it is packaged please let us know.

Denoised data

This data has been denoised to remove likely sequencing error and chimeric sequences as per the methods detailed here. These methods are commonly employed in microbial amplicon studies and seek to remove possible (likely) spurious sequences from datasets. Denoised ASV data has been recommended as the standard for many microbiome requirements. ASVs should be thought of as zero radius OTUs, not as a list of unique reads. How well the denoising steps split real, closely related, sequences and remove (lump) sequence errors around real sequences depends upon model arguments used in the denoising algorithm and in the chimera detection steps. As stated in the various algorithm descriptions, denoising is, therefore, a balance between removing as much noise as possible, while not lumping true, closely related, sequences into a single ASV. Because of the difficulties in determining whether sequences in low abundance are biologically real or not, generally algorithms are optimised to keep things we’re pretty sure about, and remove those we’re less sure about. The result being that sometimes real sequences are removed (from the ASV list, not the abundance count), and sometimes spurious sequences are kept.

Denoised data is the most commonly used and presented in amplicon studies and is useful for looking at general patterns of diversity, beta diversity and alpha diversity studies (but see Bissett & Brown, 2018) and exploring patterns of sequence distribution (bearing in mind the limitations described above).

Denoised data is provided via a searchable database that allows users to filter on amplicon, taxonomy and sample specific contextual data.

Non-denoised data 

Non-denoised data is provided as all of the unique sequences found in the dataset and their abundance, after merging R1 and R2 reads into a single amplicon. No other QC or filtering is performed.  While denoised data tends to lean towards lumping data (as explained above), non-denoised data contains ALL unique reads in the dataset (including sequencing errors and chimeras, etc.). Non-denoised data, therefore emphasises splitting rather than lumping. Non-denoised data is particularly useful for investigation of closely related sequences where at least one of those sequences may be rare in the sequencing run, and could therefore be lumped or removed during denoising and chimera detection.

The non-denoised data is provided as text based abundance tables, and is directly downloadable in this form. At present we do not host this data as a searchable database. The data are available as: 1. the full dataset; 2. the data subsetted as per user input of sample IDs; or 3. the data subsetted as per user input of a nucleotide sequence of interest.

We provide here two examples of how the data may be used.

1. You may be interested in investigating the abundance of specific sequences of interest across the dataset and have a question like: What is the abundance distribution of strain 1 and strain 2, which have sequence divergence 1 bp and that I know the sequences of, across all AM Marine samples?

To obtain the data you will need to place a request through the request page:

  • ascertain the sample and sequence run IDs for the samples of interest from the database
  • enter the samples IDs and the sequences for strains 1 and 2 into the non-denoised data download page
  • once your request is processed, you will receive an email saying the data is ready, download it
  • you now have a sequence abundance table of strain 1 and strain 2 across your requested samples

2. You may be interested in investigating the closely related sequence variability in a specific taxa and have a question like: How can identify all the likely oligotypes of Bradyrhizobia in all soil samples in the AM?

  • ascertain the sample and sequence run IDs for the samples of interest from the database
  • enter the samples IDs and the taxonomy string of interest into the non-denoised data download page
  • once your request is processed, you will receive an email saying the data is ready, download it
  • you now have a table of abundance of all of the unique Bradyrhizobia sequences in soil samples that can now be used for, e.g. oligotype analysis

References

Bissett A & Brown MV (2018) Alpha-diversity is strongly influenced by the composition of other samples when using multiplexed sequencing approaches. Soil Biol Biochem 127: 79-81.

Callahan BJ, McMurdie PJ & Holmes SP (2017) Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The Isme Journal 11: 2639.