Australian Microbiome amplicon datatypes

Following user feedback, the AM now provides two amplicon sequence abundance data types,

  1. denoised amplicon sequence variants (ASV) or zero radius operational taxonomic units (zOTU)  
  2. All unique (non-denoised) sequences

These datatypes, along with possible uses, are outlined below.  

Although denoised amplicon sequence variants have been advocated as the standard for microbiome analyses (e.g., (Callahan et al., 2017), it has become apparent that for some lines of inquiry the lack of sensitivity, caused by trying to balance removal of noise (denoising) vs maintenance of real sequences, in denoising algorithms has resulted in denoised ASV data not fulfilling its promise in terms of investigating closely related real sequences.  We’re interested in providing data of higher resolution to our user community for use cases where standard ASV data is not adequate,  hence the provision of non-denoised data.  We are seeking user feedback as to the utility of this data and the format in which it is provided, in the hope that this feedback can guide us in making it of greater utility.  If you would like to provide information on how you’re using the data and how we could improve the way it is packaged please let us know

Denoised data

This data has been denoised to remove likely sequencing error and chimeric sequences as per the methods detailed here.  These methods are commonly employed in microbial amplicon studies and seek to remove possible (likely) spurious sequences from datasets.  Denoised ASV data has been recommended as the standard for many microbiome requirements.  ASVs should be thought of as zero radius OTUs, not as a list of unique reads.  How well the denoising steps split real, closely related, sequences and remove (lump) sequence errors around real sequences depends upon model arguments used in the denoising algorithm and in the chimera detection steps.  As stated in the various algorithm descriptions, denoising is, therefore, a balance between removing as much noise as possible, while not lumping true, closely related, sequences into a single ASV.  Because of the difficulties in determining whether sequences in low abundance are biologically real or not, generally algorithms are optimised to keep things we’re pretty sure about, and remove those we’re less sure about.  The result being that sometimes real sequences are removed (from the ASV list, not the abundance count), and sometimes spurious sequences are kept.

Denoised data is the most commonly used and presented in amplicon studies and is useful for looking at general patterns of diversity, beta diversity and alpha diversity studies (but see (Bissett & Brown, 2018)) and exploring patterns of sequence distribution (bearing in mind the limitations described above).

Denoised data is provided via a searchable database that allows users to filter on amplicon, taxonomy and sample specific contextual data.

To explore and download denoised data go here.

Non-denoised data

Non-denoised data is provided as all of the unique sequences found in the dataset and their abundance, after merging R1 and R2 reads into a single amplicon.  No other QC or filtering is performed.  While denoised data tends to lean towards lumping data (as explained above), non-denoised data contains ALL unique reads in the dataset (including sequencing errors and chimeras, etc.).  Non-denoised data, therefore emphasises splitting rather than lumping.  Non-denoised data is particularly useful for investigation of closely related sequences where at least one of those sequences may be rare in the sequencing run, and could therefore be lumped or removed during denoising and chimera detection.

The non-denoised data is provided as text based abundance tables, and is directly downloadable in this form.  At present we don’t have the capacity to host this as a searchable database (as the denoised data is), but this could change in the future depending on user feedback.  The data are available as 1. the full dataset, 2. the data subsetted as per user input of sample ID’s or 3. the data subsetted as per user input of a nucleotide sequence of interest.

We provide here two examples of how the data may be used.  If you have other examples you’d like us to include please provide them.


  1. You may be interested in investigating the abundance of specific sequences of interest across the dataset and have a question like: What is the abundance distribution of strain 1 and strain 2, which have sequence divergence 1 bp and that I know the sequences of, across all AM Marine samples?
    To obtain the data you’d need:
    a.    Ascertain the sample and sequence run ID’s for the samples of interest from the database (get the sample ID’s for all marine samples)
    b.    enter the samples ID’s and the sequences for strains 1 and 2 into the non-denoised data download page
    c.    wait for an email saying the data is ready, download it
    d.    You now have a sequence abundance table of strain 1 and strain 2 across the marine samples
  2. You may be interested in investigating the closely related sequence variability in a specific taxa and have a question like: How can identify all the likely oligtypes of Bradyrhizobia in all soil samples in the AM?
    a.    Ascertain the sample and sequence run ID’s for the samples of interest from the database (get the sample ID’s for all marine samples)
    b.    Enter the samples ID’s and the taxonomy string of interest into the non-denoised data download page
    c.    wait for an email saying the data is ready, download it
    d.    You now have a table of abundance of all of the unique Bradyrhizobia sequences in Soil samples that can now be used for, e.g., oligotyper ,analysis

If you’d like the non-denoised data go here.

Bissett A & Brown MV (2018) Alpha-diversity is strongly influenced by the composition of other samples when using multiplexed sequencing approaches. Soil Biol Biochem 127: 79-81.
Callahan BJ, McMurdie PJ & Holmes SP (2017) Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The Isme Journal 11: 2639.