VAMB

Metagenomic Binning

VAMB (Variational Autoencoders for Metagenomic Binning) uses deep variational autoencoders to learn a latent representation of contigs from their tetranucleotide frequencies and co-abundance profiles across multiple samples, enabling accurate metagenomic binning. [1]

How to Obtain Output Model File

Below is a brief workflow the team ran to obtain the output model examples we present on the tools page.

Input

Co-assembled contigs (FASTA) + per-sample BAM alignment files

Output

TSV with per-cluster metrics: cluster name, radius, peak/valley ratio, kind (normal/loner/fallback), total bp, contig count, medoid contig

conda install -c bioconda vamb

Docker image: quay.io/biocontainers/vamb:4.1.3--pyhdfd78af_0

Sample 1 gut microbiome (3 samples from PRJNA795985, 1M read pairs each) co-assembled with Assembly and binned with VAMB v5.0.4, producing 7,212 clusters.

1
Download 3 samples from PRJNA795985
```
for SRR in SRR17531757 SRR17531762 SRR17531772; do fastq-dump --split-files --gzip -X 1000000 --outdir reads/ $SRR; done
```
Downloads 1M read pairs per sample (~70 MB compressed each) from the "Diet and Antimicrobial Resistance in Healthy US Adults" study (Shrestha et al., mBio 2022).

Co-assemble all samples with MEGAHIT

megahit -1 reads/SRR17531757_1.fastq.gz,reads/SRR17531762_1.fastq.gz,reads/SRR17531772_1.fastq.gz -2 reads/SRR17531757_2.fastq.gz,reads/SRR17531762_2.fastq.gz,reads/SRR17531772_2.fastq.gz -o assembly --min-contig-len 1500 -t 8 --presets meta-sensitive

Produces 14,934 contigs (≥1,500 bp).

Map each sample to contigs

for SRR in SRR17531757 SRR17531762 SRR17531772; do minimap2 -ax sr -t 8 assembly/final.contigs.fa reads/${SRR}_1.fastq.gz reads/${SRR}_2.fastq.gz | samtools sort -@ 4 -o bams/${SRR}.sorted.bam && samtools index bams/${SRR}.sorted.bam; done

Run VAMB v5.0.4

vamb bin default --outdir vamb_out --fasta assembly/final.contigs.fa --bamdir bams/ -m 1500 --minfasta 200000 -p 8

Produces vae_clusters_metadata.tsv with 7,212 clusters (715 normal, 6,472 loner, 25 fallback).

Upload vae_clusters_metadata.tsv to IntMeta

Materials Used

Sample 1	PRJNA795985 — Gut microbiome co-assembly, 3 samples, 7,212 clusters (single, comparison)
Sample 2	SRR14092310 — VAMB v5.0.4 from MEGAHIT assembly, 164 clusters (comparison, group-dysbiotic)
Sample 3	SRR14092160 — Human gut metagenome assembly, pre-VRE Day -9 (group-healthy)
Sample 4	SRR17531757 — Healthy adult gut metagenome assembly (group-healthy)
Sample 5	SRR17531762 — Healthy adult gut metagenome assembly (group-healthy)
Sample 6	SRR17531772 — Healthy adult gut metagenome assembly (group-healthy)
Sample 7	SRR14092162 — HCT patient gut assembly, Enterococcus-dominated (group-dysbiotic)
Sample 8	SRR14092309 — HCT patient gut assembly, Enterococcus-dominated (group-dysbiotic)
Sample 9	SRR14092284 — HCT patient gut assembly, Klebsiella + E. coli (group-dysbiotic)
Assembly	MEGAHIT assembly of metagenome reads
Docker Image	pip install vamb (v5.0.4)

Sample Output Files

Download the output files used in the tool page demos. You can upload these directly to IntMeta to explore the visualizations.

Single & Comparison

Group Analysis

Charts Reference

Detailed descriptions for all 22 visualizations generated by VAMB in IntMeta.

`cluster-kind-distribution`

Pie chart of cluster kinds from the 'kind' column: normal (separated by density peaks in latent space), loner (isolated single-contig clusters), and fallback (assigned the default clustering radius). A healthy assembly produces mostly normal clusters; many fallback clusters suggest poor separation.

`genome-size-distribution`

Bar chart of total bp per cluster, sorted by size and colored by kind. VAMB recommends filtering clusters below 250 Kbp and discarding fallback clusters for downstream analysis.

`radius-vs-pvr`

Scatter plot of clustering radius vs peak-to-valley ratio (PVR, from the 'peak valley ratio' column), colored by kind. Normal clusters typically have higher PVR (clearer density separation) and tighter radius. Low PVR with large radius suggests poorly resolved clusters.

`genome-size-vs-contigs`

Scatter plot of total bp (genome size) vs ncontigs per cluster. Well-resolved bins cluster at moderate contig counts with substantial genome sizes. Points in the upper-left (many contigs, small genome) may indicate chimeric clusters.

`contigs-per-cluster`

Bar chart of ncontigs per cluster, colored by kind. Very high contig counts may indicate over-fragmented or chimeric clusters that merged unrelated contigs.

`avg-contig-length`

Bar chart of average contig length (total bp / ncontigs) per cluster. Higher values indicate better-assembled genomes with longer individual contigs. Low averages suggest highly fragmented assemblies.

`metrics-by-kind`

Grouped box plot comparing distributions of genome size (bp), contig count, and radius across the three cluster kinds (normal, loner, fallback). Reveals systematic differences — e.g., loner clusters tend to be smaller with single contigs.

`cluster-metrics-heatmap`

Heatmap of min-max normalized metrics (bp, ncontigs, radius, peak valley ratio) across the top clusters. Each metric is scaled 0–1 within its column; darker cells = higher relative values. Useful for spotting outlier clusters.

`comp-quality-tiers`

Grouped bar chart comparing the distribution of heuristic quality tiers (High / Medium / Low based on genome size and N50 thresholds) across samples. Reveals which sample produced more high-quality VAMB clusters.

`comp-genome-size`

Box plot or grouped bar chart comparing the genome size (total bp) distribution of VAMB clusters across samples. Differences may reflect varying community complexity or co-assembly depth.

`comp-kind-distribution`

Stacked or grouped bar chart comparing the proportion of normal, loner, and fallback clusters across samples. A higher fraction of normal clusters indicates better latent-space separation and more reliable binning.

`comp-quality-pct`

100% stacked bar chart showing the proportion of High / Medium / Low quality clusters per sample. Normalizes for different cluster counts, enabling direct comparison of binning quality. Standard in CAMI challenge benchmarks.

`comp-total-recovery`

Stacked bar chart of total base pairs recovered per sample, split by quality tier. Measures actual data volume binned, complementing cluster count. Used in VAMB and CAMI evaluations.

`comp-cdf`

Empirical cumulative distribution function (CDF) of cluster sizes overlaid per sample. Reveals full distribution shape beyond boxplot summaries. Standard in CAMI benchmarking.

`comp-contigs`

Boxplot of contig count per cluster across samples. Higher contig counts indicate more fragmented assemblies, potentially from lower sequencing depth or complex community composition.

`binning-group-quality-tiers`

Grouped bar chart of heuristic quality tier counts (High / Medium / Low) per experimental group, with Chi-square test. Tiers are based on cluster kind and genome size: High = normal + ≥500 Kbp, Medium = normal ≥100 Kbp or fallback ≥500 Kbp, Low = all else. Note: these are assembly-metric heuristics, not MIMAG classifications — VAMB does not output completeness or contamination.

`binning-group-metric-boxplots`

Boxplots comparing a selected metric (genome size, contig count, radius, or peak-valley ratio) across groups. All individual cluster values are pooled per group. Kruskal-Wallis test (≥3 groups) or Mann-Whitney U test (2 groups) assesses significance; pairwise comparisons use Benjamini-Hochberg FDR correction.

`binning-group-recovery-rate`

Grouped bar chart of mean cluster count per sample at each quality tier (High, Medium, Low, Total) across groups, with ±1 SD error bars. Tests whether experimental groups differ in how many VAMB clusters they recover at each quality level.

`binning-group-pca`

PCA ordination of per-sample quality profiles (mean genome size, contig count, radius, peak-valley ratio, %normal, %loner, %fallback, %HQ). Features are z-score standardized before decomposition. PERMANOVA on the Euclidean distance matrix tests whether group centroids differ significantly. 95% confidence ellipses are drawn for groups with ≥3 samples.

`binning-group-cdf`

Overlaid empirical cumulative distribution functions (CDFs) of a selected metric across groups. The two-sample Kolmogorov-Smirnov test quantifies the maximum vertical distance between curves — significant D-statistics indicate the groups' metric distributions differ in location, spread, or shape.

`binning-group-cluster-types`

Stacked bar chart of cluster kind counts (Normal / Loner / Fallback) per group, with Chi-square test. These three cluster types are defined by VAMB's latent-space density-peak clustering. Significant p-values suggest experimental conditions influence the proportion of well-separated vs. poorly-resolved clusters.

`binning-group-total-recovery`

Stacked bar chart of mean total base pairs recovered per group, split by quality tier. Error bars show ±1 SD. Measures actual data volume binned per group, complementing cluster count and recovery rate charts.

References

[1]Nissen, J.N., Johansen, J., Allesøe, R.L. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat Biotechnol 39, 555–560 (2021).DOI

[2]Anderson, M.J. A new method for non-parametric multivariate analysis of variance. Austral Ecology 26, 32–46 (2001).DOI

GitHub

DAS Tool SemiBin2