Kaiju

Taxonomic Classification

Kaiju is a protein-level metagenomic classifier that translates DNA reads into amino acid sequences and matches them against a reference protein database using the Burrows-Wheeler transform (BWT). [1]

How to Obtain Output Model File

Below is a brief workflow the team ran to obtain the output model examples we present on the tools page.

Input

FASTQ/FASTA files (single-end or paired-end reads)

Output

Tab-separated classification with taxon IDs, plus summary tables at each taxonomic rank

conda install -c bioconda kaiju

Docker image: quay.io/biocontainers/kaiju:1.10.1--h43eeafb_0

Sample 1 gut metagenome reads (200K subset) classified with Kaiju against a curated gut microbiome Database (644 representative/reference genomes from NCBI RefSeq).

Download reads from NCBI SRA

prefetch SRR14092310 && fasterq-dump SRR14092310 -O /data --split-files && gzip /data/SRR14092310_*.fastq

Subset to 200K read pairs

zcat SRR14092310_1.fastq.gz | head -800000 > subset_R1.fastq && zcat SRR14092310_2.fastq.gz | head -800000 > subset_R2.fastq

200K subset keeps the output under IntMeta's 50 MB upload limit.

Run Kaiju classification

kaiju -t nodes.dmp -f kaiju_db_gut.fmi -i subset_R1.fastq -j subset_R2.fastq -o kaiju_raw.tsv -z 12

Add taxon names

kaiju-addTaxonNames -t nodes.dmp -n names.dmp -i kaiju_raw.tsv -o kaiju_output.tsv -r superkingdom,phylum,class,order,family,genus,species -u

Upload kaiju_output.tsv to IntMeta

Materials Used

Sample 1	SRR14092310 — Human gut metagenome, VRE Day 0, Enterococcus-dominated (single, comparison, group-dysbiotic)
Sample 2	SRR14092160 — Human gut metagenome, pre-VRE Day -9, diverse community (comparison, group-healthy)
Sample 3	SRR17531757 — Healthy adult gut metagenome (group-healthy)
Sample 4	SRR17531762 — Healthy adult gut metagenome (group-healthy)
Sample 5	SRR17531772 — Healthy adult gut metagenome (group-healthy)
Sample 6	SRR14092162 — HCT patient gut, Enterococcus-dominated (group-dysbiotic)
Sample 7	SRR14092309 — HCT patient gut, Enterococcus-dominated (group-dysbiotic)
Sample 8	SRR14092284 — HCT patient gut, Klebsiella + E. coli (group-dysbiotic)
Sample 9	SRR14092299 — HCT patient gut, Enterococcus-dominated 99% (group-dysbiotic)
Sample 10	SRR14092304 — HCT patient gut, Klebsiella-dominated 97% (group-dysbiotic)
Sample 11	SRR17531755 — Healthy adult gut metagenome (group-healthy)
Sample 12	SRR17531758 — Healthy adult gut metagenome (group-healthy)
Database	`Curated gut microbiome protein database (644 representative/reference genomes, 2.19M proteins, 1.3 GB FM-index)`
Docker Image	quay.io/biocontainers/kaiju:1.10.1--h43eeafb_0

Sample Output Files

Download the output files used in the tool page demos. You can upload these directly to IntMeta to explore the visualizations.

Single & Comparison

Group Analysis

Charts Reference

Detailed descriptions for all 30 visualizations generated by Kaiju in IntMeta.

`distribution`

Bar chart of the top organisms ranked by read count from Kaiju's protein-level classification. Because Kaiju translates reads to amino acids before matching, it can detect divergent organisms that nucleotide k-mer methods miss.

`composition`

Pie/donut chart showing the top taxa as a percentage of their combined read count at the selected rank, derived from Kaiju's BWT protein-level assignments.

`richness`

Counts every distinct organism with at least one assigned read at each major rank (Domain through Species). Protein-level classification tends to detect additional divergent or novel taxa compared to DNA k-mer tools.

`diversity`

Alpha diversity indices computed from per-taxon read-count proportions: Shannon H = −Σ(pᵢ · ln pᵢ), Simpson D = 1 − Σ(nᵢ(nᵢ−1))/(N(N−1)), and Pielou's evenness J = H / ln(richness). Measures community richness and evenness from Kaiju abundance data.

`multilevel-composition`

Stacked bar chart with the top taxa (by read count) at each major rank; remaining taxa grouped as 'Other'. Reveals compositional shifts from Phylum to Species level.

`dependency-wheel`

Chord diagram linking parent to child taxa across ranks from the Kaiju lineage output. Connection thickness is proportional to shared read count; edges below the minimum coverage threshold are filtered.

`sankey-flow`

Sankey flow diagram showing how reads distribute from higher to lower taxonomic ranks through the Kaiju classification hierarchy. Band width equals read count; low-coverage edges are removed.

`comp-grouped-abundance`

Grouped bar chart placing the top taxa from each sample side by side at the selected rank. Each group contains one bar per sample, colored by sample identity, enabling direct visual comparison of absolute read counts for the same organism across Kaiju-classified samples.

`comp-relative-abundance`

100% stacked bar chart where each bar represents one sample and segments show the proportional contribution of each taxon. Useful for comparing community composition when samples have very different sequencing depths, since all bars are normalized to the same height.

`comp-abundance-heatmap`

Color-matrix heatmap with taxa on one axis and samples on the other. Cell color intensity is proportional to read abundance at the selected rank. Hierarchical clustering on both axes groups similar samples and co-occurring taxa together.

`comp-diversity-indices`

Multi-panel chart displaying Shannon entropy, Simpson diversity, Observed Richness, and Pielou's Evenness for each sample on its own y-axis scale. Enables quick cross-sample comparison of alpha diversity without scale distortion.

`comp-shared-taxa`

Venn diagram showing the count of taxa that are shared between samples versus taxa exclusive to each individual sample. Computed at the selected taxonomic rank using presence/absence of reads ≥ 1.

`group-alpha-diversity`

Boxplots of Shannon entropy, Simpson diversity, Observed Richness, and Pielou's Evenness per group. Each box summarizes within-group variation. Kruskal-Wallis p-values test for significant differences between groups.

`group-pcoa`

Principal Coordinates Analysis (PCoA) on Bray-Curtis dissimilarity matrix. Points are colored by group assignment with 95% confidence ellipses. Axis labels show the percentage of variance explained by each coordinate.

`group-nmds`

Non-metric Multidimensional Scaling (NMDS) ordination with stress value displayed. Points colored by group with confidence ellipses. Lower stress (<0.2) indicates a good representation of the original distances in 2D.

`group-distance-boxplots`

Boxplots of within-group vs between-group Bray-Curtis distances. PERMANOVA R² and p-value quantify the fraction of variance explained by grouping. ANOSIM R statistic measures the degree of group separation.

`group-relative-abundance`

Group-averaged relative abundance as 100% stacked bars at the selected taxonomic rank. Each bar shows the mean proportional composition across all samples in the group, enabling direct group-level comparison.

`group-differential-abundance`

Bar chart of taxa with significant abundance differences between groups (Kruskal-Wallis test, Benjamini-Hochberg FDR correction). Color indicates which group has higher abundance. Only taxa passing the significance threshold are shown.

`group-lefse`

LEfSe (Linear Discriminant Analysis Effect Size) biomarker chart. Horizontal bars show LDA scores for taxa that significantly discriminate between groups. Higher LDA scores indicate stronger association with the respective group.

`group-shared-taxa`

Venn diagram showing shared and unique taxa across experimental groups. Each group's taxa set is the union of all taxa detected in any sample belonging to that group. Click regions to view the specific taxa list with read counts.

`group-classification`

Stacked column chart showing mean classified vs unclassified reads per experimental group. Tooltip shows percentage breakdown and sample count per group.

`group-distribution`

Grouped column chart showing mean read counts for top N taxa at the selected taxonomic level, grouped by experimental group. Enables comparison of taxonomic abundance patterns between groups.

`group-heatmap`

Heatmap of mean log₁₀-transformed read counts per experimental group (columns) and taxa (rows). Color intensity reflects abundance, allowing quick visual comparison of taxonomic profiles across groups.

`rarefaction`

Rarefaction curve plotting observed taxa vs subsampled read depth at the selected rank. Computed analytically using the hypergeometric expectation. A curve that plateaus indicates sufficient sequencing depth; a curve still rising suggests under-sampling.

`rank-abundance`

Rank-abundance curve (Whittaker plot) with taxa ranked by decreasing relative abundance on a logarithmic y-axis. The curve's length along the x-axis reflects richness, while the slope indicates evenness — a steep drop means a few taxa dominate.

`comp-rarefaction`

Overlaid rarefaction curves for all uploaded samples. Enables visual comparison of sequencing effort and taxonomic saturation across samples.

`comp-rank-abundance`

Overlaid rank-abundance curves for all uploaded samples. Steeper curves indicate more uneven communities dominated by fewer taxa.

`group-taxonomic-sunburst`

Interactive sunburst displaying the full taxonomic hierarchy as concentric rings. LEfSe biomarker nodes are colored by their enriched group; non-significant nodes are gray. Click any ring segment to drill down.

`group-volcano`

Volcano plot: log₂ fold-change (x) vs −log₁₀ adjusted p-value (y). Points above the horizontal line pass significance (BH FDR). Colored points are significantly enriched in the respective group.

`comp-classification`

Stacked bar chart showing total classified vs unclassified reads per sample. Enables quick comparison of classification rates across samples — large unclassified fractions may indicate novel organisms or database limitations.

References

[1]Menzel, P., Ng, K.L. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun 7, 11257 (2016).DOI

[2]Anderson, M.J. A new method for non-parametric multivariate analysis of variance. Austral Ecology 26, 32–46 (2001).DOI

[3]Segata, N., Izard, J., Waldron, L. et al. Metagenomic biomarker discovery and explanation. Genome Biol 12, R60 (2011).DOI

GitHub Database

CLARK CheckM2