CLARK
Taxonomic ClassificationCLARK (CLAssifier based on Reduced K-mers) is a supervised sequence classification method that uses discriminative k-mers — k-mers unique to a specific taxon — for fast and precise metagenomic classification. [1]
How to Obtain Output Model File
Below is a brief workflow the team ran to obtain the output model examples we present on the tools page.
Input
FASTQ/FASTA files (single-end or paired-end reads)
Output
CSV classification results with confidence scores and estimated abundance table
conda install -c bioconda clark
Sample 1 human gut metagenome (SRR14092160, 5M read subset) classified with CLARK-l against a custom Database of 31 gut-associated bacterial genomes.
- 1
Download reads from NCBI SRA
prefetch SRR14092160 && fasterq-dump SRR14092160 -O /data --split-files && gzip /data/SRR14092160_*.fastq
Human gut metagenome, pre-VRE colonization timepoint (Day -9). Illumina paired-end, ~5.5M read pairs (partial extraction).
- 2
Subset to 5M reads and decompress
zcat SRR14092160_1.fastq.gz | head -20000000 > subset_1.fastq && zcat SRR14092160_2.fastq.gz | head -20000000 > subset_2.fastq
CLARK does not support gzipped FASTQ. 5M reads = 20M lines in FASTQ format.
- 3
Build custom 31-genome database
Download 31 RefSeq genomes via NCBI Datasets, prepare targets.txt and NCBI taxonomy files (nodes.dmp, names.dmp).
- 4
Run CLARK-l classification
CLARK-l -T /db/targets.txt -D /db/custom_0/ -P subset_1.fastq subset_2.fastq -R clark_result -n 8 -m 0
CLARK-l is the light-weight mode (~4 GB RAM). 0.20% classified (10,049 / 5,000,000 reads). 26 species detected.
- 5
Estimate abundance
estimate_abundance.sh -F clark_result.csv -D /db --highconfidence
Produces the CSV abundance file that IntMeta reads. The --highconfidence flag filters low-confidence assignments.
clark_abundance.clark to IntMetaMaterials Used
Sample Output Files
Download the output files used in the tool page demos. You can upload these directly to IntMeta to explore the visualizations.
Single & Comparison
Group Analysis
Charts Reference
Detailed descriptions for all 30 visualizations generated by CLARK in IntMeta.

distribution
Bar chart of the top organisms ranked by read count from CLARK's discriminative k-mer classification. Only reads matching k-mers unique to a single taxon are counted, producing high-confidence assignments.

composition
Pie/donut chart showing the top classified taxa as a percentage of their combined read count. Each slice is (taxon reads / sum of displayed reads) × 100 at the selected rank.

richness
Counts every distinct organism with at least one assigned read at each major rank (Domain through Species). CLARK's discriminative k-mer approach typically reports fewer taxa than Kraken2, but each assignment carries higher specificity.

diversity
Alpha diversity indices computed from per-taxon read-count proportions: Shannon H = −Σ(pᵢ · ln pᵢ), Simpson D = 1 − Σ(nᵢ(nᵢ−1))/(N(N−1)), and Pielou's evenness J = H / ln(richness). Useful for comparing community evenness across samples.

multilevel-composition
Stacked bar chart with the top taxa (by read count) at each major rank; remaining taxa grouped as 'Other'. Shows how the dominant organisms shift across classification depths.

dependency-wheel
Chord diagram linking parent to child taxa across ranks, with connection thickness proportional to shared read count. Edges below the minimum coverage threshold (default 5%) are filtered out.

sankey-flow
Sankey flow diagram tracing how reads distribute from the start rank to the end rank through the lineage hierarchy. Band width equals read count between connected taxa; low-coverage edges are removed.

comp-classification
Stacked bar chart showing total classified vs unclassified reads per sample. Enables quick comparison of classification rates across samples — large unclassified fractions may indicate novel organisms or database limitations.

comp-grouped-abundance
Grouped bar chart placing the top taxa from each sample side by side at the selected rank. Each group contains one bar per sample, colored by sample identity, enabling direct visual comparison of absolute read counts for the same organism across samples.

comp-relative-abundance
100% stacked bar chart where each bar represents one sample and segments show the proportional contribution of each taxon. Useful for comparing community composition when samples have very different sequencing depths, since all bars are normalized to the same height.

comp-abundance-heatmap
Color-matrix heatmap with taxa on one axis and samples on the other. Cell color intensity is proportional to read abundance at the selected rank. Hierarchical clustering on both axes groups similar samples and co-occurring taxa together.

comp-diversity-indices
Multi-panel chart displaying Shannon entropy, Simpson diversity, Observed Richness, and Pielou's Evenness for each sample on its own y-axis scale. Enables quick cross-sample comparison of alpha diversity without scale distortion.

group-alpha-diversity
Boxplots of Shannon entropy, Simpson diversity, Observed Richness, and Pielou's Evenness per group. Each box summarizes within-group variation. Kruskal-Wallis p-values test for significant differences between groups.

group-pcoa
Principal Coordinates Analysis (PCoA) on Bray-Curtis dissimilarity matrix. Points are colored by group assignment with 95% confidence ellipses. Axis labels show the percentage of variance explained by each coordinate.

group-nmds
Non-metric Multidimensional Scaling (NMDS) ordination with stress value displayed. Points colored by group with confidence ellipses. Lower stress (<0.2) indicates a good representation of the original distances in 2D.

group-distance-boxplots
Boxplots of within-group vs between-group Bray-Curtis distances. PERMANOVA R² and p-value quantify the fraction of variance explained by grouping. ANOSIM R statistic measures the degree of group separation.

group-relative-abundance
Group-averaged relative abundance as 100% stacked bars at the selected taxonomic rank. Each bar shows the mean proportional composition across all samples in the group, enabling direct group-level comparison.

group-differential-abundance
Bar chart of taxa with significant abundance differences between groups (Kruskal-Wallis test, Benjamini-Hochberg FDR correction). Color indicates which group has higher abundance. Only taxa passing the significance threshold are shown.

group-lefse
LEfSe (Linear Discriminant Analysis Effect Size) biomarker chart. Horizontal bars show LDA scores for taxa that significantly discriminate between groups. Higher LDA scores indicate stronger association with the respective group.
group-classification
Stacked column chart showing mean classified vs unclassified reads per experimental group. Tooltip shows percentage breakdown and sample count per group.

group-distribution
Grouped column chart showing mean read counts for top N taxa at the selected taxonomic level, grouped by experimental group. Enables comparison of taxonomic abundance patterns between groups.

group-heatmap
Heatmap of mean log₁₀-transformed read counts per experimental group (columns) and taxa (rows). Color intensity reflects abundance, allowing quick visual comparison of taxonomic profiles across groups.

rarefaction
Rarefaction curve plotting observed taxa vs subsampled read depth at the selected rank. Computed analytically using the hypergeometric expectation. A curve that plateaus indicates sufficient sequencing depth; a curve still rising suggests under-sampling.

rank-abundance
Rank-abundance curve (Whittaker plot) with taxa ranked by decreasing relative abundance on a logarithmic y-axis. The curve's length along the x-axis reflects richness, while the slope indicates evenness — a steep drop means a few taxa dominate.

comp-rarefaction
Overlaid rarefaction curves for all uploaded samples. Enables visual comparison of sequencing effort and taxonomic saturation across samples.

comp-rank-abundance
Overlaid rank-abundance curves for all uploaded samples. Steeper curves indicate more uneven communities dominated by fewer taxa.

group-taxonomic-sunburst
Interactive sunburst displaying the full taxonomic hierarchy as concentric rings. LEfSe biomarker nodes are colored by their enriched group; non-significant nodes are gray. Click any ring segment to drill down.

group-volcano
Volcano plot: log₂ fold-change (x) vs −log₁₀ adjusted p-value (y). Points above the horizontal line pass significance (BH FDR). Colored points are significantly enriched in the respective group.

