Overview
This page runs two complementary analyses on the query protein to identify CAZyme domains:
- Active Site Conservation Analysis — Aligns the query against 283 pre-built CAZyme family
profile alignments to map catalytic residue positions and assess whether the active site is conserved.
- dbCAN HMM Scan — Searches the query against the dbCAN HMM database (875 CAZyme family
profile HMMs) using HMMER
hmmscan to identify domain hits by probabilistic sequence-profile comparison.
Together, these provide both a statistical domain classification (HMM E-values and scores) and a
residue-level view of active site conservation.
Part 1: Active Site Conservation Analysis
Input Data
- Profile alignments — Pre-computed multiple sequence alignments (MSAs) of all known members of
each CAZyme family (GH, CBM, CE, PL, GT, AA classes). Each MSA captures the conserved structural and functional
patterns of that family.
- Conservation analysis — Per-column statistics for each family alignment: consensus residue,
conservation frequency, information content, and flags marking which positions are conserved and which
are catalytic candidates (highly conserved residues characteristic of the active site).
Step-by-Step Process
1 Profile alignment
The query protein is aligned against each family MSA using DECIPHER::AlignProfiles().
This aligns the single query against the profile (consensus pattern) of all family members simultaneously,
preserving the gap structure of the original MSA. The query is appended to the alignment as an additional row.
2 Column mapping
The merged alignment may have new columns inserted to accommodate query insertions. A column mapping is built
by walking the first profile sequence before and after merging, tracking which original profile columns correspond
to which merged columns. A second mapping converts merged columns to query sequence positions (1-based).
3 Coverage scoring
For each active profile column, checks whether the query has a residue (not a gap) at the corresponding position.
Coverage = (query residues aligned to profile columns) / (total active profile columns).
A minimum coverage of 70% is required.
4 Conservation scoring
At every column marked as conserved in the family profile, compares the query’s residue to the consensus residue.
Conservation score = (matching residues) / (conserved positions checked, excluding query gaps).
A minimum of 40% is required.
5 Hit decision
A domain hit is reported when both coverage ≥ 70% and conservation ≥ 40%.
Domain boundaries are determined by the first and last query residue positions that align to the profile.
6 Catalytic site mapping
For each hit, the positions flagged as catalytic candidates in the profile are mapped onto the query.
Using the column mappings from step 2, each catalytic profile position is traced to its corresponding query
sequence position. The expected (consensus) residue is compared to what the query actually has:
Match — The query has the expected catalytic residue at this position, indicating the
active site is conserved and likely functional.
Mismatch — The query has a different residue. This may indicate altered enzymatic
activity, a non-functional site, or natural subfamily variation.
Gap — The catalytic profile position has no corresponding residue in the query alignment,
suggesting a deletion or truncation in this region of the protein.
Part 2: dbCAN HMM Scan
The query protein is searched against the
dbCAN-HMMdb-V14
containing 875 profile hidden Markov models (HMMs) built from CAZyme family sequences. This uses
HMMER 3.4 hmmscan, which compares the query
against every profile HMM in the database.
How HMM scanning works
- Each profile HMM is a probabilistic model trained on a multiple sequence alignment of known family members.
It captures position-specific amino acid preferences, insertion/deletion rates, and transition probabilities.
- HMMER uses the Forward/Backward algorithm to compute the probability that the query (or a subsequence)
was generated by each HMM, producing a log-odds bit score and statistical E-value.
- A multi-stage filter pipeline (MSV → Viterbi → Forward) rapidly eliminates non-homologous models
before running the full computation, making the search fast even against hundreds of profiles.
- For each significant hit, HMMER reports domain coordinates on the query:
- Ali (alignment) — the query region directly aligned to the HMM profile.
- Env (envelope) — a slightly wider region representing the posterior probability
envelope of where the domain likely resides, accounting for alignment uncertainty at the boundaries.
- HMM range — which positions of the profile model were matched (out of the model length).
Key metrics
- E-value — Expected number of hits with this score or better by chance alone. Lower is more significant;
values < 1e-5 are typically strong hits.
- Score — Log-odds bit score measuring how well the query matches the HMM versus a random model.
Higher is better.
- Acc — Mean posterior probability of the aligned residues (0–1). Values near 1.0 indicate
high confidence in the alignment coordinates.
Reading the Results
- The summary table at the top is divided into two sections: conservation analysis hits (with coverage,
conservation, and catalytic site counts) and dbCAN HMM hits (with E-values, scores, and coordinates).
- The domain bar shows where all hits map along the full protein length. Conservation hits appear as
solid bars; HMM hits appear as dashed bars. Both are staggered vertically so overlapping domains remain visible.
- Each conservation entry detail shows two aligned rows in a scrollable box:
Query (the actual protein sequence, with catalytic positions highlighted) and
Profile (the expected consensus residues at catalytic positions, dots elsewhere).
- The colored chips below each conservation entry summarize every catalytic site mapping in compact
Expected{profile_pos}>Actual{query_pos} notation.
- Each HMM entry detail shows the query sequence with the alignment region underlined in the
domain’s color (= marks on the HMM line), and the broader envelope region marked with a
dotted underline (- marks).
- GO terms and InterPro/Pfam cross-references provide functional context for conservation hits.
Software & Data Sources