CasMiner: Accelerating Cas9 Discovery and Engineering Through Deep Learning

Why We Need a New Approach to Cas9 Mining

The CRISPR-Cas9 system has become a cornerstone of modern molecular biology, powering applications from gene knockout to base editing across diverse organisms. Yet the repertoire of Cas9 proteins available for practical use remains remarkably small. Most Cas9 variants in wide use today — SpCas9, SaCas9, Nme2Cas9, and a handful of others — were discovered through homology-based or similarity-driven mining strategies that rely on multiple rounds of BLAST searches and subjective parameter tuning.

This dependence on sequence alignment introduces several bottlenecks:

Query sequence diversity skews the search space, causing distantly related Cas9 proteins to be missed
Subjective decisions about sequence quality thresholds and representative cluster selection bias the output
Proteins that share functional similarity but diverge in primary sequence may be overlooked entirely

Here lies a critical insight: Cas9 proteins can differ dramatically in amino acid sequence while performing essentially the same DNA recognition and cleavage functions. This suggests that important determinants of Cas9 identity exist beyond the handful of conserved domains that homology searches rely on. Deep learning, with its capacity to learn latent features from sequence data without explicit annotation, offers a natural solution to this problem. While neural networks have already been applied to predict gRNA efficiency and off-target rates, their potential for de novo Cas9 protein mining had remained largely unexplored — until now.

Building CasMiner: A CNN-LSTM Framework for Cas9 Identification

Training Data Strategy

The developers of CasMiner constructed their training datasets from UniRef90, extracting 1,946 protein sequences annotated as Cas9 nucleases (801–1,820 amino acids) as the positive set. For the negative set, they employed a shuffling strategy: each positive sequence was randomly permuted at varying thresholds (10% to 100%, in 10% increments), preserving amino acid composition while destroying the hidden sequence features that define Cas9 function. This produced 10 distinct negative datasets, each paired with the same positive set to train a separate model.

Model Architecture and Performance

Each of the 10 models uses a CNN-LSTM architecture. The convolutional layers capture local sequence motifs, while the LSTM layers model longer-range dependencies. Among the 10 models, the one trained on Dataset80 — where negative sequences were 80% shuffled — delivered the best performance:

Training accuracy: 99.63 ± 0.25%
Test set accuracy, precision, recall, F1 score, and AUC all approached 100%

Generalization Across Protein Families

A classifier is only useful if it can distinguish Cas9 from proteins that share functional similarities. CasMiner was tested against several challenging negative sets:

UniRef100 Cas9 dataset (1,097 sequences): 99.77 ± 1.13% accuracy
Helicase dataset: 3.40 ± 2.92% misclassification rate
Nuclease dataset: 3.41 ± 3.00% misclassification rate
Cas12/Cas13 dataset: 2.46 ± 1.52% misclassification rate
Glycoside hydrolase dataset: 4.81 ± 4.78% misclassification rate

These results confirm that CasMiner does not simply learn surface-level sequence statistics — it captures features specific to Cas9 that are not shared by other nuclease families.

The Cas9 Fingerprint

Using Grad-CAM to extract features from the convolutional layers, the researchers identified a distinctive activation pattern in Cas9 sequences that they termed the Cas9 fingerprint. Non-Cas9 sequences showed no such pattern. Each fingerprint contains two prominent peaks corresponding to active-site residues — for example, D10 and H840 in SpCas9 — and most also exhibit peaks associated with DNA-binding residues such as H983 in SpCas9 and H701 in SaCas9. This internal representation demonstrates that CasMiner learns biologically meaningful features rather than statistical artifacts.

Discovering VpCas9: A New Cas9 From Vagococcus penaei

Mining Pipeline

Armed with CasMiner, the team turned to the proGenomes non-redundant genome collection. They pre-filtered proteins to the 800–1,600 residue range, subdivided them into eight subsets by 100-amino-acid increments, and used Pfam domain annotation to prioritize the subset most enriched for Cas9-associated domains. The 1,301–1,400 AA subset contained the most RuvC_III, Cas9_BH, Cas9_REC, HNH_4, and Cas9_PI domain annotations.

CasMiner scanned this subset and identified 333 sequences with prediction scores above 50%, of which 262 scored at or above 90%. After downloading the corresponding 214 genomes and recovering 218 candidate sequences, the researchers used the CRISPR Recognition Tool to associate candidates with CRISPR repeat arrays. The top-scoring candidate lacked associated repeats, so the second-highest candidate — linked to 56 repeats in the genome of Vagococcus penaei CD276T — was selected and named VpCas9.

VpCas9 Characteristics

1,332 amino acids with 48.8% sequence identity to SpCas9
The genomic locus encodes Cas1, Cas2, and Cns2 alongside a CRISPR repeat array
The sgRNA shares 86.0% identity with SpCas9 sgRNA, with differences concentrated in the 3′ tracrRNA region
Secondary structure divergences between Vp-sgRNA and Sp-sgRNA are mainly in the Linker, Stem-loop 3, and Stem-loop 4 regions

VpCas9 Activity and PAM Specificity

PAM Preference and Cleavage

VpCas9 recognizes the PAM NGGV (N: A/T/C/G; V: G/T). The first three positions, NGG, match SpCas9, while the fourth position prefers G and tolerates T. Cleavage occurs 3–4 nucleotides upstream of the PAM, consistent with SpCas9.

In vitro cleavage assays confirmed that VpCas9 robustly cuts substrate DNA. Overall cleavage rates were 75.52% on target strands and 76.91% on non-target strands.

In Vivo Validation in E. coli

An E. coli fluorescence reporter system provided further evidence:

Both CRISPR-VpCas9 and CRISPR-SpCas9 reduced growth rate and fluorescence intensity
Vp-sgRNA mediated higher cleavage activity than Sp-sgRNA
SpCas9 showed no preference between the two sgRNAs
VpCas9 and SpCas9 could interchangeably use each other's sgRNA

Structural Insights From AlphaFold3

AlphaFold3 predicted structures revealed an RMSD of only 1.508 Å between VpCas9 and SpCas9. A notable structural difference is that Vp-sgRNA's Stem-loop 3 forms a taller, narrower structure. VpCas9 accommodates this through a groove formed by residues E726 and R729, while SpCas9 uses a flatter surface (D718 and H721) that can accept both sgRNA scaffolds.

Engineering VpCas9: Rational Design of High-Efficiency Mutants

From Evolution to Engineering

The same deep-learning features that enable Cas9 identification can also guide protein engineering. The researchers retrieved 851 VpCas9 homologs from UniRef90, extracted their feature matrices, and computed a core-site matrix and a position-specific amino acid probability (PSAP) conservation matrix. By calculating the difference (Diff) between mutation and wild-type residues across both matrices, they ranked candidate mutations and selected 12 target sites with total ranking scores below 30.

Single and Double Mutants

A qPCR-based method was developed to quantify editing efficiency. Among the 12 single mutants, 9 outperformed wild-type VpCas9, with V623I showing the highest efficiency. Building on this result, the team generated three double mutants on the V623I background:

VPM2-1 (V623I-I439L)
VPM2-2 (V623I-E372K)
VPM2-3 (V623I-I526V)

All three double mutants achieved significantly higher editing efficiency than wild-type VpCas9.

Molecular Mechanisms Behind Enhanced Editing

Increased Structural Rigidity

Molecular dynamics simulations over 500 ns revealed that all three mutants exhibit reduced fluctuations in the HNH domain, indicating increased structural rigidity. VPM2-1, VPM2-2, and VPM2-3 also showed dampened fluctuations in the PI and REC2 domains. Notably, all mutation sites reside in the REC1_2 domain — distant from the HNH and PI domains — suggesting that these substitutions modulate protein dynamics through long-range allosteric effects.

Enhanced Electrostatic Complementarity

Surface electrostatic potential analysis showed that all mutants carry more positive charge than wild-type VpCas9:

VPM2-1: +12.25%
VPM2-2: +9.79%
VPM2-3: +2.54%

The added positive charge concentrates at the entrance of the HNH-REC1 channel or along the inner wall of the PI-REC1 cavity, enhancing electrostatic attraction to the negatively charged DNA target strand.

Conformational Landscape

Principal component analysis and free energy landscape calculations further demonstrated that the mutants explore a narrower conformational space with broader minimum free energy basins — VpCas9: 315.37 Å²; VPM2-2: 1,050.41 Å²; VPM2-3: 987.88 Å² — implying that the mutants adopt more stable conformations favorable for DNA binding and cleavage.

Multi-Species Validation: From Rice to Human Cells

Editing efficiency of VpCas9 and mutants VPM2-1, VPM2-2, VPM2-3 across rice callus, maize protoplasts, and HEK293T cells compared with SpCas9.

Figure 1.Application of VpCas9 and mutants for genome editing in rice callus, maize protoplasts and HEK293T cells. (Xu, et al. 2026))

Rice Callus

Twelve target sites across six genes (OsCKX1, OsDWF4, OsGhd8, OsHd3a, OsNramp5, OsSD1) were tested in rice callus:

Editing efficiency ranged from 37.00% to 81.20%
VPM2-1, VPM2-2, and VPM2-3 outperformed wild-type VpCas9 at five of the twelve target sites
Average efficiencies: VPM2-1 (68.46%), VPM2-2 (65.88%), VPM2-3 (67.12%) versus VpCas9 (61.35%)
Off-target effects were virtually undetectable

Head-to-head comparison with SpCas9 at five additional OsNramp5 target sites showed that VPM2-1 (65.23%) and VPM2-3 (66.03%) significantly outperformed both VpCas9 (58.23%) and SpCas9 (56.06%). Off-target rates of all three VpCas9 mutants were comparable to SpCas9.

In stably transformed rice plants, overall editing efficiency was consistent with callus results, with VPM2-3 showing the highest average activity at 70.12%.

Maize Protoplasts

Two mCherry reporter target sites were tested:

VpCas9 total editing efficiency: 47.97%; SpCas9: 46.13% — essentially equivalent
VPM2-3 total editing efficiency: 53.68%, with a fragment deletion rate of 18.55%, significantly exceeding both VpCas9 and SpCas9
Off-target activity was low or undetectable

HEK293T Human Cells

Six genomic loci were tested in HEK293T cells. VpCas9 and its mutants edited all targets. At AAVS2, Dnmt1, and Emx1, efficiencies matched SpCas9, while at FANCF, PVALB, and Zscan2, they fell short. Importantly, off-target analysis revealed that SpCas9 produced substantial off-target editing at AAVS2_OTS_219 (51.31%) and FANCF_OTS_256 (23.57%), whereas VpCas9 and its mutants showed no detectable off-target events.

What Sets CasMiner Apart From Existing Methods

Systematic benchmarking against MP-TRANS, ESM2, random forest, and SVM classifiers confirmed CasMiner's superiority. The shuffling ratio proved to be a critical hyperparameter, with 80% emerging as optimal — neither too low (leaving residual features) nor too high (making classification trivially easy).

When compared with traditional homology tools:

BlastP achieves high coverage within known Cas9 clusters but poor sensitivity across distant clusters
HMMER relies on profile hidden Markov models, yet only the HNH_4 domain is truly universal across all Cas9 families
CasMiner requires no pre-specified query sequence, making it a powerful complement to existing homology-based searches

The recommended practical workflow combines keyword and Pfam domain pre-filtering with CasMiner scoring, balancing computational cost and discovery power.

Current Limitations and Future Directions

The optimal shuffling ratio must be empirically validated for each new protein family model
VpCas9 underperforms SpCas9 at certain loci in mammalian cells
The mechanisms underlying cross-species variation in editing efficiency — potentially involving expression levels, chromatin accessibility, and DNA repair pathway differences — require further investigation

Despite these limitations, CasMiner establishes a proof of concept that deep learning can move beyond predicting the behavior of known CRISPR tools to discovering and engineering entirely new ones. For researchers working on plant genome editing, the demonstration that VPM2-3 outperforms SpCas9 in both rice and maize is particularly encouraging, suggesting that the Cas9 toolbox for crop improvement still has substantial room to expand.

Related Services & Products

Plant Genetic Modification by CRISPR/CAS9

RNA Interference (RNAi) Mediated Gene Silencing in Plants

Plant Gene Silencing by Virus-Induced Gene Silencing (VIGS)

Gene Overexpression in Plants

Gene Knock-in Services with CRISPR/Cas9 Technology

Gene Knockout Services with CRISPR/CAS9 Technology

CRISPR Activation (CRISPRa) Service

CRISPR Interference (CRISPRi) Service

CRISPR/Cas12a Multiplexable Gene Editing

CRISPR/Cas12b service

CRISPRoff/CRISPRon Service

Plant Genetic Transformation

Agrobacterium Competent Cells

Reference

Xu, G., et al. (2026). CasMiner: a deep learning tool for high-throughput mining and rational design of efficient Cas9. National Science Review, nwag090. DOI: 1093/nsr/nwag090.

For research or industrial raw materials, not for personal medical use!

Online Inquiry

Recent Posts