cgRNASP-CN: a minimal coarse-grained representation-based statistical potential for RNA 3D structure evaluation

Ling Song; Shixiong Yu; Xunxun Wang; Ya-Lan Tan; Zhi-Jie Tan

doi:10.1088/1572-9494/ac7042

Communications in Theoretical Physics >

2022 , Vol. 74 >Issue 7: 75602

DOI: https://doi.org/10.1088/1572-9494/ac7042

Statistical Physics, Soft Matter and Biophysics

cgRNASP-CN: a minimal coarse-grained representation-based statistical potential for RNA 3D structure evaluation

Ling Song ¹ ,
Shixiong Yu ¹ ,
Xunxun Wang ¹ ,
Ya-Lan Tan ^,²^,^∗ ,
Zhi-Jie Tan ^,¹^,^∗

Expand

¹Department of Physics and Key Laboratory of Artificial Micro & Nano-structures of Education, School of Physics and Technology, Wuhan University, Wuhan 430072, China
²Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University, Wuhan 430073, China

∗Authors to whom any correspondence should be addressed.

Received date: 2022-04-19

Revised date: 2022-05-16

Accepted date: 2022-05-17

Online published: 2022-07-01

Copyright

Fold

Abstract

Knowledge of RNA 3-dimensional (3D) structures is critical to understand the important biological functions of RNAs, and various models have been developed to predict RNA 3D structures in silico. However, there is still lack of a reliable and efficient statistical potential for RNA 3D structure evaluation. For this purpose, we developed a statistical potential based on a minimal coarse-grained representation and residue separation, where every nucleotide is represented by C4' atom for backbone and N1 (or N9) atom for base. In analogy to the newly developed all-atom rsRNASP, cgRNASP-CN is composed of short-ranged and long-ranged potentials, and the short-ranged one was involved more subtly. The examination indicates that the performance of cgRNASP-CN is close to that of the all-atom rsRNASP and is superior to other top all-atom traditional statistical potentials and scoring functions trained from neural networks, for two realistic test datasets including the RNA-Puzzles dataset. Very importantly, cgRNASP-CN is about 100 times more efficient than existing all-atom statistical potentials/scoring functions including rsRNASP. cgRNASP-CN is available at website: https://github.com/Tan-group/cgRNASP-CN.

Key words： RNA structure prediction; statistical potential; structure evaluation

Cite this article

Ling Song , Shixiong Yu , Xunxun Wang , Ya-Lan Tan , Zhi-Jie Tan . cgRNASP-CN: a minimal coarse-grained representation-based statistical potential for RNA 3D structure evaluation[J]. Communications in Theoretical Physics, 2022 , 74(7) : 075602 . DOI: 10.1088/1572-9494/ac7042

1. Introduction

Noncoding RNAs have crucial biological functions such as regulating gene expression and catalyzing some biochemical reactions [1–4], and the functions of RNAs are generally correlated to their structures, especially three-dimensional (3D) structure [5, 6]. Due to the high cost of experimental methods such as x-ray crystallography, NMR spectroscopy and cryo-electron microscopy, the high-resolution 3D structures of RNAs stored in protein database bank (PDB) are still very limited [7]. Parallelly, some theoretical/computational models have been developed to predict the 3D structures of RNAs [8–14] either based on certain physical principles or based on existing structures in PDB database [7], and correspondingly the models can be roughly divided into physics-based ones [15–22] and knowledge-based one [23–25]. The physics-based models such as SimRNA [26, 27], IsRNA [28–30], iFold [31], NAST [32], HiRE-RNA [33], and our model of salt effect [34–40], are generally based on coarse-grained (CG) representations, specified CG force fields, and certain conformation sampling strategies. The knowledge-based models such as MC-fold/MC-sym pipeline, FARNA [25], Vfold3D [41–44], RNAComposer [45, 46], and 3d RNA [47, 48], are generally based on various fragment libraries and fragment-assembly strategies. Generally, an RNA 3D structure prediction model generally generates a large number of 3D structure candidates for a target RNA, and consequently, a reliable statistical potential/scoring function is required to identify a structure closest to the native one [49, 50]. Furthermore, a reliable statistical potential can be involved in guiding RNA conformational sampling [26–30].

Knowledge-based statistical potentials have been shown to be rather effective and efficient in structure prediction and evaluation for proteins [51–57], protein-ligand complexes [58] and protein-protein complexes [59, 60]. There have been six kinds of reference states commonly used in building statistical potentials, i.e. average reference state [54], quasi-chemical approximate reference state [57], atomic-shuffle reference state [61], finite ideal-gas reference state [62], spherical non-interaction reference state [63] and random-walk chain reference state [64]. For RNA 3D structure evaluation, some statistical potentials have been developed based on different reference states [4, 65–67]. Bernauer et al developed differentiable statistical potentials of KB at both all-atom and CG representations based on the quasi-chemical approximation reference state [65]. Capriotti et al built all-atom and CG statistical potentials of RASP based on the averaging reference state [66]. Wang et al derived an all-atom distance and torsion-angle-dependent statistical potential of 3dRNAscore based on the average reference state [4]. Zhang et al proposed an all-atom distance-dependent statistical potential of DFIRE based on the finite ideal-gas reference state [68]. By building six statistical potentials based on the same training set and six existing reference states, we found that the finite ideal-gas and random-walk chain reference states are slightly better than other reference states in identifying native structures and ranking decoy structures [67]. Recently, machine learning/deep learning approaches have been used in building scoring functions RNA 3D structure evaluation [69, 70]. Compared with the top traditional statistical potentials, RNA3DCNN constructed by 3D convolutional neural network shows excellent performance in identifying native structures of RNA-Puzzle dataset [69], and the newly developed ARES [70], from deep neural network based on training data from FARFAR2 showed rather good performance for evaluating structures from FARFAR2 [71]. Very recently, we have developed an all-atom residue-separation-based statistical potential of rsRNASP through distinguishing short-ranged and long-ranged potentials, and rsRNASP shows a visibly improved performance than existing statistical potentials and scoring functions from neural networks [72].

However, almost all existing physics-based models for RNA 3D structure prediction are based on different-level CG representations rather than the all-atom one to reduce conformational space, while the existing statistical potentials/scoring functions of high performance are all based on the all-atom representation. Consequently, a reliable CG statistical potential is crucially important for a CG-based 3D structure prediction model rather than an all-atom-based one. Furthermore, a reliable CG statistical potential can also be applicable for all-atom structure evaluation at much higher efficiency than an all-atom one since much fewer CG atoms are involved. Therefore, reliable CG statistical potentials are still highly required, not only for CG structure evaluation but also for all-atom structure evaluation at high efficiency.

In this work, we developed a CG statistical potential of cgRNASP-CN for RNA 3D structure evaluation based on a minimal CG representation for nucleotides. Specifically, we used two real heavy atoms of C4', and N1 (for purines or N9 for pyrimidines) for describing a nucleotide, and C4' atoms and N1 (or N9) atoms describe the backbone and bases for an RNA chain, respectively. The examinations for realistic datasets show that cgRNASP-CN has a good performance in structure evaluation for realistic test sets including the RNA-Puzzles dataset. Furthermore, for the RNA-Puzzle dataset, the performance of cgRNASP-CN is very close to the newly developed all-atom rsRNASP and superior to other top all-atom statistical potentials/scoring functions. Very importantly, cgRNASP-CN is (over) ∼100 times more efficient than existing top all-atom statistical potentials/scoring functions.

2. Methods

A minimal CG representation

First, we surveyed the physics-based RNA 3D structure prediction models [26, 29, 38], and found that C4' and N1 (for purines or N9 for pyrimidines) atoms were used very frequently for describing backbone and base for a nucleotide, respectively. For example, C4' and N1 (or N9) atoms were used in SimRNA [26, 27], Vfold [41–44], Shapiro's model [73], and our CG model with salt effect [34–40]. Moreover, the CG representation of two atoms of C4' and N1 (or N9) can be considered as a minimal one for describing backbone and base for a nucleotide, respectively; see figure 1. Thus, based on the minimal CG representation, we developed a new CG statistical potential, namely cgRNASP-CN, for RNA 3D structure evaluation.

View original graphic|Download|PPT slide

Figure 1. (A) Illustration of a minimal coarse-grained (CG) representation used in developing the statistical potential of cgRNASP-CN for RNA 3D structure evaluation, where 2 CG beads at C4' and N1/N9 heavy atoms describe backbone and base, respectively; (B) illustration for top-1 structure identified by cgRNASP-CN from the structure candidates.

A CG statistical potential based on a minimal CG representation and residue separation

RNA folding is generally hierarchical [74], and the interactions at different residue separations may play different roles in stabilizing RNA 3D structures [51]. Similar to the newly developed all-atom rsRNASP [72], the total energy of an RNA conformation C of a given sequence is composed of short-ranged energy and long-ranged energy in the present cgRNASP-CN [75]:

(1)$\begin{eqnarray}E\left(C\right)={E}_{\mathrm{short}\,(1\leqslant k\leqslant {k}_{0})}+\omega {E}_{\mathrm{long}\,({k}_{0}\lt k)},\end{eqnarray}$

where k₀ is a residue separation threshold to distinguish short- and long-ranged interactions and ω is a weight to balance the two contributions. The long-ranged energy E_long in cgRNASP-CN can be given by [72, 75]

(2)$\begin{eqnarray}{E}_{\mathrm{long}\,}=\displaystyle \sum _{{k}_{0}\lt k}{E}_{{\rm{long}}}\left(i,j,r\right),\end{eqnarray}$

and

$\begin{eqnarray*}{E}_{{\rm{long}}}\left(i,j,r\right)=-{k}_{{\rm{B}}}T\,\mathrm{ln}\,\displaystyle \frac{{P}_{{k}_{0}\lt k}^{{\rm{obs}}}\left(i,j,r\right)}{{P}_{{k}_{0}\lt k}^{{\rm{ref}}}\left(i,j,r\right)},\end{eqnarray*}$

where ${P}_{k}^{{\rm{obs}}}\left(i,j,r\right)$ and ${P}_{k}^{{\rm{ref}}}\left(i,j,r\right)$ are the probabilities of the distance between atom pairs of types $i$ and $j$ in distance interval $(r,r+dr]$ and in the range of residue separation k₀ < k for the native and reference states, respectively. The short-ranged energy E_short in cgRNASP-CN is given by [75]

(3)$\begin{eqnarray}\begin{array}{l}{E}_{\mathrm{short}\,}=\displaystyle \sum {E}_{k=1}(i,j,r)+\alpha \displaystyle \sum {E}_{k=2}\left(i,j,r\right)\\ \,+\beta \displaystyle \sum {E}_{3\leqslant k\leqslant k0}(i,j,r),\end{array}\end{eqnarray}$

and

$\begin{eqnarray*}{E}_{k\in {\rm{range}}}\left(i,j,r\right)=-{k}_{{\rm{B}}}T\,\mathrm{ln}\,\displaystyle \frac{{P}_{k\in {\rm{range}}}^{{\rm{obs}}}\left(i,j,r\right)}{{P}_{k\in {\rm{range}}}^{{\rm{ref}}}\left(i,j,r\right)},\end{eqnarray*}$

where k ∈ range stands for the residue separation k in the k ranges in equation (3). ${P}_{k\in {\rm{range}}}^{{\rm{obs}}}\left(i,j,r\right)$ and ${P}_{k\in {\rm{range}}}^{{\rm{ref}}}\left(i,j,r\right)$ are the probabilities of the distance between atom pairs of types $i$ and $j$ in distance interval (r, r + dr] and in the range of residue separation k ∈ range (in equation (3)) for the native and reference states, respectively. For the short-ranged and long-ranged potentials, we used the average and finite ideal-gas reference states, respectively. For the details about the use of the reference states for deriving the short- and long-ranged potentials, please see section S1 (available online at stacks.iop.org/CTP/74/075602/mmedia) in the supplementary materials, and please see [24, 67, 72] for the detailed description of the existing reference states.

Training set and parameters

In cgRNASP-CN, we used the same non-redundant training native set that was recently used to derive all-atom rsRNASP [72], and the dataset is available at https://github.com/Tan-group/rsRNASP [72]. It should be noted that there are several RNAs in the training set with over 80% identity with the RNAs in the test set, and these RNAs were still reserved in the training set for keeping the complete structure spectrum. For these RNAs, we used the leave-one-out method according to previous works [67, 72, 75]. To optimize the weights (α, β, ω) in equations (1) and (3) for short- and long-ranged interactions, we used a training decoy dataset previously built for deriving the all-atom rsRNASP, which is available at https://github.com/Tan-group/rsRNASP [72]. According to all-atom rsRNASP [72], ${k}_{0}$ is taken as 4, and an RNA length N-dependent function f(N) was involved to normalize the N-dependent CG bead-pair number of the long-ranged interactions due to the large residue-separation range and the consequent N-dependent CG bead-pair number. Consequently, $\omega $ in equation (1) is equal to $\omega ={\omega }_{0}/f(N).$ Based on the examinations on the training decoy dataset, $\alpha ,$ β and ${\omega }_{0}$ were determined for cgRNASP-CN, respectively. For the details of f(N) and $\alpha ,$ β and ${\omega }_{0},$ please see section S3 and figures S2, S3 in the supplementary material.

In cgRNASP-CN, the distance bin width is 0.3 Å [4, 67, 72, 75], and the cut-off distances of the statistical potential of 1, 2, 3–4 and $4\lt k$ in the k range are set to the values according to the distance distribution between CG beads in different residue-separation ranges [72]; please see section S2 and figure S1 in the supplementary materials for more detailed information. For the case that some atom-pairs are not observed within a certain bin width, the potentials were set to the highest potential value in the whole range for corresponding CG atom pair types, and when the distance of CG atomic pairs is less than 3.9 Å (mean van der Waals diameter for C4' and N1 (or N9) atoms), the potentials were set to a high value of 50, where ${k}_{{\rm{B}}}T$ was taken as the unit of potential energy.

Test datasets

In order to test the performance of cgRNASP-CN, we used two realistic test sets of the PM and Puzzles datasets instead of those from perturbation methods [72]. The PM dataset was built by us previously through four RNA 3D structure prediction models with given native secondary structures, which is composed of decoy structures for 20 RNAs and is available at https://github.com/Tan-group/rsRNASP [72]. The Puzzles dataset was generated from the CASP-like competition of RNA 3D structures predictions, and is composed of the decoy structures of 22 RNAs from various top research groups around the world [67]. The Puzzles dataset is available at https://github.com/RNA-Puzzles/standardized_dataset, and the Puzzles dataset is of particular importance since it was generated from the blind CASP-like 3D RNA structure predictions from various top research groups with given sequences [67].

Measuring RNA structure similarity

To measure the structural difference between the two RNA 3D structures, we used both root-mean-square-deviation (RMSD) and deformation index (DI) metrics. The DI value between structures A and B is calculated as follows [76]:

(4)$\begin{eqnarray}{\rm{DI}}({\rm{A}},{\rm{B}})=\displaystyle \frac{{\rm{RMSD}}({\rm{A}},{\rm{B}})}{{\rm{INF}}({\rm{A}},{\rm{B}})},\end{eqnarray}$

where ${\rm{RMSD}}({\rm{A}},{\rm{B}})$ and ${\rm{INF}}({\rm{A}},{\rm{B}})$ represent geometric and topological differences between structures A and B, respectively. INF describes interaction network fidelity and can be measured by Matthews correlation coefficient of base-pairing and base-stacking interactions [77]. If structures A and B have very similar hydrogen bond interaction networks, the DI will be similar to the RMSD, otherwise the DI value will be relatively larger than the RMSD value. The tools for calculating DI and INF are available at https://github.com/RNA-Puzzles/BasicAssessMetrics [78].

3. Results and discussion

In the following, we tested the performance of cgRNASP-CN against two realistic datasets PM and Puzzles, in a comparable way with existing top all-atom statistical potentials/scoring functions including rsRNASP [72], RNA3DCNN [69], ARES [70], DFIRE-RNA [68], 3dRNAscore [4], and RASP [66]. First, we examined the overall performance of cgRNASP-CN against the two datasets, and afterwards focused on the performance of cgRNASP-CN against the RNA-Puzzles dataset. Finally, we examined the computation efficiency of cgRNASP-CN, compared with the existing top all-atom statistical potentials/scoring functions.

Evaluation metrics

To describe the performance of cgRNASP-CN, we used the following three metrics: (a) the number of identified native structures; (b) the DI values of the lowest-energy structure (including and excluding the native structure), and (c) the Pearson correlation coefficient (PCC) between energies and DIs of decoy structures. The PCC value is calculated as follows:

(5)$\begin{eqnarray}\mathrm{PCC}=\frac{\displaystyle {\sum }_{m=1}^{{M}_{\mathrm{decoys}}}({E}_{m}-\bar{E})({R}_{m}-\bar{R})}{\sqrt{\displaystyle {\sum }_{m=1}^{{M}_{\mathrm{decoys}}}{({E}_{m}-\bar{E})}^{2}}\sqrt{\displaystyle {\sum }_{m=1}^{{M}_{\mathrm{decoys}}}{({R}_{m}-\bar{R})}^{2}}},\end{eqnarray}$

where ${M}_{{\rm{decoys}}}$ is the total number of decoy structures for an RNA. ${E}_{m}$ and ${R}_{m}$ are the energy and DI of the mth decoy structure, respectively. $\bar{E}$ and $\bar{R}$ are the averaged energy and DI of all decoy structures, respectively. The PCC value ranges from 0 to 1, and when PCC is equal to 1, the statistical potential has a perfect performance.

Overall performance of cgRNASP-CN for PM and Puzzles datasets

In identifying native structures

As shown in figure 2(A) and table S2 in the supplementary material, cgRNASP-CN identifies 30 native structures from the decoys of 42 RNAs for the PM and Puzzles datasets, i.e. cgRNASP-CN identifies ∼71% native structures for the two realistic datasets. In contrast, rsRNASP, RNA3DCNN, ARES, DFIRE-RNA, 3dRNAscore, and RASP identify 32, 27, 2, 20, 4, and 4 native structures from the decoys of 42 RNAs for the two datasets. This indicates that the performance of cgRNASP-CN is slightly lower than the all-atom rsRNASP, while is higher than other all-atom statistical potentials/scoring functions in identifying native structures.

View original graphic|Download|PPT slide

Figure 2. (A) Number of identified native structures, (B) average DI values of the lowest-energy structures (including native ones), (C) average DI values of the lowest-energy decoys (excluding native ones), and (D) average PCC values between DIs and energies by cgRNASP-CN and other all-atom statistical potentials. Panels (A)–(D) are for the two realistic datasets (PM + Puzzles), and the PCC values were averaged over the mean values of respective test sets since decoys in a dataset were generated with the same method and have similar structure features.

In identifying near-native structures

We also examined the performance of cgRNASP-CN in identifying near-native structures for the two realistic datasets involving native structures. As shown in figure 2(B) and table S2 in the supplementary material, the mean DI of lowest-energy structures from cgRNASP-CN is ∼3.5 Å for the two realistic datasets with native structures. Such value becomes 3.9 Å for rsRNASP, 6.1 Å for RNA3DCNN, 16.1 Å for ARES, 8.9 Å for DFIRE-RNA, 16.5 Å for 3dRNAscore, and 17.4 Å for RASP, respectively. Namely, the DI from cgRNASP-CN is slightly smaller than that from the all-atom rsRNASP, while is apparently smaller than those from other top all-atom statistical potentials/scoring functions including RNA3DCNN, ARES, DFIRE-RNA, 3dRNAscore, and RASP. This indicates the overall better performance of cgRNASP-CN than other top all-atom statistical potentials and scoring functions in identifying near-native structures for the two realistic datasets involving native ones.

Furthermore, we examined the ability of cgRNASP-CN in identifying near-native structures for the two datasets without involving native ones, since a 3D prediction model generally cannot generate native structures. As shown in figure 2(C) and tables S2 in the supplementary material, the mean DI from cgRNASP-CN is 12.6 Å, a smaller value than those from other existing top statistical potentials and scoring functions, while such values are 12.8 Å for rsRNASP, 15.6 Å for RNA3DCNN, 17.1 Å for ARES, 13.8 Å for DFIRE-RNA, 18.3 Å for 3dRNAscore, and 19.4 Å for RASP. This suggests that cgRNASP-CN has very slightly better performance than rsRNASP and visibly better performance than RNA3DCNN, ARES, DFIRE-RNA, 3dRNAscore, and RASP, in identifying near-native structures for the two realistic datasets without native ones.

In ranking decoy structures

A good statistical potential cannot only identify the near-native structures from decoys, but also rank the decoy structures according to their similarity to the native ones. We used the PCC between energies and DIs of decoys to assess the ability of cgRNASP-CN in ranking decoy structures of RNAs. As shown in figure 2(D) and table S2 in the supplementary material, the PCC value from cgRNASP-CN is ∼0.60 for the two realistic datasets. Such value is very slightly smaller than that of rsRNASP (PCC ∼ 0.61), while appears visibly larger than those of RNA3DCNN (PCC ∼ 0.41), ARES (PCC ∼ 0.38), DFIRE-RNA (PCC ∼ 0.53), 3dRNAscore (PCC ∼ 0.27), and RASP (PCC ∼ 0.26). Thus, cgRNASP-CN is very similar to rsRNASP while is visibly superior to other statistical potentials and scoring functions in ranking decoy structures.

Therefore, for the two realistic datasets, the present cgRNASP-CN is very similar to the all-atom rsRNASP and is visibly superior to other top all-atom statistical potentials/scoring functions for RNA 3D structure evaluation. It is encouraging that cgRNASP-CN appears very slightly better than rsRNASP in identifying near-native structures for the realistic PM and Puzzles datasets.

Performance of cgRNASP-CN for RNA-Puzzles dataset

The Puzzles dataset was generated from the CASP-like competition of RNA 3D structures predictions, and is composed of the decoy structures of 22 RNAs from various top research groups around the world. Due to the particular importance of the Puzzles dataset, in the following, we explicitly examined the performance of cgRNASP-CN against the Puzzles dataset.

In identifying native/near-native structures

As shown in figures 3(A)–(C) and table S3 in the supplementary material, cgRNASP-CN identifies 14 native structures from the decoys of 22 RNAs in the Puzzles dataset, and such number of identified native ones is slightly smaller than that of the all-atom rsRNASP (16 out of 22) while is larger than those of other all-atom statistical potentials/scoring functions including RNA3DCNN (13 out 22), ARES (2 out of 22), DFIRE-RNA (10 out of 22), 3dRNAscore (2 out of 22), and RASP (2 out of 22). Moreover, the DI values for the Puzzles dataset with and without native structures from cgRNASP-CN are 5.1 and 13.2 Å, which are similar to those from rsRNASP (4.6 and 14.4 Å) and appear smaller than those from RNA3DCNN (5.9 and 18.5 Å), ARES (18.1 and 18.8 Å), DFIRE-RNA (7.6 and 14.4 Å), 3dRNAscore (17.1 and 19.4 Å), and RASP (17.8 and 20.0 Å). This indicates that cgRNASP-CN is similar to the all-atom rsRNASP in identifying near-native structures and appears superior to other statistical potentials and scoring functions. Importantly, it is noted that the DI value of cgRNASP-CN for the Puzzles dataset without native structures is slightly smaller than that from rsRNASP, suggesting that cgRNASP-CN can identify structures closer to native ones than the all-atom rsRNASP since a native structure is generally absent for a blind structure prediction.

View original graphic|Download|PPT slide

Figure 3. (A) Number of identified native structures, (B) average DI values of the lowest-energy structures (including native ones), (C) average DI values of the lowest-energy decoys (excluding native ones), and (D) average values of PCCs between DIs and energies by cgRNASP-CN and other all-atom statistical potentials for the Puzzles dataset.

In ranking decoy structures

The PCC values between energies and DIs of decoys of the Puzzles dataset are shown in figure 3(D) and tables S3 in the supplementary material. The PCC from cgRNASP-CN is 0.55, a slightly lower value than that from rsRNASP (0.57). However, the PCC from cgRNASP-CN is visibly higher than those from other top all-atom statistical potentials/scoring functions including RNA3DCNN (0.35), ARES (0.40), DFIRE-RNA (0.52), 3dRNAscore (0.35), and RASP (0.38). This suggests that cgRNASP-CN is close to rsRNASP and appears superior to other all-atom statistical potentials/scoring functions in ranking decoy structures for the Puzzles dataset.

Therefore, for the Puzzles dataset, the performance of cgRNASP-CN is overall similar to that of the all-atom rsRNASP and is better than other top all-atom statistical potentials/scoring functions. Notably, cgRNASP-CN can identify the structures closer to native ones when native structures are not involved in the Puzzles dataset, since a blind structure prediction generally does not involve a native structure.

Computation efficiency of cgRNASP-CN

As shown above, for the two realistic datasets of PM and Puzzles, the present cgRNASP-CN has a very similar performance with the newly developed all-atom rsRNASP and an overall better performance than other top all-atom statistical potentials/scoring functions. Since cgRNASP-CN is a statistical potential based on a minimal 2-bead CG representation, cgRNASP-CN can be employed not only for related CG structure evaluation, but also for all-atom structure evaluation at high efficiency. In the following, we quantitatively examined the computation efficiency of cgRNASP-CN for the RNAs in the Puzzles dataset, in a comparison with existing all-atom statistical potentials/scoring functions.

As shown in figure 4, for the RNAs in the Puzzles dataset, cgRNASP-CN is significantly more efficient than the all-atom statistical potential/scoring function of rsRNASP, RNA3DCNN, and DFIRE-RNA. Specifically, for the Puzzles dataset, the computation time of cgRNASP-CN is about 1/130 of that of rsRNASP, and the computation time of rsRNASP is comparable to that of DFIRE-RNA and is about 1/10 of that of RNA3DCNN. It is understandable since cgRNASP-CN involves a minimal 2-bead CG representation for a nucleotide and the computation time of a statistical potential generally is proportional to the square of atom number involved in the statistical potential. Therefore, cgRNASP-CN with good performance is significantly more efficient than existing all-atom statistical potentials/scoring functions, which would enable cgRNASP-CN to greatly save evaluation time for a given an ensemble of candidates or evaluate much more structure candidates within a given time.

View original graphic|Download|PPT slide

Figure 4. Computation times of cgRNASP-CN and other top all-atom statistical potentials/scoring functions for the Puzzles dataset containing decoys of 22 RNAs, relative to that of cgRNASP-CN. The PDB IDs of the 22 RNAs in the Puzzles dataset were shown as the X-axis label.

Conclusion

In this work, we developed the CG statistical potential of cgRNASP-CN based on a minimal CG representation for a nucleotide. The examinations against the realistic datasets show that compared with the newly developed all-atom rsRNASP, cgRNASP-CN has similar performance and even could identify nearer-native structures for the realistic datasets without involving native structures. Furthermore, cgRNASP-CN is superior to other top existing all-atom statistical potentials/scoring functions for the realistic datasets. More importantly, cgRNASP-CN is significantly (over 100 times) more efficient than existing top all-atom statistical potentials/scoring functions including rsRNASP. Therefore, cgRNASP-CN can be used not only for evaluating CG structure candidates with the corresponding CG atoms but also for evaluating all-atom structure candidates at very high efficiency.

However, the performance of cgRNASP-CN is still limited to a relatively good level. For example, for the realistic datasets, the percentage of identified native structures is ∼71% and the PCC value between DIs and energies of decoys is ∼0.6, and such two values are still apparently lower than the ideal value of 1. Therefore, the present CG statistical potential of cgRNASP-CN is still required to be improved for a more reliable evaluation for RNA 3D structure candidates. First, due to the limited native RNA structures in the current PDB database [7], cgRNASP-CN can be continuously improved with the increase in the number of RNA structures deposited in the PDB database. Second, in addition to distance between CG atoms, some other geometric parameters such as torsion angle and orientation can be involved to develop a statistical potential to more completely capture the geometry of RNA 3D structures [4, 24, 79]. Third, multi-body potentials can be explicitly involved in cgRNASP-CN, which will improve the description for correlated atom-atom distance distributions [24, 80]. Nevertheless, the present statistical potential of cgRNASP-CN based on a minimal CG representation would be very beneficial for related CG-based 3D structure evaluation and for all-atom-based 3D structure evaluation at significantly high computation efficiency.

Data availability statement

All relevant data are within the paper and its supplementary material files. The potential of cgRNASP-CN is available at website https://github.com/Tan-group/cgRNASP-CN.

Author contributions

Z J T, Y L T and L S designed the research. L S, X W and S X Y performed the research. T Z J, Y L T, X W and L S analyzed the data. L S, Y L T, X W, and Z J T wrote the manuscript.

Funding

This work was supported by grants from the National Science Foundation of China (12075171, 11774272).

We are grateful to Profs Shi-Jie Chen (University of Missouri) and Jian Zhang (Nanjing University) for valuable discussions. The numerical calculations in this work were performed on the super computing system in the Super Computing Center of Wuhan University.

References

Publishing order | Descend order by publishing year | Descend order by cited within

1	Breaker R R Gesteland R Cech T Atkins J 2006 The RNA World 3rd edn New York Cold Spring Harbor Laboratory Press

2	Dethoff E A Chugh J Mustoe A M Al-Hashimi H M 2012 Functional complexity and regulation through RNA dynamics Nature 482 322 330 DOI

3	Wang J Mao K Zhao Y Zeng C Xiang J Zhang Y Xiao Y 2017 Optimization of RNA 3D structure prediction using evolutionary restraints of nucleotide–nucleotide interactions from direct coupling analysis Nucleic Acids Res. 45 6299 6309 DOI

4	Wang J Zhao Y Zhu C Xiao Y 2015 3dRNAscore: a distance and torsion angle dependent evaluation function of 3D RNA structures Nucleic Acids Res. 43 e63 e63 DOI

5	Doherty E A Doudna J A 2001 Ribozyme structures and mechanisms Annu. Rev. Biophys. Biomol. Struct. 30 457 475 DOI

6	Edwards T E Klein D J Ferré-D'Amaré A R 2007 Riboswitches: small-molecule recognition by gene regulatory RNAs Curr. Opin. Struct. Biol. 17 273 279 DOI

7	Rose P W Prlić A Altunkaya A Bi C Bradley A R Christie C H Costanzo L D Duarte J M Dutta S Feng Z 2016 The RCSB protein data bank: integrative view of protein, gene and 3D structural information Nucleic Acids Res. 45 D271 D281 DOI

8	Das R Baker D 2007 Automated de novo prediction of native-like RNA tertiary structures Proc. Natl Acad. Sci. 104 14664 14669 DOI

9	Das R Karanicolas J Baker D 2010 Atomic accuracy in predicting and designing noncanonical RNA structure Nat. Methods 7 291 294 DOI

10	Sun L-Z Zhang D Chen S-J 2017 Theory and modeling of RNA structure and interactions with metal ions and small molecules Annu. Rev. Biophys. 46 227 246 DOI

11	Miao Z Westhof E 2017 RNA structure: advances and assessment of 3D structure prediction Annu. Rev. Biophys. 46 483 503 DOI

12	Zhang J Bian Y Lin H Wang W 2012 RNA fragment modeling with a nucleobase discrete-state model Phys. Rev. E 85 021909 DOI

13	Zhang J Lin M Chen R Wang W Liang J 2008 Discrete state model and accurate estimation of loop entropy of RNA secondary structures J. Chem. Phys. 128 125107 DOI

14	Zhang J Dundas J Lin M Chen R Wang W Liang J 2009 Prediction of geometrically feasible three-dimensional structures of pseudoknotted RNA through free energy estimation RNA 15 2248 2263 DOI

15	Sim A Y Levitt M Minary P 2012 Modeling and design by hierarchical natural moves Proc. Natl Acad. Sci. 109 2890 2895 DOI

16	Cao S Chen S-J 2005 Predicting RNA folding thermodynamics with a reduced chain representation model RNA 11 1884 1897 DOI

17	Tan R K Petrov A S Harvey S C 2006 YUP: a molecular simulation program for coarse-grained and multiscaled models J. Chem. Theory Comput. 2 529 540 DOI

18	Pasquali S Derreumaux P 2010 HiRE-RNA: a high resolution coarse-grained energy model for RNA J. Phys. Chem. B 114 11957 11966 DOI

19	Denesyuk N A Thirumalai D 2013 Coarse-grained model for predicting RNA folding thermodynamics J. Phys. Chem. B 117 4901 4911 DOI

20	Šulc P Romano F Ouldridge T E Doye J P Louis A A 2014 A nucleotide-level coarse-grained model of RNA J. Chem. Phys. 140 235102 DOI

21	Xia Z Bell D R Shi Y Ren P 2013 RNA 3D structure prediction by using a coarse-grained model and experimental data J. Phys. Chem. B 117 3135 3144 DOI

22	Bell D R Cheng S Y Salazar H Ren P 2017 Capturing RNA folding free energy with coarse-grained molecular dynamics simulations Sci. Rep. 7 45812 DOI

23	Parisien M Major F 2008 The MC-Fold and MC-Sym pipeline infers RNA structure from sequence data Nature 452 51 55 DOI

24	Tan Y-L Feng C-J Wang X Zhang W Tan Z-J 2021 Statistical potentials for 3D structure evaluation: from proteins to RNAs Chin. Phys. B 30 028705 DOI

25	Alam T Uludag M Essack M Salhi A Ashoor H Hanks J B Kapfer C Mineta K Gojobori T Bajic V B 2017 FARNA: knowledgebase of inferred functions of non-coding RNA transcripts Nucleic Acids Res. 45 2838 2848 DOI

26	Boniecki M J Lach G Dawson W K Tomala K Lukasz P Soltysinski T Rother K M Bujnicki J M 2016 SimRNA: a coarse-grained method for RNA folding simulations and 3D structure prediction Nucleic Acids Res. 44 e63 e63 DOI

27	Magnus M Boniecki M J Dawson W Bujnicki J M 2016 SimRNAweb: a web server for RNA 3D structure modeling with optional restraints Nucleic Acids Res. 44 W315 W319 DOI

28	Zhang D Chen S-J 2018 IsRNA: an iterative simulated reference state approach to modeling correlated interactions in RNA folding J. Chem. Theory Comput. 14 2230 2239 DOI

29	Zhang D Chen S-J Zhou R 2021 Modeling noncanonical RNA base pairs by a coarse-grained IsRNA2 model J. Phys. Chem. B 125 11907 11915 DOI

30	Zhang D Li J Chen S-J 2021 IsRNA1: de novo prediction and blind screening of RNA 3D structures J. Chem. Theory Comput. 17 1842 1857 DOI

31	Ding F Sharma S Chalasani P Demidov V V Broude N E Dokholyan N V 2008 Ab initio RNA folding by discrete molecular dynamics: from structure prediction to folding mechanisms RNA 14 1164 1173 DOI

32	Jonikas M A Radmer R J Laederach A Das R Pearlman S Herschlag D Altman R B 2009 Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters RNA 15 189 199 DOI

33	Cragnolini T Laurin Y Derreumaux P Pasquali S 2015 Coarse-grained HiRE-RNA model for ab initio RNA folding beyond simple molecules, including noncanonical and multiple base pairings J. Chem. Theory Comput. 11 3510 3522 DOI

34	Shi Y-Z Wang F-H Wu Y-Y Tan Z-J 2014 A coarse-grained model with implicit salt for RNAs: predicting 3D structure, stability and salt effect J. Chem. Phys. 141 105102 DOI

35	Shi Y-Z Wu Y-Y Wang F-H Tan Z-J 2014 RNA structure prediction: progress and perspective Chin. Phys. B 23 078701 DOI

36	Shi Y-Z Jin L Wang F-H Zhu X-L Tan Z-J 2015 Predicting 3D structure, flexibility, and stability of RNA hairpins in monovalent and divalent ion solutions Biophys. J. 109 2654 2665 DOI

37	Jin L Shi Y-Z Feng C-J Tan Y-L Tan Z-J 2018 Modeling structure, stability, and flexibility of double-stranded RNAs in salt solutions Biophys. J. 115 1403 1416 DOI

38	Shi Y-Z Jin L Feng C-J Tan Y-L Tan Z-J 2018 Predicting 3D structure and stability of RNA pseudoknots in monovalent and divalent ion solutions PLoS Comput. Biol. 14 e1006222 DOI

39	Jin L Tan Y-L Wu Y Wang X Shi Y-Z Tan Z-J 2019 Structure folding of RNA kissing complexes in salt solutions: predicting 3D structure, stability, and folding pathway RNA 25 1532 1548 DOI

40	Feng C Tan Y-L Cheng Y-X Shi Y-Z Tan Z-J 2021 Salt-dependent RNA pseudoknot stability: effect of spatial confinement Front. Mol. Biosci. 8 666369 DOI

41	Cao S Chen S-J 2011 Physics-based de novo prediction of RNA 3D structures J. Phys. Chem. B 115 4216 4226 DOI

42	Xu X Zhao P Chen S-J 2014 Vfold: a web server for RNA structure and folding thermodynamics prediction PLoS One 9 e107504 DOI

43	Xu X Chen S-J 2017 Hierarchical assembly of RNA three-dimensional structures based on loop templates J. Phys. Chem. B 122 5327 5335 DOI

44	Xu X Chen S-J 2020 Topological constraints of RNA pseudoknotted and loop-kissing motifs: applications to three-dimensional structure prediction Nucleic Acids Res. 48 6503 6512 DOI

45	Popenda M Szachniuk M Antczak M Purzycka K J Lukasiak P Bartol N Blazewicz J Adamiak R W 2012 Automated 3D structure composition for large RNAs Nucleic Acids Res. 40 e112 e112 DOI

46	Antczak M Popenda M Zok T Sarzynska J Ratajczak T Tomczyk K Adamiak R W Szachniuk M 2016 New functionality of RNAComposer: an application to shape the axis of miR160 precursor structure Acta Biochim. Pol. 63 737 744 DOI

47	Zhao Y Huang Y Gong Z Wang Y Man J Xiao Y 2012 Automated and fast building of three-dimensional RNA structures Sci. Rep. 2 734 DOI

48	Wang J Wang J Huang Y Xiao Y 2019 3dRNA v2. 0: an updated web server for RNA 3D structure prediction Int. J. Mol. Sci. 20 4116 DOI

49	Sippl M J 1990 Calculation of conformational ensembles from potentials of mena force: an approach to the knowledge-based prediction of local structures in globular proteins J. Mol. Biol. 213 859 883 DOI

50	Wienecke A Laederach A 2022 A novel algorithm for ranking RNA structure candidates Biophys. J. 121 7 10 DOI

51	Tanaka S Scheraga H A 1976 Medium-and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins Macromolecules 9 945 950 DOI

52	Miyazawa S Jernigan R L 1985 Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation Macromolecules 18 534 552 DOI

53	Thomas P D Dill K A 1996 Statistical potentials extracted from protein structures: how accurate are they? J. Mol. Biol. 257 457 469 DOI

54	Samudrala R Moult J 1998 An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction J. Mol. Biol. 275 895 916 DOI

55	Skolnick J Kolinski A Ortiz A 2000 Derivation of protein-specific pair potentials based on weak sequence fragment similarity Proteins 38 3 16 DOI

56	Gromiha M M Selvaraj S 2004 Inter-residue interactions in protein folding and stability Prog. Biophys. Mol. Biol. 86 235 277 DOI

57	Lu H Skolnick J 2001 A distance-dependent atomic knowledge-based potential for improved protein structure selection Proteins 44 223 232 DOI

58	Ma Z Zou X 2021 MDock: a suite for molecular inverse docking and target prediction In Protein-Ligand Interactions and Drug Design vol 2266 New York Humana 313 322 DOI

59	Huang S Y Zou X 2008 An iterative knowledge-based scoring function for protein-protein recognition Proteins 72 557 579 DOI

60	Feng Y Huang S-Y 2020 ITScore-NL: an iterative knowledge-based scoring function for nucleic acid–ligand interactions J. Chem. Inf. Model. 60 6698 6708 DOI

61	Rykunov D Fiser A 2007 Effects of amino acid composition, finite size of proteins, and sparse statistics on distance-dependent statistical pair potentials Proteins 67 559 568 DOI

62	Zhou H Zhou Y 2002 Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction Protein Sci. 11 2714 2726 DOI

63	Shen M Y Sali A 2006 Statistical potential for assessment and prediction of protein structures Protein Sci. 15 2507 2524 DOI

64	Zhang J Zhang Y 2010 A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction PLoS One 5 e15386 DOI

65	Bernauer J Huang X Sim A Y Levitt M 2011 Fully differentiable coarse-grained and all-atom knowledge-based potentials for RNA structure evaluation RNA 17 1066 1075 DOI

66	Capriotti E Norambuena T Marti-Renom M A Melo F 2011 All-atom knowledge-based potential for RNA structure prediction and assessment Bioinformatics 27 1086 1093 DOI

67	Tan Y-L Feng C-J Jin L Shi Y-Z Zhang W Tan Z-J 2019 What is the best reference state for building statistical potentials in RNA 3D structure evaluation? RNA 25 793 812 DOI

68	Zhang T Hu G Yang Y Wang J Zhou Y 2020 All-atom knowledge-based potential for RNA structure discrimination based on the distance-scaled finite ideal-gas reference state J. Comput. Biol. 27 856 867 DOI

69	Li J Zhu W Wang J Li W Gong S Zhang J Wang W 2018 RNA3DCNN: local and global quality assessments of RNA 3D structures using 3D deep convolutional neural networks PLoS Comput. Biol. 14 e1006514 DOI

70	Townshend R J Eismann S Watkins A M Rangan R Karelina M Das R Dror R O 2021 Geometric deep learning of RNA structure Science 373 1047 1051 DOI

71	Watkins A M Rangan R Das R 2020 FARFAR2: improved de novo rosetta prediction of complex global RNA folds Structure. 28 963 976. DOI

72	Tan Y-L Wang X Shi Y-Z Zhang W Tan Z-J 2022 rsRNASP: a residue-separation-based statistical potential for RNA 3D structure evaluation Biophys. J. 121 142 156 DOI

73	Paliy M Melnik R Shapiro B A 2010 Coarse-graining RNA nanostructures for molecular dynamics simulations Phys. Biol. 7 036001 DOI

74	Brion P Westhof E 1997 Hierarchy and dynamics of RNA folding Annu. Rev. Biophys. Biomol. Struct. 26 113 137 DOI

75	Tan Y-L Wang X Yu S Zhang B Tan Z-J 2022 cgRNASP: coarse-grained statistical potentials with residue separation for RNA structure evaluation bioRxiv:10.1101/2022.03.13.484152

76	Parisien M Cruz J A Westhof É Major F 2009 New metrics for comparing and assessing discrepancies between RNA 3D structures and models RNA 15 1875 1885 DOI

77	Matthews B W 1975 Comparison of the predicted and observed secondary structure of T4 phage lysozyme Biochim. Biophys. Acta. 405 442 451 DOI

78	Magnus M Antczak M Zok T Wiedemann J Lukasiak P Cao Y Bujnicki J M Westhof E Szachniuk M Miao Z 2020 RNA-Puzzles toolkit: a computational resource of RNA 3D structure benchmark datasets, structure manipulation, and evaluation tools Nucleic Acids Res. 48 576 588 DOI

79	Xiong P Wu R Zhan J Zhou Y 2021 Pairing a high-resolution statistical potential with a nucleobase-centric sampling algorithm for improving RNA model refinement Nat. Commun. 12 2777 DOI

80	Masso M 2018 All-atom four-body knowledge-based statistical potential to distinguish native tertiary RNA structures from nonnative folds J. Theor. Biol. 453 58 67 DOI

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

1. Introduction

2. Methods

A minimal CG representation

A CG statistical potential based on a minimal CG representation and residue separation

Training set and parameters

Test datasets

Measuring RNA structure similarity

3. Results and discussion

Evaluation metrics

Overall performance of cgRNASP-CN for PM and Puzzles datasets

In identifying native structures

In identifying near-native structures

In ranking decoy structures

Performance of cgRNASP-CN for RNA-Puzzles dataset

In identifying native/near-native structures

In ranking decoy structures

Computation efficiency of cgRNASP-CN

Figure 4. Computation times of cgRNASP-CN and other top all-atom statistical potentials/scoring functions for the Puzzles dataset containing decoys of 22 RNAs, relative to that of cgRNASP-CN. The PDB IDs of the 22 RNAs in the Puzzles dataset were shown as the X-axis label.

Conclusion

Data availability statement

Author contributions

Funding

References