Structure Motif Search
What is a structure motif?
Structure motifs are the spatial or 3D arrangement of a small number of amino acids (at least 2) that have significance - e.g., form a catalytic or binding site. The amino acid residues making up the motif may be remote from one another in the 1D sequence or even be located in different polymer chains as long they are close to each other in 3D space (within 15 Å of each other). The structure motif search service (Bittrich et al., 2020) retrieves all occurrences of specific structure motifs in 3D structures available from RCSB.org.
The active site of the enolase superfamily is used as an example here (Meng et al., 2004). The enolase superfamily is a group of proteins diverse in sequence, yet largely similar in 3D structure that all catalyze removal of a proton from a carboxylic acid (Babbitt et al., 1996).
When is this search useful?
The structure motif search service is particularly useful when you are interested in exploring the local structural properties of protein structures. This search service complements the structure search service and finds local, structural similarities between proteins. Search results only depend on the residues specified in the query, so it can identify local structural similarities even when the proteins have limited sequence or overall structural similarity. So, for example, this search can find similar ligand binding sites in unrelated proteins, regardless of whether the structures have a ligand bound in that neighborhood.
Detection of such structure motifs can provide valuable insights into the function(s) of previously uncharacterized proteins, especially ones that do not resemble other proteins at either the sequence or global structure level.
The structure motif search service is accessible via the Mol* interface, where query residues (amino acids and nucleotides) can be specified in a visualized molecular structure, and the ‘Advanced Search’ panel, where the query residue details can be specified by typing them into the interface.
Defining queries using Mol*
The RCSB Mol* plugin provides a convenient way to visualize a structure and define structure motif queries. The general Mol* documentation can be found here. Steps for specifying the structure motif query as described here.
To define a structure motif query for the enolase superfamily based on mandelate racemase (PDB ID 2mnr) and using the template described in (Meng, 2004) use the following steps.
In the Mol* interface, click and expand the ‘Structure Motif Search’ menu in the control panel on the right. Activate the selection mode of Mol* by clicking the mouse pointer icon and set the selection level to Residue (default). This allows you to select individual residues that will define the query.
The 5 residues that constitute the template described in literature are used here to define the query motif.
Select individual residues by clicking on them either in the 3D canvas or in the sequence panel. The selected residues will be populated in the Structure Motif Search list in the control panel. Up to 10 residues may be included in this list. Add to the selection by clicking on additional residues, or remove residues by clicking on the trash icon in the residue list. The ‘Structure Motif Search’ element of Mol* behaves like the ‘Measurements’ panel.
Hover over the residue of interest to verify
label_seq_id. The information will appear in the tooltip in the bottom right corner of the Mol* panel. Author defined chain IDs and residue numbers will appear in square brackets if label and author identifiers are different. The sequence view at the top is particularly helpful when selecting residues by author numbering. Discrepancies between
auth_seq_id will be shown by Mol* in square brackets. Learn more about Identifiers in the PDB.
In cases where a range of amino acids (or nucleotides) may realize the same biological function or bind the same ligand, it is possible to define position-specific exchanges in the query to accommodate possible variations in specific locations of the query structure motif.
For each entry of the residue list, exchanges can be specified individually by clicking on the options icon (three horizontal bars with short vertical lines intersecting them). This will open a panel with 20 amino acid and 8 nucleotide names. Click on all three-letter codes that should be considered as valid exchanges at the corresponding position. Only the original residue type is valid if no exchanges are defined. Make sure to include the original residue type when additional exchanges are defined.
The number of exchanges per position is limited to 4.
Click the ‘Submit Search’ button. This will open a new browser tab and your query will be shown in the ‘Advanced Search’ panel.
Defining queries using the 'Advanced Search' panel
In the second option the query motif can be specified in the ‘Advanced Search’ panel. List a structure ID (PDB ID or RCSB.org assigned CSM ID) and all the residues that make up the motif. Alternative, use the Mol* wizard from above to autofill this panel. The page also gives you opportunities to verify, refine, or extend your search (click ‘Open In Query Builder’ when arriving from Mol*).
- Insert the PDB ID or RCSB.org assigned CSM ID that contains the query 3D structure motif.
- Specify 2 to 10 residues that make up the group of residues you want to find in other structures in the archive.
- The first box is for the polymer Chain ID (
label_asym_id) of the residues. Note that a motif may include amino acids from multiple polymer chains.
- The Operator box is for optionally specifying the transformation operation that was used to generate a bioassembly (see PDB ID 2mnr as example). Identify operations by their
struct_oper_id. Combinations of operators are annotated like 1x61 or Px61. Set the value to '1' if you are referencing original coordinates.
- The residue numbers included in the query are identified by their
label_seq_id. Note that in publications, residues are likely referenced by their
auth_seq_id, an identifier assigned by the authors. However, to define queries and report results the RCSB PDB website uses
- Exchanges - Optionally, define position-specific exchanges or substitutions. Note, by default, only the residue type observed in the reference structure will be considered as valid. A set of comma-separated three-letter codes allows searching for different amino acids (or nucleotides) at the specified positions. Must include the original residue type to consider it at a particular position.
- Use the ‘Add Residue’ button to extend your selection to include additional amino acid residues in the structure motif, or use the ‘x’ button on the right to delete individual residues.
- The 'RMSD cutoff' parameter can be used to filter high RMSD hits that are unlikely to be biologically relevant.
- The 'Atom Pairing' parameter gives fine-grained control over the atom set that is considered for the alignment. By default all atoms are evaluated. Alternatively, only backbone, only side-chain, or only Cα/C4′ and Cβ/C1′ atoms can be selected for the RMSD calculation.
- Make sure to set the result type to ‘Assemblies’ to get detailed information on the result page that includes matched residue identifiers and reports the score of this hit. Note: A single entry may have more than one occurrence of the query structure motif. Since the motif may span more than one polymer chain, each occurrence is an assembly. If this option is not selected only PDB entries that contain the query motif are listed in the results.
How to interpret the result score?
The results are displayed as ‘Assemblies’.
All assemblies in the PDB archive that contain groups of residues that resemble the query motif are returned and sets of residues that match the query are identified by their
label_seq_id. Discrepancies between
auth_seq_id will be reported in square brackets. The
label_comp_id of each residue is reported. The RMSD score of the match is provided as well.
All potential matches are reported with a root-mean-square deviation (RMSD) score, which is computed by aligning each identified match to the query motif and measuring the displacement of each matched atom. Values of 0.0 Å indicate optimal alignment, higher values occur for dissimilar groups of residues.
The 'Align' button at the beginning of each line launches a Mol* view that shows the superposition of query motif and selected match.
Limitations of Structure Motif Search
The structure motif search service is a heuristic search with a false negative rate <2%. This means that 1 in 50 relevant hits will get missed when compared to a much slower exhaustive search strategy. The service uses 3 features to describe the geometric properties of all residue pairs present in the query motif: backbone distance (db), side-chain distance (ds), and the angle θ between the CαCβ vector of both amino acids. Hits will get missed if one of these properties differs too much. Tolerance values are 1 Å for distances and 20° for the angle property.
The false positive rate for hits with low RMSD values <0.5 Å tends to be 0, but the false positive rate increases for hits with higher RMSD values. This also means that no hits will be found in structures that contain only a Cα trace.
|3 geometric properties are used to describe residue pairs: backbone distance between Cα atoms, side-chain distance between Cβ atoms, and angle between the corresponding vectors.|
Details about the search algorithm and scoring are discussed in Bittrich et al., 2020. In particular, see Figure 3 and the accompanying discussion of observed false negatives. The 'For advanced users' section provides information on how to run structure motif queries with increased tolerance values that lower false negative rates at the expense of higher runtimes.
The structure motif search service finds resemblances of 2 to 10 residues that are in spatial proximity. Interesting motifs are defined in literature and available in resources such as the Catalytic Site Atlas (CSA). It is applicable for a number of example queries. All given identifiers are
|Template of the enolase superfamily
The enolase superfamily is a group of proteins diverse in sequence, yet largely similar in 3D structure that all catalyze removal of a proton from a carboxylic acid (Babbitt, 1996). The structure motif supporting this catalytic function (Meng, 2004) is represented in PDB ID 2mnr.
|Catalytic triad of serine proteases
Many hydrolases use a serine nucleophile during catalysis. Canonical serine protease catalytic triads are composed of His, Asp, and Ser residues (PDB ID 4cha). Typically these residues occur within two polypeptide chains, because many of these proteases are initially made as zymogens that require activation by proteolytic processing (Hedstrom, 2002) to prevent uncontrolled digestion of proteins within the cell.
You can also combine your query with keywords to narrow the result set and find more interesting occurrences of the query motif.
Aminopeptidases play important roles in protein degradation by removing residues from the N- or amino terminus of polypeptide chains (Burley, 1990). Bovine leucine aminopeptidase (BLLAP) is a homohexameric enzyme with 32 quaternary symmetry. The active site of BLLAP contains two adjacent zinc ions separated by ∼2.9 Å and coordinated by the sidechains of five conserved residues Lys, Asp, Asp, Asp, and Glu (PDB ID 1lap).
Eukaryotic transcription factors often contain His2/Cys2 Zinc Finger domains (PDB ID 1g2f) that bind DNA. These motifs are composed of two cysteine and two histidine residues, which stabilize a small ββα domain structure that envelopes and coordinates a single zinc ion (Pabo, 2001). In the absence of the zinc ion, these domains do not adopt compact, folded structures and are incapable of binding DNA.
G-tetrads are a common nucleic acid association motif (PDB ID 3mij). They are composed of guanines and stabilized by Hoogsteen base pairings. The four O6 oxygen atoms coordinate monovalent ions, such as K+, and individual tetrads tend to be stacked one atop the other (Burge, 2006).
For advanced users
All Java source-code is publicly available on GitHub (github.com/rcsb/strucmotif-search), and the project is distributed as a Maven artifact.
We encourage interested users to set up a local installation of the structure motif search service. This allows you to configure the tool for your exact requirements and gives fine-grained control over all parameters, some of which are not exposed on rcsb.org. Additional features include:
- Increased tolerance values that allow one to retrieve more dissimilar hits
- Definition of query motifs using custom structures that are not part of the PDB archive (such as AlphaFold structures)
- Screening for occurrences of known motifs in a structure of unknown function
- Bittrich S, Burley SK, Rose AS (2020) Real-time structural motif searching in proteins using an inverted index strategy. PLoS computational biology. 16(12): e1008502, doi: 10.1371/journal.pcbi.1008502
- Meng EC, Polacco BJ, Babbitt PC (2004) Superfamily active site templates. PROTEINS: Structure, Function, and Bioinformatics. 55(4): 962–976, doi: 10.1002/prot.20099.
- Babbitt PC, Hasson MS, Wedekind JE, Palmer DR, Barrett WC, Reed GH, et al. (1996) The enolase superfamily: a general strategy for enzyme-catalyzed abstraction of the α-protons of carboxylic acids. Biochemistry. 35(51): 16489–16501, doi: 10.1021/bi9616413.
- Hedstrom L. (2002) Serine protease mechanism and specificity. Chemical reviews. 102(12): 4501–4524, doi: 10.1021/cr000033x.
- Burley SK, David PR, Taylor A, Lipscomb WN (1990) Molecular structure of leucine aminopeptidase at 2.7-A resolution. Proceedings of the National Academy of Sciences. 87(17): 6878–6882.
- Pabo CO, Peisach E, Grant RA (2001) Design and selection of novel Cys2His2 zinc finger proteins. Annual review of biochemistry. 70(1):313–340, doi: 10.1146/annurev.biochem.70.1.313.
- Burge S, Parkinson GN, Hazel P, Todd AK, Neidle S (2006) Quadruplex DNA: sequence, topology and structure. Nucleic acids research. 34(19): 5402–5415, doi: 10.1093/nar/gkl655.