Motivation
Protein function is chiefly encoded in the amino acid sequence of the protein. But how is that function encoded?
Protein sequence data can give us some clues to the answer. If a group or cluster of related proteins always use the same amino acid at the same position in the protein (in other words, if the position is conserved), there is a high probability that it is very important for the protein. If it were not, one would expect tha mutations at that position would have only a relatively small effect on the protein, and over time these mutations would accumulate. Another database we developed, the CoC website, uses conservation information, as well as structural alignments of different sequence families, to find positions that are important for protein folding and protein function. Likewise, experiment data has consistently supported the importance of conserved positions.
But what about residues that are not conserved? We know that related proteins often have different functions. These differences should likewise be encoded in the protein sequence. Experimentally, it has been found that only one or a few mutations can change protein function. For example, three mutations in the transcription factor estrogen receptor change its DNA binding specificity. These positions are termed specificity-determining residues. In the following, we describe a method that predicts these specificity-determining residues, the residue positions where mutations can change the function of the protein.
Constructing the database
Specificity-determining residues are expected to be conserved within groups of proteins that have the same function but to be different between groups of proteins that have different functions. For example,
In the hypothetical protein alignment above, proteins in the same group have the same function, but proteins in different groups have different functions. The first and fourth positions are complete conserved and are not expected to be specificity-determining. The second position, in red, is conserved within the groups that have the same function but is different between the different groups. For this alignment, this would be our prediction of the residue most likely to be specificity-determining.
To find the specificity-determining residues for a protein family, the first step is to group the proteins by function. In previous work, we developed a method that groups proteins into approximate functional groups based solely on sequence data. Using only sequence data allowed us to use all available protein sequences. Other methods lose information by removing sequences where functional data or evolutionary relationships are not known.
The method first places the proteins on a mathematical graph, connecting proteins if they have a minimum level of sequence identity. As this sequence identity cutoff is modified from zero to one, the number of clusters changes from one fully connected cluster to as many different clusters as there are distinct protein sequences. One way to visualize this is to look at the size of the largest cluster. For example, the plot below shows the size of the largest cluster (also known as the giant component) for the different sequence identity cutoffs in the IclR helix-turn-helix domain family. The normalized giant component size is the size of the largest cluster divided by the total number of proteins in the family.
The clustering method uses the phase transition in the size of the giant component to determine the best cutoff to use. Using this cutoff leaves moderately sized clusters of proteins where the proteins in a cluster are more closely related to each other than to the other proteins in the family. Because high sequence identity usually means that two proteins have the same or very similar functions, the clusters should likewise have closely related functions.
With these functional clusters, we can search for specificity-determining residues. In later work, we combined the automated method of grouping proteins by function with a statistical method of predicting specificity-determining residues developed by Mirny and Gelfand. Their method calculates the mutual information of the residues (x) and groups (y) in a column (i) by the equation:

where f(x,y) is the fraction of occurences in the multiple sequence alignment that have residue x and group y. Then, because the value of the mutual information is dependent on the overall distribution of amino acids in a particular column, the statistical significance of the information is determined. We follow their first model, where an expected mutual information is calculated by, first, shuffling the rows of the multiple sequence alignment and recalculating the mutual information, then, second, transforming this information using a maximum likelihood estimator. This estimator is used because proteins in the same group are, by construction, more similar to each other than to proteins in different groups. The method linearly fits the shuffled mutual information to the true information such that the sum of the squares of the z-scores are minimized. This z-score, defined below, is then used as the measure of significance of the true mutual information observed, and therefore the estimate of what is a specificity-determining residue.

Because specificity-determining residues should have a higher than expected amount of mutual information, residues with a large, positive z-score are most signficant. We present two ways of determing which residues are significant predictions. The simplest method is to simply take the ten residue positions with the most significant z-scores. The other method we use is the Bernoulli estimator, first used for specificity-determining residues by Kalinina, Mironov, Gelfand, and Rakhmaninova. This method assumes that the z-scores come from a Gaussian distribution, then calculates the set of most significant positions that is least likely to be the tail of the Guassian distribution, meaning that they are the least likely to be random predictions. In our previous work, we used both methods to choose the specificity-determining residues.
Datasets and methods used for the website
The specificity-determining residues were calculated for protein families in the PFAM database and the GPCRDB database. For PFAM, families with at least 500 proteins in the full multiple sequence alignment were used. Families were required to contain positions with no more than 30% gaps. If a PFAM family has more than 5000 sequences, 5000 sequences were randomly selected from the alignment for the calculation. This restriction was necessary because of computer memory constraints, and it creates a multiple sequence alignment similar to older alignments were fewer protein-coding genes had been sequenced.
For G-protein coupled receptor (GPCR) alignments from the GPCRDB, five multiple sequence alignments were used: the full class A family and four class A subfamilies (amine, olfactory, opsin, and peptide). Class A is the largest GPCR class, and the four subfamilies used are the four largest subfamilies in class A. Considering subfamilies allows one to find positions that have been used for determining the different specificities within a subfamily but that may not have been used in other subfamilies. We only considered the predicted transmembrane helices of the GPCR families because of their functional importance and the higher reliablility of multiple sequence alignments in this region.
To make the website, Raster3D was used to make the static protein images, based on protein structures listed in PFAM or found on the PDB website. Xmgrace was used to show the plots of the giant component size. For the logo plots showing the distribution of amino acids in different functional groups, the program WebLogo was used. For the rotatable images of protein structures with specificity-determining residues highlighted, the program Jmol is used. For searching the database to match a protein sequence, the program HMMER is used.
Copyright © 2007 Donald and Shakhnovich





