We would like to work out the potential function of the protein encoded by
the sequence and find the protein family to which it belongs. A common
starting point for this is to identify previously characterised sequences
which the protein is similar to, and infer function and family relationships
on the basis of this similarity. This can be done by a pairwise comparison
between the query sequence and a database of proteins. One way in which to
do this is to perform a BLAST search.
2) Searching primary databases: BLASTing our sequence
The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence
comparison algorithms that are used to search sequence databases for optimal
local alignments to a query (if you would like to know more about BLAST,
tutorials are available here).
Click here to open the NCBI protein-protein BLAST site in a new window.
Copy and paste the query sequence into the box provided.
Now click on the BLAST! button to perform the search. Once your job has been
submitted to the queue, click on the Format! button to receive your results
(this might take a minute or two to run if the server is busy).
The analysis returns a list of hits in descending order of significance.
The column labelled E-value gives a normalised score; a score of 0, or a
highly-negative score, indicates a potentially significant alignment and
scores approaching 1 represent a level of similarity which are more likely to
have arisen by chance.
Examine your results and click on the links to learn more about the matches.
|
Questions: |
|
|
| Which sequences are the top scoring matches against your query sequence? |
| Are they all the same type of proteins? |
| Which sequences might we define as poor matches? |
| Can we infer the function of our query protein from the matches? |
| Can we infer the family to which it belongs? |
3) Searching 'pattern' databases
As a next step in the analysis of our sequence, we might want to search secondary,
or "pattern", databases. Such a search would reveal whether the sequence matches
a signature for a given protein family. A number of different pattern
resources are available, each with different underlying analysis methods.
It is therefore sensible to look at more than one to get an overall picture.
For the moment we will focus on three different databases: PROSITE, Pfam and
PRINTS.
PROSITE: PROSITE is a database of patterns and profiles which can be used to identify
known families to which uncharacterised sequences belong. If you'd like to find
out more about PROSITE and its construction, the user manual can be found here.
To search PROSITE click here and copy and paste the sequence into the box provided.
Check the 'Exclude patterns with a high probability of occurrence' box
and click on 'QuickScan' to perform a search of PROSITE. Click on the links
to learn more about the matches.
|
Questions: |
| What matches does the search return? |
| Do these correspond with our BLAST results? |
| Does this tell us any additional information about our query sequence? |
Pfam: Next, we will perform a search of Pfam, which houses a collection of profile
hidden Markov models which can be used to diagnose protein families and domains.
If you would like to find out more about Pfam, a selection of help and FAQ pages is
available online.
To search the database, click here and copy and paste the sequence into the box
provided, clicking on 'Search Pfam' to submit the sequence. The job is submitted to a
queue and may take several minutes to run - click on 'retrieve' every 30 seconds or
so to check whether it has completed. Click on the graphic to learn more about your
results.
|
Questions: |
| What matches does the search return? |
| Do these correspond with our BLAST and PROSITE results? |
| Do these tell us any additional information about our query sequence? |
PRINTS: Now we will search PRINTS, which is a compendium of protein fingerprints.
A fingerprint is a group of conserved motifs used to characterise a protein family.
If you'd like to know more about PRINTS and its construction methods, the user guide is
available here.
To search PRINTS, click here and once again paste the sequence into the box provided,
clicking on 'Send Query' to begin the search (note - the PRINTS search tool FPScan
only accepts raw sequences as input which means you may have to remove the ">new_sequence"
line to get it to run). The PRINTS match-list also provides a hierarchy so that you
can determine any relationship of the matches to each other. Once you have clicked on
your results and read their annotation, click on the 'view relations' link to
see if your hits are related.
|
Questions: |
| What are the highest scoring matches? |
| What are their relationships to each other? |
| What does the weight of evidence suggest your query sequence might be? |
| Does this correspond with our previous results? |
| Have we learned any additional information about our query sequence? |
To return a graphical overview showing where the motifs have matched the query
sequence, click on the GRAPHScan and Motif3D links.
|
Question: |
| How do these graphical views of the motif matches relate to each other? |
4) An integrated family database: InterPro
InterPro is an integrated documentation resource for protein families, domains
and sites. It combines a number of databases that use different methodologies
and a varying degree of biological information on well-characterised proteins
to derive protein signatures. To learn more about InterPro you can read the user
manual here.
A composite search of PRINTS, Pfam and PROSITE, as well as several other
pattern resources, can be performed by searching InterPro. Click here and
paste the sequence into the box provided. You will then need to enter an
Email address into the appropriate space and click on 'Submit Job'. The
analysis may take a few minutes to complete.
|
Questions |
| Do the matches correspond with previous results? |
| Are there any differences in the InterPro results than with searching the |
| different databases individually? |
5) Annotation tools
Using primary database information
As we have seen, the results of searching pattern databases are different
from searching primary databases with BLAST. One difference is that the hits
are to single sequences and not gathered together into related groups. In the
course of developing automatic tools to aid database construction we created
PRECIS which gathers together sets of related primary database (Swiss-Prot)
entries and collates the results into a structured report.
To use PRECIS, click here. Copy and paste your sequence into the box provided
and press the 'Send Query' button. Examine the output.
|
Question: |
| What happens to the report if you rerun the search with a higher BLAST |
| cut-off (eg, 1e-30) in the box provided, allowing more sequences into the |
| annotation culling process? |
Mining online literature
Although primary sequence databases hold lots of useful data about their
entries, much more information can be found in the biomedical literature.
However, because the literature is so big, finding relevant information is
time-consuming. Click this link to go to BioIE, a rule-based information extraction
system which can be used to mine PubMed.
In the keywords box, enter the probable name of your protein which you should
have deduced! If not, highlight the area to the right to reveal it! -> proteinase activated receptor 2
In the number of abstracts box, enter 50, and press the 'Retrieve Abstracts'
button. You should now see a page containing your PubMed results with links to
word distributions which you might like to explore. When you are ready to perform
the information extraction proper, choose "Diseases & Therapeutic Compounds" from
the list provided in the 'type of extraction' pull down menu and click on the
'Extraction' button. Examine your results.
|
Question: |
| What diseases/syndromes may the protein be involved in? |
| Was this information found in any of the databases we've examined? |
6) Alignment tools: CINEMA
Finally, now that we know what our sequence is, we would like refine the
alignment of the family from which we create our fingerprints. We have begun
this using an automated alignment tool, but the results are jumbled.
To access CINEMA, our alignment editor, click 'exceed' and log on with your
user/pass to start a linux session. Type 'cinema5' to start the editor, then
chose 'open' and proteasear.seqs. Align the sequences!!!