Sequence Analysis Workshop

Alex Mitchell, Paul Bradley, Anna Divoli & Terri Attwood
School of Biological Sciences and Department of Computer Science,
University of Manchester


AIM: The aim of this workshop is to cover some of the sequence analysis tools
available to researchers to help characterise unknown seqeunces.

1) Introduction

We have been given the following protein sequence, cloned and sequenced by a researcher working in a laboratory:
>new_sequence
MRSSSAAWLLGAAILLAASLSCSGTIQGTNRSSKGRSLIGKVDGTSHVTGKGVTVETVFS
VDEFSASVLTGKLTTVFLPIVYTIVFVVPLPSNGMALWVFLFRTKKKHPAVIYMANLALA
DLLSVIWFPLKIAYHIHGNNWIYGEALCNVLIGFFYGNMYCSILFMTCLSVQRYWVIVNP
MGHSRKKANIAIGISLAIWLLVVIPLYVVKQTIFIPALNITTCHDVLPEQLLVGDMFNYF
LSLAIGVFLFPAFLTYVLMIRMLRSSAMDENSEKKRKRAIKLIVTVLATYLICFTPSNLL
LVVHYFLIKSQGQSHVYALYIVALCLSTLNSCIDPFVYYFVSHDFRDHAKNALLCRSVRT
VKQMQVSLTSKKHSRSYSSSSTTVKTS
We would like to work out the potential function of the protein encoded by
the sequence and find the protein family to which it belongs. A common 
starting point for this is to identify previously characterised sequences 
which the protein is similar to, and infer function and family relationships
on the basis of this similarity. This can be done by a pairwise comparison 
between the query sequence and a database of proteins. One way in which to 
do this is to perform a BLAST search.

2) Searching primary databases: BLASTing our sequence

The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query (if you would like to know more about BLAST, tutorials are available here). Click here to open the NCBI protein-protein BLAST site in a new window. Copy and paste the query sequence into the box provided. Now click on the BLAST! button to perform the search. Once your job has been submitted to the queue, click on the Format! button to receive your results (this might take a minute or two to run if the server is busy). The analysis returns a list of hits in descending order of significance. The column labelled E-value gives a normalised score; a score of 0, or a highly-negative score, indicates a potentially significant alignment and scores approaching 1 represent a level of similarity which are more likely to have arisen by chance. Examine your results and click on the links to learn more about the matches.
Questions:
Which sequences are the top scoring matches against your query sequence?
Are they all the same type of proteins?
Which sequences might we define as poor matches?
Can we infer the function of our query protein from the matches?
Can we infer the family to which it belongs?

3) Searching 'pattern' databases

As a next step in the analysis of our sequence, we might want to search secondary, or "pattern", databases. Such a search would reveal whether the sequence matches a signature for a given protein family. A number of different pattern resources are available, each with different underlying analysis methods. It is therefore sensible to look at more than one to get an overall picture. For the moment we will focus on three different databases: PROSITE, Pfam and PRINTS. PROSITE: PROSITE is a database of patterns and profiles which can be used to identify known families to which uncharacterised sequences belong. If you'd like to find out more about PROSITE and its construction, the user manual can be found here. To search PROSITE click here and copy and paste the sequence into the box provided. Check the 'Exclude patterns with a high probability of occurrence' box and click on 'QuickScan' to perform a search of PROSITE. Click on the links to learn more about the matches.
Questions:
What matches does the search return?
Do these correspond with our BLAST results?
Does this tell us any additional information about our query sequence?
Pfam: Next, we will perform a search of Pfam, which houses a collection of profile hidden Markov models which can be used to diagnose protein families and domains. If you would like to find out more about Pfam, a selection of help and FAQ pages is available online. To search the database, click here and copy and paste the sequence into the box provided, clicking on 'Search Pfam' to submit the sequence. The job is submitted to a queue and may take several minutes to run - click on 'retrieve' every 30 seconds or so to check whether it has completed. Click on the graphic to learn more about your results.
Questions:
What matches does the search return?
Do these correspond with our BLAST and PROSITE results?
Do these tell us any additional information about our query sequence?
PRINTS: Now we will search PRINTS, which is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family. If you'd like to know more about PRINTS and its construction methods, the user guide is available here. To search PRINTS, click here and once again paste the sequence into the box provided, clicking on 'Send Query' to begin the search (note - the PRINTS search tool FPScan only accepts raw sequences as input which means you may have to remove the ">new_sequence" line to get it to run). The PRINTS match-list also provides a hierarchy so that you can determine any relationship of the matches to each other. Once you have clicked on your results and read their annotation, click on the 'view relations' link to see if your hits are related.
Questions:
What are the highest scoring matches?
What are their relationships to each other?
What does the weight of evidence suggest your query sequence might be?
Does this correspond with our previous results?
Have we learned any additional information about our query sequence?
To return a graphical overview showing where the motifs have matched the query sequence, click on the GRAPHScan and Motif3D links.
Question:
How do these graphical views of the motif matches relate to each other?

4) An integrated family database: InterPro

InterPro is an integrated documentation resource for protein families, domains and sites. It combines a number of databases that use different methodologies and a varying degree of biological information on well-characterised proteins to derive protein signatures. To learn more about InterPro you can read the user manual here. A composite search of PRINTS, Pfam and PROSITE, as well as several other pattern resources, can be performed by searching InterPro. Click here and paste the sequence into the box provided. You will then need to enter an Email address into the appropriate space and click on 'Submit Job'. The analysis may take a few minutes to complete.
Questions
Do the matches correspond with previous results?
Are there any differences in the InterPro results than with searching the
different databases individually?

5) Annotation tools

Using primary database information As we have seen, the results of searching pattern databases are different from searching primary databases with BLAST. One difference is that the hits are to single sequences and not gathered together into related groups. In the course of developing automatic tools to aid database construction we created PRECIS which gathers together sets of related primary database (Swiss-Prot) entries and collates the results into a structured report. To use PRECIS, click here. Copy and paste your sequence into the box provided and press the 'Send Query' button. Examine the output.
Question:
What happens to the report if you rerun the search with a higher BLAST
cut-off (eg, 1e-30) in the box provided, allowing more sequences into the
annotation culling process?
Mining online literature Although primary sequence databases hold lots of useful data about their entries, much more information can be found in the biomedical literature. However, because the literature is so big, finding relevant information is time-consuming. Click this link to go to BioIE, a rule-based information extraction system which can be used to mine PubMed. In the keywords box, enter the probable name of your protein which you should have deduced! If not, highlight the area to the right to reveal it! -> proteinase activated receptor 2 In the number of abstracts box, enter 50, and press the 'Retrieve Abstracts' button. You should now see a page containing your PubMed results with links to word distributions which you might like to explore. When you are ready to perform the information extraction proper, choose "Diseases & Therapeutic Compounds" from the list provided in the 'type of extraction' pull down menu and click on the 'Extraction' button. Examine your results.
Question:
What diseases/syndromes may the protein be involved in?
Was this information found in any of the databases we've examined?

6) Alignment tools: CINEMA

Finally, now that we know what our sequence is, we would like refine the alignment of the family from which we create our fingerprints. We have begun this using an automated alignment tool, but the results are jumbled. To access CINEMA, our alignment editor, click 'exceed' and log on with your user/pass to start a linux session. Type 'cinema5' to start the editor, then chose 'open' and proteasear.seqs. Align the sequences!!!