#####################################
#                                   #
# SHARKhunt v1.0                    #
#                                   #	
# John Pinney (University of Leeds) #
#                                   #
#####################################

For searching genomic DNA for protein profile models.

1. Installation
===============


You will need a Unix-type system with a recent Java installation
(v1.4+). The SHARKhunt tarball contains everything else you need.


2. The SHARKhunt protocol
=========================



3. Running SHARKhunt
====================


3.1 Input data
--------------

First you will need to prepare your genome data. SHARKhunt will
accept EITHER 
  
a single FASTA-format file containing several genomic sequences
(e.g. contigs, chromosomes etc.) 

OR 

a directory structure like that of the "blankdata" directory
supplied with SHARKhunt. You will see that there are several
subdirectories into which FASTA-format files containing one or many
sequences may be placed:

EST		 for EST files
GSS		 for GSS files
chromosome	 for chromosomes (whole or as contigs)
organelle	 for organelle genomes

This structure allows SHARKhunt to treat organellar and EST data
(which does not contain introns) separately from genomic DNA (which
may contain introns). The easiest thing to do is to copy the
"blankdata" directory to your workspace, change its name and then put
the data in the right place.

It is a good idea to check that all of your sequences have unique
FASTA comment lines, since these will be used to report the location
of any hits found.


3.2 Command-line options
------------------------

Run SHARKhunt using the command:

sharkhunt [options] genome_path genome_id

where genome_path is the single FASTA file or genome directory
containing the data and genome_id is a short identifier for this run
(containing no spaces!). 


The possible options are:

-euk        Eukaryotic genome [T/F]
            (default=true)

	    Specify eukaryotic or prokaryotic DNA.

-blastcut   Cutoff E-value for initial BLAST search [real]
            (default=1.0)

	    Sets the sensitivity of the first-pass PSI=BLAST search.

-leeway     No. of bases to extract each side of BLAST hit [integer]
            (default=2000)
	    
	    Determines the size of the region to take forward from the
	    initial PSI-BLAST search to Wise2 analysis. See section 2
	    for further explanation.
	    
-maxregions No. of best BLAST regions to take forward to Wise2 search [integer]
            (default=5)

	    Limits the number of regions analysed by Wise2 for each
	    profile. This is necessary to avoid processing large
	    gene families. By default, the best 5 BLAST hits are taken.

-mode       Wise2 running mode [wing|global|both]
            (default=both)

	    When using HMM profiles, Wise2 can run in "wing" mode,
	    where local alignment is used for the first and last 15
	    model positions and global alignment in between, or
	    "global" mode, where global alignment is used throughout.
	    "wing" mode is more sensitive to remote homologues, but
	    "global" mode tends to result in better E-value
	    scores. Using "both" will make SHARKhunt run slower, but
	    gives the best overall results.

-finalcut   Final E-value cut-off [real]
            (default=0.1)

	    The PSI-BLAST cut-off for the final predictions.

-tmp        Directory for working files [string]
            (default=current directory)

	    Different SHARKhunt runs are able to share the same
	    working directory.

-out        Output directory [string]
            (default=current directory)


advanced options:

--profiles  Specify profile directory [String]
            (default=./priam_04)

	    For use with user-defined profiles made using SHARKmodel
	    (see section 5).

--blasthome Specify BLAST directory [String]
            (default=./blast)

	    Use an alternative BLAST distribution.

--wisehome  Specify Wise2 directory [String]
            (default=./wise2)

	    Use an alternative version of Wise2.

--Xmx       Specify JVM size [String]
            (default=64M)

	    Increase the size of the Java virtual machine if SHARKhunt
	    runs out of memory.


4. Output files
===============

All output files are prefixed with the "genome_id" command line
argument. 

SHARKhunt generates log files during the run:

.search:  Shows the progress of the profile search.

.filter:  Shows which hits have been discarded in favour of
	  alternative functional assignments.

.run:	  Contains information about the options selected for this
	  run.

On completion, a number of output files are created:

.ec	  A list of all functions (EC numbers) predicted along with
	  their E-values.

.faa	  Translated coding regions in FASTA format.

.gff	  Locations of hits in GFF format.

.prof	  Breakdown of hits found for each profile.

.xml	  Complete results in XML format, ready for upload to the
	  metaSHARK website.


5. SHARKmodel
=============

A script is provided for creating user-defined profiles for use with
SHARKhunt. Make a directory containing FASTA-format files with names
ending in .faa, one per profile, where each file contains one or more
sequences representing a protein domain. 

To run SHARKmodel, type

sharkmodel model_dir

where model_dir is the name of the model directory. 

For each FASTA file containing more than one sequence, the script will
construct a multiple alignment, a HMM and a PSI-BLAST profile. Single
sequences are converted directly to PSI-BLAST models.

The input directory is now ready to be used in SHARKhunt searches: use
the --profiles option to specify the new profile directory.

There is the option to take advantage of the PRIAM-style AND and OR
rules for combining profiles. The initial data files must be named
according to the convention

XpY.faa

Where Y represents the family of profiles (for example, an EC number)
and X is a unique identifier. For example, PRIAM contains four
profiles representing the function EC 3.6.3.15:

1p3.6.3.15
3p3.6.3.15 
8p3.6.3.15
10p3.6.3.15 

By default, the detection of any one of the profiles representing a
family is sufficient to declare that the function is present in an
organism - this is called an "OR" rule and corresponds to having
several distinct families for a single function. Alternatively,
profiles may represent components of an enzyme complex. In this case,
all profiles must be found in order for the function to be present -
this is the PRIAM "AND" rule. Rules containing an "AND" must be
specified in a file called "collections_AND" within the profile
directory. Each line in this file relates to a different function. For
3.6.3.15, the rule is

 8p3.6.3.15 OR 10p3.6.3.15 OR 3p3.6.3.15 AND 1p3.6.3.15

Either 8p3.6.3.15, or 10p3.6.3.15, or both 3p3.6.3.15 AND 1p3.6.3.15
must be present in order for the function to be asserted.

