MINOTAUR Help Section
Click on the required topic to access help on that subject.
- What is MINOTAUR?
- MINOTAUR is a suite of tools, developed in collaboration with text mining experts at institutes across Europe. It aims to help users extract information from UniProtKB and published literature, or from users' own uploaded text. The package is likely to be useful to users wishing to search PubMed for specific types of information, such as database annotators aiming to collate information relating to a particular sequence or set of sequences.
- MINOTAUR is a suite of tools, developed in collaboration with text mining experts at institutes across Europe. It aims to help users extract information from UniProtKB and published literature, or from users' own uploaded text. The package is likely to be useful to users wishing to search PubMed for specific types of information, such as database annotators aiming to collate information relating to a particular sequence or set of sequences.
- The PubMed query tab.
- This entry point allows users to search PubMed directly. Submitting a query will lead to an intermediate results page, where the query and a range of alternative search terms (generated though the application of different PubMed search tags) are displayed. The number of documents matching each query is also output. Users have the option of ranking the matched abstracts (ordering them according to relevancy), or they can choose to extract sentences that relate to various topics of interest from the documents. The latter option can be achieved through two different approaches - SVM- or template-based sentence classification. For more information on these approaches, see the SVMs, Rules and Templates section below.
The Advanced Options drop down menu allows users to select a ranking module with which to order the documents. Users may also fine tune the sentence extraction process by adjusting the precision and recall settings of the SVM classifiers using a slider bar.
Users can select the number of abstracts to be passed to the rankers/classifiers (up to a maximum of 200). An option to modify the PubMed query terms, and regenerate the page accordingly, is also provided.
- This entry point allows users to search PubMed directly. Submitting a query will lead to an intermediate results page, where the query and a range of alternative search terms (generated though the application of different PubMed search tags) are displayed. The number of documents matching each query is also output. Users have the option of ranking the matched abstracts (ordering them according to relevancy), or they can choose to extract sentences that relate to various topics of interest from the documents. The latter option can be achieved through two different approaches - SVM- or template-based sentence classification. For more information on these approaches, see the SVMs, Rules and Templates section below.
- The Corpus tab.
- Here users can upload their own corpus (set of abstracts) in PubMed XML format, and utilise the built-in rankers to rank their documents and/or sentence classifiers to extract informative sentences.
Whilst uploading the XML-formatted file, it is necessary to provide query terms that will be used to rank the documents, as well as choosing the appropriate action to perform (rank documents, rank and extract sentences with SVMs, or rank and extract sentences with templates). For more details on the document ranking and sentence extraction processes, see the corresponding sections below.
- Here users can upload their own corpus (set of abstracts) in PubMed XML format, and utilise the built-in rankers to rank their documents and/or sentence classifiers to extract informative sentences.
- The BLAST tab.
- This entry-point allows users to BLAST their protein sequence (in FASTA format, or as a UniProtKB/Swiss-Prot ID or accession number) against UniProtKB, and construct a PRECIS report based on information collated from the top hits. Alternatively, users can utilise the PRECIS engine to suggested PubMed search terms and use those to query PubMed via an intermediate results page, similar to that described in the PubMed Query tab section above.
NB: the Advanced Options menu can be used to limit the number of sequences returned by BLAST. Also note that UPI (UniParc Identifiers) will not work with the system at this time.
- This entry-point allows users to BLAST their protein sequence (in FASTA format, or as a UniProtKB/Swiss-Prot ID or accession number) against UniProtKB, and construct a PRECIS report based on information collated from the top hits. Alternatively, users can utilise the PRECIS engine to suggested PubMed search terms and use those to query PubMed via an intermediate results page, similar to that described in the PubMed Query tab section above.
- The UniProtKB/Swiss-Prot IDs tab.
- This entry point allows users to supply a set of UniProtKB/Swiss-Prot IDs, and to construct a PRECIS report based on information collated from the corresponding database entries. Alternatively, users may utilise the PRECIS engine to suggest PubMed search terms and use those to query PubMed in the manner described in the PubMed Query tab section above.
- This entry point allows users to supply a set of UniProtKB/Swiss-Prot IDs, and to construct a PRECIS report based on information collated from the corresponding database entries. Alternatively, users may utilise the PRECIS engine to suggest PubMed search terms and use those to query PubMed in the manner described in the PubMed Query tab section above.
- The PRECIS engine.
- The PRECIS engine takes a list of UniProtKB/Swiss-Prot entries and analyses specific fields (DE, CC, DR, KW, etc.) to filter and extract relevant information. It then collates the data into a report, containing information relating to the matched protein family, including database cross-references, structural information, disease associations, functional annotation, keywords, and literature references.
When PRECIS is utilised to suggest PubMed search terms, it processes the Swiss-Prot ID, DE fields and CC -!- SIMILARITY and CONTAINS subfields. Keywords are also collected from the KW field, and output as additional potentially useful search terms.
- The PRECIS engine takes a list of UniProtKB/Swiss-Prot entries and analyses specific fields (DE, CC, DR, KW, etc.) to filter and extract relevant information. It then collates the data into a report, containing information relating to the matched protein family, including database cross-references, structural information, disease associations, functional annotation, keywords, and literature references.
- Document rankers.
- MINOTAUR uses two ranking modes to order the abstracts returned from PubMed. The first of these (rank by word) attempts to match each word in the query terms, giving the highest score based on the frequency of word occurence. The second mode (rank by subphrase) attempts to match all of the words in a multiword query in order, awarding the highest score to those abstracts where the subphrase occurs most frequently. The ranker will then attempt to match a progressively truncated subphrase by removing words from the end of the query term, awarding a lower score the shorter the matching subphrase. For both ranking modes, the final scores are normalized, so that abstracts are awarded a score of between 0 (no query terms found) and 1 (the most query terms found).
Abstracts in which no query terms are found are not fed into the sentence classifier components of MINOTAUR, since they are deemed irrelevant to the query. However, they are displayed on the Rank Abstacts According to Relevancy results page (identifiable by their zero score rating).
- MINOTAUR uses two ranking modes to order the abstracts returned from PubMed. The first of these (rank by word) attempts to match each word in the query terms, giving the highest score based on the frequency of word occurence. The second mode (rank by subphrase) attempts to match all of the words in a multiword query in order, awarding the highest score to those abstracts where the subphrase occurs most frequently. The ranker will then attempt to match a progressively truncated subphrase by removing words from the end of the query term, awarding a lower score the shorter the matching subphrase. For both ranking modes, the final scores are normalized, so that abstracts are awarded a score of between 0 (no query terms found) and 1 (the most query terms found).
- SVMs, Rules and Templates.
- MINOTAUR contains a sentence classifier module, the function of which is to excise relevant sentences from the documents passed to it by the rankers. Two separate approaches are utilised to achieve this: SVM- and template-based.
Support Vector Machines (SVMs):
MINOTAUR's SVMs represent a machine-learning approach to sentence classification. They have been trained to identify sentences relating to structure, function, disease, subcelllular localisation and tissue specificity.For each category of interest, 5 different classifier settings are possible, with different bias towards precision or recall. These settings may be adjusted via the precision/recall slider bar found in the advanced options drop down menu where contextually relevant.
Templates:
MINOTAUR's templates represent a manual approach to sentence classification, and draw upon a combination of word patterns and rules that have been identified by annotators as pertinent to structure, function, diseases and therapeutic compounds, localisation and familial relationships. Where matching sentences are found, the rule or template is overlaid upon the sentence on the results page, so that reasons for the match can be easily visualised.
- MINOTAUR contains a sentence classifier module, the function of which is to excise relevant sentences from the documents passed to it by the rankers. Two separate approaches are utilised to achieve this: SVM- and template-based.
- Input formats.
- MINOTAUR takes a range of inputs, depending on the entry point selected.
- For a PubMed query, users can input free text (as if querying PubMed directly).
- For a Corpus query, users may upload a file in PubMed XML format.
- For BLAST, there are two formats accepted:
(a) FastA format
(b) UniProtKB/Swiss-Prot accession numbers and identifiers.
- For the UniProtKB/Swiss-Prot IDs query, users must supply a list of valid UniProtKB/Swiss-Prot IDs. - Output formats & example results.
- Results can be output in the following formats:
- Abstracts can be downloaded in plain text or xml format. An example of the abstract output format is available here.
- Extracted sentences are output to the screen, grouped according to the abstract from which they were excised. A hyperlinked PubMed identifier is displayed for each abstract, along with its ranking score. Each sentence has a corresponding checkbox, allowing users to quickly weed out erroneous sentences, selecting only those that they consider useful. Selected sentences can then be visualised and downloaded in plain text format. An example of this kind of output can be found here.
- Results can be output in the following formats:
- How to access the Web services.
- To facilitate machine access, the BLAST and Swiss-Prot ID-based entry points are also available as fledgling Web services: the WSDL files are located at www.bioinf.manchester.ac.uk/dbbrowser/minotaur/precis.wsdl and precis_blast.wsdl
- Selected publications.
- Mitchell, A.L., Selimas, I. & Attwood, T.K. (2006)
Challenges for protein family annotation.
In The 17th European Conference on Machine Learning & the 10th European Conference on Principles & Practice of Knowledge Discovery in Databases(ECML/PKDD-2006), September 18-22, Berlin, Germany.
Link - Mitchell, A.L., Divoli, A., Kim, J-H., Hilario, M., Selimas, I. & Attwood T.K. (2005)
METIS: multiple extraction techniques for informative sentences.
Bioinformatics, 21(22), 4196-7.
PubMed - Divoli, A. & Attwood, T.K. (2005)
BioIE: extracting informative sentences from the biomedical literature.
Bioinformatics, 21(9), 2138-9.
PubMed - Mitchell, A.L., Reich, J.R. & Attwood, T.K. (2003)
PRECIS - Protein Reports Engineered from Concise Information in Swiss-Prot.
Bioinformatics, 19(13), 1664-1671.
PubMed
The following publications describe a number of previously developed Web-based annotation tools, which have been refined, adapted and interconnected, in order to underpin some of the key MINOTAUR modules:
- Mitchell, A.L., Selimas, I. & Attwood, T.K. (2006)
- If you cannot find what you are looking for, try consulting our FAQ,
or else searching the Web: