MINOTAUR FAQ Section
The following topics are addressed (click on the question to see the answer):
- Why would I want to use MINOTAUR?
- MINOTAUR is a tool that gathers information from UniProt and online biomedical abstracts in response to user queries. It was developed to help database curators write annotation about protein families, but we hope it will prove useful for anyone interested in grabbing salient facts from biomedical literature.
- MINOTAUR is a tool that gathers information from UniProt and online biomedical abstracts in response to user queries. It was developed to help database curators write annotation about protein families, but we hope it will prove useful for anyone interested in grabbing salient facts from biomedical literature.
- I have a set of abstracts - what information can MINOTAUR help extract from them?
- Provided the abstracts are in PubMed XML format, MINOTAUR can help order them to find the most relevant to your particular interest. In addition, the software can attempt to extract sentences from these abstracts which relate to a variety of topics that are important to biologists and annotators (protein structure, function, disease associations, and so on), and allow you to download the results as a text file.
- I have a query sequence - what can I learn about it using MINOTAUR?
- Using MINOTAUR's BLAST-based entry point, you can search for Swiss-Prot sequences that are related to your query sequence. The PRECIS component of MINOTAUR can then collate the Swiss-Prot information on these sequences into a report. MINOTAUR can also use an adapted version of this process to generate search terms that can be used to interrogate PubMed to find relevant online abstracts, and can extract informative sentences from these documents.
- I have a set of Swiss-Prot entries - what can MINOTAUR tell me about
them?
- MINOTAUR can examine the entries and attempt to detect the relationship between them. It will then create a report (using the PRECIS engine - see "What is PRECIS" below), tailored according to this relationship, based on common Swiss-Prot annotation. MINOTAUR can also use an adapted version of this process to suggest query terms that are relevant to your set of entries. These can be used to search PubMed to find relevant online abstracts. The software can then extract informative sentences from these documents.
- I'm interested in a particular protein or family of proteins - what
kind of information can MINOTAUR help me to retrieve about them?
- MINOTAUR can help search PubMed to identify the online abstracts most relevant to your protein or protein family. Using the in-built set of sentence classifiers, it can then attempt to identify information (in the form of sentences) from these abstracts that relates to protein structure, function, disease associations, subcellular localisation, and tissue specificiy, along with sentences that contain family-terms.
- What is PRECIS?
- PRECIS is a component of MINOTAUR that takes a list of UniProtKB/Swiss-Prot entries and analyses specific database fields to filter and extract relevant information. It can be used to produce a concise report on the sequences, or to suggest query terms that can be used to search online literature for relevant documents. For more information on PRECIS, see the "PRECIS Engine" section on the Help page. Alternatively, you can find a publication on PRECIS here.
- How does the ranking system work?
- The ranking part of MINOTAUR is very simple. Essentially, it takes the abstracts returned in response to a PubMed query and looks at the number of times the query terms are found in each document. The abstracts with the most terms (hopefully the most relevant) are placed at the top of the list, and the abstracts where the least terms are found are placed at the bottom. This means that the first results you see will theoretically involve the most relevant abstracts. When used in conjunction with the sentence classifiers, the ranking system throws away those abstracts where no query terms are found - these can sometimes occur because the PubMed search engine pulls in articles that are related through their MeSH terms, but do not refer directly to the query terms.
- What are the precision and recall values for the document rankers?
- The Information Retrieval (IR) part of MINOTAUR is handled by the PubMed search engine. The function of the document rankers is simply to sort the results of this process, placing those abstracts in which the query terms occur most frequently at the top of the list (see "How does the ranking system work?" above). It therefore makes little sense to calculate precision and recall levels for the rankers, since they will perform in line with the PubMed engine (albeit it with a slight boost to precision at the sentence classification stage, since abstracts containing no query terms are excluded from this step).
- What do the sentence classifiers do?
- The sentence classifiers take the documents passed to them by the rankers and attempt to identify sentences which are related to your selected topic of interest. Two types of classifier are available in MINOTAUR: SVM- and template-based.
- What are SVMs?
- Support vector machines (SVMs) are a set of supervised learning methods that can be used for classification. More information on the methodology can be found here.
- SVM development for MINOTAUR
For MINOTAUR, SVMs were trained using to recognise sentences that relate to a variety of topics (structure, function, disease associations, subcellular localisation, and tissue specificity). Investivation of a number of SVM kernels and hyperparameter values found linear SVMs with a C parameter value of 0.1 to perform best at identifying such sentences, and so this type of classifier was selected for use by the system. - What are templates?
- The templates are a combination of word patterns and rules that can be used to identify sentences relating to different topics.
- Template development for MINOTAUR
The development of the templates used by MINOTAUR has been previously described here. To briefly summarise, word patterns relating to structure, function, diseases and therapeutic compounds, localisation and familial relationships were manually identified by annotators. The target patterns chosen included single words, word pairs or small phrases (including non-contiguous phrases). The patterns were then used to build templates.
The level of template complexity rises from word pairs (made of two keywords, or a keyword followed by a preposition), to more complex templates (containing keywords, prepositions, and allowing for certain number of words in between). For example, for the "diseases and theraputic compounds" category, valid templates include:
- "[a-z]{1,16}pathy"
- "[A-Z][a-z]{2,10} disease"
- "treatment of ... by"
(where "..." allows for a defined number of words)The system is configured to find all matching templates in a sentence, but to prioritise the most complex structures.
- How well do the sentence classifiers perform?
- The performance of the classifiers varies according to which approach has been chosen (template- or SVM-based) and the topic under investigation. Overviews of template-based sentence classification system, including development and performance can be found here and here.
For the SVMs, evaluation is slightly complicated, since there are 5 precision/recall settings for each topic. We calculate that the best performing precision-biased classifiers display an average precision in the 60-70% range, while best performing recall-biased classifiers display an average recall in the 60-80% range, over the five topics. The individual abstracts and results files used to esitmate these values are available as a tarball from the following link: SVM results.
- The performance of the classifiers varies according to which approach has been chosen (template- or SVM-based) and the topic under investigation. Overviews of template-based sentence classification system, including development and performance can be found here and here.
- Some of the sentences I got back don't relate to the protein I was
searcing for! Why is that?
- At the present time, MINOTAUR returns *all* of the sentences it finds relating to the topic you select (e.g., all of the function sentences, or all of the sentences that contain disease information, etc). This is because selecting only those involving your query protein would be a very complicated task involving difficult text mining issues such as name entity recognition and anaphora resolution (eg, "*the protein* activates a signalling cascade", "*it* is a kinase", and so on). This is something we would like to tackle in future projects.
- I selected the "Extract Structure Sentences" option, but some of the
results don't look very structure-like to me! Is the system broken?
- Unfortunately, sentence classification is a non-trivial computational task,
so no system performs with 100% accuracy. We estimate that our best performing
classifiers perform with 60-70% precision, so it is likely that at least some
of the results you get back may not be correct.
You can use the checkboxes to select only the correct sentences on the results page and regenerate the output (after ticking the checkboxes, choose the 'select sentences' option at the bottom of the page).
Alternatively, if you used the SVM extraction option to generate your results, you can attempt to reduce the number of false positives by increasing the precision of the classifier (use the slider bar in the Advanced Options drop down menu). This may, however, reduce the number of results you get back, and might mean you miss some potentially useful sentences.
- Unfortunately, sentence classification is a non-trivial computational task,
so no system performs with 100% accuracy. We estimate that our best performing
classifiers perform with 60-70% precision, so it is likely that at least some
of the results you get back may not be correct.
- SVMs or templates - which sentence classification approach should I use?
- Our previous work has shown that the different sentence-classification systems perform differently, depending upon the sentence type evaluated. For example, under one set of test conditions, the templates performed better at classifying disease-related sentences than the SVM-based approach, while for structure-related sentences the opposite was true. We suggest you experiment with different template/SVM and precision/recall settings in order to find the best method and optimal calibration for your needs.
- This looks like it might be useful for my research. Can I go ahead and use it?
- Absolutely! MINOTAUR is free and open to all users.
- Who funds MINOTAUR?
- MINOTAUR's development initially received 6-months' funding from BBSRC's Tools and Resources Development Fund. Refinement and expansion of the application is currently funded by the European Commission, as part of the IMPACT project.