The methods we choose to use in any analysis contribute strongly to the outcome. Furthermore, it is common throughout science to find many different methods that all perform the same basic task, all of which can make any number of varying assumptions about your data. This variety of methods and assumptions can create difficulties when communicating the methodology used in an experiment. It also makes it difficult for new researchers to decide which methods they should use in their experiments.
My work has centred on the computational collection and analysis of methodologies from full text scientific articles. We have already conducted a survey of 22,000 articles related to phylogenetic analysis. Our results have highlighted interesting field-specific and temporal trends of phylogenetic practice. We have also investigated the influence of 'Expert' authors in relation to others in their field.
The level of sophistication in text mining software is very high. For various reasons, however, a lot of this software is often not made available to the wider research community. I have been working on core text mining software and services (see below) to automate common text mining tasks (such as corpus collection). I have deliberately aimed to make my software available through a range of interfaces (browser, web service, API, local GUI) that support different needs. Most of the software I have made is written in Java which can be run on a large range of very different machines. I have also made the vast majority of the code I have written for these projects available to view, download and generally mess around with, this is made possible by standard open source licenses. My software projects have all made use of third-party open source software libraries and projects and without these, much of the work would have been significantly more labourious.
Scientific literature is now widely available in a variety of electronic formats. This enables us to employ computational methods to survey vast literature collections for information. An article can contain many different forms of information that can be relevant to other researchers (e.g. citations, protocols, algorithms, hypotheses, sequence data etc.). As a preliminary step to isolating and extracting these different forms of information, I have developed a text classifier that can label the sections of an article (e.g. Introduction, Results). This enables us to target information mining software to specific sections of an article, based on the information we are seeking, thus improving accuracy and reducing the computational effort required.
Mining Semantic Networks of Bioinformatics e-Resources from the Literature
Proceedings of the SWAT4LS Workshop, 2009, Amsterdam, The Netherlands, Full Text (PDF) , Proceedings
Biomedical event detection using rules, conditional random fields and parse tree distances
Proceedings of the Workshop on BioNLP: Shared Task, NAACL-HLT 2009, Boulder, USA, Full Text (PDF) , Proceedings
Full-Text Mining: Linking Practice, Protocols and Articles in Biological Research
Proceedings of the BioLink SIG, ISMB 2008, Toronto, Canada, Full Text (PDF) , Proceedings
Methodology capture: discriminating between the "best" and the rest of community practice.
BMC Bioinformatics 9:359, Full Text (HTML) , Full Text (PDF) , doi:10.1186/1471-2105-9-359 , UK PubMed Central