Distributing the computation is also an important requirement because bioinformatics applications could require very different resources depending on the analysis to perform: multiple alignments of sequences, genome assembling or intensive protein sequence comparison. Biologists and bioinformaticians are combining regularly multiple software packages to analyze their data. They used these software for their intensive processes from Web portals, or through with shell commands or scrips written in interpreted languages.
Regarding the computations and the virtual machines, the main requirements are related to satisfying the software dependencies and the very different behavior of the biological applications in terms of CPU and memory. Some applications only require one CPU but with a lot of memory (96 or 128MB) whereas others require lot of CPUs that are accessed through MPI mechanism. We have built a virtual machine with pre-installed bioinformatics software. To install the required bioinformatics software we used a script system, called `bioapps', that we had developed. This tool download the application package from the reference site and install the compiled binary on the machine. Because the bioinformatics applications require access to reference data to process their analyses, this bioinformatics compute appliance is linked to the biological databases repository appliance, and require to mount the exported volumes containing the biological data.We have predefined a bioinformatics appliance with software such as ClustalW, BLAST, FastA and SSearch. Yet users should connect and run the application by hand, but we are planning to added Web interfaces that could be a local Web portal where the user connect to input his data and run the tool. Or that could be Web service interfaces (with SOAP or RESTful endpoint) that the user could integrate to its standard bioinformatics workflows.
You can deploy your own instance of the Bioinformatics Compute Node appliance. It is available from the StratusLab appliances repository under the 'bio/compute' sub-directory. Once deployed, you connect to your instance through ssh as any usual StratusLab virtual machine. We have pre-installed several bioinformatics applications that are available from the command line: BLAST, FastA, SSearch and ClustalW2. During the deployement, the appliance automatically mount the StratusLab reference biological databases repository volume as the local directory '/biodb'. Then these databases can be used for bioinformatics analyses with the sequence similarity tools like blast, FastA and SSearch. In further developments, we are planning to add Web interfaces to these bioinformatics software to ease the use of them by scientists on the cloud.