A new version of the filter tool is now available as part of a preconfigured workflow (currently the workflow is on its own, new workflows will be available soon with the filter tool included).
The Filter 2 program retains all of the functionality of the original tool with the added ability to update the RFAM database files. The original Filter program used an old version of the t/rRNA database taken from the original srna tools programs. The updated database is likely to be much larger and therefore will add to the processing time.
We used the small RNA Workbench among other bioinformatics tools to conduct a “miRNA Workshop 2016” at the Earlham institute. The goal was to train early-career bioinformaticians from across the globe how to analyse small RNA data generated by next generation sequencing for both animal and plant research.
Main fixes relating to the preconfigured pipeline for Quality Checking, Normalisation and Differential Expression
The tutorial video for this pipeline is now available and can be found here
Interface updated to show more information about which nodes are running, correctly configured, completed and those that are waiting user input (key displayed on the interface)
File Hierarchy Wizard
Replicate files now index from 1 instead of 0
Multiple files can be selected when adding replicates to samples
Options added for outputting aligned sequence files
Bug fixes including:
File removal works correctly for all file types
Performance and UI improvements
Major performance improvements
Users can now choose to remove ranges of size classes from the dataset using the sliders placed at the top of the interface or individual size classes using the input text box found in the options menu
Bug fixes for removal of files so that it now works as expected
Performance improved greatly
Export options improved
Some fixes to CSS JavaFX validation problems still causing black marks to appear. Issue still remains in some plots, if unusable use the export menu to view the file in a web browser.
Fixes for menu disable functionality when certain graphs are not available
Jaccard Index table no longer produced for second report. Producing the data for this plot is extremely time consuming and does not add much information to the plot produced in the second report.
Normalisation now waits for the user to press continue before running, this is to give time to select desired normalisations before the module begins
KL divergence graph now renders more clearly
If only one normalisation is selected in the previous stage the node automatically selects it
Users can now control the window length (default 4000nt)
Fixes for some of the issues with annotation display
This release is an Alpha test for the newest version of the workbench (check the frequently encountered problems for a list of known issues). This version comes with an early release of a new tool that will be included in the final 4.0 Workbench Release and a new look to the interface.
The workbench will slowly be completely migrating to the new interface over the coming months.
The new interface discards the old style of single tool multi document interface from Version 2 onwards to this point in favour of a workflow style. Users currently have the option to use only one preconfigured workflow, however more preconfigured options will be coming soon along with the ability to completely customise a workflow for a user’s own needs.
In addition to the new workflow a completely new backend to the program has been included, this removes all of the previous IO based in memory calculations in favour of a database module that should help to counteract some of the problems faced with memory requirements for the software. More details will be given out in a post on the workbench diaries section similar to that of the Adapter Removal low memory post.
Also some new functionality for sequence alignment has been included
A new front end in the form of a Workflow model has been included. Users can still view the old style from the new interface by clicking the version 3.2 button and using the program as before.
Bowtie can now be used from the wizard to align sequences, however, currently only pre-indexed genome files should be used in this mode. Pre indexed genomes can be found at the bowtie website: http://bowtie-bio.sourceforge.net/index.shtml
Quality Checking and Normalisation
A brand new tool for quality checking and normalisation of small RNA data. This tool works as a flow from setup through review and completion. Users can setup the software to reflect exactly how their experiment was formed, check the initial quality of the data, deselect problematic files and then normalise the results.
users can then export the data as CSV format for use in GEO uploading or as normalised FASTA files. Users can also export all of the graphs shown in the quality checking stage of the workflow. Further details can be found here
A second workflow containing all of the previous normalisation and QC analysis nodes but with the addition of a differential expression node is also included. Further details can be found here
Changes to the Version 3 build…
As previously mentioned, the version 3.2 release can still be accessed from this build by clicking the “Show version 3.2” and the Version 4.0 alpha can be returned to from the tools menu.
The version 3 tools will continue to be updated until a time that they move to the new interface or are completely replaced. As such, some tools from the previous version have received improvements. Details below:
Low memory version has received some back end improvements to speed of processing
Low memory version has corrected errors produced when processing HD adapter samples
Users can now specify the output directory for miRCat results when running from the command line. The extra flag
-output_directory followed by the desired location can be added to the miRCat instruction.
A more informative message is now given to the user and written to the log if the input file has bad formatting or invalid lines that will cause the software to crash
A fix for GA tracking data initialization occasional crashes from the command line
Adapter Removal LM (Low Memory)
The first in a set of major changes to the underlying data structure of the sRNA Workbench. Memory was becoming a major issue so our focus switched toward strategies for counteracting the large volumes of data our users wanted to process using the workbench. The new Adapter Removal tool (currently shown as Adapter Removal LM beta) can be used in exactly the same way as the old tool with one change, the users will notice a redundant version of the direct FASTQ to FASTA conversion in their output directory. Further details can be found in the new series of diary posts I will be making for the new tools. The adapter removal low memory diary can be found here.
Adapter Removal (All Versions)
The GUI no longer outputs HD stats if the HD adaptor trimming option was not selected
Users now have the option to change the background colours and colours of the glyphs used to represent; GFF annotations, Aligned small RNA files (including the aggregated mode) and the backgrounds of all tiers to suit their usage. Users should use the help->settings menu on the VisSR panel to change the colours they are interested in.
A new version of Ghost Script has been included that no longer requires an X11 build to be present on OSX computers (as support for X11 was removed in Mountain Lion)
A bug that left a hardcoded value in place for the offset absolute differential expression value has been fixed
Rounded P values that read 0 have now been replaced with a more informative message. Remaining P values have been rounded to two significant figures
The CLI output now matches the updated output style created for the GUI mode
Fixed an issue in GUI mode where genome matches were not being summed up correctly
Fixed an issue where normalised counts were being summed over rows incorrectly
This is the first of the existing tools to take advantage of the low memory updates that will be rolling out over this development period. This tool was chosen as the pre-cursor as it is a relatively simple algorithm and lends itself well to the chosen design strategy
What did we change?
Previously, a typical run of the Adapter Removal tool would see a FASTQ file loaded into memory, processed, then output and stats would be gained on the input file and what happened during processing that would then be reported to the user. The new design sees the traditional Java style loading data into the JVM replaced with memory mapped buffers of data and all IO operations handled exclusively by the operating system rather than Java. Chunks of information are read directly into RAM and incrementally handed off to disk as they are processed. Memory mapped files are independant of java heap space and therefore will trigger less out of heap exceptions at runtime that are typically unrecoverable. In addition, this type of IO allows for full random access of data, something that will come in handy for the other tools in the workbench One of the main issues with this type of process is that to create a non-redundant version of small RNA data you typically need information about the entire file. This is because small RNA sequences can appear in any order in the original FASTQ data. Originally the program would record each sequence it found and then simply add to the abundance count when it came across the sequence later in the file. Obviously, this then leads to each unique sequence being stored in memory which can take up heap space. The new system uses an external sorting algorithm to sort the processed sequences and then update the counts after the last occurance of that sequence in the incoming stream of data. Additional stats about the state of the data at each stage can also be gathered in this way allowing for the same functionality in terms of graph output as the original tool.
How effective were the changes?
Tracking heap memory usage was done using the java monitoring console using the OSX operating system and a fixed data chunk size of 1MB (this value will be modifiable in the future allowing users to favour speed or RAM depending on requirements and hardware). Additional tests have been performed on Windows 7 and Ubuntu 11 with very similar results. If any readers would like to see the screen grabs from these runs let me know in the comments below and I will upload them. Below is a trace of the heap memory used when processing four FASTQ datasets at a total size of 9.3GB. It is worth noting at this point that Java will make the most of any heap it finds to reduce the amount of garbage collection. Therefore the same data can use more heap depending on how much is available. This is important here because the old version of the tool required a certain amount of heap or it would crash with an out of memory exception, the new version was tested with the same heap size but can in fact run with a far lower amount. A low heap size run is also shown in the gallery (but for the new tool only for obvious reasons). This used 500MB of heap to demonstrate that the tool can run with smaller overall RAM sizes.
It is clear that the new version can function with orders of magnitude less memory. However, run time can be dramatically affected too. Mainly because of the extra work that was required in terms of sorting and preparation of data (the original version could for example process the fastq data directly without the need for prior conversion). In the gallery above a run is shown with a 1MB chunk size Therefore, users may wish to make use of the upcoming chunk size modifier, although there is likely to be a sweet spot I am yet to discover. Clearly allowing 1GB chunks of data to be processed at one time RAM will reduce the IO operations significantly for example and reduce runtime dramatically if required.
Any known problems so far?
Yes, the major problem with this type of setup is in fact only present on the Windows OS. Unfortunately Windows does not allow any process to delete memory mapped files even if they are no longer required in the program. According to the Java developers, this problem can never be fixed as it is down to Microsoft. In short, I ask the garbage collector to kick in, if it does so the file will be able to be deleted but there is no way to force the garbage collector, only hint. If it is not deleted the user will have to do this by clearing the User/temp directory after running. Other problems I will add here as they are reported to me.
A work in progress…
The reason I am releasing this tool as a beta is to attempt to get a grip on some of the problems that will inevitably arise as the tool is stressed by our users. I will continue to modify the tool in an attempt to get close to the run time of the original program but keep the RAM usage low.
This series of posts is about a set of major changes that will be happening under the hood of the sRNA Workbench.
These changes have a single and specific aim; to reduce the memory footprint of the workbench.
One of the major problems facing the workbench in its current state (and a large proportion of bioinformatic tools in my opinion) is the amount of RAM required to process and analyse large datasets.
While single dataset size is unlikely to increase (that is one sample that forms part of an experiment) unless the research requires a deeper than normal sequencing depth. The overall size of an entire experiment is increasing and continues to do so. As the cost of producing such data decreases, the volume will increase (a problem that is mentioned time and time again).
As more complex analysis tools are added and the features of existing tools are improved the amount of memory required to run the workbench is increasing. Therefore, a major goal over the coming development period is to drastically reduce this need for resources for all existing tools in the workbench and any new tools that are added.
Interested users can follow the changes in the posts that are made on each release that contains a new version of the tool. Also if you download the new versions of the software and use the new tools you can really help by reporting problems you have found with them.
Links to posts on the new tools can be found below and will be updated as they are ready: