We used the small RNA Workbench among other bioinformatics tools to conduct a “miRNA Workshop 2016” at the Earlham institute. The goal was to train early-career bioinformaticians from across the globe how to analyse small RNA data generated by next generation sequencing for both animal and plant research.
This is the first of the existing tools to take advantage of the low memory updates that will be rolling out over this development period. This tool was chosen as the pre-cursor as it is a relatively simple algorithm and lends itself well to the chosen design strategy
What did we change?
Previously, a typical run of the Adapter Removal tool would see a FASTQ file loaded into memory, processed, then output and stats would be gained on the input file and what happened during processing that would then be reported to the user. The new design sees the traditional Java style loading data into the JVM replaced with memory mapped buffers of data and all IO operations handled exclusively by the operating system rather than Java. Chunks of information are read directly into RAM and incrementally handed off to disk as they are processed. Memory mapped files are independant of java heap space and therefore will trigger less out of heap exceptions at runtime that are typically unrecoverable. In addition, this type of IO allows for full random access of data, something that will come in handy for the other tools in the workbench One of the main issues with this type of process is that to create a non-redundant version of small RNA data you typically need information about the entire file. This is because small RNA sequences can appear in any order in the original FASTQ data. Originally the program would record each sequence it found and then simply add to the abundance count when it came across the sequence later in the file. Obviously, this then leads to each unique sequence being stored in memory which can take up heap space. The new system uses an external sorting algorithm to sort the processed sequences and then update the counts after the last occurance of that sequence in the incoming stream of data. Additional stats about the state of the data at each stage can also be gathered in this way allowing for the same functionality in terms of graph output as the original tool.
How effective were the changes?
Tracking heap memory usage was done using the java monitoring console using the OSX operating system and a fixed data chunk size of 1MB (this value will be modifiable in the future allowing users to favour speed or RAM depending on requirements and hardware). Additional tests have been performed on Windows 7 and Ubuntu 11 with very similar results. If any readers would like to see the screen grabs from these runs let me know in the comments below and I will upload them. Below is a trace of the heap memory used when processing four FASTQ datasets at a total size of 9.3GB. It is worth noting at this point that Java will make the most of any heap it finds to reduce the amount of garbage collection. Therefore the same data can use more heap depending on how much is available. This is important here because the old version of the tool required a certain amount of heap or it would crash with an out of memory exception, the new version was tested with the same heap size but can in fact run with a far lower amount. A low heap size run is also shown in the gallery (but for the new tool only for obvious reasons). This used 500MB of heap to demonstrate that the tool can run with smaller overall RAM sizes.
It is clear that the new version can function with orders of magnitude less memory. However, run time can be dramatically affected too. Mainly because of the extra work that was required in terms of sorting and preparation of data (the original version could for example process the fastq data directly without the need for prior conversion). In the gallery above a run is shown with a 1MB chunk size Therefore, users may wish to make use of the upcoming chunk size modifier, although there is likely to be a sweet spot I am yet to discover. Clearly allowing 1GB chunks of data to be processed at one time RAM will reduce the IO operations significantly for example and reduce runtime dramatically if required.
Any known problems so far?
Yes, the major problem with this type of setup is in fact only present on the Windows OS. Unfortunately Windows does not allow any process to delete memory mapped files even if they are no longer required in the program. According to the Java developers, this problem can never be fixed as it is down to Microsoft. In short, I ask the garbage collector to kick in, if it does so the file will be able to be deleted but there is no way to force the garbage collector, only hint. If it is not deleted the user will have to do this by clearing the User/temp directory after running. Other problems I will add here as they are reported to me.
A work in progress…
The reason I am releasing this tool as a beta is to attempt to get a grip on some of the problems that will inevitably arise as the tool is stressed by our users. I will continue to modify the tool in an attempt to get close to the run time of the original program but keep the RAM usage low.
This series of posts is about a set of major changes that will be happening under the hood of the sRNA Workbench.
These changes have a single and specific aim; to reduce the memory footprint of the workbench.
One of the major problems facing the workbench in its current state (and a large proportion of bioinformatic tools in my opinion) is the amount of RAM required to process and analyse large datasets.
While single dataset size is unlikely to increase (that is one sample that forms part of an experiment) unless the research requires a deeper than normal sequencing depth. The overall size of an entire experiment is increasing and continues to do so. As the cost of producing such data decreases, the volume will increase (a problem that is mentioned time and time again).
As more complex analysis tools are added and the features of existing tools are improved the amount of memory required to run the workbench is increasing. Therefore, a major goal over the coming development period is to drastically reduce this need for resources for all existing tools in the workbench and any new tools that are added.
Interested users can follow the changes in the posts that are made on each release that contains a new version of the tool. Also if you download the new versions of the software and use the new tools you can really help by reporting problems you have found with them.
Links to posts on the new tools can be found below and will be updated as they are ready: