Adapter Removal LM

The next iteration of the Adapter Removal Tool

This is the first of the existing tools to take advantage of the low memory updates that will be rolling out over this development period. This tool was chosen as the pre-cursor as it is a relatively simple algorithm and lends itself well to the chosen design strategy

What did we change?

Previously, a typical run of the Adapter Removal tool would see a FASTQ file loaded into memory, processed, then output and stats would be gained on the input file and what happened during processing that would then be reported to the user. The new design sees the traditional Java style loading data into the JVM replaced with memory mapped buffers of data and all IO operations handled exclusively by the operating system rather than Java. Chunks of information are read directly into RAM and incrementally handed off to disk as they are processed. Memory mapped files are independant of java heap space and therefore will trigger less out of heap exceptions at runtime that are typically unrecoverable. In addition, this type of IO allows for full random access of data, something that will come in handy for the other tools in the workbench One of the main issues with this type of process is that to create a non-redundant version of small RNA data you typically need information about the entire file. This is because small RNA sequences can appear in any order in the original FASTQ data. Originally the program would record each sequence it found and then simply add to the abundance count when it came across the sequence later in the file. Obviously, this then leads to each unique sequence being stored in memory which can take up heap space. The new system uses an external sorting algorithm to sort the processed sequences and then update the counts after the last occurance of that sequence in the incoming stream of data. Additional stats about the state of the data at each stage can also be gathered in this way allowing for the same functionality in terms of graph output as the original tool.

How effective were the changes?

Tracking heap memory usage was done using the java monitoring console using the OSX operating system and a fixed data chunk size of 1MB (this value will be modifiable in the future allowing users to favour speed or RAM depending on requirements and hardware). Additional tests have been performed on Windows 7 and Ubuntu 11 with very similar results. If any readers would like to see the screen grabs from these runs let me know in the comments below and I will upload them. Below is a trace of the heap memory used when processing four FASTQ datasets at a total size of 9.3GB. It is worth noting at this point that Java will make the most of any heap it finds to reduce the amount of garbage collection. Therefore the same data can use more heap depending on how much is available. This is important here because the old version of the tool required a certain amount of heap or it would crash with an out of memory exception, the new version was tested with the same heap size but can in fact run with a far lower amount. A low heap size run is also shown in the gallery (but for the new tool only for obvious reasons). This used 500MB of heap to demonstrate that the tool can run with smaller overall RAM sizes.

It is clear that the new version can function with orders of magnitude less memory. However, run time can be dramatically affected too. Mainly because of the extra work that was required in terms of sorting and preparation of data (the original version could for example process the fastq data directly without the need for prior conversion). In the gallery above a run is shown with a 1MB chunk size Therefore, users may wish to make use of the upcoming chunk size modifier, although there is likely to be a sweet spot I am yet to discover. Clearly allowing 1GB chunks of data to be processed at one time RAM will reduce the IO operations significantly for example and reduce runtime dramatically if required.

Any known problems so far?

Yes, the major problem with this type of setup is in fact only present on the Windows OS. Unfortunately Windows does not allow any process to delete memory mapped files even if they are no longer required in the program. According to the Java developers, this problem can never be fixed as it is down to Microsoft. In short, I ask the garbage collector to kick in, if it does so the file will be able to be deleted but there is no way to force the garbage collector, only hint. If it is not deleted the user will have to do this by clearing the User/temp directory after running. Other problems I will add here as they are reported to me.

A work in progress…


The reason I am releasing this tool as a beta is to attempt to get a grip on some of the problems that will inevitably arise as the tool is stressed by our users.  I will continue to modify the tool in an attempt to get close to the run time of the original program but keep the RAM usage low.

  • The reason I am releasing this tool as a beta is to attempt to get a grip on some of the problems that will inevitably arise as the tool is stressed by our users.

  • Andrew

    Is there any way for the adaptor removal to be applied to the fastq file and return a fastq file? (instead of fasta)

  • Pingback: Under the Hood | The UEA small RNA Workbench()