Adapter Removal

Available from Version: 1.0

When sequencing devices produce a list small RNAs, often the minimum read length exceeds the length of the small RNA. Depending on the device, this results in sequenced reads with adaptor sequences at one, or both, ends of the read. The Adaptor Removal tool can remove these adaptor sequences making sRNA data ready for analysis and processing by other tools.

The tool is able to quickly and efficiently process high-throughput sequenced data in FastQ or a FastA formats to produce a FastA file containing trimmed reads with redundancy removed. The tool processes the input file in the following manner:

  • Optionally trim 5′ adaptor from beginning of all reads. Reads not containing a 5′ adaptor, if specified, are discarded.
  • Trim 3′ adaptor from end of all reads. Reads not containing a 3′ adaptor are discarded.
  • All trimmed sequences outside a user specified length range are discarded.

5′ adaptor trimming is an optional step because some sequencing devices automatically trim the 5′ adaptors from sequenced data. For example, Solexa/Illumina reads start at the first base of the sRNA and contain only the 3′ adaptor, whereas 454 datasets contain both the 5′ and the 3′ adaptors, as shown in the diagram below.

The adaptor matching process looks for exact matches in each read to the adaptor sequence. Therefore it will not trim reads with adaptors containing mismatches. In addition it is common, particularly in reads from Illumina/Solexa devices, that the adaptor is truncated in the raw read. For these reasons it is preferable to match using a truncated version of the adaptor sequence. In practice, the first 8nt of the 3′ adaptor and/or the last 8nt of the 5′ adaptor sequence are normally sufficient. This behaviour is easily controllable via the GUI interface (as shown below), which allows users to specify the full adaptor sequence and then enter the number of nucleotides to use in the matching algorithm. In addition, a set of commonly used adaptor sequences are readily available from drop down menus, saving the user time when processing data sets using commmon adaptors.

Adapter Removal Panel
The main interface for the adapter removal tool

After processing has completed, the user can view a table of the job’s execution statistics and a table representing the length distribution of trimmed reads in the results panel shown below the input panel, as shown in the Figure below. This information is also output to file.

Output from AR
The output window for the Adapter Removal tool
  • sRNA Workbench Admin

    Hi Kara,

    Discarding the reads with no adapter match , the sequences containing unassigned nucleotides (Ns) and with low sequence complexity (defined as less than 3 distinct bases) is automatically done within the adapter removal tool. The rationale is that sequences for which the adapter is not present are probably longer than the 20-40nt sRNA range; the sequences containing Ns or with low sequence complexity will only slow down the subsequent steps of the analysis (mainly the matching to the reference genome).

    More on the adapter removal tool can be read here:
    http://srna-workbench.cmp.uea.ac.uk/tools/helper-tools/adapter-removal/

    Irina Mohorianu

  • Noa Sela

    dear all,
    I am trying to run adapter removal from the command line (no gui):
    when I run this command:
    java -jar ~/srna-workbenchV3.2/srna-workbench/Workbench.jar -tool adaptor -f -srna_file 11_S1_L001_R1_001.fastq -out_file /storage1/Data/Aviv_victor_june_2015/29_3_16_Miseq -adaptor TGGAATTCTCGGGTGCCAAGG

    I get the following error:

    uk.ac.uea.cmp.srnaworkbench.utils.LOGGERS.WorkbenchLogger log

    SEVERE: WORKBENCH: AR: Message: 3′ adaptor is mandatory. Please provide a 3′ adaptor sequence (adaptor_sequence_3).;

    Stack Trace: uk.ac.uea.cmp.srnaworkbench.tools.adaptorremover.AdaptorRemover.process(AdaptorRemover.java:223)

    uk.ac.uea.cmp.srnaworkbench.tools.RunnableTool.run(RunnableTool.java:344)

    uk.ac.uea.cmp.srnaworkbench.tools.ToolBox$2.startTool(ToolBox.java:85)

    uk.ac.uea.cmp.srnaworkbench.Main.startWorkbench(Main.java:230)

    uk.ac.uea.cmp.srnaworkbench.Main.main(Main.java:47)

    how do I introduce the 3′ adaptor sequence into the command line?
    Thanks,
    Noa

    • The sRNA Workbench

      Hi Noa,

      So sorry the reply is this late, it only just appeared today in my comments moderation which is very strange.

      Anyway, you must use the parameter file to control the tools from the command line (examples are provided in the data/default_params folder) enter the 3′ adapter into that and then give it the file as part of the command.

      Let me know if this helps or if you need any further info,

      Cheers,
      Matt

  • Bhavik

    Whats the memory required to run the program? I am getting this error repeatedly ” Out of memory occurred. We advice increasing the amount of memory available to JVM using the -Xmx argument or running the smaller datasets through this machine”.

    • The sRNA Workbench

      Hi Bhavik,

      The amount of memory is variable depending on the input. Have you tried the Low Memory version available in the latest release?

      Cheers,
      Matt

  • Andrew

    Hi, is it possible for the adaptor trimming to just apply to the fastq file, and return a trimmed fastq file?

    Thanks,

    • The sRNA Workbench

      Hi Andrew,

      Sorry for the delay in response. I have been away from the office.

      This currently is not possible, would you like to have this function added to a future release?

      Cheers,
      Matt

  • Pingback: The UEA small RNA Workbench Version 3.0 | The UEA Small RNA Workbench()

  • Ki

    Hi,

    I was wondering how to interpret the number after the alignment in the output stats file. Also, does the program produce information about the location of adapter sequence or is there any way to calculate it based on the stats output? Thanks a lot.

    Ki

    • Hi,

      I am a little unsure what you mean here, we do not give an alignment in the stats file rather a length distribution of sequences that remained after the adapter has been trimmed for example:

      ——-
      19 10228 10226
      20 90544 90519
      21 9263918 9261876
      22 54747 54732
      ——-

      would illustrate a major peak in 21mers left within your sample after processing. We give both a redundant and non-redundant count of sequences and the file is output in non-redundant format. See the example below:

      redundant:
      >TCGGGCCAGAGATTCGGACCTT
      TCGGGCCAGAGATTCGGACCTT
      >TCGGGCCAGAGATTCGGACCTT
      TCGGGCCAGAGATTCGGACCTT

      non-redundant:
      >TCGGGCCAGAGATTCGGACCTT(2)
      TCGGGCCAGAGATTCGGACCTT

      i.e. non-redundant counts are counts of unique sequences (each sequence in a non-redundant file will appear only once in the entire result)

      The location of the adapter sequence will vary all the way through the file which is how you will end up with sequences of differing length. If you wish to know where the adapter sequence was found in an individual read, you should refer back to your original FASTQ file.

      Does this answer your question?

      cheers,
      Matt

      • Ki

        Thanks for the quick reply.
        I posting the top line of my .pair.stats file here,
        @SN189:99:C21GKACXX:1:1101:1244:2126 1:N:0:TAGCTTA alignment:130 adapter:29 MM:1 Aligned @SN189:99:C21GKACXX:1:1101:1244:2126 2:N:0:TAGCTTA alignment:130 adapter:29 MM:1 Aligned
        I was wondering what does “alignment:130” means in this case. Thanks again.

        • Ki

          Oops! I think I posted in the wrong place πŸ™ Sorry

          • Hi,

            Yes I do not think our software produced that file!

            Cheers,
            Matt

  • toral

    Adaptor Trimming Process Trims 3β€² adaptor from end of all reads. Reads not containing a 3β€² adaptor are discarded. So my question is why reads not containing 3′ adpotor are discared instead it should retain those sequences without adaptor.

    • Hi,

      the reads that do not contain the adapter will be longer than 50 – 8 = 42.
      The sRNAs are regulatory ncRNAs in the 20-30 range, so if we clone a >35nt RNA fragment it will not be a sRNA, but maybe a mRNA fragment or a long non coding RNA that the instrument has not fully captured. Using the solexa sequencing strategy we can only capture small non coding RNAs. For the long one we would use paired end reads. Therefore we discard the reads that do not contain the adapter as they will not be used in any further analysis.

      Cheers,
      Matt

  • DIvya Patel

    Hi Dr.Matt
    I am new to sRNA workbench and at present I am using the Adapter Removal tool.i have SRA data of plant which size is 5GB to 6GB. my system is linux centOs of 32 bit,and java version is 1.6.0_22.when i run the Adaptor Removal tool there was an error like,
    Out of Memory Error occured.
    We advise increasing the amount of memory available to the JVM using -Xmx argument or running smaller datasets through this machine.
    I had tried -Xmx and -Xms for increasing the memory but the same error had occured.i had provided almost 10GB memory .
    Is there any other way of using the tool for getting result.
    kindly guide me in this.
    Thank You.
    Sincerely,
    Divya

    • Hi Divya,

      Can you confirm the sra data has been first converted to FASTQ before running through the adapter removal tool?

      Cheers,
      Matt

      • DIvya Patel

        Hi,
        I have converted sra data into FASTQ file.
        hope you reply soon.
        thanks.

        • Hi,

          Ok, a couple of questions, how large is the FASTQ file after extraction from SRA?

          Also, when using the Xms commands, did you apply that to Workbench.jar or sRNAWorkbenchStartup.jar ?

          Cheers,
          Matt

          • DIvya Patel

            hello Matt,
            My sample 1 FASTQ file size is 4.9 Gb and sample 2 FASTQ file size is 5.6 Gb.
            the command i have used to increase memory space is as follows:
            “java -Xms 7000Mb -jar sRNAWorkbenchStartup.jar”
            please help out
            thanks Matt

          • DIvya Patel

            hi matt
            actually i had tried both Workbench.jar or sRNAWorkbenchStartup.jar .
            thank you.

  • Indranil

    Dear Matt,
    I am new to sRNA workbench and at present I am using the Adapter Removal tool. I have a microRNA illumina sequencing file. Here we have used barcodes to mark individual samples and we have mixed 20 different samples simultaneously. The barcode is situated at the 3′ end of the microRNA followed by the illumina adapter sequence. At present I have used the Adapter removal tool to remove the illumina adapter. However to get the sub-samples I guess I have to use the tool several times to remove different barcodes. Is there any other way of using the tool or the workbench so that I don’t have to run the tool several times to get my sub-samples.
    Thanks for your help.

    Best regards,
    Indranil.

    • Hi Indranil,

      Partially the answer is yes, you can use the adapter removal tool to strip the barcode and the adapter sequence in one single run for one file.

      (1) What type of barcoding was used? the inline (for which the barcode is still there) or multiplex (for which the reads are already split into samples).

      (2) First split the samples using only the barcodes.

      (3) Then, for each sample run the adaptor removal tool with barcode+adaptor (to make sure that the resulting sequences do not contain the barcode) see below.

      The barcodes are typically 6nt in length, by default we suggest you remove the adapter sequence based on the first 8nt of the sequence entered into the window, however, for this you will want to type the barcode sequence in, followed by the adapter sequence and use the first 10nt instead (there is no right or wrong answer here but 10 may be ok for your data).

      However, there is currently no functionality for running the tool in batch mode unless you wrap the commands to run the tool from the CLI in a script. I will be looking to add a batch mode for the GUI allowing multiple files to be entered for adapter removal in a future release.

      • Indranil

        Dear Matt,

        Thank you very much for your reply. We have used inline barcoding and the barcode length is 5nt in length. Is there any easy way of splitting the samples using only the barcodes?

        Best regards,
        Indranil.

        • Hi,

          I might mention that, the total number of reads is approx 120m per lane, if you mix 20 samples and if the barcodes introduce the same sequencing bias you will get at most 6m reads per sample which may not be deep enough. Out of those 6m reads only around 70% may match to the genome, in addition Illumina supply 12 stable barcodes of 6nt in length so you may wish to check the efficiency of all of the 5nt barcodes you used to ensure all samples are equally biased.

          However, in order to process your data run your original FASTQ file 20 times with each of your barcodes + adapter sequence and use ~8nt for the matching. The resulting files will then be split per sample (be sure to remember which barcode you used for each output file perhaps encode this into the filename)

          Thanks,
          Matt

  • G.Velmurugan

    Dear Dr. Matt,

    Thank you for your suggestions regarding rat genome. I have one more doubt. Basically I am a biologist. My fastq/ fasta files are around 4-6 GB size. When I tried to trim adaptors, I receive a message saying “JVM memory is not enough, try -Xmx argument or use small size”. I increased the memory upto -Xmx5000m, but in vain.

    Then I spilitted the files using a online program (HJSplit) into files around 2 GB. Initially trimming worked for 2 GB file but now it is not working and ending in the same – not enough memory.

    Kindly guide me in this regard. The system has 4 GB RAM and the hard disk has around 300 GB free space. The JAVA version 1.7 is installed.

    Thank you,

    Sincerely,
    Velmurugan

    • Hi Velmurugan,

      4GB of RAM may not be enough to process your datasets in one go as you have discovered. If you have split the files into 2GB chunks then you may have enough total RAM but that does not mean there was enough allocated to the actual workbench program. The sRNAWorkbenchStartup.jar program will attempt to discover how much RAM on your system is not being used by other programs and give a portion of it to the workbench.

      I see you have tried with the -Xmx command but did you give this to the Workbench.jar or sRNAWorkbenchStartup.jar program? It needs to be allocated to the Workbench.jar, this effectively bypasses the startup program and forces an amount of RAM to be allocated to the Workbench.

      In addition what operating system are you using? If it is linux or unix based (MAC OSX) then you may need to purge memory before starting (as these operating systems can often use inactive memory for other tasks). If you need some information on how to do this on your operating system please let me know. (or just restart your computer before running the workbench)

      I would first ensure there are no other programs running when you are using the workbench to ensure as much memory as possible can be used for the workbench then try something like:

      java -Xms3g -Xmx3g -jar Workbench.jar

      this will give the program 3GB of RAM to work with your smaller 2GB datasets. Alternatively the program will run on a server if you have access to one?

      Let me know if this helps!

      Matt

      • G.Velmurugan

        Dear Dr. Matt,

        Thank you very much for your immediate response. As per your suggestion, I closed all other programs during the run of the workbench. But unfortunately your suggestion (java -Xms3g -Xmx3g -jar Workbench.jar) not worked. Still your software says not enough memory.

        Regarding the OS, I am using windows 7.

        We planned to buy new computer dedicated for bioinformatics analysis work in our lab. Please let me know the minimum configurations (RAM, Hard disk space, graphic card and other parameters) required forsmall RNA analysis of files around 7 GB size.

        Meanwhile, I will be more fortunate if you could provide me some ideas to overcome the memory problem in our 4 GB RAM system it self.

        Sincerely,
        Velmurugan

        • Hi Velmurugan,

          well in terms of minimum specifications it is hard to say because different types of data require more or less RAM depending on biological factors. In my experience so far animal data for example often requires less memory than plant data and is at times faster to process. Moreover different programs within the workbench could require less or more RAM for the same data sets (it will for example require more memory to predict micro RNA sequences within a dataset than just to filter that dataset for example)

          My advice would be, determine the maximum amount of funds you have to purchase your new computer and dedicate as much of it as possible to memory. Still consider a fair amount of disc space (especially if you are generating large amounts of data) and get a fairly decent processor if possible. But as data sets increase in size in the future the bottleneck in your analysis will always be memory, even if the slightly slower processor means you will wait a little longer for the results, at least you will get the results! In addition, many of the tools in the workbench use concurrent multi-threading for the procedure so for the most part, the more computing cores the better.

          To overcome your current problem there are a few options open to you, are your 4GB/2GB data files FASTQ data taken directly from a sequencing instrument or are they FASTA formatted? If they are FASTQ then you will probably reduce the file size considerably by converting them to FASTA (this is because each sequence in a FASTA file is represented with two lines in a file rather than the four lines in FASTQ data).

          If they are already FASTA then you will need to split them up for the initial adapter removal process. After this process is complete you will find your files reduce in size considerably again, this is because we convert the files into a “non-redundant” format, in short, each sequence is only represented once in the resulting file and the descriptor line will tell you how many times that sequence appeared in the original data,

          For example:

          original file:
          >SEQUENCE
          ACTGACTGACTGACTG
          >SEQUENCE
          ACTGACTGACTGACTG
          >SEQUENCE
          ACTGACTGACTGACTG

          resulting file:

          >ACTGACTGACTGACTG(3)
          ACTGACTGACTGACTG

          As you can see this reduces the size of the data set dramatically.

          If you have already converted your dataset to FASTA then I would suggest splitting the data into more manageable chunks or removing the adapters using our web based toolkit. You can find this at:

          http://srna-tools.cmp.uea.ac.uk

          Unfortunately, we had a power outage last night and the web interface is currently down. You may find it accessible within a few hours (hopefully!)

          please let me know if this helps.
          Matt

          • Just to let you know, the srna-tools webserver is back online πŸ™‚

  • Nicolas

    Hi,

    I am trying to run the adapter removal tool, without success. Could you give me a concrete example (terminal command) with a concrete srna file to understand how to use this tool?

    Thank you,

    Nicolas

    • admin

      Hi Nicolas,

      we have identified a slight issue when running this tool from the command line. A fix is in place and will be available in our next patch. Please subscribe to the RSS feed for information on this release.

      For now please use the tool from the graphical interface to gain full usage.

      Thanks,
      Matt

      • Nicolas

        Hi Matt,

        Thanks for your reply.
        I have downloaded and installed the new version of the software suite. While using the adaptor removal tool in command line (i need to use the command line version), i get this error message: “Message: 3′ adaptor is mandatory. Please provide a 3′ adaptor sequence (adaptor_sequence_3).;”
        What is the name of this parameter and where do you place it ? Could you give me a concrete example of how using this tool in command line ?

        Thanks,

        Nicolas

        • admin

          Hi Nicolas,

          in the user/data/default_params directory you will find a file called ‘default_adaptorremover_params.cfg’

          please make a copy of this file and open it in a text editor, then fill in the 3′ adapter sequence and remove any unneeded options (for example 5′ adapter).

          An example of the text inside a completed parameter file is below:

          adaptor_sequence_3=TCGTATGCCGTCTTCTGCTTG
          adaptor_sequence_3_length=8
          min_length=16
          max_length=35

          (the length 8 refers to how many adapter characters you wish to use for the matching)

          Assuming you have a srna file next to the workbench called ‘tomato.fastq’ for example (this file can also be in FASTA format)

          The instruction you will need to enter into the command line is:

          java -jar Workbench.jar -tool adaptor -srna_file tomato.fastq -out_file tomato_no_adap.fa -params default_adaptorremover_params.cfg

          this will then create the trimmed file called tomato_no_adap.fa. Replace the file names with the correct path for your data.

          I hope this helps, please feel free to comment with any further problems or for more information.

          Matt

          • Nandan

            Hi Matt,

            I am trying to use the standalone tool for sRNA analysis..
            I have my mouse data from illumina HiSeq. Are there any standard adapter sequences associated with this kind of dataset (for trimming) or do I need to get them from the lab guys who sequenced them?

            The tool looks pretty impressive and I am hoping to use it all the way for sRNA analysis.

            cheers,

            Nandan

          • Hi Nandan,

            In theory you should use the Illumina 1.5 adapter sequence which we have pre-loaded into the workbench (LMN_2)

            however, you should check with the wet lab biologists which adapter sequence they used just in case!

            Thanks,
            Matt

          • Nandan

            Thanks Matt,

            I will check how things go.

            cheers,

            Nandan

A suite of tools for analysing micro RNA and other small RNA data from High-Throughput Sequencing devices