Filter

Available from Version: 1.0

This tool filters sRNA sequences according to user-defined criteria in a multi-stage pipeline. The tool accepts an unfiltered FastA format file as input and produces a filtered FastA format file as output. In addition, the tool produces statistics summarising the number of sequences removed at each step in the filter pipeline.

Filter Window
The main interface for the Filter tool

The filter tool can discard sequences based on these user defined criteria:

    • Sequence Length

User defines minimum and maximum boundaries for valid sequences, capped at 16nt thru 50nt.

    • Sequence Complexity

If selected, low-complexity (sequences containing less than 3 distinct nucleotides) sequences are removed.

    • Sequence Abundance

User defines minimum and maximum abundance levels for valid sequences.

    • Sequence Validity

If selected, sequences not containing known nucleotides are discarded.

    • Kill Known Sequences

User may specify a list of sequences, which, if found, should be discarded. The kill-list must be a FastA format file.

    • Discard Known Transfer and Ribosomal RNA

This filtering is commonly conducted on sRNA datasets, since reads mapping to tRNA and rRNA might be degradation products. If selected, sequences matching known transfer or ribosomal RNA sequences are discarded. The user may also specify whether or not matches should only be allowed on the sense strand. Known t/rRNAs are stored in a FastA file in $INSTALL PATH/data/t and r RNAs.fa. The file contains t/rRNAs obtained from RFAM, version 10 (Jan-2010) [12, 10], the Genomic tRNA Database [6] and EMBL [16], release 95 (09-Jun-2008). The file can be replaced with any FastA file containing t/rRNAs sequences.

    • Discard Sequences Not In Genome

If selected, sequences not aligning to a user specified genome will be discarded. Usually reads that do not map to the genome are considered sequencing errors or minor contamination, and are generally discarded. Another application for genome filtering is the analysis of reads produced from virus-treatment experiments. For example, these sRNA reads can be partitioned into three categories: reads identified in both the host and viral genome; reads unique only to the host genome; and reads unique only to the viral genome. This can be achieved by running the filter tool several times with the different genomes.

The user also has the option to produce a log file containing all discarded sequences. Each sequence in the log will be associated with the reason it was discarded.

After execution the user can see an overview of the job in the results panel shown below. The results show the number of reads remaining after each filter stage has been completed. In this example, Complexity filtering, Invalid filtering and genome filtering had no effect on the data. Only length, abundance and t/rRNA filtering had an effect.

Filter output
The output table for the Filter tool
  • Katja

    Hi Matt,

    after filtering step I would like to obtain redundat output format. I am able to do that using GUI, but not CLI. When adding “make_nr” to my configuration file as suggested in configuration file with default parameters, this parameter is not recognised. The message I get is:

    WORKENCH ERROR: Unrecoverable exception occured: java.io.IOException: Unknown Parameter: “make_nr”
    Feb 09, 2015 11:35:24 AM uk.ac.uea.cmp.srnaworkbench.utils.LOGGERS.WorkbenchLogger log
    SEVERE: WORKBENCH:
    java.io.IOException: Unknown Parameter: “make_nr”
    at uk.ac.uea.cmp.srnaworkbench.tools.ToolParameters.load(ToolParameters.java:195)
    at uk.ac.uea.cmp.srnaworkbench.tools.ToolBox$5.startTool(ToolBox.java:181)
    at uk.ac.uea.cmp.srnaworkbench.Main.startWorkbench(Main.java:230)
    at uk.ac.uea.cmp.srnaworkbench.Main.main(Main.java:47)

    Please let me know, if you have any solution to this problem. I would really appreciate, if I could run filter tool in CLI, as is much more convenient to use for multiple files.

    Thank you,

    Katja

    • The sRNA Workbench

      Hi Katja,

      If it works in the GUI but not from the CLI it is probably a small bug in the CLI version that I have introduced somewhere. I will investigate it, fix it and add it to the patch notes for the next release!

      Thanks for pointing it out to me. It is a great help! Hopefully a new version will be ready soon, quite a lot of stuff had to change for the next release but I am confident it will be available in the next few weeks. Sorry I cannot help further for now.

      Cheers,
      Matt

  • Vijay

    Hi Matt,
    I am using centOS with 16GB RAM, after starting UEA smallRNAWorkbench 3.1 with a command java -Xmx300g -jar Workbench.jar adapter removal was succesfully completed. But while using filter tool it showed error as below….

    root@localhost srna-workbenchV3.1_Linux]# java -Xmx300g -jar Workbench.jar
    UEA sRNA Workbench startup…
    Apr 22, 2014 3:09:54 PM uk.ac.uea.cmp.srnaworkbench.utils.LOGGERS.WorkbenchLogger log
    SEVERE: WORKBENCH: FILTER: Message: /home/Vijay/srna-workbenchV3.1_Linux/User/temp/_1/filter.kill-list.out.patman (No such file or directory);
    Stack Trace: java.io.FileInputStream.open(Native Method)
    java.io.FileInputStream.(FileInputStream.java:146)
    uk.ac.uea.cmp.srnaworkbench.utils.patman.PatmanReader.process(PatmanReader.java:68)
    uk.ac.uea.cmp.srnaworkbench.utils.patman.PatmanReader.process(PatmanReader.java:42)
    uk.ac.uea.cmp.srnaworkbench.tools.filter.Filter.filterWithPatman(Filter.java:390)
    uk.ac.uea.cmp.srnaworkbench.tools.filter.Filter.process(Filter.java:227)
    uk.ac.uea.cmp.srnaworkbench.tools.RunnableTool.run(RunnableTool.java:339)
    java.lang.Thread.run(Thread.java:744)

    Here I haven’t given the Killer list file path and i havent selected kill the known sequences.

    Many thanks

    • The sRNA Workbench

      Hi Vijay,

      I am not sure why it has tried to look for a file you have not selected.

      Did the error repeat every time you tried it?

      Cheers,
      Matt

  • ken

    Hi Matt,

    I’d just like to confirm that the filter tool uses Patman, with no mismatches or gaps allowed when scanning the t/rRNA database, as well as the genome?

    many thanks,

    Ken

    • The sRNA Workbench

      Hi Ken,

      Yes that is correct, filtering is done with exact matches only. Would it help to be able to configure this on the GUI/CLI to allow for mis-matches etc?

      Cheers,
      Matt

      • ken

        Hi Matt,

        Thanks for the info. I do think it would help to be able to configure this, but actually my problem was I was running into memory limitations so I opted to run bowtie to do the filtering and wanted to know what the parameters were to mimic. (after non-redundancy run, I had a 514Mb reads file, but the t/rRNA db I replaced was around 1.4Gb. After restarting the workbench, the db filtering step maxed out at around 30Gb (limit of my machine is 32Gb). I’m not sure why it is taking up so much memory despite the input file sizes, but maybe there was too many hits per read. I did java -Xmx29g -jar Workbench.jar to start the GUI).

        I might use the Workbench.jar from the command line setting the -Xmx to a larger size on a server with higher RAM. I believe the params file format is under data/default_params?

        Also, if the filtering tool could also have an option to output both filtered and non-filtered reads, that would be awesome. (actually I just found that the temp dir contains this, so I might work off this).

        Cheers,
        Ken

        • The sRNA Workbench

          Hi Ken,

          Ok no problem I will add this for the next version! I am also surprised it is using so much memory for filtering as this is usually not so intensive. Out of interest what OS are you using?

          Cheers,
          Matt

          • ken

            Hi Matt,

            I’m using Ubuntu 10.04 LTS.

            Yes I was surprised it was using so much memory. I did find out that a large proportion of my reads were only single copy. Fyi, here are some stats (after uniquifying):

            Input unique reads – 10,202,773
            After t/rRNA filter by bowtie – 10,081,368

            t/rRNA filtered, min 5 counts reads – 735,876

            So more than 9mil reads were of < 5 counts.

            When I ran the 'Input seqs' reads against the 1.4Gb t/rRNA db with standalone patman, I was able to get it to complete and the peak mem usage was around 16Gb. So for whatever reason the GUI is adding some other mem usage. The difference with standalone patman and GUI output is that the former doesn't output the filtered sequences in fasta format – maybe there some intermediate processing in the GUI that is taking up lots of memory? Or maybe java jre/linux combo and memory leakage problem?

            I can also report that miRCat fell over in memory too (max 31Gb on my workstation) when I used all sequences (t/rRNA filtered seqs) against my genome (2.3Gb). I ran the same analysis on a HPC server with 60Gb allocated, and after 28 hours it was still running with about 50Gb mem usage.
            When I used the min5 count seqs for mirCat on my workstation, against this genome the peak RAM usage was around 19Gb and I was able to get it to complete.

            In short it seems that running patman via the GUI is eating up a lot of memory when including low count reads.

            Despite the memory issues as my datasets get larger and larger, I am still going out of my way to use this software as it is great, easy and intuitive to use!

            cheers,

            Ken

          • The sRNA Workbench

            Hi Ken,

            Thanks a lot for the info, I will use it as a starting point for the investigation! This sort of data is really helpful to me.

            Thanks also for the kind words on the software overall. The memory problems are an issue at the moment, one I am looking forward to sorting out if our next round of funding gets awarded. It is one of the major objectives I have outlined and I have a few neat strategies in mind that will hopefully consign at least some of them to the past.

            I will make a note in the next release post if I can figure out why the GUI version is chewing up so much resources

            Cheers,
            Matt

  • Kangquan YIN

    Hi, I am new user of sRNA workbench for analysing miRNA deep seq data. can FILTER tool remove snoRNA and snRNA?
    another question is how can I use this tool to remove a plant virus sequnces because the samples are infected by the plant virus. However, I only have the full length sequence of the virus in fasta format rather than small sequnces derived from the virus.
    Thanks ahead for your reply!

    • Hi and welcome!

      Currently we cannot do this type of filtering with our tool. I am considering adding a new option to the tool that will allow users to specify their own files to be matched against for removal (rather than having pre-installed files for T/RNA for example). Does this sound like it might help?

      Thanks,
      Matt

      • Kangquan YIN

        Thanks! I really need that tools, hope it will be incorporated into sRNAworkbench soon~

  • I’m looking at the best way to use the workbench to analyse a large number of files and I guess using the tools at the command line is the best way to do this. Is there documentation of what should be in the parameter files and the format of those files? I couldn’t find it at this site so apologies if I just haven’t looked hard enough.

    • Hi Nathaniel,

      For now the CLI is probably the best option, depending on what you want to do with those files. If it is one of the more intensive tasks (such as miRCat) I would recommend running a few jobs first to ensure the parameters and file locations etc are all looking ok. Of course, some of the tools such as miRProf will allow you to use multiple files that are all part of one experiment. If you wish to see the changes in expression over a time series for example.

      I will be adding a batch mode to the Adapter Removal and Filter tools as soon as possible as it seems this is the most requested feature for these tools at present. It will hopefully not be a large amount of work, just a few changes to the interface.

      If you want to find example parameter files for running the tools from the command line, I have provided these in the Workbench download, navigate to /data/default_params (where is your extracted directory)

      then just copy and paste (to keep a fresh param file for later) the desired param file (.cfg files with the required tool name) into a new location and open in a text editor. Remove any lines for parameters you do not wish to change and modify the lines for those you do want to change (any unfound parameters should just revert to their default parameters at run time)

      Let me know if this helps.
      Thanks,
      Matthew

      • A batch mode for adaptor removal and filtering will be great and much appreciated.

        Those example default files are exactly what I needed thanks. I have run the filter tool but can not get it to use a genome fasta file for filtering matches. My filter has these lines

        filter_genome_hits=true
        genome=”/mnt/its/spruceBLAST/BLAST/genome_assemblies/ASSEMBLYLOCK_2012_JULY/picea_abies.master.july2012.fa”

        I tried with and without the quotes around the path but in both cases, the genome filter was not performed.

        I also wanted to run the sequence alignment tool but it is not listed as one of the tool options and when I tried to use ‘-tool sequencealignment’ this gives a tool not found error. What am I doing wrong? There is also no default param example file for this tool.

        When I last used your tools at the web server version, it was always the case that the output should be kept redundant. Is that the same here for tools later in the pipeline, such as mircat, or can they now handle non-redundant input from the tools coming before in the pipeline?

        A last question for now is how would you suggest handling an experiment with sequence data from multiple samples when it comes to tools such as mircat? Would it be best to cat all of the input files together and run mirna, siloco etc on the combined data for maximum sensitivity?

        Thanks for you help getting me started.

        • Hi,

          I have just checked through the CLI version of Filter and you are correct, there seems to be an issue with the latest version of the parameter loading code that prevents genome files from being loaded, it is highly possible this will affect other tools too. Unfortunately I only have limited testing time as I work alone and the CLI is probably the least tested area of the program. The genome filtering will definitely work from the GUI however (the workbench can make use of X window forwarding if you are logged into a server to run the tool on the server but load the GUI onto your console for example).

          I am working on fixing this now and will release an updated version ASAP, with any luck I can squeeze in the batch mode at least from the GUI with this release too…

          Sequence Alignment is not currently available from the CLI. I will consider adding it in future if it is something people will need, at present you must access this tool from the GUI (or just run the patman program alone, the Sequence Alignment tool is just a wrapper for this. However, please ensure the input small RNA data is in the non-redundant bracket format that is output from the other tools in the Workbench. You can find the patman binary in the ExeFiles directory).

          Yes all tools are specifically designed to take input from the “helper tools” now. So non-redundant input is fine (and in some cases necessary). Some of the tools such as miRCat will detect non-redundant input and convert it internally but some tools such as miRProf will require the files to be in the non-redundant format prior to load. The filter tool can produce a “non-destructive” filtered version of the file to create non-redundant data sets without removal of any sequences if you need it.

          Multiple sample data: I assume by this you mean varying time points or treatments not biological replicates? If so, then SiLoCo and miRProf are designed to have all of your sample files input into one run (not joined, but as a list) so they can make use of all the data for expression profiling/loci prediction etc (if you want to track a miRNA expression level across several time points for example). However, miRCat and TA-SI should have each sample file processed separately. I would recommend only combining your files if they are replicates of the same treatment/time point in your experiment.

          Let me know if this helps,
          Matt

          • Thanks for all the details. I’ll stick to the GUI for now and wait for the CLI fixes. I understand the time constraints so anything you can do is much appreciated.

            For my last point about multiple samples what I have are time points so for the tools looking more at expression style information I would keep everything separate. I was thinking specifically about miRCat, siLoCo and ta-si prediction where I’m more interested in, for example, identifying all novel miRNAs and then post analysing their expression. I was thinking to pool all data for those tools to maximise coverage for detecting e.g. novel miRNAs.

          • No problem,

            I would definitely recommend running the tools without combining the files. Underneath SiLoCo, it will actually combine all the files to generate the locus anyway. Therefore this will be the same as if you combined the files yourself, However, it will then report back to you the statistical information on each file. You basically will lose nothing by running SiLoCo with all your files at once, the GUI interface allows you to select all your files in one go. Same for miRProf, if you load all of your samples in at once you can keep track of which known miRNA came from which file and track the expression across your time series.

            As for miRCat and TA-SI. You will in fact get less information by combining the files in this instance (depending on how far apart your time points are!) Imagine a situation where you have a TASI in one out of ten of your samples only. By combining the files you will have increased the noise factor of your data by ten but not the TASI! Therefore you may end up not detecting it at all… Same is true even more so for miRCat. You really will get a much better coverage in fact by running each sample separately. We have done several runs where we combined data through miRCat (and TASI) and ended up not finding features that we did find when running the samples alone. Basically it all boils down to noise in the data. And without knowing what the noise is it is tricky to remove it…

  • Anurag gautam

    t_and_r_RNAs.fa file of sRNA workbench possess only 1989 sequences whereas Rfam database( wellcome trust sanger institute database containing collection of RNA families except miRNA,) contains 3,57,924 sequences, so please can you explain why sRNA workbench uses very less number of sequences for filtration.

    • admin

      We use an old version of the RFAM database and we will be providing a new version soon, along with functionality to specify your own T/R RNA file, and possibly an auto update (in the same way that the miRBase updater works).

      You can download from RFAM the current version and replace the t_and_r_RNAs.fa file with this one (rename the new RFAM file with this name) to use the latest version for filtering. We aim to provide a general filtering but the user should aim to use a T and R RNA filtering file specific to the organism of interest.

A suite of tools for analysing micro RNA and other small RNA data from High-Throughput Sequencing devices