Filter

Available from Version: 1.0

This tool filters sRNA sequences according to user-defined criteria in a multi-stage pipeline. The tool accepts an unfiltered FastA format file as input and produces a filtered FastA format file as output. In addition, the tool produces statistics summarising the number of sequences removed at each step in the filter pipeline.

Filter Window

The main interface for the Filter tool

The filter tool can discard sequences based on these user defined criteria:

    • Sequence Length

User defines minimum and maximum boundaries for valid sequences, capped at 16nt thru 50nt.

    • Sequence Complexity

If selected, low-complexity (sequences containing less than 3 distinct nucleotides) sequences are removed.

    • Sequence Abundance

User defines minimum and maximum abundance levels for valid sequences.

    • Sequence Validity

If selected, sequences not containing known nucleotides are discarded.

    • Kill Known Sequences

User may specify a list of sequences, which, if found, should be discarded. The kill-list must be a FastA format file.

    • Discard Known Transfer and Ribosomal RNA

This filtering is commonly conducted on sRNA datasets, since reads mapping to tRNA and rRNA might be degradation products. If selected, sequences matching known transfer or ribosomal RNA sequences are discarded. The user may also specify whether or not matches should only be allowed on the sense strand. Known t/rRNAs are stored in a FastA file in $INSTALL PATH/data/t and r RNAs.fa. The file contains t/rRNAs obtained from RFAM, version 10 (Jan-2010) [12, 10], the Genomic tRNA Database [6] and EMBL [16], release 95 (09-Jun-2008). The file can be replaced with any FastA file containing t/rRNAs sequences.

    • Discard Sequences Not In Genome

If selected, sequences not aligning to a user specified genome will be discarded. Usually reads that do not map to the genome are considered sequencing errors or minor contamination, and are generally discarded. Another application for genome filtering is the analysis of reads produced from virus-treatment experiments. For example, these sRNA reads can be partitioned into three categories: reads identified in both the host and viral genome; reads unique only to the host genome; and reads unique only to the viral genome. This can be achieved by running the filter tool several times with the different genomes.

The user also has the option to produce a log file containing all discarded sequences. Each sequence in the log will be associated with the reason it was discarded.

After execution the user can see an overview of the job in the results panel shown below. The results show the number of reads remaining after each filter stage has been completed. In this example, Complexity filtering, Invalid filtering and genome filtering had no effect on the data. Only length, abundance and t/rRNA filtering had an effect.

Filter output

The output table for the Filter tool

11 comments on “Filter

  1. Kangquan YIN on said:

    Hi, I am new user of sRNA workbench for analysing miRNA deep seq data. can FILTER tool remove snoRNA and snRNA?
    another question is how can I use this tool to remove a plant virus sequnces because the samples are infected by the plant virus. However, I only have the full length sequence of the virus in fasta format rather than small sequnces derived from the virus.
    Thanks ahead for your reply!

    • Hi and welcome!

      Currently we cannot do this type of filtering with our tool. I am considering adding a new option to the tool that will allow users to specify their own files to be matched against for removal (rather than having pre-installed files for T/RNA for example). Does this sound like it might help?

      Thanks,
      Matt

  2. I’m looking at the best way to use the workbench to analyse a large number of files and I guess using the tools at the command line is the best way to do this. Is there documentation of what should be in the parameter files and the format of those files? I couldn’t find it at this site so apologies if I just haven’t looked hard enough.

    • Hi Nathaniel,

      For now the CLI is probably the best option, depending on what you want to do with those files. If it is one of the more intensive tasks (such as miRCat) I would recommend running a few jobs first to ensure the parameters and file locations etc are all looking ok. Of course, some of the tools such as miRProf will allow you to use multiple files that are all part of one experiment. If you wish to see the changes in expression over a time series for example.

      I will be adding a batch mode to the Adapter Removal and Filter tools as soon as possible as it seems this is the most requested feature for these tools at present. It will hopefully not be a large amount of work, just a few changes to the interface.

      If you want to find example parameter files for running the tools from the command line, I have provided these in the Workbench download, navigate to /data/default_params (where is your extracted directory)

      then just copy and paste (to keep a fresh param file for later) the desired param file (.cfg files with the required tool name) into a new location and open in a text editor. Remove any lines for parameters you do not wish to change and modify the lines for those you do want to change (any unfound parameters should just revert to their default parameters at run time)

      Let me know if this helps.
      Thanks,
      Matthew

      • A batch mode for adaptor removal and filtering will be great and much appreciated.

        Those example default files are exactly what I needed thanks. I have run the filter tool but can not get it to use a genome fasta file for filtering matches. My filter has these lines

        filter_genome_hits=true
        genome=”/mnt/its/spruceBLAST/BLAST/genome_assemblies/ASSEMBLYLOCK_2012_JULY/picea_abies.master.july2012.fa”

        I tried with and without the quotes around the path but in both cases, the genome filter was not performed.

        I also wanted to run the sequence alignment tool but it is not listed as one of the tool options and when I tried to use ‘-tool sequencealignment’ this gives a tool not found error. What am I doing wrong? There is also no default param example file for this tool.

        When I last used your tools at the web server version, it was always the case that the output should be kept redundant. Is that the same here for tools later in the pipeline, such as mircat, or can they now handle non-redundant input from the tools coming before in the pipeline?

        A last question for now is how would you suggest handling an experiment with sequence data from multiple samples when it comes to tools such as mircat? Would it be best to cat all of the input files together and run mirna, siloco etc on the combined data for maximum sensitivity?

        Thanks for you help getting me started.

        • Hi,

          I have just checked through the CLI version of Filter and you are correct, there seems to be an issue with the latest version of the parameter loading code that prevents genome files from being loaded, it is highly possible this will affect other tools too. Unfortunately I only have limited testing time as I work alone and the CLI is probably the least tested area of the program. The genome filtering will definitely work from the GUI however (the workbench can make use of X window forwarding if you are logged into a server to run the tool on the server but load the GUI onto your console for example).

          I am working on fixing this now and will release an updated version ASAP, with any luck I can squeeze in the batch mode at least from the GUI with this release too…

          Sequence Alignment is not currently available from the CLI. I will consider adding it in future if it is something people will need, at present you must access this tool from the GUI (or just run the patman program alone, the Sequence Alignment tool is just a wrapper for this. However, please ensure the input small RNA data is in the non-redundant bracket format that is output from the other tools in the Workbench. You can find the patman binary in the ExeFiles directory).

          Yes all tools are specifically designed to take input from the “helper tools” now. So non-redundant input is fine (and in some cases necessary). Some of the tools such as miRCat will detect non-redundant input and convert it internally but some tools such as miRProf will require the files to be in the non-redundant format prior to load. The filter tool can produce a “non-destructive” filtered version of the file to create non-redundant data sets without removal of any sequences if you need it.

          Multiple sample data: I assume by this you mean varying time points or treatments not biological replicates? If so, then SiLoCo and miRProf are designed to have all of your sample files input into one run (not joined, but as a list) so they can make use of all the data for expression profiling/loci prediction etc (if you want to track a miRNA expression level across several time points for example). However, miRCat and TA-SI should have each sample file processed separately. I would recommend only combining your files if they are replicates of the same treatment/time point in your experiment.

          Let me know if this helps,
          Matt

          • Thanks for all the details. I’ll stick to the GUI for now and wait for the CLI fixes. I understand the time constraints so anything you can do is much appreciated.

            For my last point about multiple samples what I have are time points so for the tools looking more at expression style information I would keep everything separate. I was thinking specifically about miRCat, siLoCo and ta-si prediction where I’m more interested in, for example, identifying all novel miRNAs and then post analysing their expression. I was thinking to pool all data for those tools to maximise coverage for detecting e.g. novel miRNAs.

          • No problem,

            I would definitely recommend running the tools without combining the files. Underneath SiLoCo, it will actually combine all the files to generate the locus anyway. Therefore this will be the same as if you combined the files yourself, However, it will then report back to you the statistical information on each file. You basically will lose nothing by running SiLoCo with all your files at once, the GUI interface allows you to select all your files in one go. Same for miRProf, if you load all of your samples in at once you can keep track of which known miRNA came from which file and track the expression across your time series.

            As for miRCat and TA-SI. You will in fact get less information by combining the files in this instance (depending on how far apart your time points are!) Imagine a situation where you have a TASI in one out of ten of your samples only. By combining the files you will have increased the noise factor of your data by ten but not the TASI! Therefore you may end up not detecting it at all… Same is true even more so for miRCat. You really will get a much better coverage in fact by running each sample separately. We have done several runs where we combined data through miRCat (and TASI) and ended up not finding features that we did find when running the samples alone. Basically it all boils down to noise in the data. And without knowing what the noise is it is tricky to remove it…

  3. Anurag gautam on said:

    t_and_r_RNAs.fa file of sRNA workbench possess only 1989 sequences whereas Rfam database( wellcome trust sanger institute database containing collection of RNA families except miRNA,) contains 3,57,924 sequences, so please can you explain why sRNA workbench uses very less number of sequences for filtration.

    • admin on said:

      We use an old version of the RFAM database and we will be providing a new version soon, along with functionality to specify your own T/R RNA file, and possibly an auto update (in the same way that the miRBase updater works).

      You can download from RFAM the current version and replace the t_and_r_RNAs.fa file with this one (rename the new RFAM file with this name) to use the latest version for filtering. We aim to provide a general filtering but the user should aim to use a T and R RNA filtering file specific to the organism of interest.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>