This tool filters sRNA sequences according to user-defined criteria in a multi-stage pipeline. The tool accepts an unfiltered FastA format file as input and produces a filtered FastA format file as output. In addition, the tool produces statistics summarising the number of sequences removed at each step in the filter pipeline.
The filter tool can discard sequences based on these user defined criteria:
- Sequence Length
User defines minimum and maximum boundaries for valid sequences, capped at 16nt thru 50nt.
- Sequence Complexity
If selected, low-complexity (sequences containing less than 3 distinct nucleotides) sequences are removed.
- Sequence Abundance
User defines minimum and maximum abundance levels for valid sequences.
- Sequence Validity
If selected, sequences not containing known nucleotides are discarded.
- Kill Known Sequences
User may specify a list of sequences, which, if found, should be discarded. The kill-list must be a FastA format file.
- Discard Known Transfer and Ribosomal RNA
This filtering is commonly conducted on sRNA datasets, since reads mapping to tRNA and rRNA might be degradation products. If selected, sequences matching known transfer or ribosomal RNA sequences are discarded. The user may also specify whether or not matches should only be allowed on the sense strand. Known t/rRNAs are stored in a FastA file in $INSTALL PATH/data/t and r RNAs.fa. The file contains t/rRNAs obtained from RFAM, version 10 (Jan-2010) [12, 10], the Genomic tRNA Database  and EMBL , release 95 (09-Jun-2008). The file can be replaced with any FastA file containing t/rRNAs sequences.
- Discard Sequences Not In Genome
If selected, sequences not aligning to a user specified genome will be discarded. Usually reads that do not map to the genome are considered sequencing errors or minor contamination, and are generally discarded. Another application for genome filtering is the analysis of reads produced from virus-treatment experiments. For example, these sRNA reads can be partitioned into three categories: reads identified in both the host and viral genome; reads unique only to the host genome; and reads unique only to the viral genome. This can be achieved by running the filter tool several times with the different genomes.
The user also has the option to produce a log file containing all discarded sequences. Each sequence in the log will be associated with the reason it was discarded.
After execution the user can see an overview of the job in the results panel shown below. The results show the number of reads remaining after each filter stage has been completed. In this example, Complexity filtering, Invalid filtering and genome filtering had no effect on the data. Only length, abundance and t/rRNA filtering had an effect.