Sequence Alignment

The Sequence Alignment tool allows a user to align a FastA file containing short reads to another FastA file containing long reads. The most common use case maybe to align sRNA data file with a genome file to produce a list of hits. The tools pipeline involves performing some basic pre-processing (which the user has some ability to customise), running the sequence alignment tool itself, which currently is patman (Prüfer et al., 2008), to align the sequences to the genome, and finally to perform some optional post-processing. This tool make sequence alignment a much simpler task for the user by providing an easy to use GUI and pipelineing commonly performed actions together with the actual sequence alignment process.

To use the tool, the user must enter a short read file path, a long read file path and specify where the file output file should be saved. All other options on the GUI are optional.

The tool does a certain amount of pre-processing before sequence alignment, regardless of whether the user checks the “Enable pre-processing” box. This is to ensure that short read data being passed to the sequence aligner is in a suitable format and only contains reads that are likely to deliver sensible results. Specifically, the short read data is filtered to remove:

  • Any sequences less than 16nt long
  • Any sequences containing less than 3 distinct nucleotides
  • Any sequences containing invalid nucleotides (i.e. those containing nucleotides other than {‘A’,T’,’G’,’C’}.

Optionally, the user can specify additional pre-processing parameters:

  • Minimum sequence length (capped at 16nt)
  • Maximum sequence length (capped at 50nt)
  • Minimum abundance

After the short reads file has been pre-processed, it is automatically passed to the patman alignment program, with the user specified parameters, which are:

    • Maximum allowed mismatches (capped at 3nt)
    • Maximum allowed consecutive gaps (capped at 3nt)
    • Select whether to align to the positive strand of the long read file only (unchecking this box allows for searching of both positive and negative strands along the long reads).
    • Specify the chunk size for the short read file. This option is there to limit the amount of data patman is required to process in one go. On some platforms patman is limited to 2GB of process space so reducing the amount of data it is required to process may allow it to proceed if memory is a problem. The default setting of 3000000 should allow the vast majority of jobs to succeed on any platform, but if out of memory errors are reported, consider reducing this figure. Small performance gains are possible by increasing this figure when processing large sRNA files, however please keep in mind that the larger this figure the higher the risk of the alignment failing.

After alignment the user can optionally decide to do further filtering of aligned hits based on the short reads weighted abundance. The weighted abundance is calculated by dividing the distinct short read abundance by the number of times the distinct short read aligned to the genome.

Prüfer K, et al. PatMaN: rapid alignment of short sequences to large databases. Bioinformatics 2008;24:1530-1531

A suite of tools for analysing micro RNA and other small RNA data from High-Throughput Sequencing devices