Sequence Alignment

Available from Version: 1.0

The Sequence Alignment tool allows a user to align a FastA file containing short reads to another FastA file containing long reads. The most common use case maybe to align sRNA data file with a genome file to produce a list of hits. The tools pipeline involves performing some basic pre-processing (which the user has some ability to customise), running the sequence alignment tool itself, which currently is patman (Prüfer et al., 2008), to align the sequences to the genome, and finally to perform some optional post-processing. This tool make sequence alignment a much simpler task for the user by providing an easy to use GUI and pipelineing commonly performed actions together with the actual sequence alignment process.

To use the tool, the user must enter a short read file path, a long read file path and specify where the file output file should be saved. All other options on the GUI are optional.

The tool does a certain amount of pre-processing before sequence alignment, regardless of whether the user checks the “Enable pre-processing” box. This is to ensure that short read data being passed to the sequence aligner is in a suitable format and only contains reads that are likely to deliver sensible results. Specifically, the short read data is filtered to remove:

  • Any sequences less than 16nt long
  • Any sequences containing less than 3 distinct nucleotides
  • Any sequences containing invalid nucleotides (i.e. those containing nucleotides other than {‘A’,T’,’G’,’C’}.

Optionally, the user can specify additional pre-processing parameters:

  • Minimum sequence length (capped at 16nt)
  • Maximum sequence length (capped at 50nt)
  • Minimum abundance

After the short reads file has been pre-processed, it is automatically passed to the patman alignment program, with the user specified parameters, which are:

    • Maximum allowed mismatches (capped at 3nt)
    • Maximum allowed consecutive gaps (capped at 3nt)
    • Select whether to align to the positive strand of the long read file only (unchecking this box allows for searching of both positive and negative strands along the long reads).
    • Specify the chunk size for the short read file. This option is there to limit the amount of data patman is required to process in one go. On some platforms patman is limited to 2GB of process space so reducing the amount of data it is required to process may allow it to proceed if memory is a problem. The default setting of 3000000 should allow the vast majority of jobs to succeed on any platform, but if out of memory errors are reported, consider reducing this figure. Small performance gains are possible by increasing this figure when processing large sRNA files, however please keep in mind that the larger this figure the higher the risk of the alignment failing.

After alignment the user can optionally decide to do further filtering of aligned hits based on the short reads weighted abundance. The weighted abundance is calculated by dividing the distinct short read abundance by the number of times the distinct short read aligned to the genome.

Prüfer K, et al. PatMaN: rapid alignment of short sequences to large databases. Bioinformatics 2008;24:1530-1531

  • The sRNA Workbench

    Hi Kostya,

    Could you go into the folder user/logs and delete all of the log files you find inside (and in the new user directory) and try again?

    Cheers,
    Matt

  • Kostya

    Hello,
    I am testing your seq aligner on a Mac using the Arabidopsis files from the tutotial. But this is not working with the “xxx_align.patman” file being created in a second. But this file is empty. Any ideas why this is not working for me?
    Thank you

    • The sRNA Workbench

      Hi Kostya,

      Could you let me know if anything is written into the user/logs folder? Are the error logs empty?

      Cheers,
      Matt

  • Marius Snyman

    Hi Matt,

    I’ve been unable to change any alignment parameters as well as pre/post-processing preferences. Its as if what appears in the blocks are fixed and I can’t type anything into them.
    Do you have a solution? Maybe I must configure the software on our server somehow?
    Cheers,
    Marius

    • Hi Marius,

      Did you enable the check boxes above each set of parameters?

      Cheers,
      Matt

      • Marius Snyman

        Matt, thanx for the reply. Yes I did check the boxes. The check boxes are the only things I can select. The other boxes with values I can not change after I checked the “Enable pre-processing” box. I’m enabled to change these parameters on my colleague’s Mac. But I’m using Putty SSH to run the program from our server on my Windows machine.

        • Ahh, so you are forwarding the GUI to your windows machine through X11? If so I have not tested this. I usually login through a UNIX or LINUX terminal.

          I will just switch on my windows machine and try to configure a putty session. I will get back to you shortly…

          Cheers,
          Matt

          • Marius Snyman

            Okay. I use putty, so all I did was selecting “Enable X11 forwarding” under putty’s SSH settings. Also, one must have the Xming server installed as well. Just double-click on Xming icon after/before you changed the abovementioned putty SSH setting. Then login via putty onto your server. The terminal that open’s afterwards works just as any Linux terminal. Thanx for checking it out.

          • Hi Marius,

            I have just tested it out on my W7 machine. It seemed to work ok but I had to maximise the window before it would register the mouse clicks for some reason. However, after the main workbench window produced by XMing was maximised I was able to check the boxes and modify the params.

            Can you try with the window maximised if you havent already and let me know the result?

            Cheers,
            Matt

          • Marius Snyman

            Hi Matt.
            No, its still not registering the mouse click. Weird…

          • Marius Snyman

            Also, the Xming window were maximised when the workbench started.
            Marius

          • That is really strange, my Workbench window was maximised but the XMing window wasnt… I know it sounds a little odd but did you try a minimise (not to toolbar but just to small window) and maximise again?

            To begin with it would not allow the menus to remain dropped until I fiddled with it. I will do some googling to see if this is a problem related to Java swing or if it is a bug I have introduced somewhere (but these are pretty default buttons and seem to work ok on other distributions so far) perplexing indeed…

            Also, did you fire up the program from sRNAWorkbenchStartup.jar or just from Workbench.jar ?

          • Marius Snyman

            Hi Matt,
            Yes,I tried going from minimised to maximised and vise versa. I started the program from both jar files, and both did the same.
            Marius

          • After a bit of searching I found this information:

            “There is an apparently known issue with running Java programs via X which causes all sorts of problems in painting the windows correctly. In my case, the mouse position was being shown in one place but the menus were being highlighted about 50 pixels lower”

            http://www.yonahruss.com/unix/issues-running-java-programs-remotely-via-x.html

            I am just looking into a workaround. Unfortunately the likely workaround will be using a different X server on your windows machine…

          • Ok so the workarounds I have found used to be using a different AWT toolkit:

            http://docs.oracle.com/javase/6/docs/technotes/guides/awt/1.5/xawt.html

            http://superuser.com/questions/57407/problem-with-remoted-display-of-java-applications

            most websites I have read recommend using the MToolkit. (Although I have to say, it is weird that it works for me and not you, I will keep digging…) However, I have also read that the latest versions of Java do not support it and when I tried it Java crashed.

            Unfortunately the best thing to do will be to use a different X server on your windows machine because XMing is problematic for Java. Do you want me to recommend any?

          • Marius Snyman

            Hi Matt,
            I found a X server called MobaXterm. The workbench runs with it, and I can change settings, but only within 1 minute from start-up. Afterwards I get the same problem…clicking disabled.
            Marius

          • wow, that is even more strange! There must be something very wrong with how java forwards X to windows based consoles, unfortunately it will most likely affect all java programs you try to use. I will submit a new bug report to oracle. I dont know if you use it very often but I have heard the x client on Cygwin is ok. I will get the admin people to install it on my computer to check it out.

            The other option I use to interface with my servers is to have a virtual box with a linux dist on it but that is a bit of work to setup.

            As before I will keep attempting to find a solution but unfortunately Sun have closed their page with the original info on it

        • Marius Snyman

          Hi again Matt,
          I found a solution..
          I don’t know if you usually archive things like this, but here is a solution for W7 users who run into the same problem with Java:
          Install MobaXterm as X server. http://mobaxterm.mobatek.net/support/documentation.html#4_2_5 “I have an issue with an X11 remote program (Java/X11/Motif) which does not accept keyboard input: everytime I press a keyboard key, nothing is written on the text field

          This problem can generally be solved easily by using X11 with a window manager instead of using X11 in “multiwindow” (transparent) mode: go to MobaXterm “Settings” window –> “X11” tab –> select “Windowed mode with Fvwm” setting in the combo box and apply. Restart your remote program and it should take keyboard input correctly.
          If you want to use the standard “multiwindow” (transparent) X11 server and occasionally use the windowed X11 server with the window manager, you can go to MobaXterm “Settings” window –> “X11” tab –> select “Multiwindow mode” setting in the combo box and apply, then open the “tools” menu and select either “X11 tab with Dwm” or “X11 window with Fvwm2”.

          Cheers,
          Marius

          • Fantastic, this is very useful for other users which is why I like the comments section on this website!

            I am going to put a FAQ style page up at some point in the near future and I will be sure to add this to it!

            Thanks for sharing with the users 🙂

            Cheers,
            Matt

  • Gennady

    Hi,

    I am using your workbench for our small RNA sample and have a couple of questions.

    – Can the read file be fastQ or only fastA? If fastQ is ok, is the quality used?

    – When I use miRProf, it seems I cannot provide the patman aligned file; I have to provide fastA. Does patman count how many identical reads are in the fasta file and aligns only once? Or it will be much slower if the same read is repeated many times? I didn’t find such information in the patman paper.
    I am asking because I created a fasta file containing only unique reads, aligned it with patman (through helper tools) and then edited the aligned file to contain the correct count in brackets after each sequence. However, I couldn’t use this file as input for further analysis. Original fasta file has more than 100 million reads, while processed/unique-reads one is less than 1 million.

    Thank you

    • Hi Gennady,

      Both the Adapter removal tool and the Sequence Alignment tool can read FASTQ data files. The rest of the tools should be in FASTA format.

      Some of the tools in the workbench convert any FASTA data into a non-redundant format prior to processing. By non redundant, I mean that each sequence will be represented only once in the processing and output of any results with the abundance count shown in brackets on the descriptor line, no quality data is considered at this point however. miRCat for example, can handle FASTA files in redundant and non-redundant format but most of the other tools currently require non-redundant files (updates will change this to allow all tools to process redundant data very soon).

      If you wish to produce a non redundant FASTA file from a redundant FASTA file you can use the Filter program to conduct a “non-destructive” filtering. Just load the data into the tool and uncheck all filtering options apart from “make output non-redundant” (you can of course do other types of filtering at this point, either way all output from this tool will be in non-redundant format)

      Patman is not a program written by me (we use it because it is very nice for exact alignments), if you use the patman binary without the workbench it will align the sequences as they appear in the file (redundant or non-redundant) if you use the sequence alignment program through the workbench it will perform the data compression for you (and also split the file into manageable chunks with a chunk size you can modify through the GUI to allow the data to fit into memory)

      Does this answer your question?

      Thanks,
      Matt

      • Gennady

        Hi Matt,

        Sorry for the late reply – I thought I’ll get an email when someone answers.

        My problem is that the (java version) workbench eats all the memory – I used -Xmx64g and it’s not enough. I have a (trimmed before using the workbench) fasta file of 5.6G, which has many redundant reads, so I thought to use non-redundant formats somehow, but I couldn’t. So I made a 1% and a 10% redundant files (51M and 554M in size). Running miRProf (and miRCat, SiLoCo) worked fine for 1% file, but miRProf was again out of memory for a 10% file (with -Xmx32g; I didn’t try miRCat and SiLoCo for this file). While using 1% is useful, I am still hopeful to use the whole dataset. Do I do something wrong? Do I have to specify some parameters for patman even though I don’t explicitly call it (I don’t use helper tools, just the analysis tools now)?

        I am now trying to work with the Perl version but it seems to be somewhat outdated. I’ve made it to start running (just testing now) but its mirbase data is outdated and the update command doesn’t work. I’d like to avoid updating it manually, so if you could suggest of an easy way/fix to update that would be great. Or will this Perl version also eat a lot of memory?

        Finally, in the miRCat results for a 1% file, I’ve looked at a couple of high-count predictions that are indicated as novel (no associated known miRNA assigned). However, there are known miRNAs in the genome in this place, and they are actually present in the mirbase database used by the workbench. So it seems like a small bug to me, but maybe I miss something (maybe there is a snp).

        Thank you for your help,
        -Gennady

        • Gennady

          I’ve actually fixed the perl version to update to current mirbase (v19) and testing how it works …

          • Hi Gennady,

            Yes sorry about no notifications on reply. It might be a good idea for me to try and set something up that allows that but I can see it being a potential way in for spammers (I have to block many from this site!).

            If you wish to make a file non-redundant please try the filter tool. You can run it with no filtering options selected and only select “make output non redundant” Then use this file for input into miRProf. This should greatly reduce the size of the input FASTA file.

            The miRCat bug you mention is known to me, the way I check for known miRNA is not very accurate because of the variations that can be found in miRNA sequences, most people will run their output FASTA of predicted miRNA through miRProf. I am probably going to add this as a step that can be configured in miRCat so this doesn’t have to be done by the user…

            Please let me know if you manage to produce the non-redundant file through the filter tool and if it reduces your data set to a workable size because this kind of information is very valuable to the development of the software and working alone means I am always very dependant on (and grateful for) user feedback!

            Thanks,
            Matt

          • Gennady

            Hi Matt,

            Here is what I did to use miRProf (java workbench GUI). As I described previously, I gave it redundant fasta files and I’ve seen the pipeline produce non-redundant files in a temporary directory. But then it was going out of memory for 10% and full fasta files (after patman finished running). So I thought I have to (or can) provide a redundant fasta to miRProf. You are saying in your first response that input to miRProf should be non-redundant, and tell me to do a non-redundant file with the filter tool. I just did it — the filtering works fine, and I get a 41M non-redundant fasta file. But when I provide this small (41M), non-redundant fasta as an input for miRProf I get out of memory (after some time) just as I was getting out of memory before with redundant fastas. So it seems to me the problem is not in providing redundant or non-redundant files, but somehow maybe if the read counts are big it causes a problem.

            Meanwhile, the perl version of miRProf worked on the whole (5.6G) redundant fasta file. One issue with it, though, is that the weighted count is identical to raw count for all the miRNAs (while in the java version run with 1% as descibed in my previous post this is not the case, for the same miRNAs).

            Thank you,
            -Gennady

            THERE IS NO “REPLY” BUTTON UNDER YOUR LAST RESPONSE

          • Hi Gennady,

            Well that is very odd, Again I have definitely used far larger non-redundant files through miRProf on the java version and had no trouble even with small amounts of RAM. I will take my largest non-redundant file now and see if I can re-create the problem. Are your small RNA files private data? Or can they be provided to me for testing? Which organism is it?

            One question, did you supply a genome file to the Perl version? One advantage our original scripts have over the java code is that they can run miRProf without a genome (I will add this to the java miRProf asap). But this means that it cannot calculate genome normalised values.

            EDIT: Just ran some large (ish) non-redundant files through miRProf on my laptop using just 10GB of RAM (300M+) and it seemed to work ok, so as you say, it must be something with that particular dataset that is causing an issue… Very hard for me to diagnose or propose a solution I am afraid, is there anyway I can get a look at this data or something similar so I can see if we have a bug or need to prepare something for this?

            Thanks,
            Matt

          • Gennady

            Hi Matt,

            Presently I cannot send the data to you – my lab chief is abroad.

            Regarding the perl version – yes, I did use genome, and the progress report seems to reflect that – “matching to genome” is in the list:

            Uncompression of files: pre-processing_input_files
            checking miRBase files
            extracting_valid_sequences
            matching to t and rRNAs
            filtering matches
            matching to genome
            filtering matches
            matching to miRBase
            miRNA profiling: generating miRProf output

            So I don’t know why the weighted counts are not weighted (the number is just rounded to three significant digits but otherwise is identical to raw counts). I’ll try to look at the code.
            Thank you,
            -Gennady

        • Yiang

          Hi Gennady!
          I am in the same track of you!
          Could you share your experience with the PERL version of miRCat. I could not get it run at all!
          I feel lost as the PERL version is not as other modules that can be installed thru CPAN. I tried to export the libs, but there are bunch of error coming out. Too many to start with and put here.
          cc: Matt
          1) Does the PERL version take less memory as the workbench GUI, as the whole job is split into individual steps? Am I right?
          2) Do you have more detailed instruction for the PERL version beside the pdf document of the link?
          Thanks a lot!
          Yifang

          • Hi,

            1) yes in all of my tests the Perl version has used more memory than the Java version of miRCat and as it is sequential (rather than the java multi threaded version) it takes far longer too. I usually use A.th samples for testing and in nearly all of them it completes the entire procedure in less than 20 minutes compared to over 2 hours for the perl version and consumes around half the amount of memory.
            2)Unfortunately not, the pdf is the best set of instructions we have for the Perl version. We do not officially support that code anymore as we are attempting to completely supersede the Perl with the Java program in the next year or so (when the web server dies we do not have any resources to replace it so are trying to offer a downloadable equivalent along with installing it on other we resources such as iPlant)

          • Gennady

            Hi Yifang,

            I am not sure at what stage you’re with the perl code. If you are interested in miRCat specifically, then I can just tell you that it is presently running for me for the whole dataset (redundant fasta, 6.3G).

            On the other hand, miRProf finished fine for the whole dataset, and the only issue I see now is that weighted counts are reported identical to raw counts – see my previous posts.

            If you are unable to start the program at all – here is what I did. I downloaded the two zip archives, and unzipped them in my directory; then I added the bin directory to the PATH variable. Then I took an example command from the documentation.pdf file, and put my genome and my fasta file: ./srna-tools.pl –tool mirprof … Initially it was complaining that it does not find certain package(s) in the @INC etc., so I, as a superuser, installed the mentioned packages, one by one, using cpan command (there were 4 or 5 packages to install). I didn’t have to install bioperl because I had it already. I also had to modify the tools.conf file so that it would include needed keys that are given on the command line – see mirbase_dbs hash around line 108. In order to update to a new version of mirbase (v19) I also had to modify lib/SrnaTools/Application/CliApp.pm around lines 403-417 because now instead of the “organism.txt” file mirbase has “organisms.txt.gz” (in plural, and gzipped).

            HTH,
            Let me know if you have more questions,
            -Gennady

  • bharath

    Hi,
    I am trying to use alignment tool using sample test data of fastq but i got error
    “String index out of range: -10”. could you please suggest me what wolud be the reason and how to fix this error.

    thanks,
    bharath

    • Hi Bharath,

      It is hard to say exactly what is causing this without a little more information. Sometimes this sort of thing is related to the input data. Could you email me the first 100 lines of your input FASTQ file?

      matthew.stocks@uea.ac.uk

      perhaps I can figure out what the problem is from there.

      Thanks,
      Matt

    • bharath

      Hi,

      Apologies for the delay.
      I used tutorial dataset downloaded from here only. The file name is tomato.fastq and reference genome file is Ath_TAIR9.fa

      thanks,
      bharath

      • Hi Barath,

        The issue is just the wrong tutorial files being used. If you check out the tutorial page the files you should use to complete this tutorial are the genome file you are using (ATH) and:

        GSM118373_Rajagopalan_col0_leaf_nr.fa

        you have the tomato fastq data that is to be used in the adapter removal tutorial (the files that should be used for each tutorial are listed just above the video along with the location inside the tutorial data directory that you have downloaded).

        In any future use with real data, fastq data must be processed with the adapter removal tool first to be trimmed and converted into FASTA data before further analysis.

        Thanks,
        Matt

  • Mister Mustard

    Hi,
    could you explain which values are shown in the .patman output file? Is the number between brackets just behind the sRNA sequence the sRNA accumulation value? Thanks in advance.

    • admin

      Hi,

      Apologies for the delay in reply!

      Yes the value is the non-redundant count for the sample. For example:

      if the sequence: TTCAGAGTTTACAGTCCGAC appeared 50 times within a FASTA file being aligned, the result would appear just once in the output file as:

      >TTCAGAGTTTACAGTCCGAC(50)

      So all aligned sRNA sequences in the resultant .patman file are unique. (they may of course match more than once to the genome but they will be the same sequence with the same abundance score)

      I hope this helps. Please feel free to reply with any further questions

  • Chintan Vora

    Can you tell me which mapping tool is sRNA workbench using?

A suite of tools for analysing micro RNA and other small RNA data from High-Throughput Sequencing devices