SiLoCo

Available from Version: 2.0

This tool predicts sRNA loci using the method described in [1] and [2]. It also enables the user to compare the expression profile of sRNA loci between difererent samples. In order to determine the relative position of sRNAs, the reads are mapped to the reference genome using PatMaN [3]. Only full-length, perfect matches are accepted as hits. The genome-matching reads are normalised [4] and weighted by repetitiveness. The normalisation method divides hit counts by the number of redundant reads that match the genome. The normalised count, for each distinct read, is given in \hits per 1 million matching reads”. Because it is impossible to decide where a sRNA with multiple matches to the genome originated, we correct the normalised read-abundance for repet- itiveness by dividing it by the number of matches to the genome. The result is a weighted hit count. The method uses the normalised and weighted read-abundance and relative position of sRNAs on the reference genome to predict the sRNA loci. A locus must have a minimum of 3 weighted sRNA hits (this threshold can be adjusted using the min hits parameter) and no gap (absence of sRNA hits) longer than 300nt (this threshold can be adjusted using the sRNA loci distance parameter).

The datasets must contain sRNA sequence reads in FASTA format, in redundant form, i.e. with one entry for each read. Sequences shorter than 18nt (minsize parameter) or longer than 30nt (maxsize parameter) will be removed.

Required Parameters:

  • Genome File: The location of the genome file in FASTA format.
  • sample names: The locations of the sRNA samples

Input files are entered using the box displayed below:

Input dialogue for SiLoCo. Enter your databases for each sample and your genome file

Optional Parameters

  • sRNA loci distance: The maximum gap length in a locus, default max gap = 300).
  • max size: The maximum length of a sRNA.(18 maxsize 35, default maxsize = 25).
  • min size: The minimum length of a sRNA.(18 minsize 35, default minsize = 25).
  • min sRNA locus size: The minimum number of sRNAs in a locus.(1 min hits, default min hits = 3).
  • max genome hits: The maximum number of times a sRNA can hit the genome.(18 minsize 35, default minsize = 18).

The results are presented in a Table as shown in the image below.

An example of the results from a two sample SiLoCo run

The headers for each column contains the description of the data and the name of the sample file. Locus-data is shown in a table with the following columns:

  • Chromosome, start/end position and length Genomic location and length of locus in nucleotides. Some incomplete genomes may not yet be assembled into chromosomes and the acces- sions listed here may be scaffolds or bacs instead. The list is initially sorted by chromosome and position.
  • Raw count Sum of read abundances in samples 1 and 2 that from the locus (not corrected for repetitiveness).
  • Weighted count Sum of raw read abundances divided by number of matches of each sequence to the genome.
  • Normalised count Sum of weighted counts divided by the total number of genome-matching reads in each sample, given in \hits per 1 million genome-matching reads. Normalised counts (abundances) are comparable between sam- ples.
  • Uniquely matching reads (optional) Number of sequence reads in the locus that only have a single match to the genome.

The context menu operates on the currently selected result line. ‘Show in VisSR’ will display the selected locus in VisSR. An example of a locus shown in VisSR is below:

siloco in VisSR
A locus, predicted in SiLoCo and displayed in VisSR

[1] Attila Molnar, Frank Schwach, David J Studholme, Eva C Thuene- mann, and David C Baulcombe. mirnas control gene expression in the single-cell alga chlamydomonas reinhardtii. Nature, 447(7148):1126 1129, Jun 2007 [2] Rebecca A Mosher, Frank Schwach, David Studholme, and David C Baulcombe. Polivb induences rna-directed dna methylation indepen- dently of its role in sirna biogenesis. Proc Natl Acad Sci U S A, 105(8):31453150, Feb 2008 [3] Kay Prufer, Udo Stenzel, Michael Dannemann, Richard E Green, Michael Lachmann, and Janet Kelso. Patman: rapid alignment of short sequences to large databases. Bioinformatics, 24(13):1530{1531, Jul 2008 [4] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaevter, and Barbara Wold. Mapping and quantifying mammalian transcrip- tomes by rna-seq. Nat Methods, 5(7):621:628, Jul 2008

  • The sRNA Workbench

    Hi Cllorens,

    Were you able to run the test data through on that computer?

    As a piece of software, It requires far more memory than I had hoped but I just haven’t had enough time to fully optimise it yet. I usually just run it on our server with lots of memory to ensure it finishes.

  • cllorens

    Hi Mat,

    may i use replicates to perform a siloco analysis? or i have replicates may i cat them in a single file, thank you in advance

    cheers

    • The sRNA Workbench

      Hi Cllorens,

      Yes it can use replicates for loci prediction, it will perform the concatenation for you, just enter each file you wish to use on new lines (or you can use the file dialog to select all of the files required).

      In addition, CoLIde can use replicates as well as separating each replicate into samples to improve the loci prediction.

      Let me know if this helps or if you need further info!

      Cheers,
      Matt

      • cllorens

        Hi Matt
        Thank you for your fast reply, yes i am clear about how to proceed with Colido but not with Siloco with which it is still unclear to me how to proceed using replicates.

        Did you mean that if i upload the file either using the file dialog of the GUI or comma separated via commands , the tool will differentiate the replicates? I understand if i label them using a common same sample name also indicating the number of each replicate.

        For instance, in my case i have two conditions (will say nexo and texo) and three replicates. Then to let the tool to differentiate the replicates should i label the samples as NEXO1,NEXO2, NEXO3,TEXO1,TEXO2,TEXO3?

        • The sRNA Workbench

          Hi Cllorens,

          No, SiLoCo will merge all of the files it finds into one for the alignment, then remember which files contained which individual sequence and use that in the results. Therefore SiLoco has no way of distinguishing samples apart. It just uses all the data it finds to predict a locus. Only CoLIde can set samples apart from replicates.

          Cheers,
          Matt

          • cllorens

            Ok Matt,
            Thank you, i did it as you said and got it. i am trying to also run Colide but have out of memory problems even if i assign more memory using -Xml etc. I think, my raw material is not as large as to have requirements of ram larger than 30Gb as i am using (six samples of 5 millions reads each agains the grch38). I think i can try to tune java a little bit more to solve this, if i find the way i will post it in the siloco forum.

          • The sRNA Workbench

            Hi Cllorens,

            Yes it is quite memory intensive like many of the applications in the original workbench. The latest version has a new system of using disk space instead of RAM for its computation. However CoLide and SiLoCo have not yet received the update (I am trying to get each tool done as soon as possible).

            It does seem like a large amount of RAM but it really is down to the data itself more than anything, lots of repeated elements for example can cause huge alignments even from seemingly small datasets. Again, the latest version has the ability to cap the alignments but only for the newest tools (Quality Check, Normalisation, Differential Expression) but this will filter out to the other tools.

            Out of interest, what command did you use to up the RAM given to the JVM?

            Cheers,
            Matt

  • Devika Parvathy

    Hi, does the smallRNA workbench suit for small RNA analysis of bacterial origin?

    • The sRNA Workbench

      Hi Devika,

      It may be suitable for some analysis, what type of things are you looking to do?

      Cheers,
      Matt

      • Devika Parvathy

        Hi, I am a little late to follow up. I intend to identify differential expressed sRNA’s from bacterial sRNA-seq under different conditions. Thank you.

  • Luisa Fernanda Bermudez

    Hi!
    I ran SiLoCo using the command line and I would like to know if there is any option to know the % of reads mapped. Also if is it possible to visualize some loci in VisSR after you run by command line.
    Thank you very much! this program helped me a lot for solving many troubles. Is an excelente work!

    • The sRNA Workbench

      Hi Luisa,

      Thanks for your message, Sorry for the delay in response, something broke with the comments system and I was not notified until I checked manually!

      Unfortunately, the outputs of the tools cannot be directly loaded into VisSR (apart from the alignment tool which can show the aligned sequences as a new track).

      By percentage of reads mapped, do you mean those that mapped to the genome or each locus?

      Cheers,
      Matt

      • Luisa Fernanda Bermudez

        Hi Matt,
        Thanks for your reply.
        By percentage of reads mapped I refer to the total mapped to the genome.

        Thank you!
        Luisa

        • The sRNA Workbench

          Hi Luisa,

          that information cannot be given directly in SiLoCo, but the Version 4 alpha can give you this information using the quality check pipeline.

          It is at a very early stage but it should be good to run on the majority of datasets, I am going to release a new alpha build very soon along with some tutorial videos, the first report on the pipeline will give you percentages for non redundant, redundant mapping along with complexity information for all files in the dataset along with many other useful statistics.

          Let me know if you have any trouble using it! or if you need further information…

          Cheers,
          Matt

  • The sRNA Workbench

    Hi Nathan,

    The abundance value is the normalised (per total) count for that sample. The raw abundance is not reported in either the GUI or CLI version (it should usually be reporting a decimal value of some kind?)

    Cheers,
    Matt

  • zehong D

    Hello,

    As post before, there are following columns in output.csv:
    Abundance,Unique sRNAs,Average Size Class,Strand Bias,Mean Count

    and the abundance column should be RPM normalised value if we provided a genome.

    My question is “what kind of test should be used if I want to find the differential expressed sRNAs between two samples, when use normalised RPM as input file”? Is there a way to generate the raw count data for each sRNA? Thanks.

    Best,
    Zehong

    • The sRNA Workbench

      Hi Zehong,

      typical tests for differential expression between two samples could be to calculate the offset fold change in abundance for a specific sRNA you are interested in.

      yes if you want the count for each small RNA in a locus you can get this when running the software in GUI mode. Navigate the table to the locus you are interested in and right click, then select export individual sequences to write these to file (alternatively you can write the entire locus out)

      Hope this helps,
      Let me know if you need any further information!

      Matt

  • Evan Foley

    Hello,

    I am working with a sRNA set that I would like to map to a custom data base using SiLoCo. The data I have was sequenced in 2010 by LC Sciences and because there was not an established genome at the time, mapping the reads to a database turned out to be unsuccessful. I currently have all of the processed and filtered sRNA’s that have been determined mappable because LC Sciences returns a file of determined mappable reads. I am having difficulty taking these mappable read files and comparing them to my genome in SiLoCo and am wondering if it is an issue due to the company that sequenced it. Should I start my .Raw data from scratch using the UEA workbench?

    Thanks!

    • The sRNA Workbench

      Hi Evan,

      Could you send me a small exact sample of one of the files you are trying to use in SiLoCo? A few lines should be enough, just so I can have a look at the format. My email address is:

      matthew.stocks@uea.ac.uk

      Cheers,
      Matt

  • Carlos Pérez Arques

    Hello,

    I’ve run SiLoCo several times and it seems I can’t get it to run with specific params other than default ones. For instance, I select a minimum locus size of 100 and when I retrieve the .csv file I find loci with smaller varying sizes (even 16pb loci); same thing with minimum abundance, which I set to 5 and then I obtain some loci practically empty. These are the only params I can check, because sRNA min and max size are not showed in the output, nor cluster sentinel, so I really don’t know if these work. I’ve tried running SiLoCo from the command line too, specifying -params from a .cfg file (copied from the default params in data folder and adjusted to the values I want), but this didn’t work neither. Am I doing something wrong? I would’ve expect other people discussing this issue, but I see I’m the only one.

    Edit: I’m running the latest version which at this moment is 3.2, but I tried all versions available (2.0 and so forth) to the same result.

    Thanks in advance,

    Carlos.

    • The sRNA Workbench

      Dear user,

      The cluster sentinel is used to determine the distance at which small RNA hits should be apart in order to be considered part of a locus. The rest of the params you are having issue with appear to be a bug I have introduced where they are not being read correctly. I will fix this for the next version. Probably hasn’t been spotted yet because many people are using CoLIDe for their locus prediction instead. Also tools that use SiLoCo as part of their process (miRCat) are not affected by the bug.

      Thanks for bringing this to my attention, I will add it to the list! Let me know if you need any further information.

      Best wishes,
      Matt

  • Sarah Rogans

    Hi Matt
    I was just wondering what the values for strand bias mean? I am getting either 0, 1 or values like 0.25, 0.75, 0.89. I was just wondering which values mean negative or positive strand.
    Thanks so much
    Sarah Rogans

    • The sRNA Workbench

      Hi Sarah,

      The strand bias is calculated as follows:

      1: sum up all sequence abundances on the positive and negative strands for that locus.

      2: assuming subtracting and adding one to the other results in a positive value, do that and divide one by the other

      (totalpositive-totalnegative)/(totalpositive+totalnegative)

      resulting in a value indicating the bias toward which strand the entire locus has (closer to one means a bias toward positive) for example:

      total positve = 10
      total negative = 1

      9/11 = 0.82

      clearly a bias of positive in this case.

      I hope this helps!

      Cheers,
      Matt

  • The sRNA Workbench

    Hi nb,

    if the library size is 1.5 M then for a read with raw abundance 1 (raw abundance = number of times you see a sequence in the sample) you will obtain a normalized expression level 1/1.5 = 0.6(6).

    However, when you say library size, do you mean redundant or non redundant? The RPM normalisation is done on redundant counts. Our latest libraries are getting a non-redundant count of around 1.5-2m.

    Did you supply a genome?

    Cheers,
    Matt

    • nb

      Yes, I did provide a genome.
      1.5 M is the non-redundant, in the original fastq files.
      So, which which of the following quantities should I use as the normalized expression values?

      Abundance,Unique sRNAs, Average Size Class,Strand Bias,Mean Count,

      Thanks, any prompt response will be highly appreciated.
      Best.

      • The sRNA Workbench

        Hi,

        Sorry for the delay, if you provided a genome to the program then the abundance column will contain the RPM normalised value.

        Cheers,
        Matt

  • The sRNA Workbench

    Hi,

    1:
    an sRNA is one sequence with a specific “task”/ target. e.g. miRNAs
    an sRNA locus is the genomic region which can produce sRNAs e.g. a miRNA locus, or a heterochromatin locus. You can review our paper called CoLIde here:

    https://www.landesbioscience.com/journals/rnabiology/article/25538/?nocache=1520344760

    and this paper from Molnar:

    http://www.ncbi.nlm.nih.gov/pubmed/17538623

    2:
    The length for plant sRNAs strongly depends on what type of sRNA you are looking for. miRNAs for example will usually be 21nt long and piRNAs 28nt.

    I Hope this helps,

    Matt

    • nb

      Thanks a lot Matt, That is very helpful. I have a follow-up question:
      Then how do I go from sRNA loci that SILOCO predicts to the actual sRNAs? and their expressions in a given NGS library?
      Thanks again Matt.

      • The sRNA Workbench

        Hi,

        You can export the sequences that form any locus by right clicking on the row in the table you are interested in, from this menu you can choose to output each sequence to FASTA (with the abundances embedded into the FASTA header) or the sequence of the entire locus. You can also view the locus in the VisSR tool from this menu.

        I hope this helps,
        Matt

  • Kenlee Nakasugi

    Hi Matt,
    I’m using the latest v2.5, and the output from Silico doesn’t appear to have normalized values as described above, just the raw abundance and unique hits and a few other columns. I’m hoping to use the normalized values for some other analysis. Will this field be back?
    Also, the default parameter for min. size is 16 set in the params window – is this intended?
    Some other notes:
    – parameters won’t write to disk even when clicking save
    – when exporting the output to csv, if one wants to over-write a previously saved file, it actually appends to it instead of overwriting. I had a 20Mb file saved initially, and after a second analysis it doubled to 40Mb. The number of lines doubled exactly too.
    Cheers
    Ken

    • Hi Ken,

      Yes the minimum size is set to 16 by default. Do you need to examine smaller sequences?

      I am looking into the file write issues you have reported. Hopefully it should not be to tricky to re-create and fix!

      Thanks for letting me know,
      Matt

  • I am using a parameter file that contains this

    max_genome_hits=100
    min_abundance=1
    cluster_sentinel=100
    min_length=18
    max_length=26
    min_locus_size=100

    I get this error

    Illegal min_length parameter value. Valid values: 16 <= min_length <= 0.

    But I can not see what the problem is. I have run this from the GUI using these settings and it works fine there. When I try to save the settings from the GUI no file gets written to disk.

    I also can not see how to name the output if I run this from the command line. I ran once without specifying -params and the program ran but no output was written in the current directory.

    When I run this with a large number of sample (24) I get many loci generated with no or very low expression in all samples. Should it be the case that at least one of the samples has a clear expression signal for each loci detected?

    • Hi,

      I have figured out the problem here. It is just a silly bug I have introduced while setting up the program. Clearly <=0 is incorrect.

      I have fixed it and will add it to the change list for the next release. Thanks for pointing it out! Output for SiLoCo is placed into the user/SiLoCoData directory into a time/date stamped folder, you should find your results there (but will not be able to modify the command line params until I release the fixed version of the code I am afraid)

      Also for your large 24 sample experiment. With SiLoco, no, you may not find each locus has a strong expression profile because of the conditions that are used to determine loci (rule based) are likely to produce many false positives (we have found in tomato and A.th data that 1/3 of the predicted loci had a high chance of being real and the other 2/3 could be degradation products)

      For this reason, we have developed a new locus detection tool that will improve detection for your experiment. It is based on statistical approach to locus detection and although will not replace SiLoCo completely will undoubtably provide your with a result set that is far more usable!

      We hope to make this tool available in the next few weeks. Most likely we will have two releases, one with the bug fixes and a few features and then a major release with the new tool

  • In the first column of the csv file output there seems to be a formatting issue. I have values like these

    Locus
    scaffold_11989-2047
    scaffold_16151-6168

    I guess there should be a delimiter between the scaffolds name and the start position?

    • hmm that is strange, usually there is a ‘/’ between the chromosome name and start and stop.

      To be honest this format is just a legacy from the original scripts. For the next release I will just put all data into separate columns anyway and this should stop any type of problem like this appearing in the future.

A suite of tools for analysing micro RNA and other small RNA data from High-Throughput Sequencing devices