CoLide

Available from Version: 3.0

This tool infers the location of significant biological units known as sRNA loci, by combining genomic location with the analysis of other information such as variation in expression levels (expression pattern) and size class distribution. In the CoLIde tool we define a locus as a union of regions sharing the same pattern, located in close proximity on the genome. Biological relevance, detected though the analysis of size class distribution is presented for each locus.

This tool can be used on ordered (e.g. time-dependent) or un-ordered (e.g. organ, mutant) serie of samples both with and without biological/technical replicates. The tool reliably identifies known types of loci and shows improved performance on sequencing data from both plants (e.g. A. Thaliana, S. Lycopersicum) and animals (e.g. D. Melanogaster) when compared to existing locus detection techniques.

Setup

The tool first requires information on how many samples form the experiment.

CoLide Setup
First select the sample count

From here it will require you to input the files that relate to each sample:

Required Parameters:

  • Genome File: The location of the genome file in FASTA format.
  • Sample Names: The locations of the sRNA samples and their optional replicates

Input files are entered using the box displayed below:

Either use the history browser, file browser or type the path to the list of files for each sample
Either use the history browser, file browser or type the path to the list of files for each sample

Each sample can be modified (i.e. have files added and removed) individually by selecting the desired sample number from the table below:

The tabbed interface allows you to access all the samples and modify them individually
The tabbed interface allows you to access all the samples and modify them individually

Series Type Parameters

  • Ordered Series: (select this option if order is important to the experiment e.g. time series)
  • Unordered Series: (select this option if order is not important to the experiment e.g. organ series)

The Confidence Interval (CI)s are also controlled using the following parameters which represents the percentage of replicated
measurements to be included in each CI.

Non Replicate Data – Confidence Interval Control

  • Percentage CI: This determines the percentage to add to either side of the normalised expression

Replicate Data – Confidence Interval Control

  • Min Max: Use the minimum and maximum normalised expression value to determine the confidence interval (100%)
  • +-SD: CI is mean +- 1 standard deviation (67%)
  • +-r(2)SD: CI is mean +- standard deviation divided by the square root of 2 (50%)
  • +-2SD: Ci is mean +- 2 X standard deviation

Percentage Overlap: controls the amount each confidence interval must overlap to be considered a straight pattern

The results are presented in a Table as shown in the image below.

The table presents itself in a tree structure, click the icon on each chromosome you are interested in to view the loci
The table presents itself in a tree structure, click the icon on each chromosome you are interested in to view the loci

The headers for each column contains the description of the data and the name of the sample file.
Locus-data is shown in a table with the following columns:

  • ID: Split by chromosome/scaffold: each locus is numbered per chromosome.
  • Start: Start coordinate for locus
  • End: End coordinate for locus
  • Length: Locus length
  • P-Val: The probability value for the locus as calculated from the chi-square statistic
  • Sample 1-n: The expression series for this locus
  • Chromosome: The chromosome this locus resides on
  • Differential Expression: The absolute differential expression for this locus

The context menu operates on the currently selected result line.

CoLIDE Right Click

  • Export individual sequences: Export the sequences that form the locus to FASTA
  • Output entire locus: Export the entire locus sequence from the genome to FASTA
  • Show locus in genome view: display the selected locus using standard arrow view in VisSR
  • Show locus in aggregate genome view: display the selected locus as a compressed view in VisSR

Viewing Results

Users have two options when viewing loci predicted in CoLIDE. The standard arrow view as shown below:

The classic view of a small RNA alignment
The classic view of a small RNA alignment

Or the aggregated view as shown below (same data and location)

The aggregated view available from CoLide
The aggregated view available from CoLide

The aggregated view groups all small RNAs in the locus into windows of 100nt and generates a
histogram on the abundance of all small RNAs within that window

  • Andrew

    Hi,

    I’ve been running Colide on some samples (.fastq -> adapter removal -> .fa -> colide).

    Setup -> Ubuntu, 100GB given to java

    Experiment -> timecourse, 3 timepoints. 3,3 and 2 biological replicates per time point respectively. Reference genome -> Cow .fa sequence from Ensembl

    I ran Colide yesterday, and it’s been running for ~24hrs, using 100GB in main memory, another 100GB of cache and has been utilising 40 cores. What sort of time scale should I expect for completion? Does it sound like I’ve done something wrong in my setup?

    Thanks,

    Andrew

    • Andrew

      After two days of processing I received an out of memory error. Can anyone give me any suggestions?

      • The sRNA Workbench

        Hi Andrew,

        I have replied to your previous message. Again sorry for the delays, I am currently on leave.

        Best wishes,
        Matt

    • The sRNA Workbench

      Hi Andrew,

      Again, sorry for the delay (see my previous message)

      ColIde is an intensive tool but perhaps not that intensive! There could be an issue with setup or there could be huge amounts of repeat mapping over the genome causing large alignment files.

      I have an improvement in the pipeline that should prevent this from happening in the future that will be coming on a future release along with a brand new back end that should allow the software to process your data much more efficiently.

      hopefully this will make it into the next release and I am working flat out to get this available as soon as possible.

      Best wishes,
      Matt

      • Andrew

        Thanks for getting back to me Matthew.

        I downloaded the repeat masked Fasta reference file and that seems to have done the trick… partially.

        I now get an error of -1 after producing the histogram and asking for 4 numbers (around ~ 20 – 22).

        Thanks for the help!

        • The sRNA Workbench

          Hi Andrew,

          I am not sure what has caused that issue, has anything been written into the error logs? (user/logs) you can send them directly to my email address if that helps:

          matthew.stocks@uea.ac.uk

          no personal information is contained within them.

          cheers,
          Matt

  • The sRNA Workbench

    Hi Sanjay,

    The DE calculation reported in the CoLIde program is known as the ‘absolute offset differential expression’. The calculation is to take the minimal and maximal expression values from the series and add the offset value to them. Then the formula is:

    ln(min/max) / ln(2)

    I hope this helps, let me know if you have any further questions

    Cheers,
    Matthew

    • sanjay

      Hi Matthew,
      Thanks for the information. Is there any threshold here to say which loci are up regulated and which are down regulated?
      Also, the p values reported here are either zero or close to 1. What could be the reason for this.

      Thanks again.

      • The sRNA Workbench

        Hi Sanjay,

        Unfortunately, when using absolute DE you cannot differentiate between up/down regulated. All the proportions that come out of that formula will be >= 0 (in log scale).
        The absolute DE only tells you the extent of the DE, not its direction. But we will be adding new DE methods in the future that should be more informative.

        Usually, the threshold for DE is 1 (log2 scale), which corresponds to 2 Offset Fold Change.
        The limit for validation (northern or qPCR) is ~1.5 OFC which corresponds to ~0.5 in log2 scale. However, the 1.5-2OFC is considered grey area and it should be avoided.

        The p values shown as 0 are probably the result of a rounding or precision error. I will replace these with a <0.001 message or something more informative.

        Pvals close to 1 suggest that the locus could be obtained by chance (the distribution of size classes is random uniform so likely to be random degradation products)

        Close to 0 means it is highly significant in some class. In the RNAi context, only the DICERs produce a specific size class, so a locus with some specificity is an argument that the locus could have been produced by a DICER. However, it must be noted that the size class test is only an argument towards a conclusion, it is not a full proof that the locus is indeed a true one.

        I hope this helps!

        Cheers,
        Matt

  • Magdy Alabady

    Hi there.,

    I have been trying ColIDE with 4 samples (2 replicates each). I use the filtered, redundant files that I obtain from the Filter tool in the kit. But the job kept on failing without giving a reason for the fail. After the pressing OK on the popup window about the major length classes (with and without modification), the job fails. Once it gave a failure message that has “-1” in it. I am puzzled?? Any suggestions??

    I am running version 3 on mac os 10.8.4 (2 X 2.4 GHz 6-core intel Xeon processor, 32 GB 1333 MHZ DDR3 memory, and 4 TB storage).

    Another problem with this version: the mapping tool seems to be nonfunctional. All mapping jobs produced empty files, although the job with completed successfully. When I run I patman on the command line for the same jobs with the same parameters, it produces large mapping files (in the GB’s arena). Any suggestion

    • The sRNA Workbench

      Hi Magdy,

      It is tricky to say what is causing the problems you are experiencing. Could I have a little more info? If you could let me know details on the OS you are using, the organism and data (if it is available) so I can attempt to recreate the problem on my own computer and also provide me with the log files for the failed runs for both CoLIde and Sequence Alignment (you will find these in user/logs) that might shed some light on the situation.

      You can email them to matthew.stocks@uea.ac.uk

      Cheers,
      Matt

  • Pingback: The UEA Small RNA Workbench Version 3.01 Alpha | The UEA Small RNA Workbench()

  • Gopal Joshi

    Dear Sir,

    I have tried new CoLIDE tool with default parameters for four samples. My server configuration is

    32GB RAM, 2 Quadcore, 1.6 TB HDD.

    Process terminated with following error.. Kindly help for successful execution. Thanks.

    root@server1 srna-workbenchV3.0_ALPHA]# java -Xmx70g -jar Workbench.jar
    UEA sRNA Workbench startup…
    Pre-processing time: 0:28:43.662
    Feb 19, 2013 3:41:25 PM uk.ac.uea.cmp.srnaworkbench.utils.WorkbenchLogger log
    SEVERE: WORKBENCH: COLIDE: Message: -1;
    Stack Trace: java.util.ArrayList.elementData(ArrayList.java:338)
    java.util.ArrayList.get(ArrayList.java:351)
    uk.ac.uea.cmp.srnaworkbench.tools.colide.CoLIDEProcess.calcMedian(CoLIDEProcess.java:593)
    uk.ac.uea.cmp.srnaworkbench.tools.colide.CoLIDEProcess.createOverlappingGroups(CoLIDEProcess.java:512)
    uk.ac.uea.cmp.srnaworkbench.tools.colide.CoLIDEProcess.process(CoLIDEProcess.java:252)
    uk.ac.uea.cmp.srnaworkbench.tools.RunnableTool.run(RunnableTool.java:339)
    java.lang.Thread.run(Thread.java:679)

A suite of tools for analysing micro RNA and other small RNA data from High-Throughput Sequencing devices