CCseqBasic - for CaptureC data analysis

Page updated by Jelena Telenius - 11:10 09/Oct/2018


The CCseqBasic analyser is a pipeline over James Davies' CCanalyser.pl script (nowadays called analyseMappedReads.pl).



Capture-C analysis does the following steps

ANALYSIS

  • Maps the reads, identifies capture-containing reads.
  • Reports the interaction fragments, for each capture-containing RE-fragment (RE = restriction enzyme).
  • Makes a UCSC data hub to visualise the interactions
  • FILTERING

  • Extra tracks in UCSC data hub, to illustrate the read filtering process (duplicate, blacklist, homologous region filters).
  • Gives a boxplot illustration of read counts along the mapping and filtering process (visible via the UCSC data hub).
  • OUTPUT DATA

  • Gives the interaction counts(for each capture) per Restriction Enzyme fragment as a gff file.
  • Gives the reads in bam files as well, for further analysis.



  • CM5-rainbow (parallel) runs further divide the above steps

    The steps for the parallel runs are the same as for the serial runs, but the workflow contains 3 parallel excecution loops : Fastq-wise analysis steps are done first in fastq-wise parallel loop, then a second loop combines the fastq-wise results to yield capture-site-wise files. Last the capture-site-wise analysis is ran in the last parallel loop. More details of the rainbow run file structure in here : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html


    Capture-C analysis needs the following "inputs"

  • Paired end sequencing fastq files, from a Capture-C library.
  • Fastq read names in Illumina format like this @SIM:1:FCX:1:15:6329:1045 1:N:0:2
  • The coordinates of each DpnII (or NlaIII) fragment, within which the capture sites are located
  • Wished "exclusion zones" around each of the capture-site-containing DpnII fragment (recommended : +/- 1000 bases)



  • Step-by-step : all analysis steps explained

    This section also contains a detailed listing of all the OUTPUT DATA FOLDERS and their contents.

    1) Preparing the data for analysis

  • Reading the paired end FASTQ files in
  • Trimming adaptors in the 3' end of the reads   (1)
  • Combining R1 and R2 reads to a single entity   (2)
  • In-silico Restriction Enzyme digesting, to reach mappable fragments
  • Mapping the fragments to the genome   (3)

  • Tools used above : (1) trim_galore, (2) FLASH, (3) bowtie1/2



    2) Analysing mapped reads

    In capture-C analysis we need to separate between multiple different fragment types.
    Here the nomenclature : readsAndFragments.pdf


    Finding reads which contain both capture and at least one interaction partner

  • Identifies reads containing a capture fragment
  • Identifies reads containing also a reporter fragment
  •       (a fragment farther away from a capture site containing dpnII site than the set "exclusion zone"
          : recommended +/- 1000 bases both sides of the RE fragment containing the capture site)

    Excluding reads and fragments which cannot be interpreted

  • Excludes fragments mapping onto two Restriction Enzyme fragments
  • Excludes reads containing "double captures" - where two or more DIFFERENT capture sites contribute to the same read
  •       (allowing multiple encounters of SAME capture site)

    Filtering out duplicates (exactly-same-fragments-seen-in-multiple-reads)

  • If all the fragments of the read have the same coordinates as in a read already seen before, the read is filtered as a duplicate.

  • Filtering out intra-read duplicates (same fragment counting twice)

  • For each read, counting each Restriction Fragment only once
  •       (fragments from the same RE-fragment within one read, are interpreted as "intra-read duplicates")

    Output log files, reports, output data files, visualisations

  • Writing a detailed report, and illustrative figure, of all steps above (how many reads and fragments in each stage)
  • Writing a detailed report of reported interaction counts.
  • Plotting all the data in UCSC data hub
  • Providing all data as
    1. (a) gff files (counts per restriction enzyme fragment) and
    2. (b) bam files (all fragments) for further downstream analysis



  • 3) Iterative analysis to illustrate read filtering

    The above (2 Analysing mapped reads) is done multiple times, like so :

    First time analysing reads

  • Prepare reads (above : 1 Prepare reads for analysis) - trim, combine, RE digest, map
  • Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.

  • Duplicate filtering

  • Duplicate filtering (reads where all fragments map to exactly same coordinates, are seen as duplicates of each others')
  • Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.

  • Homology region and blacklisted region filtering

  • Homology region filtering :
  •       Each of the capture-site-containing-RE-fragments are mapped to whole genome.
          If homology regions are found, the reporter fragments from these homology regions are eliminated
           (+/- 20 000 bases filter around each found homology region)

  • Blacklist-filtering :
  •       reporter fragments overlapping known blacklisted regions(intra-house peak call for mm9, lift-over of this for mm10,
          Duke Blacklisted regions for hg18 and hg19)are removed from the results.

  • Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.



  • 4) Combining the final results

    Each of these iterative steps (steps 3 above) are done twice - once for flash-combined )"flashed" reads, once for "non-flashed" (non-combinable) reads. This is to make troubleshooting easier : often the "flashed" and "non-flashed" reads have different kind of problems in the analysis.

    What are "flashed" and "non-flashed" reads ?
    Here the nomenclature : FAQflashed.txt

    In the end, these two analyses are combined, to yield the final amount of reported reads :

  • Combining the final filtered files of "flashed" and "non-flashed reads".
  • Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.



  • 5) Output data folders

    The (1-4) order above is also the order of the output folders of the pipeline :

    Here all the output folders explained : From input to output