CCseqBasic - for CaptureC data analysis

Page updated by Jelena Telenius - 11:10 09/Oct/2018

The CCseqBasic analyser is a pipeline over James Davies' CCanalyser.pl script (nowadays called analyseMappedReads.pl).

Capture-C analysis does the following steps

ANALYSIS

Maps the reads, identifies capture-containing reads.

Reports the interaction fragments, for each capture-containing RE-fragment (RE = restriction enzyme).

Makes a UCSC data hub to visualise the interactions

FILTERING

Extra tracks in UCSC data hub, to illustrate the read filtering process (duplicate, blacklist, homologous region filters).

Gives a boxplot illustration of read counts along the mapping and filtering process (visible via the UCSC data hub).

OUTPUT DATA

Gives the interaction counts(for each capture) per Restriction Enzyme fragment as a gff file.

Gives the reads in bam files as well, for further analysis.

CM5-rainbow (parallel) runs further divide the above steps

The steps for the parallel runs are the same as for the serial runs, but the workflow contains 3 parallel excecution loops : Fastq-wise analysis steps are done first in fastq-wise parallel loop, then a second loop combines the fastq-wise results to yield capture-site-wise files. Last the capture-site-wise analysis is ran in the last parallel loop. More details of the rainbow run file structure in here : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html

Capture-C analysis needs the following "inputs"

Paired end sequencing fastq files, from a Capture-C library.

Fastq read names in Illumina format like this @SIM:1:FCX:1:15:6329:1045 1:N:0:2

The coordinates of each DpnII (or NlaIII) fragment, within which the capture sites are located

Wished "exclusion zones" around each of the capture-site-containing DpnII fragment (recommended : +/- 1000 bases)

Step-by-step : all analysis steps explained

This section also contains a detailed listing of all the OUTPUT DATA FOLDERS and their contents.

1) Preparing the data for analysis

Reading the paired end FASTQ files in

Trimming adaptors in the 3' end of the reads (1)

Combining R1 and R2 reads to a single entity (2)

In-silico Restriction Enzyme digesting, to reach mappable fragments

Mapping the fragments to the genome (3)

Tools used above : (1) trim_galore, (2) FLASH, (3) bowtie1/2

2) Analysing mapped reads

In capture-C analysis we need to separate between multiple different fragment types.
Here the nomenclature : readsAndFragments.pdf

Finding reads which contain both capture and at least one interaction partner

Identifies reads containing a capture fragment

Identifies reads containing also a reporter fragment

(a fragment farther away from a capture site containing dpnII site than the set "exclusion zone"
: recommended +/- 1000 bases both sides of the RE fragment containing the capture site)

Excluding reads and fragments which cannot be interpreted

Excludes fragments mapping onto two Restriction Enzyme fragments

Excludes reads containing "double captures" - where two or more DIFFERENT capture sites contribute to the same read

(allowing multiple encounters of SAME capture site)

Filtering out duplicates (exactly-same-fragments-seen-in-multiple-reads)

If all the fragments of the read have the same coordinates as in a read already seen before, the read is filtered as a duplicate.

Filtering out intra-read duplicates (same fragment counting twice)

For each read, counting each Restriction Fragment only once

(fragments from the same RE-fragment within one read, are interpreted as "intra-read duplicates")

Output log files, reports, output data files, visualisations

Writing a detailed report, and illustrative figure, of all steps above (how many reads and fragments in each stage)

Writing a detailed report of reported interaction counts.

Plotting all the data in UCSC data hub

Providing all data as

(a) gff files (counts per restriction enzyme fragment) and
(b) bam files (all fragments) for further downstream analysis

3) Iterative analysis to illustrate read filtering

The above (2 Analysing mapped reads) is done multiple times, like so :

First time analysing reads

Prepare reads (above : 1 Prepare reads for analysis) - trim, combine, RE digest, map

Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.

Duplicate filtering

Duplicate filtering (reads where all fragments map to exactly same coordinates, are seen as duplicates of each others')

Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.

Homology region and blacklisted region filtering

Homology region filtering :

      Each of the capture-site-containing-RE-fragments are mapped to whole genome.
      If homology regions are found, the reporter fragments from these homology regions are eliminated
       (+/- 20 000 bases filter around each found homology region)

Blacklist-filtering :

reporter fragments overlapping known blacklisted regions(intra-house peak call for mm9, lift-over of this for mm10,
Duke Blacklisted regions for hg18 and hg19)are removed from the results.

Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.

4) Combining the final results

Each of these iterative steps (steps 3 above) are done twice - once for flash-combined )"flashed" reads, once for "non-flashed" (non-combinable) reads. This is to make troubleshooting easier : often the "flashed" and "non-flashed" reads have different kind of problems in the analysis.

What are "flashed" and "non-flashed" reads ?
Here the nomenclature : FAQflashed.txt

In the end, these two analyses are combined, to yield the final amount of reported reads :

Combining the final filtered files of "flashed" and "non-flashed reads".

Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.

5) Output data folders

The (1-4) order above is also the order of the output folders of the pipeline :

Here all the output folders explained : From input to output