CCseqBasic - for CaptureC data analysis
Page updated by Jelena Telenius - 11:10 09/Oct/2018
The CCseqBasic analyser is a pipeline over James Davies' CCanalyser.pl script (nowadays called analyseMappedReads.pl).
Capture-C analysis does the following steps
ANALYSIS
Maps the reads, identifies capture-containing reads.
Reports the interaction fragments, for each capture-containing RE-fragment (RE = restriction enzyme).
Makes a UCSC data hub to visualise the interactions
FILTERING
Extra tracks in UCSC data hub, to illustrate the read filtering process (duplicate, blacklist, homologous region filters).
Gives a boxplot illustration of read counts along the mapping and filtering process (visible via the UCSC data hub).
OUTPUT DATA
Gives the interaction counts(for each capture) per Restriction Enzyme fragment as a gff file.
Gives the reads in bam files as well, for further analysis.
CM5-rainbow (parallel) runs further divide the above steps
The steps for the parallel runs are the same as for the serial runs,
but the workflow contains 3 parallel excecution loops :
Fastq-wise analysis steps are done first in fastq-wise parallel loop,
then a second loop combines the fastq-wise results to yield capture-site-wise files.
Last the capture-site-wise analysis is ran in the last parallel loop.
More details of the rainbow run file structure in here :
http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html
Capture-C analysis needs the following "inputs"
Paired end sequencing fastq files, from a Capture-C library.
Fastq read names in Illumina format like this @SIM:1:FCX:1:15:6329:1045 1:N:0:2
The coordinates of each DpnII (or NlaIII) fragment, within which the capture sites are located
Wished "exclusion zones" around each of the capture-site-containing DpnII fragment (recommended : +/- 1000 bases)
Step-by-step : all analysis steps explained
This section also contains a detailed listing of all the OUTPUT DATA FOLDERS and their contents.
1) Preparing the data for analysis
Reading the paired end FASTQ files in
Trimming adaptors in the 3' end of the reads   (1)
Combining R1 and R2 reads to a single entity   (2)
In-silico Restriction Enzyme digesting, to reach mappable fragments
Mapping the fragments to the genome   (3)
Tools used above : (1) trim_galore, (2) FLASH, (3) bowtie1/2
2) Analysing mapped reads
In capture-C analysis we need to separate between multiple different fragment types.
Here the nomenclature :
readsAndFragments.pdf
Finding reads which contain both capture and at least one interaction partner
Identifies reads containing a capture fragment
Identifies reads containing also a reporter fragment
      (a fragment farther away from a capture site containing dpnII site than the set "exclusion zone"
      : recommended +/- 1000 bases both sides of the RE fragment containing the capture site)
Excluding reads and fragments which cannot be interpreted
Excludes fragments mapping onto two Restriction Enzyme fragments
Excludes reads containing "double captures" - where two or more DIFFERENT capture sites contribute to the same read
      (allowing multiple encounters of SAME capture site)
Filtering out duplicates (exactly-same-fragments-seen-in-multiple-reads)
If all the fragments of the read have the same coordinates as in a read already seen before, the read is filtered as a duplicate.
Filtering out intra-read duplicates (same fragment counting twice)
For each read, counting each Restriction Fragment only once
      (fragments from the same RE-fragment within one read, are interpreted as "intra-read duplicates")
Output log files, reports, output data files, visualisations
Writing a detailed report, and illustrative figure, of all steps above (how many reads and fragments in each stage)
Writing a detailed report of reported interaction counts.
Plotting all the data in UCSC data hub
Providing all data as
-
(a) gff files (counts per restriction enzyme fragment) and
-
(b) bam files (all fragments) for further downstream analysis
3) Iterative analysis to illustrate read filtering
The above (2 Analysing mapped reads) is done multiple times, like so :
First time analysing reads
Prepare reads (above : 1 Prepare reads for analysis) - trim, combine, RE digest, map
Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.
Duplicate filtering
Duplicate filtering (reads where all fragments map to exactly same coordinates, are seen as duplicates of each others')
Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.
Homology region and blacklisted region filtering
Homology region filtering :
      Each of the capture-site-containing-RE-fragments are mapped to whole genome.
      If homology regions are found, the reporter fragments from these homology regions are eliminated
       (+/- 20 000 bases filter around each found homology region)
Blacklist-filtering :
      reporter fragments overlapping known blacklisted regions(intra-house peak call for mm9, lift-over of this for mm10,
      Duke Blacklisted regions for hg18 and hg19)are removed from the results.
Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.
4) Combining the final results
Each of these iterative steps (steps 3 above) are done twice - once for flash-combined )"flashed" reads, once for "non-flashed" (non-combinable) reads.
This is to make troubleshooting easier : often the "flashed" and "non-flashed" reads have different kind of problems in the analysis.
What are "flashed" and "non-flashed" reads ?
Here the nomenclature :
FAQflashed.txt
In the end, these two analyses are combined,
to yield the final amount of reported reads :
Combining the final filtered files of "flashed" and "non-flashed reads".
Step (2) above - to produce read counts, data files, and UCSC data hub visualisation.
5) Output data folders
The (1-4) order above is also the order of the output folders of the pipeline :
Here all the output folders explained :
From input to output