updated by Jelena 12Jan2017

#################################################
#
# Some documentation about the CCanalyser report.
#
#################################################

In general it is very straightforward.

The code more or less reports "every step of the way".


So - all the numbers correspond to CHRONOLOGICAL steps during the analysis.
The only exception is number 6 (duplicate reads) - which comes way too early.

All the other numbers correspond to the filtering order in the script.


####################################################

We have divided each READ into RE-cut FRAGMENTS - and we count them in various ways

i.e. counters (1-10)

###################################################


So, first we have all fragments (1), and count them in various ways (1-10)

01 Number of capture sites loaded:      16
02 Restriction enzyme fragments loaded: 6199203
03 Lines in sam file header:    23
04 Data lines in sam file:      59265143
06 Unmapped fragments in SAM file:      2851157
06c Duplicate reads:    9801446
07 Mapped fragments:    56413986
09 Proximity exclusion fragments (Pre PCR duplicate removal):   2101978
10 Reporter fragments (Pre PCR duplicate removal):      10374399


####################################################

We reconstruct the READS from the RE-cut FRAGMENTS - and we count them in various ways

i.e. counters (11-15)

###################################################


After mapping in bowtie, our reads are in the file so, that all fragments of one "read" are one after another.

(see nomenclature "read" and "fragment" in this pdf : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/CCanalyser2_interpretAndTroubleshoot.pdf )

Then, once we know we have all the fragments of a read, we look at them,
and if we didn't lose all our reads in filtering steps above (counters 1-10) 

We further filter the reads.

Read entering further analysis stages have to have 

1) a capture fragment 
2) reporter fragment 

but MAY STILL contain multiple different captures


We also count the FRAGMENTS within the reads, which had at least one mapped fragment (counter 13)


####################################################

Detailed counts of fragments and their composition (before duplicate-filtering)

i.e. counters (11e,11ee)

###################################################


Before duplicate filtering your "preliminary counts" of reporters are in 11e and 11ee

How to read these lines :

11ee Total number of reads having captures in composition Hom9:1  , having 1 reporters and 0 exclusion fragments :     918274
11ee Total number of reads having captures in composition Hom9:1 Pam6:1  , having 1 reporters and 0 exclusion fragments :      3
11ee Total number of reads having captures in composition Hom9:2  , having 1 reporters and 0 exclusion fragments :     255796

The first line means :

Hom9 capture. Capture fragments seen : 1. 
Reporter fragments seen 1. Exclusion fragments seen 0. 
Total count of these : 918274

Second line :

Hom9 capture, having also Pam6 capture. Capture fragments seen : 1 (in Hom9), Capture fragments seen : 1 (in Pam6). 
Reporter fragments seen 1. Exclusion fragments seen 0. 
Total count of these : 3 (so very rare)

Third line :

Hom9 capture. Capture fragments seen : 2. (so, Hom9 capture was seen in 2 fragments of the read)
Reporter fragments seen 1. Exclusion fragments seen 0. 
Total count of these : 918274


####################################################

We duplicate filter our reads

i.e. counters (12-16)

###################################################


Then the code makes duplicate filter (counters 16).
So - we get rid of reads which most probably are each other's PCR duplicates.
The "composition" is reported the same way as 11e and others above.

Now we still continue reporting "global statistics" after the duplicate filtering,
and continue all the way upto number 25.

12 Total number of reads entering duplicate-filtering - should be same count as 11f :   9886207
13 Count of fragments in Reads having at least one informative fragment :       21410249
14a Reads having 2 fragments:   8291768
14a Reads having 3 fragments:   1551218
14a Reads having 4 fragments:   43048
14a Reads having 5 fragments:   171
14a Reads having 6 fragments:   2
14b Reads having 2 informative fragments:       8291768
14b Reads having 3 informative fragments:       1551218
14b Reads having 4 informative fragments:       43048
14b Reads having 5 informative fragments:       171
14b Reads having 6 informative fragments:       2
16 Non-duplicated reads:        84761


####################################################

We count our duplicate filtered reads in various ways

i.e. counters (16-23)

###################################################

16b and 16bb are counters after duplicate filter.
Interpret these like the above counters 11e and 11ee

16c Proximity exclusion fragments (After PCR duplicate removal):        1191
16d Reporter fragments (After PCR duplicate removal):   92067
16f Total fragment count (after PCR duplicate removal): 84761
16g Reads having 2 informative fragments (after PCR duplicate whole-read removal):      52924
16g Reads having 3 informative fragments (after PCR duplicate whole-read removal):      29677
16g Reads having 4 informative fragments (after PCR duplicate whole-read removal):      2094
16g Reads having 5 informative fragments (after PCR duplicate whole-read removal):      64
16g Reads having 6 informative fragments (after PCR duplicate whole-read removal):      2
23 Reporters before final filtering steps       92067


####################################################

We do final filtering to the reads

i.e. counters (23-25)

###################################################


The "last filtering steps" the report mentions is :
- if WITHIN the same read one reports SAME reporter fragment twice, it is counted only once.
- if the reporter fragment is mapped so, that one end of it is in one DpnII fragment, and other end is in other DpnII fragment
( so that the mapped reporter fragment OVERLAPS DpnII cut site), it is filtered out, as it is believed to be mismapped read.


23 Reporters before final filtering steps       92067
24 Duplicate reporters (duplicate-excluded if stringent was on) 40776
25 Reporter fragments reporting the same RE fragment within a single read (duplicate-excluded)  3708
25e Error in Reporter fragment assignment to in silico digested genome (see 24ee for details)   440
25ee Binary search error - fragment overlapping multiple restriction sites:     440
26 Actual reported fragments :  87919
Counters with reporters 110329

####################################################

The FINAL COUNTS (the "most important counters" )

###################################################


Now as we reach the FINAL counts for everything,
we divide to CAPTURE-SITE specific statistics.

So - basically the same counts as the above ones, but divided by CAPTURE SITE.
These report lines START with the capture site name, followed by numbers 12-17
So, these 12-17 have nothing to do with the 12-17 described above.
These are the FINAL counts for each capture site.

So basically after filtering duplicates we count the situation JUST BEFORE the duplicate filtering,
and just after. 


These are the most important counters :

Olig_aGlobin 12 Reporters before final filtering steps	752513
Olig_aGlobin 13 Duplicate reporters (duplicate-excluded if stringent was on)	720951
Olig_aGlobin 14 Reporter fragments reporting the same RE fragment within a single read (duplicate-excluded)	783
Olig_aGlobin 14e Error in Reporter fragment assignment to in silico digested genome (see 25ee for details)	1244
Olig_aGlobin 15 Capture fragments (final count):	728013
Olig_aGlobin 16 Proximity exclusions (final count):	2538
Olig_aGlobin 17 Reporter fragments (final count) :	750486

The one you are really interested is number 17 : the actual count of the interactions to other fragments.
Usually you can only interpret the data if you have more than 30 000 fragments for each capture in counter (17) here