updated by Jelena 12Jan2017 ################################################# # # Some documentation about the CCanalyser report. # ################################################# In general it is very straightforward. The code more or less reports "every step of the way". So - all the numbers correspond to CHRONOLOGICAL steps during the analysis. The only exception is number 6 (duplicate reads) - which comes way too early. All the other numbers correspond to the filtering order in the script. #################################################### We have divided each READ into RE-cut FRAGMENTS - and we count them in various ways i.e. counters (1-10) ################################################### So, first we have all fragments (1), and count them in various ways (1-10) 01 Number of capture sites loaded: 16 02 Restriction enzyme fragments loaded: 6199203 03 Lines in sam file header: 23 04 Data lines in sam file: 59265143 06 Unmapped fragments in SAM file: 2851157 06c Duplicate reads: 9801446 07 Mapped fragments: 56413986 09 Proximity exclusion fragments (Pre PCR duplicate removal): 2101978 10 Reporter fragments (Pre PCR duplicate removal): 10374399 #################################################### We reconstruct the READS from the RE-cut FRAGMENTS - and we count them in various ways i.e. counters (11-15) ################################################### After mapping in bowtie, our reads are in the file so, that all fragments of one "read" are one after another. (see nomenclature "read" and "fragment" in this pdf : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/CCanalyser2_interpretAndTroubleshoot.pdf ) Then, once we know we have all the fragments of a read, we look at them, and if we didn't lose all our reads in filtering steps above (counters 1-10) We further filter the reads. Read entering further analysis stages have to have 1) a capture fragment 2) reporter fragment but MAY STILL contain multiple different captures We also count the FRAGMENTS within the reads, which had at least one mapped fragment (counter 13) #################################################### Detailed counts of fragments and their composition (before duplicate-filtering) i.e. counters (11e,11ee) ################################################### Before duplicate filtering your "preliminary counts" of reporters are in 11e and 11ee How to read these lines : 11ee Total number of reads having captures in composition Hom9:1 , having 1 reporters and 0 exclusion fragments : 918274 11ee Total number of reads having captures in composition Hom9:1 Pam6:1 , having 1 reporters and 0 exclusion fragments : 3 11ee Total number of reads having captures in composition Hom9:2 , having 1 reporters and 0 exclusion fragments : 255796 The first line means : Hom9 capture. Capture fragments seen : 1. Reporter fragments seen 1. Exclusion fragments seen 0. Total count of these : 918274 Second line : Hom9 capture, having also Pam6 capture. Capture fragments seen : 1 (in Hom9), Capture fragments seen : 1 (in Pam6). Reporter fragments seen 1. Exclusion fragments seen 0. Total count of these : 3 (so very rare) Third line : Hom9 capture. Capture fragments seen : 2. (so, Hom9 capture was seen in 2 fragments of the read) Reporter fragments seen 1. Exclusion fragments seen 0. Total count of these : 918274 #################################################### We duplicate filter our reads i.e. counters (12-16) ################################################### Then the code makes duplicate filter (counters 16). So - we get rid of reads which most probably are each other's PCR duplicates. The "composition" is reported the same way as 11e and others above. Now we still continue reporting "global statistics" after the duplicate filtering, and continue all the way upto number 25. 12 Total number of reads entering duplicate-filtering - should be same count as 11f : 9886207 13 Count of fragments in Reads having at least one informative fragment : 21410249 14a Reads having 2 fragments: 8291768 14a Reads having 3 fragments: 1551218 14a Reads having 4 fragments: 43048 14a Reads having 5 fragments: 171 14a Reads having 6 fragments: 2 14b Reads having 2 informative fragments: 8291768 14b Reads having 3 informative fragments: 1551218 14b Reads having 4 informative fragments: 43048 14b Reads having 5 informative fragments: 171 14b Reads having 6 informative fragments: 2 16 Non-duplicated reads: 84761 #################################################### We count our duplicate filtered reads in various ways i.e. counters (16-23) ################################################### 16b and 16bb are counters after duplicate filter. Interpret these like the above counters 11e and 11ee 16c Proximity exclusion fragments (After PCR duplicate removal): 1191 16d Reporter fragments (After PCR duplicate removal): 92067 16f Total fragment count (after PCR duplicate removal): 84761 16g Reads having 2 informative fragments (after PCR duplicate whole-read removal): 52924 16g Reads having 3 informative fragments (after PCR duplicate whole-read removal): 29677 16g Reads having 4 informative fragments (after PCR duplicate whole-read removal): 2094 16g Reads having 5 informative fragments (after PCR duplicate whole-read removal): 64 16g Reads having 6 informative fragments (after PCR duplicate whole-read removal): 2 23 Reporters before final filtering steps 92067 #################################################### We do final filtering to the reads i.e. counters (23-25) ################################################### The "last filtering steps" the report mentions is : - if WITHIN the same read one reports SAME reporter fragment twice, it is counted only once. - if the reporter fragment is mapped so, that one end of it is in one DpnII fragment, and other end is in other DpnII fragment ( so that the mapped reporter fragment OVERLAPS DpnII cut site), it is filtered out, as it is believed to be mismapped read. 23 Reporters before final filtering steps 92067 24 Duplicate reporters (duplicate-excluded if stringent was on) 40776 25 Reporter fragments reporting the same RE fragment within a single read (duplicate-excluded) 3708 25e Error in Reporter fragment assignment to in silico digested genome (see 24ee for details) 440 25ee Binary search error - fragment overlapping multiple restriction sites: 440 26 Actual reported fragments : 87919 Counters with reporters 110329 #################################################### The FINAL COUNTS (the "most important counters" ) ################################################### Now as we reach the FINAL counts for everything, we divide to CAPTURE-SITE specific statistics. So - basically the same counts as the above ones, but divided by CAPTURE SITE. These report lines START with the capture site name, followed by numbers 12-17 So, these 12-17 have nothing to do with the 12-17 described above. These are the FINAL counts for each capture site. So basically after filtering duplicates we count the situation JUST BEFORE the duplicate filtering, and just after. These are the most important counters : Olig_aGlobin 12 Reporters before final filtering steps 752513 Olig_aGlobin 13 Duplicate reporters (duplicate-excluded if stringent was on) 720951 Olig_aGlobin 14 Reporter fragments reporting the same RE fragment within a single read (duplicate-excluded) 783 Olig_aGlobin 14e Error in Reporter fragment assignment to in silico digested genome (see 25ee for details) 1244 Olig_aGlobin 15 Capture fragments (final count): 728013 Olig_aGlobin 16 Proximity exclusions (final count): 2538 Olig_aGlobin 17 Reporter fragments (final count) : 750486 The one you are really interested is number 17 : the actual count of the interactions to other fragments. Usually you can only interpret the data if you have more than 30 000 fragments for each capture in counter (17) here