Page updated by Jelena 09Oct2018 ######################################################### The CCseqBasic analyser is a pipeline over James Davies' CCanalyser.pl script (nowadays called analyseMappedReads.pl). ######################################################### CCseqBasic runs all the below steps, automatically, in one go. How to set up the run, is explained elsewhere : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/runpipe.html The analysis consists of three stages : (i) PREPARE the data for analysis (ii) RUN THE ANALYSIS SEVERAL TIMES, generate useful visualisations and reports on the fly (iii) COMBINE THE FINAL RESULTS, and generate a summary figure If you are running this via CM5-Rainbow , the above steps are further divided to folders A,B,C,D handling different parallelisation rounds of the run. Folder A prepares for the real run, folder B takes care of fastq-wise analysis (step (i) above for each fastq), folder C combines fastq-wise analyses, and folder D runs all the steps (step(i),(ii),(iii)) for each capture site. The results can then be loaded with text file E_hubAddresses.txt. More details of the rainbow run file structure in here : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html Details below ######################################################### The (i-iii) order is also the order of the output folders of the pipeline : (i) PREPARE the data for analysis - Folder F1_beforeCCanalyser -------------------------------------------------------- (ii) RUN THE ANALYSIS SEVERAL TIMES, generate useful visualisations and reports on the fly - Folders F2-F5 like so : F2_redGraphs run the analysis once, generate the "red graphs" of the UCSC data hub F3_orangeGraphs run the analysis once, generate the "orange graphs" of the UCSC data hub F4_blatPloidyFiltering ( and its results folder F5_greenGraphs_separate ) run the analysis once, generate the "green graphs" of the UCSC data hub ( The above holds for CB4 series, but in CB3 series there is no blat filter, and final results are generated in folder F4. ) -------------------------------------------------------- (iii) COMBINE THE FINAL RESULTS, and generate a summary figure - Folders F6-F7 like so : F6_greenGraphs_combined combine the "flashed" and "nonflashed" green graphs, run the analysis once, generate the "combined green graphs" of the UCSC data hub F7_summaryFigure generate a summary figure, to make it easier to troubleshoot the above workflow, on a single glance ( The above holds for CB4 series, but in CB3 series there is no need to combine data, as all flashed and nonflashed data are merged before the analysis starts. there is no summary figure either in CB3 series. ) ######################################################### DETAILS OF EACH STEP ######################################################### If you are running this via CM5-Rainbow , the below steps are further divided to folders A,B,C,D handling different parallelisation rounds of the run. Folder A prepares for the real run, folder B takes care of fastq-wise analysis (step (i) above for each fastq), folder C combines fastq-wise analyses, and folder D runs all the steps (step(i),(ii),(iii)) for each capture site. The results can then be loaded with text file E_hubAddresses.txt. More details of the rainbow run file structure in here : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html -------------------------------------------------------- (i) PREPARE the data for analysis - Folder F1_beforeCCanalyser CM5-Rainbow runs do this twice - once fastq-wise in folder B, and then oligo-wise in folder D. more details of rainbow runs in : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html -------------------------------------------------------- F1_beforeCCanalyser preparing the sample for the analyserMappedReads.pl script this includes : read trimming, read combining (when R1 and R2 overlap), cutting reads on found restriction enzyme cut sites, read mapping READ COMBINING : "Flashed read" : program called "flash" is used to combine the reads (when R1 and R2 overlap) to a single read "Flashed reads" are the reads the program was able to combine to a single entity (overlap of R1 and R2 reads was found) R1 |------------ ----------------| R2 |-------------------| flashed read (combining R1 and R2 to a single entity) "Non-flashed read" : program called "flash" is used to combine the reads (when R1 and R2 overlap) to a single read "Non-flashed reads" are the reads the program was not able to combine to a single entity (no overlap of R1 and R2 reads was found) This means : there was no overlap, or the overlap was not "convincing enough" (contained a lot of mismatches, or was very short overlap) R1 |--------- -----------| R2 |--------- -----------| non-flashed read (R1 and R2 cannot be combined, and continue to analysis as separate entities These two read sets : "flashed" and "non-flashed" are analysed separately (all result files come "twice" - once for the flashed read set, and once for the non-flashed one) Thus, the output files have one Resriction enzyme cut output file for "flashed" reads, and one for "non-flashed" reads. The same goes for bowtie (mapping) results in the .bam files, as well as all subsequent steps. ######################################################### (ii) RUN THE ANALYSIS SEVERAL TIMES, generate useful visualisations and reports on the fly - Folders F2-F5 like so : F2_redGraphs run the analysis once, generate the "red graphs" of the UCSC data hub F3_orangeGraphs run the analysis once, generate the "orange graphs" of the UCSC data hub F4_blatPloidyFiltering ( and its results folder F5_greenGraphs_separate ) run the analysis once, generate the "green graphs" of the UCSC data hub The above holds for CB4 series, but in CB3 series there is no blat filter, and final results are generated into folder F4. CM5-Rainbow runs do this oligo-wise in folder D. more details of rainbow runs in : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html -------------------------------------------------------------------------------------- F2_redGraphs : run the analysis once, generate the "red graphs" of the UCSC data hub -------------------------------------------------------------------------------------- ( CM5-Rainbow runs do this oligo-wise in folder D. ( more details of rainbow runs in : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html When generating folder F2_redGraphs, the code reads in the mapped .bam files from F1_beforeCCanalyser folder (one file for "flashed" and one for "non-flashed" reads). The code will then run the "CCanalyser" code analyseMappedReads.pl for the first time. This program takes the mapped fragments, and : 1) Checks if there is capture site(s) amongst the fragments of that read. "Read" and "fragment" : see this document : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/CCanalyser2_interpretAndTroubleshoot.pdf If capture site is found, continue with the analysis of that read, otherwise skip that read. 2) Checks if there is at least one reporter fragment amongst the fragments of that read. "Reporter fragment" : see this document : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/CCanalyser2_interpretAndTroubleshoot.pdf If reporter fragment(s) is found, continue with the analysis of that read, otherwise skip that read. 3) Checks if there is only ONE capture site (not from many different captures) among the fragments of that read. If many different capture sites are found, skip that read, otherwise continue with the analysis of that read. 4) Checks if the read contains the same mapped fragments, in same order, than a previously analysed read (is a true read duplicate). Not filtering duplicates in F2 folder yet - only updating the counters of them. This is done separately for the flashed and non-flashed reads ("flashed" and "non-flashed" : see above, chapter "READ COMBINING") Details of these steps are listed in this document : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/reportInterpret.txt Thus, the F2 output folder contains non-filtered results for the flashed reads and non-flashed reads. Each capture site gets 2 sets of files FLASHED_ and NONFLASHED_ output file sets. Both of these sets contain, for each capture site : (1) bam file (printout of all fragments for that specific capture site - from reads containing at least 1 capture and 1 reporter, in BAM format). (2) windowed-wig file (the same data as the bam file, but in UCSC genome browser supported wig format, windowed over 300 bases with 30 base sliding window, to show the signal better.) (3) gff file (counts of reporter fragments, in restriction enzyme cut site bins, for whole genome. Listing only bins where non-zero amount of reporter fragments). (4) wig file (the same data as the gff file, but in UCSC genome browser supported wig format.) "Read" and "fragment" : see this document : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/CCanalyser2_interpretAndTroubleshoot.pdf In addition to the above, both "flashed" and "nonflashed" get summary files : (1) bam file (printout of all fragments for that specific capture site - from reads containing at least 1 capture and 1 reporter, in BAM format, this file lists in the CO: column of sam file, which exact capture the read contained, and whether the fragment in question was Reporter/Capture/Exclusion fragment, whether the reporter was Cis or Trans, whether the fragment originates from Duplicated Read). (2) report file ( helpful summary of the counts of the reads along the analysis. to interpret this, see this document : http://userweb.molbiol.ox.ac.uk/public/telenius/captureManual/reportInterpret.txt The output of the F2 folder is visualised in the UCSC data hub, as separate "flashed" and "nonflashed" tracks for each capture oligo. The F2 folder data is the red graph of the UCSC data hub (thus, the output folder is called F2_redGraphs) -------------------------------------------------------- F3_orangeGraphs ( CM5-Rainbow runs do this oligo-wise in folder D. ( more details of rainbow runs in : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html The output sam file (where all the reads where at least one capture and one reporter) of F2_redgraphs, enters another round of analyseMappedReads.pl Again this step is done separately for the flashed and nonflashed reads. The analysis and output files is exactly the same as in F2_redgraphs, except step (4) looks like this now : 4) Checks if the read contains the same mapped fragments, in same order, than a previously analysed read (is a true read duplicate). Filtering duplicates - printing only reads which are not PCR duplicates. The output of the F3 folder is visualised in the UCSC data hub, as separate "flashed" and "nonflashed" tracks for each capture oligo. The F3 folder data is the orange graph of the UCSC data hub (thus, the output folder is called F3_orangeGraphs) -------------------------------------------------------- F4_blatPloidyFiltering ( and its results folder F5_greenGraphs_separate ) ( CM5-Rainbow runs do this oligo-wise in folder D. ( more details of rainbow runs in : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html The output sam files for each capture site of F3_orangeGraphs, enter further filtering. Again this step is done separately for the flashed and nonflashed reads. BLAT-filtering (removing false positive hits due to homologous regions) : This is done to avoid embarrasment, when long range cis and trans interactions are seen, just because the capture site happens to resemble closely ANOTHER location in the genome (and mismapped reads cause this to show up as a long range "true" interaction). This filtering is done like so : Each capture site (the whole region between neighboring restriction enzyme sites within which the capture site resides) is in turn suspected to a BLAT run, where this site is mapped against the whole genome. If hits are found (i.e. if the site has homologous sites elsewhere in the genome), +/- 20 000 bases from each hit are eliminated as reporter fragments (except hits +/- 200 000 bases from the target site itself). Thus, the sam files of each oligo in turn are searched for these "to-be-eliminated" regions, and if reads mapping there are found, the reporter fragments in there are deleted. Used blat parameters are : minMatch=2 tileSize=11 maxIntron=4000 stepSize=5 minScore=10 (and can be confirmed from the qsub.out file of the pipeline run) This means that all fully matching two 11base wide regions, separated by maximum 4000 bases from each others, trigger a homologous region - and are to be removed in the blat-filtering step. Step size determines how often this "search for homology" is restarted. Here we do in 5 bases steps along whole genome. Min score triggers "how similar" the sequences need to be to trigger a homologous region. Value 10 (used here) flags regions as homologous relatively easily. PLOIDY-filtering : The sam files are also subjected to filtering of "blacklisted regions". These are generally very repeat-rich regions, which generate deep (but artifactual) signals in the data. These "false positive" peaks are seen in all samples, in all data types (Dnase-seq, ATAC-seq, RNA-seq, captureC), and if some of these regions are within the interaction region of any of the capture probes, convincing-looking interaction signal can result, and this may lead to false interpretation of the data. If you are interested in repeat-rich regions, you may wish to turn this filter off, by running the pipeline with --noPloidyFilter The F4_blatPloidyFiltering folder contains the various log files and to-be-filtered-regions listings of the above two filters. The filtered files are stored in folder F5_greenGraphs_separate. The output of the F5 folder is visualised in the UCSC data hub, as separate "flashed" and "nonflashed" tracks for each capture oligo. The F5 folder data is the green graph of the UCSC data hub (thus, the output folder is called F5_greenGraphs) ######################################################### (iii) COMBINE THE FINAL RESULTS, and generate a summary figure ( CM5-Rainbow runs do this oligo-wise in folder D. ( more details of rainbow runs in : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html - Folders F6-F7 like so : F6_greenGraphs_combined combine the "flashed" and "nonflashed" green graphs, run the analysis once, generate the "combined green graphs" of the UCSC data hub F7_summaryFigure (serial runs) or E_hubAddresses.txt (parallel CM5-rainbow runs) generate a summary figure, to make it easier to troubleshoot the above workflow, on a single glance The above holds for CB4 series, but in CB3 series there is no need to combine data, as all flashed and nonflashed data are merged before the analysis starts. there is no summary figure either in CB3 series. -------------------------------------------------------- F6_greenGraphs_combined The output of the F5 folder now contains the final results of the run. However, the data is still separated to two files the "flashed" and "nonflashed" analysis. Also the visualisations are still separate for these two. In F6 folder, the analyseMappedReads.pl is ran still one final time (like it was ran in F2 folder, so without duplicate filtering), to generate a single analysis result for a combined bam file. The output and report files as above. This generates the COMBINED tracks of the UCSC data hub (thus, the output folder is called F6_greenGraphs_combined) -------------------------------------------------------- F7_summaryFigure (serial runs) or E_hubAddresses.txt (parallel CM5-rainbow runs) The above data flow can be quite exhausting to troubleshoot, so the reporter counter files are transformed to easy-to-interpret graphical representation (available in the data hub by clicking any of the data tracks, or opening "track preferences" page by right clicking the track) This folder contains the figure, and the log files about making it. The corresponding structure for the parallel CM5-rainbow runs is E_hubAddresses.txt - containing the data hub addresses, as well as key run statistics collected to a html page. To find out what happenened during each parallel run, each run also have their error and output logs in their corresponding folders. ---------------------------------------------------------- If you are running this via CM5-Rainbow , the above steps are further divided to folders A,B,C,D handling different parallelisation rounds of the run. Folder A prepares for the real run, folder B takes care of fastq-wise analysis (step (i) above for each fastq), folder C combines fastq-wise analyses, and folder D runs all the steps (step(i),(ii),(iii)) for each capture site. The results can then be loaded with text file E_hubAddresses.txt. More details of the rainbow run file structure in here : http://userweb.molbiol.ox.ac.uk/public/telenius/DataHubs/damien7000oligos/htmlfiles/outputDataAndNotes.html