Updated by Jelena 15:00 06/July/2015 At the moment we are PRE-RELEASE, so we still have things to fix before we are ready to GitHub it :) Send all confusion and weird things to jelena__telenius__at__gmail__com ( and if needed, she will forward them to James ) ------------------------------------------------------- A) CCanalyser script B) Shell wrapper for the whole pipeline ------------------------------------------------------- A) CCanalyser2.pl script BUGS / NON-STANDARD BEHAVIOR : 1) Relative paths for input files don't work ( At the moment you should give full path to all files - sam, oligo, genome digest) 2) Dump.fastq file gets generated even when --dump flag is not on (generates empty file) 3) Some duplicate-filtered reads are actually not duplicates. Currently the filter does not take into account strandedness of fragments, and ligation order of fragments. 4) Multiple captures are not dealt the same way in SNP and non-SNP runs. 5) Jelena hasn't tested - should ask James - what happens to the last read in SAM file ? (all its fragments) - does it get analysed or do we need extra block after the loop for the last read ? INSTABILITY ISSUES : 1) Not enough testers for file integrity in place - wrong type / empty input files, non-parseable paths for public files etc, will lead to NONSENSE data, or non-existing data. --> will be fixed so that all file integrity issues lead to code CRASH loudly and clearly DEVELOPMENT TARGETS : For GitHub release : 1) Log file reporting (script will tell what it does and report counter values "in real time") 2) Error file reporting (script will crash reliably when something goes totally wrong or input looks intolerably wrong format) 3) Listing parameters which were red in - user can check if things were red in properly, and also use the log file as "notebook" of what he did in the run 4) File integrity tests and related program crashes 5) Report file improvements : adding more counters and "self-debugging" log output 6) Oligo-file auto-generator 7) Improving the statistics output : more counters, and more logical print-out order : to ensure that the user gets the things in the "order of importance" 8) Dividing the output logs to 4 files : runtime log, runtime error, run results statistics (compact), run results statistics (very detailed) 9) Providing a test data set with its output files - for new users to test that they get the same results For later releases : 1) Improvements to duplicate filtering. Now filters with too heavy hand (does not take into account the ligation order and strand of fragments). 2) De-bugging flags and de-bugging output files more flexible - now need hardcoded changes to turn reporting on. Should provide optional debugger-output of most hashes (each to be turned on with a separate flag). 3) Ensuring that some reporting results from ALL DATA red in - that all counters "add up" and nothing leaks to "anonymity". Also, keeping the original order of things (see duplicate filtering above) and reporting it - to help further development and interpreting data sets containing "esoteric" reads. 4) Adding facets from log files to the MIG output (counter values which are now only exported to logs/debugging files) 5) Ensuring that only those sam-files which have more things than just heading, get printed. Same for the rest of the output files - only capture sites which actually have something to report, will report. 6) Ensuring the SNP behave the same way than non-SNP run (consistency to the duplicate capture reporting) ------------------------------------------------------- B) Shell wrapper for the whole pipeline BUGS / NON-STANDARD BEHAVIOR : None reported - keep Jelena updated INSTABILITY ISSUES : Does not kill the script properly when any of the underlying scripts fail. Only checks for output files, and not even that for the last script (CCanalyser). DEVELOPMENT TARGETS : 1) Better crashing behavior - crashes when any of the perl scripts crashes. 2) Input format integrity testing - not allowing any of the files to go to next steps if they are not of proper format 3) Deleting files when they are not any more needed (currently saves all files from FLASHing step onwards)