######################################################### FAQ : What is the difference between VS05 and VS04 and VS03 ? ######################################################### VS05 VS04 VS03 are by and large the same code. The differences are SOLELY in how the duplicate filtering is done for some rare read types. First changes 1) VS03 vs VS04 Second changes 2) VS04 vs VS05 Third changes 3) wobbly ends ------------------------------------------------------- 1) VS03 vs VS04 : the first changes ------------------------------------------------------- VS03 and VS04 are by and large the same code. The changes are to make the duplicate filtering more stringent. As it is in VS03, it allows through the analysis many duplicates, which just "look different" but are actually duplicates. These are the non-flashed reads, where the 3' ends do not really mean anything. so, for a read like this : R1 |------------ -----------|R2 These ends are important | v R1 |------------ -----------|R2 ^ | These ends are not (are just sequencing lenght based duplicates) | v R1 |------------ -----------|R2 ^ | So, VS04 allows us to go to deeper sequencing depths (from MiSeq runs to NextSeq runs), without getting so many PCR-duplicate related artifacts. These artifacts are easy to spot - their profiles don't look "smooth" but are "spiky". Even this filtering is not perfect, but it is much better than CC3. However - the best is to use VS05 : that is explained below ! If you don't want VS05 for some reason, and are uncertain whether to use VS03 or VS04, just run both, and based on that use the one which worked ! In general, if you sequence relatively short reads (40-50 bases), you are better off with VS03, as VS04 goes too stringent with this and loses most of your signal, thinking them as PCR duplicates. So - if you have long reads (75-150bases long reads), VS04 is better, if you have short reads (40-50 bases) - VS03 is better. But you should use VS05 whenever you can (as that is actually the correct way) ------------------------------------------------------- 2) VS04 vs VS05 : the second changes ------------------------------------------------------- ( continuing from above ) For VS04 we stopped to count anything else than the beginning of the read for non-flashed reads. So, in VS04 these ends are important | v R1 |------------ -----------|R2 ^ | These ends are not (are just sequencing lenght based duplicates) | v R1 |------------ -----------|R2 ^ | However, that is true only to an extent : Before entering duplicate-filtering, the reads have already been cut at restriction enzyme (RE) cut sites, like this : |--------GA|TC---- ( example using GA:TC - the dpnII cut site sequence ) to become |--------GA |TC---- Thus, RE-cutting before analysing generates two kinds of fragments : |--------RE ( RE cut in the end of read) |------ (no RE cut site in the end of read) Now the ends of the RE-ending fragments naturally are not dependent on sequencing lenght, and these ends should be considered exact. So now - in VS05 : These ends are important | | v v | R1 |------------RE v RE-----------|R2 ^ | | v R1 |------------ -----------|R2 ^ | These ends are not (are just sequencing lenght based duplicates) | v R1 |------------ -----------|R2 ^ | This is the fix from VS04 to VS05, and this salvages all the "flattened" profiles generated in VS04, where the filtering was too stringent. VS05 suits for all sequencing lenghts (is finally doing the filter properly for all read types) ------------------------------------------------------- 3) Wobbly ends - optional way to improve this further ------------------------------------------------------- As we go very deep in sequencing, all (even minute) flaws in the duplicate filtering will come to play. The sequencer will sometimes "jump" a couple of bases, if the sequence just after the sequencing adaptor would be somehow tricky for it to bind to. This results in the starts of R1 and R2 reads to be a bit "wobbly" : R1 |--------------- R1 |------------- R1 |-------------- R1 |------------ R1 |-------------- This naturally leads the ends of the sequences to be a bit "wobbly" as well : R1 |--------------- R1 |--------------- R1 |--------------- R1 |-------------- R1 |--------------- This is what is called "wobbly ends" - and in CS5 and newer pipeline versions you can set "wobbly end" size, for the pipeline to see these "almost the same" reads as duplicates. Good value is "a couple of bases" - and if you do UMI indices, best to use --wobblyEndBinWidth 20 to be on the safe side. Page updated by Jelena 11Jun2018