Walkthrough with real data set ( GSE67959 )
You should now have ran the CapSequm web tool (the first step of the analysis).
The next step is to generate the lists of genomic regions, which share micro-homology with the capture sites. These lists are design-specific, and can thus be re-used for all samples generated using the same capture oligo design.
Generating these lists is time-consuming, and it is worth to start the homology-list generating runs right after the capture design is decided and sent for oligonucleotide sequencing.
The homology region lists are generated by running CCseqBasic tool with the flag --onlyBlat
To run CCseqBasic --onlyBlat , you need the CapSequm output file called "CCseqBasicFragmentFile.txt" , which indeed contains the coordinates of the RE fragments for all capture sites of your design.
You can also make this file manually, but CapSequm output is good for first testing : it is for sure in the right format, and you can thus use it without problems.
A readymade CapSequm output file "CCseqBasicFragmentFile.txt" , for the 30g data set used in this walk-through, is available here :
More about the above file, its file format, and editing instructions available here : fragmentfile.html
A readymade run.sh file to run CCseqBasic --onlyBlat , is available here :
You need to edit the run.sh file, to reflect the location of the CCseqBasicFragmentFile.txt input file :
capturesiteFile="/t1-data/user/hugheslab/telenius/3_CCseqBasic/CCseqBasicFragmentFile.txt"
You need to also change the pipePath to reflect the correct location of the pipeline :
pipePath="/t1-home/molhaem2/telenius/CCseqBasic/CS5/RELEASE/"
You can also write your run.sh yourself, as a oneliner like this :
CCseqBasic5.sh -c CCseqBasicFragmentFile.txt --genome mm9 --onlyBlat
Now you are ready to run.
This is a long production run, and needs to be submitted to a queue system (if you have one). The exact submitting command depends on the queue system used. The below commands were used when the script was ran in SGE (Sun Grid Engine) cluster at the WIMM :
qsub -cwd -e qsub.err -o qsub.out -N onlyBlat < ./run.sh
You can keep an eye on the run by reading the output and error logs qsub.out and qsub.err. You can expect the run to take ~ 1-2h per capture site (mouse genomes) and ~ 1-2 days per capture site (human genomes), so get prepared for a long wait.
Once the run has finished, read the qsub.err through, to confirm you didn't get errors.
Now you should have output folder full of .psl files, which can be re-used for all your samples in your actual data analysis CCseqBasic runs. Make note of the location of this folder, you need it for the setup of the CCseqBasic run.
More information of the --onlyBlat filter generation, and why is that needed, in the chapter Details of Blacklist filtering and Homology region filtering of : 2_workflow/index.html
Finetuning the homology region search, and skipping it altogether is explained in : 3_run/onlyblat.html
Here the readymade BlatPloidyFilterRun/REUSE_blat .psl -files for the 30g data set : BlatPloidyFilterRun.zip