In order to finalise the filtering of data in CCseqBasic runs, the locations of homologous regions for the captured sites need to be listed. These regions will then be filtered out of the results, to avoid reporting false positive long range cis and trans interactions.
To better understand the value and use of blat filtering, it is recommended to dig deeper into pages :
Analysis steps (what does the run do ?) - the parts relevant to "homology region filtering"
Output folders (what do you get and how to browse) - the parts relevant to "homology region filtering"To generate the homology region lists, you can run the CCseqBasic with --onlyBlat flag, already before you have your sequencing data, to generate the lists of genomic regions homologous to the captured RE fragments of your design. You only need the coordinates of the RE fragments wherein your capture oligonucleotides reside. CapSequm tool should have already generated these for you as a file called CCseqBasicFragmentFile (see above - chapter (1) Generate RE fragment coordinate file).
The homology region search takes quite a while to run, but needs to be done only once per each designed RE fragment. It saves you time in the future - you can re-use these homology region lists for any subsequent CCseqBasic runs, where you have (some of the) same RE fragments.
You can store all your "already-generated" RE-fragment homology regions lists in a single folder, and point to that folder in your CCseqBasic runs by using parameter --BLATforREUSEfolderPath /full/path/to/folder/where/these/files
Just remember to call the same RE fragment with the same name every time you run CCseqBaic - just rename and re-use the readymade homology list files if you make a new capture with a "slightly different design" but still targeting the same RE fragment.
If you don't run the homology search separately, the full CCseqBasic run will generate these lists on the fly, into subfolder of the run folder :
F4_blatPloidyFilteringLog_CC4/BlatPloidyFilterRun/REUSE_blat folderThe run will take longer - as the homology search step is slow.
CCseqBasic5.sh -c /t1-data/user/telenius/capturesiteREfragments.txt --genome mm9 --onlyBlat
Example run.sh file available here : examplerun.sh
CCseqBasic5.sh -c /t1-data/user/telenius/capturesiteREfragments.txt --genome mm9 --onlyBlat (and optionally) --onlyCis (if you are for sure 100% not interested in analysing trans interactions) --sonicationSize 300 (and optionally - if you want to have different filtering defaults) --stepSize 5 (spacing between tiles). if you want your blat run faster, set this to 11. --tileSize 11 (the size of match that triggers an alignment) --minScore 10 (minimum match score - the default 10 is quite stringent) --maxIntron 4000 (to make blat run quicker) (blat default value is 750000) - max intron size --oneOff 0 (set this to 1, if you want to allow one mismatch per tile. Setting this to 1 will make blat slow.) details below
--onlyCis (only look for homology regions within the cis chromosome of the capture RE fragment) : assumed that the full CCseqBasic is also to be ran with --onlyCis (when trans homology does not matter, naturally)
--sonicationSize 300 (how far from RE enzyme cut site reach the 'valid fragments'. This is the max expected library fragment size after sonication.) This determines which parts of the RE cut fragment and its exclusion fragment are used to look for homology regions : the parts of these which are farther away from RE cut sites than the given --sonicationSize are omitted in homology region search. The sonicationSize parameter should be the same for the full CCseqBasic runs, where the fragments farther away than this are filtered out as 'improperly mapped fragments' before they ever enter the detailed analysis steps.
BLAT OPTIONS : --onlyCis (to generate blat-filtering files for only cis chromosomes ) --stepSize 5 (spacing between tiles). if you want your blat run faster, set this to 11. --tileSize 11 (the size of match that triggers an alignment) --minScore 10 (minimum match score) --maxIntron 4000 (max intron size. defaulting here to 4000 to make blat run quicker) --oneOff 0 (set this to 1, if you want to allow one mismatch per tile. Setting this to 1 will make blat slow.) These default parameters mean that the program searches for : 11-base wide homology regions (full matching 11 bases minimum) and if it finds at least 2 of these separated by maximum 4000 bases from each others, this triggers a homologous region, which is added to the list of homologous regions ( to be filtered as false positive long range region ) Step size determines how often this "search for homology" is restarted. Here we do in 5 bases steps along whole genome. Min score triggers "how similar" the sequences need to be to trigger a homologous region. Value 10 (used here) flags regions as homologous relatively easily.
If you want to run without filtering these regions (as it takes quite a while to let these runs to finish), you can skip this step by tricking the CCseqBasic analysis run like so : --BLATforREUSEfolderPath /full/path/to/folder/where/these/files the /full/path/to/folder/where/these/files will now contain EMPTY files - one for each of the capture RE fragments, named like so : TEMP_Hba-1_blat.psl where 'Hba-1' was the name the capture RE fragment was given in the input parameter file CCseqBasicFragmentFile This will act as a sign to the mean that "1) homology-region search was already done, and 2) no homology regions were found in the search", so this filter is essentially skipped over in the CCseqBasic run.