Page updated by Jelena Telenius - 17:00 28/Nov/2018

CCseqBasic

~ Generate the list of homologous genomic regions



Homologous regions lists are needed to filter false positive long range cis and trans interactions

In order to finalise the filtering of data in CCseqBasic runs, the locations of homologous regions for the captured sites need to be listed. These regions will then be filtered out of the results, to avoid reporting false positive long range cis and trans interactions.

To better understand the value and use of blat filtering, it is recommended to dig deeper into pages :

Analysis steps (what does the run do ?) - the parts relevant to "homology region filtering"

Output folders (what do you get and how to browse) - the parts relevant to "homology region filtering"


These lists are RE-fragment specific, and can be generated right after running CapSequm (which designed your capture sites)

To generate the homology region lists, you can run the CCseqBasic with --onlyBlat flag, already before you have your sequencing data, to generate the lists of genomic regions homologous to the captured RE fragments of your design. You only need the coordinates of the RE fragments wherein your capture oligonucleotides reside. CapSequm tool should have already generated these for you as a file called CCseqBasicFragmentFile (see above - chapter (1) Generate RE fragment coordinate file).

The homology region search takes quite a while to run, but needs to be done only once per each designed RE fragment.
It saves you time in the future - you can re-use these homology region lists for any subsequent CCseqBasic runs, where you have (some of the) same RE fragments.


The homology lists can be collected to single directory for re-use

The output folder of your --onlyBlat run will contain one .psl file per each RE fragment listed in your CCseqBasicFragmentFile , named like so : TEMP_Hba-1_blat.psl where 'Hba-1' was the name the capture RE fragment was given in the input parameter file CCseqBasicFragmentFile

You can store all your "already-generated" RE-fragment homology regions lists in a single folder,
and point to that folder in your CCseqBasic runs by using parameter --BLATforREUSEfolderPath /full/path/to/folder/where/these/files

Just remember to call the same RE fragment with the same name every time you run CCseqBaic - just rename and re-use the readymade homology list files if you make a new capture with a "slightly different design" but still targeting the same RE fragment.


Pre-generating these lists is not obligatory

If you don't run the homology search separately, the full CCseqBasic run will generate these lists on the fly, into subfolder of the run folder :

F4_blatPloidyFilteringLog_CC4/BlatPloidyFilterRun/REUSE_blat folder
The run will take longer - as the homology search step is slow.


Example run command for generating the homology regions lists

CCseqBasic5.sh
    -c /t1-data/user/telenius/capturesiteREfragments.txt
    --genome mm9
    --onlyBlat


Example run script

Example run.sh file available here : examplerun.sh


Detailed explanation of all run parameters

All possible run parameters are given here :
CCseqBasic5.sh
    -c /t1-data/user/telenius/capturesiteREfragments.txt
    --genome mm9
    --onlyBlat

    (and optionally)
    --onlyCis  (if you are for sure 100% not interested in analysing trans interactions)
    --sonicationSize 300
    
    (and optionally - if you want to have different filtering defaults)
    
    --stepSize 5 (spacing between tiles). if you want your blat run faster, set this to 11.
    --tileSize 11 (the size of match that triggers an alignment)
    --minScore 10 (minimum match score - the default 10 is quite stringent)
    --maxIntron 4000 (to make blat run quicker) (blat default value is 750000) - max intron size
    --oneOff 0 (set this to 1, if you want to allow one mismatch per tile. Setting this to 1 will make blat slow.)
    
details below

--onlyCis (only look for homology regions within the cis chromosome of the capture RE fragment)
     : assumed that the full CCseqBasic is also to be ran with --onlyCis
       (when trans homology does not matter, naturally)
--sonicationSize 300 (how far from RE enzyme cut site reach the 'valid fragments'.
    This is the max expected library fragment size after sonication.)
    This determines which parts of the RE cut fragment and its exclusion fragment
    are used to look for homology regions : the parts of these which are farther away from RE cut sites
    than the given --sonicationSize are omitted in homology region search.
    
    The sonicationSize parameter should be the same for the full CCseqBasic runs, where the fragments farther away
    than this are filtered out as 'improperly mapped fragments' before they ever enter the detailed analysis steps.
BLAT OPTIONS :
--onlyCis        (to generate blat-filtering files for only cis chromosomes )
--stepSize 5     (spacing between tiles). if you want your blat run faster, set this to 11.
--tileSize 11    (the size of match that triggers an alignment)
--minScore 10    (minimum match score)
--maxIntron 4000 (max intron size. defaulting here to 4000 to make blat run quicker)
--oneOff 0       (set this to 1, if you want to allow one mismatch per tile. Setting this to 1 will make blat slow.)


  These default parameters mean that the program searches for :
  
  11-base wide homology regions (full matching 11 bases minimum)
  and if it finds at least 2 of these
  separated by maximum 4000 bases from each others,
  this triggers a homologous region, which is added to the list of homologous regions
  ( to be filtered as false positive long range region )
  Step size determines how often this "search for homology" is restarted. Here we do in 5 bases steps
  along whole genome.
  Min score triggers "how similar" the sequences need to be to trigger a homologous region.
  Value 10 (used here) flags regions as homologous relatively easily.


Skipping the homology filter altogether

If you want to run without filtering these regions (as it takes quite a while to let these runs to finish),
you can skip this step by tricking the CCseqBasic analysis run like so :

--BLATforREUSEfolderPath /full/path/to/folder/where/these/files

the

/full/path/to/folder/where/these/files

will now contain EMPTY files - one for each of the capture RE fragments, named like so :

TEMP_Hba-1_blat.psl
where 'Hba-1' was the name the capture RE fragment was given in the input parameter file CCseqBasicFragmentFile

This will act as a sign to the mean that "1) homology-region search was already done, and 2) no homology regions were found in the search",
so this filter is essentially skipped over in the CCseqBasic run.