Page updated by Jelena Telenius - 17:00 28/Nov/2018

CCseqBasic
SNP-specific analysis (for allelic skew)

Walkthrough with real data set ( GSE67959 )



Pre-requsites for a succesfull SNP-specific run

The pre-requisite of a data-rich SNP-style run is a SNP of interest closer than half the --ampliconSize (default 350b) from a RE cut site (175b).
Preferably the SNP of interest should be within +/- 50 bases from the RE cut site, and the closer to the cut site, the better.
If all your SNPs are farther from RE cut sites than this, you cannot probe them via this method.
However, note that having multiple SNPs of interest under your very capture oligo (within the very oligonucleotide you used to capture the data with) may affect your pull-down.


Disclaimer of the data depth

As this method only probes reads which overlap the SNPs of interest - the data depth is greatly reduced (the more the farther away the SNP of interest is from the closest RE cut site).
Do not expect high quality interaction profile, but rather a plot from which you can "roughly determine" whether the general landscape of interactions was different between the allelic forms.


Selecting (and building) a non-biasing reference genome for SNP-specific runs

You are recommended to use --bowtie2 as a flag in your CCseqBasic SNP-specific run, as it has better tolerance for SNPs in mapping.
However, the duplicate filtering of CCseqBasic struggles a little with bowtie2 multimapped reads, so best practise (at the time of writing this 01Feb2019) is to map with both --bowtie1 and --bowtie2 , and check no artifacts accumulate in bowtie2, and check that SNPs get treated fairly in bowtie1
The very best way would be to generate a N-masked genome (and its associated bowtie1 build), which replaces the SNPs of interest with Ns ,
and update that genome to the /conf/genomeBuildSetup.sh file.
This genome can now be used in bowtie1 mapping without biasing the true SNP distribution over the sites of interest.


Run instructions

You should now have ran the CapSequm web tool (the first step of the analysis),
and the --onlyBlat run of CCseqBasic (the second step of the analysis).

The above are optional pre-requisites of the CCseqBasic run (but are highly recommended).

To run CCseqBasic , you need the CapSequm output file called "CCseqBasicFragmentFile.txt" ,
which indeed contains the coordinates of the RE fragments for all capture sites of your design.

You can also make this file manually, but CapSequm output is good for first testing :
it is for sure in the right format, and you can thus use it without problems.

A readymade CapSequm output file "CCseqBasicFragmentFile.txt" ,
for the 30g data set used in this walk-through, is available here :

  • CCseqBasicFragmentFile.txt
  • The above file only contains 7 columns. To make the run SNP-specific, we need to add in the 8th and 9th column,
    to describe the desired SNP, and the desired allele.

    Each SNP-specific file can contain each capture site ONLY ONCE
    ( if you want to run for multiple alleles of the base of interest, you need to run several runs ).

    In this example we are running the SNP-specific run for one of the capture sites (Oct-4) of the sample.
    The selected SNP has 2 alleles of interest (G, A) in the position of interest.
    This leads us to have 2 runs.
    Even though only one of our capture sites has a SNP of interest, we also include the other capture sites into the analysis,
    to enable the run properly filter the ambiguous interactions which are assigned to more than one capture site.

    The readymade "CCseqBasicFragmentFile.txt" files,
    for the 2 SNP runs, are available here :

  • CCseqBasicFragmentFile_SNP_allele1.txt
  • CCseqBasicFragmentFile_SNP_allele2.txt
  • More about the above file, its file format, and editing instructions available here : fragmentfile.html

    In addition to this file, you need the Fastq files from the sequencing of the Capture-C library.

    Below instructions how to fetch them for the 30g data set, which is used as the example data set in this walk-through.

    Here a readymade script to fetch the fastqs for GSE67959, stored in the European Sequence Archive :

  • downloadFastqs.sh.txt
  • You need to run the above script (or something similar) to fetch the fastq files to your own data area.

    Here detailed instructions, how the above script was constructed
    - i.e. how to get from the GEO archive number GSE67959
    to the European Sequence Archive file download locations :

  • fromGEOtoFastq.txt

  • In the below example we are only using the 'ery1' sample (not all the downloaded 6 samples)

    The expected file sizes are :

    ery1:   3.0G    SRR1976563_1.fastq.gz
            3.2G    SRR1976563_2.fastq.gz
    



    Now you should have these files :

    CCseqBasicFragmentFile_SNP1.txt   
    CCseqBasicFragmentFile_SNP2.txt   
    
    and
    GSE67959_fastqs
    `-- ery1
        |-- SRR1976563_1.fastq.gz
        `-- SRR1976563_2.fastq.gz
    
    

    These files can now be used to set up the CCseqBasic runs.

    Below a .zip archive of all the needed files, in a correct folder structure, to start both SNP runs :

    CCseqBasic_runExample_SNP.zip

    After unpacking, you should see these files and folders :

    |-- CCseqBasicFragmentFile_SNP_allele1.txt
    |-- CCseqBasicFragmentFile_SNP_allele2.txt
    |
    |-- ery1_SNP_allele1
    |   |-- PIPE_fastqPaths.txt
    |   `-- run.sh
    |
    `-- ery1_SNP_allele2
        |-- PIPE_fastqPaths.txt
        `-- run.sh
    
    

    The ery1_SNP_allele1 ery1_SNP_allele2 folders are the run folders :
    each sample needs to be started in its own empty folder (except the PIPE_fastqPaths.txt and run.sh files)

    You need to edit the PIPE_fastqPaths.txt files, to reflect the actual location of the fastq files in your system.

    ery1/PIPE_fastqPaths.txt and ery2/PIPE_fastqPaths.txt show the 2 supported file formats of this file :

    ery1_SNP_allele1/PIPE_fastqPaths.txt
    ------------------------
    /t1-data/GSE67959_fastqs/ery1/SRR1976563_1.fastq.gz	/t1-data/GSE67959_fastqs/ery1/SRR1976563_2.fastq.gz
    ------------------------
    
    ery1_SNP_allele2/PIPE_fastqPaths.txt
    ------------------------
    SRR1976563_1.fastq.gz	SRR1976563_2.fastq.gz	/t1-data/GSE67959_fastqs/ery2
    ------------------------
    

    You need to also edit the run.sh file, to reflect the location of the other input files in your system :

    capturesiteFile="/t1-data/user/hugheslab/telenius/3_CCseqBasic/CCseqBasicFragmentFile_SNP1.txt"
    PublicPath="/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS"
    reuseBLATdir='/t1-data/user/hugheslab/telenius/2_CCseqBasic_onlyBlat/BlatPloidyFilterRun/REUSE_blat'
    

    You need to also change the pipePath to reflect the correct location of the pipeline :

    pipePath="/t1-home/molhaem2/telenius/CCseqBasic/CS5/RELEASE/"
    

    The public path is the location where you want to store your run output data - for browsing via UCSC genome browser.
    More about setting this, in the installation instructions page : CCseqBasic/1_setup/index.html
    on step (9.) "server address".

    The reuseBLATdir is the directory where your CCseqBasic --onlyBlat run results are.
    If you didn't run the --onlyBlat run, set :

    reuseBLATdir='.'
    

    Here you can download readymade BlatPloidyFilterRun/REUSE_blat folder for the 30g data set, to be set to reuseBLATdir : BlatPloidyFilterRun.zip

    Notice, that if you are using the REUSE_blat directory from your earlier (non-SNP) 30g test runs
    you need to generate correctly named copies (or symlinks) to the SNP-run capture sites, like so :

    cp  TEMP_Oct-4_blat.psl  TEMP_Oct-4_1_blat.psl
    cp  TEMP_Oct-4_blat.psl  TEMP_Oct-4_2_blat.psl
    

    Now you are ready to run.

    This is a long production run, and needs to be submitted to a queue system (if you have one).
    The exact submitting command depends on the queue system used.
    The below commands were used when the script was ran in SGE (Sun Grid Engine) cluster at the WIMM :

    cd ery1_SNP1
    qsub -cwd -e qsub.err -o qsub.out -N ery1_SNP1 < ./run.sh
    cd ../ery1_SNP2/
    qsub -cwd -e qsub.err -o qsub.out -N ery1_SNP2 < ./run.sh
    cd ..
    

    You can keep an eye on the run by reading the output and error logs qsub.out and qsub.err.

    Once the run has finished, the last 10 lines of the qsub.out file will give the location of the visualisation data hubs :

    [telenius@deva ery1]$ tail -n 10 qsub.out 
    
    Generated a data hub for RAW data in :                http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C
    ( pre-filtered data for DEBUGGING purposes is here :  http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C
    Generated a data hub for FILTERED data in :           http://userweb.molbiol.ox.ac.uk/public/telenius/test7/C
    Generated a data hub for FILTERED flashed+nonflashed combined data in :     http://userweb.molbiol.ox.ac.uk/p
    
    Generated a COMBINED data hub (of all the above) in :     http://userweb.molbiol.ox.ac.uk/public/telenius/tes
    How to load this hub to UCSC : http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_Allgr
    
    [telenius@deva 1]$     
    

    Of these, the most important is the COMBINED data hub (of all the above)
    ( the others are needed mostly for troubleshooting, if runs crash unexpectedly before finishing )

    As you can see, these lines also point to a help document, how to load this data hub into UCSC :
    http://sara.molbiol.ox.ac.uk/public/telenius/DataHubs/ReadMe/HUBtutorial_AllGroups_160813.pdf


    More information of these data hubs in : 6_visualisation/index.html and 7_QC/index.html

    The output folders of the run are explained more thoroughly in here : 4_outputfolders/index.html
    The interaction counters are explained here : 5_counters/index.html
    And the workflow of the CCseqBasic tool is explained here : 2_workflow/index.html


    Optional run parameters, and other finetuning of the run is explained here : 3_run/index.html


    Here the readymade COMBINED data hubs for both ery1_SNP1 and ery1_SNP2 samples, for loading to UCSC :

  • Ery1_SNP1
  • http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_ery1_SNP1/CS5_dpnII/30g_ery1_SNP1_CS5_hub.txt
  • Ery1_SNP2
  • http://userweb.molbiol.ox.ac.uk/public/telenius/CaptureCompendium/CCseqBasic/3_run/30gRUNS/30g_ery1_SNP2/CS5_dpnII/30g_ery1_SNP2_CS5_hub.txt


    How to load the above data hubs into UCSC : HUBtutorial_AllGroups_160813.pdf