Page updated by Jelena Telenius - 17:00 28/Nov/2018

CCseqBasic --snp

~ investigate the allelic skew




Pre-requsites for a succesfull SNP-specific run

The pre-requisite of a data-rich SNP-style run is a SNP of interest closer than half the --ampliconSize (default 350b) from a RE cut site (175b).
Preferably the SNP of interest should be within +/- 50 bases from the RE cut site, and the closer to the cut site, the better.
If all your SNPs are farther from RE cut sites than this, you cannot probe them via this method.
However, note that having multiple SNPs of interest under your very capture oligo (within the very oligonucleotide you used to capture the data with) may affect your pull-down.


Disclaimer of the data depth

As this method only probes reads which overlap the SNPs of interest - the data depth is greatly reduced (the more the farther away the SNP of interest is from the closest RE cut site).
Do not expect high quality interaction profile, but rather a plot from which you can "roughly determine" whether the general landscape of interactions was different between the allelic forms.


Selecting (and building) a non-biasing reference genome for SNP-specific runs

You are recommended to use --bowtie2 as a flag in your CCseqBasic SNP-specific run, as it has better tolerance for SNPs in mapping.
However, the duplicate filtering of CCseqBasic struggles a little with bowtie2 multimapped reads, so best practise (at the time of writing this 01Feb2019) is to map with both --bowtie1 and --bowtie2 , and check no artifacts accumulate in bowtie2, and check that SNPs get treated fairly in bowtie1
The very best way would be to generate a N-masked genome (and its associated bowtie1 build), which replaces the SNPs of interest with Ns ,
and update that genome to the /conf/genomeBuildSetup.sh file.
This genome can now be used in bowtie1 mapping without biasing the true SNP distribution over the sites of interest.


Run instructions for --snp run

  1. Run command for --snp run
  2. Capture fragment parameter file for --snp run
  3. Full walk through (and test data set)


(1) Run command

To run a SNP-specific run, run the pipeline normally, for example :

CCseqBasic5.sh \
    -c /t1-data/user/telenius/capturesiteREfragments.txt \
    -s C57captureTest \
    --pf /public/telenius/CAPTUREC_DATA/C57captureTest_run4 \
    --genome mm9 \
    --chunkmb 1012 \
    --R1 /t1-data/user/telenius/R1_001.fastq \
    --R2 /t1-data/user/telenius/R2_001.fastq \
To turn on the SNP-specific mode , add flag
--snp
To yield run command like :
CCseqBasic5.sh \
    -c /t1-data/user/telenius/capturesiteREfragments.txt \
    -s C57captureTest \
    --pf /public/telenius/CAPTUREC_DATA/C57captureTest_run4 \
    --genome mm9 \
    --chunkmb 1012 \
    --R1 /t1-data/user/telenius/R1_001.fastq \
    --R2 /t1-data/user/telenius/R2_001.fastq \
    --snp

(2) Capture fragment parameter file for SNP run

This command will read the SNP locations of interest from the last 2 columns of the capture fragment parameter file :

( these are 1-based coordinates - like f.ex. gff and sam files, not 0-based like f.ex. bed files.
     In 1-based coordinates a single base is defined by for example chr1 1 1 (chr1 region from the 1st base to the 1st base - i.e. the 1st base itself)
     I.e. in 1-based coordinates only ONE coordinate is needed to define a SNP location, instead of both start and end coordinates.
     More about 1-based and 0-based files f.ex. in here
)

I.e. to make the run SNP-specific, we need to add in the 8th and 9th column,
to describe the desired SNP, and the desired allele, like so :

Oct-4	17	35641755	35643135	17	35640755	35644135	35641789	G
Pkdrej  15      85651837        85652458        15      85650837        85653458        85652165        A

Keep all capture sites in the file (also the ones with no SNPs )

Even if you only have SNPs of interest in some of your capture sites, you should include all of your capture sites to your capture fragment parameter file, to enable the run properly filter the ambiguous interactions which are assigned to more than one capture site.
To exclude these "only for filtering" capture sites from the output reports and visualisations, give "dummy coordinates" 1 A as the last columns of the capture fragment parameter file, like so :

TP53	11	69392395	69394473	11	69391395	69395473	1	A

You need one CCseqBasic run per allele of interest (G,A,T,C)

Each SNP-specific file can contain each capture site ONLY ONCE
( if you want to run for multiple alleles of the base of interest, you need to run several runs ).

For example, to probe both major alleles of the above example Oct-4 (G,A) and Pkdrej (A,G), we need 2 runs like so :

Oct-4	17	35641755	35643135	17	35640755	35644135	35641789	G
Pkdrej  15      85651837        85652458        15      85650837        85653458        85652165        A
TP53	11	69392395	69394473	11	69391395	69395473	1	A
and
Oct-4	17	35641755	35643135	17	35640755	35644135	35641789	A
Pkdrej  15      85651837        85652458        15      85650837        85653458        85652165        G
TP53	11	69392395	69394473	11	69391395	69395473	1	A

More about the capture site parameter file

Details about the capture site parameter file, its file format, and editing instructions available here : fragmentfile.html


(3) Full walk through (and test data set)

Walk through 'vignette' using the data set of original CaptureC method publication : walkthrough