Commit 63cd6f1c authored by Yujuan Gui's avatar Yujuan Gui
Browse files

useful

parent 80e3f14b
## 180604 ATAC-seq
## adapted from Aurelien
### Kundaje lab
main protocol is from https://github.com/kundajelab/atac-seq-pipeline
and the ATAC-pipeline as [googledoc](https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit)
### Multi QC
short reads with contamination of adapter Nextera Transposase
### PALEOMIX
Remove adapter and map the reads with BWA
#### makefile
```
module load lang/Python/2.7.14-intel-2018a
paleomix bam_pipeline mkfile > template.yaml
```
- edit the compression from `bz2` to `gz`
` CompressionFormat: gz`
- edit the makefile to specify the correct nextera sequences, `AdapterRemoval` needs to look for
comment out the original TruSeq adapter sequences
```
AdapterRemoval:
# Adapter sequences, set and uncomment to override defaults
# --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
--adapter1: CTGTCTCTTATACACATCT
# --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--adapter2: CTGTCTCTTATACACATCT
--minlength: 35
--collapse: no
```
- edit the ExcludeReads:
```
Singleton: yes
CollapsedTruncated: yes
```
- edit the behaviour for duplicates, mark and not filter
` PCRDuplicates: mark`
- edit the features, _i.e_ the files that paleomix needs to produce
```
Features:
# Generate BAM without realignment around indels (yes / no)
RawBAM: yes
# Generate indel-realigned BAM using the GATK Indel realigner (yes / no)
RealignedBAM: no
# To disable mapDamage, write 'no'; to generate basic mapDamage plots,
# write 'plot'; to build post-mortem damage models, write 'model',
# and to produce rescaled BAMs, write 'rescale'. The 'model' option
# includes the 'plot' output, and the 'rescale' option includes both
# 'plot' and 'model' results. All analyses are carried out per library.
mapDamage: no
# Generate coverage information for the raw BAM (wo/ indel realignment).
# If one or more 'RegionsOfInterest' have been specified for a prefix,
# additional coverage files are generated for each alignment (yes / no)
Coverage: yes
# Generate histogram of number of sites with a given read-depth, from 0
# to 200. If one or more 'RegionsOfInterest' have been specified for a
# prefix, additional histograms are generated for each alignment (yes / no)
Depths: no
# Generate summary table for each target (yes / no)
Summary: yes
# Generate histogram of PCR duplicates, for use with PreSeq (yes / no)
DuplicateHist: no
```
we want only RawBaM, Coverage and Summary files.
- edit the References (Prefixes section), here both the nuclear and mito genome
```
Prefixes:
GRCh38.p3:
Path: /work/projects/daneurogen/genome/mouseGenomemm10/GRCm38.p3.mt.fasta
```
- edit for the samples using R (proviving the HPC disk is mounted on your machine)
Refer to paleomix_makefile.Rmd
- add samples to main yaml
`cat samples.yaml >> template.yaml`
- validate the makefile with
` paleomix bam_pipeline run --dry-run 180401.yaml`
if fine, then proceed
#### paleomix pipeline
launcher bash looks like:
- edit: job name `-J`
- edit: the `time` here **24* hours
- edit: `mail` recipient
- edit the makefile name, here `180604.yaml`
```
#!/bin/bash -l
#SBATCH -J paleomix
#SBATCH --mail-type=end,fail
#SBATCH --mail-user=yujuan.gui@uni.lu
#SBATCH -N 1
#SBATCH -c 28
#SBATCH --ntasks-per-node=1
#SBATCH --time=48:00:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
module load lang/Java
module load tools/bzip2
# prevent sub-spawn, 28 cores -> 28 processes
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
srun paleomix bam_pipeline run --bwa-max-threads=4 \
--adapterremoval-max-threads=4 --max-threads=28 180604.yaml
```
- move truncated reads in a dedicated folder, renaming them (assuming first name is the id)
# awk: text processing
```
mkdir -p truncated
# create a dummy try first!!
find . -name "*truncated*" | while read -r file
do id=$(cut -d'/' -f2 <<<$file)
idpath=$(cut -d'/' -f1 <<<$file)
mv $file ${idpath}/${id}.truncated.gz
done
```
- QC `module load bio/FastQC; parallel -j 8 "fastqc {}" ::: truncated/*gz` => don't work under parallel mode, have to do it one by one
- filter dup and poor mapping qual + keep properly paired
`parallel -j 4 "samtools view -b -q30 -f 2 -F 1024 {} > {.}.nodup.q30.bam" ::: *p3.bam`
### rm mt
`sambamba view -f bam -F "ref_name!='mouseMT'" [input.bam] > [output.bam]`
####################################################
bulky version to remove the mt
sort by coordinate
`parallel "samtools sort {} -o ./test/{.}.sort.q30.bam" ::: *.nodup.q30.bam` => wildcard for names, check
index
`parallel "samtools index {} " ::: *..nodup.sort.q30.bam`
subset chromosome
`samtools idxstats MB17-C0031_2-5k.mm10.p3.nodup.sort.q30.bam | cut -f 1 | grep -v MT | xargs samtools view -b MB17-C0031_2-5k.mm10.p3.nodup.sort.q30.bam > MB17-C0031_2-5k.mm10.p3.nodup.noMT.sort.q30.bam`
sort by name
#########################################################
### processing BAM
- sort by names using `sambamba`
`srun -p interactive --cpu-bind=none --ntasks-per-node=1 --mem=32GB -c 8 --time=4:0:0 --pty bash -i`
`parallel "srun sambamba sort -n -o {.}.names.bam -m 32GB -t 4 --tmpdir /dev/shm {}" ::: *p3.nodup.q30.bam`
by default, give 1 node, 1 core, 1 thread)
### length distribution
grep -v: select non-matching lines
^: regular expression. When "^" appears at the beginning of a regular expression or after a new line, it represents the beginning of a string
`parallel "sambamba view {} | cut -f 9 | grep -v '^-' | gzip > {.}.lg.gz" ::: [input.bam]`
Then go to R to plot lg_distribution.Rmd
#### convert BAM to BEDPE then TAGfile
BEDPE: describe disjoint genome features, such as structural variations or paired-end sequence alignments.
reason: 1. bed12 format does not allow inter-chromosomal feature definition; 2. bed12 only has one strand field insufficient for paired-end sequencing
tagalign:bed3+3 https://genome.ucsc.edu/FAQ/FAQformat.html#format15
zcat: uncompress files to standard output
awk: awkward. data extraction and reporting tool
gzip: compress or uncompres files. -n: do not save or restore the original name and time stamp
-c: write on standard output, keep original files unchanged
```
parallel "bedtools bamtobed -bedpe -mate1 -i {} | gzip > {.}.bedpe.gz" ::: [*.input.bam]
for f in *.bedpe.gz
do zcat ${f} | \
awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\tN\t1000\t%s\n%s\t%s\t%s\tN\t1000\t%s\n",$1,$2,$3,$9,$4,$5,$6,$10}' | \
gzip -nc > ${f/bedpe/tagAlign}
done
```
#### subsample for cross-correlation plots
shuf: write a random permutation of the input lines to standard output
```
NREADS=25000000
for f in *.bedpe.gz
do zcat ${f} | shuf -n ${NREADS} --random-source=${f/bedpe/tagAlign} | \
awk 'BEGIN{OFS="\t"}{print $1,$2,$3,"N","1000",$9}' | gzip -nc > ${f/bedpe/subspl.tagAlign}
done
```
############################################################
NSC > 2, RSC < 1
`https://github.com/kundajelab/phantompeakqualtools` => not working either
does not work. so using [the singularity image](https://github.com/ENCODE-DCC/chip-seq-pipeline2/blob/master/docs/tutorial_local_singularity.md)
```
module load tools/Singularity
singularity shell -s /bin/bash --bind /work/projects/lsru/atac-seq/:/180604 /scratch/users/aginolhac/chip-seq-pipeline-v1.1.simg
#
# --bind: [where is the data]:[where I would like the container to run]
module load lang/R
for f in *.subspl.tagAlign.gz
do Rscript ~/program/phantompeakqualtools/run_spp.R -c=${f} -p=8 \
-savp=${f/tagAlign.gz/tagAlign.CCplot.pdf} -out=${f/tagAlign.gz/tagAlign.CCscores.txt}
done
```
interpret using those guidelines
https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part0_quality_control.md
> NSC: normalized stand coefficient
> RSC: relative strand correlation
> NSC values range from a minimum of 1 to larger positive numbers. 1.1 is the critical threshold. Datasets with NSC values much less than 1.1 (< 1.05) tend to have low signal to noise or few peaks (this could be biological eg.a factor that truly binds only a few sites in a particular tissue type OR it could be due to poor quality). RSC values range from 0 to larger positive values. 1 is the critical threshold. RSC values significantly lower than 1 (< 0.8) tend to have low signal to noise. The low scores can be due to failed and poor quality ChIP, low read sequence quality and hence lots of mismappings, shallow sequencing depth (significantly below saturation) or a combination of these. Like the NSC, datasets with few binding sites (< 200) which is biologically justifiable also show low RSC scores.
------------------------------------------------
col. abbreviation description
----- ----------------- -------------------------
1 Filename tagAlign/BAM filename
2 numReads effective sequencing depth i.e. total number of mapped reads in input file
3 estFragLen comma separated strand cross-correlation peak(s) in decreasing order of correlation.
4 corr_estFragLen comma separated strand cross-correlation value(s) in decreasing order (COL2 follows the same order)
5 phantomPeak Read length/phantom peak strand shift
6 corr_phantomPeak Correlation value at phantom peak
7 argmin_corr strand shift at which cross-correlation is lowest
8 min_corr minimum value of cross-correlation
9 NSC Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
10 RSC Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
11 QualityTag Quality tag based on thresholded RSC (codes= -2:veryLow, -1:Low, 0:Medium, 1:High, 2:veryHigh)
------------------------------------------------
####################################################################
#### create pseudo-replicates
```
for FINAL_TA_FILE in *names.tagAlign.gz
do zcat ${FINAL_TA_FILE} | sed 'N;s/\n/\t/' | gzip -nc > ${FINAL_TA_FILE}.joined
PR_PREFIX="${FINAL_TA_FILE}.filt.nodup"
# Get total number of read pairs
nlines=$( zcat ${FINAL_TA_FILE}.joined | wc -l )
nlines=$(( (nlines + 1) / 2 ))
# Shuffle and split BEDPE file into 2 equal parts
zcat -f ${FINAL_TA_FILE}.joined | shuf --random-source=${FINAL_TA_FILE} | \
split -d -l ${nlines} - ${PR_PREFIX} # Will produce ${PR_PREFIX}00 and ${PR_PREFIX}01
# Convert fake BEDPE to reads into standard tagAlign file
awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\t%s\t%s\t%s\n%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "${PR_PREFIX}00" | \
gzip -nc > ${FINAL_TA_FILE/tagAlign.gz/PE2SE.pr1.tagAlign.gz}
\rm "${PR_PREFIX}00"
awk 'BEGIN{OFS="\t"}{printf "%s\t%s\t%s\t%s\t%s\t%s\n%s\t%s\t%s\t%s\t%s\t%s\n",$1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "${PR_PREFIX}01" | \
gzip -nc > ${FINAL_TA_FILE/tagAlign.gz/PE2SE.pr2.tagAlign.gz}
\rm "${PR_PREFIX}01"
\rm ${FINAL_TA_FILE}.joined
done
```
- shift transposase motif: `shifted_tag = "$prefix.tn5.tagAlign.gz"` => what's this?
```
for tag in *pr{1,2}.tagAlign.gz
do shifted_tag=${tag/tagAlign.gz/shifted.tagAlign.gz}
zcat $tag | awk -F $'\t' 'BEGIN {OFS = FS}{ if ($6 == "+") {$2 = $2 + 4} else if ($6 == "-") {$3 = $3 - 5} print $0}' | \
gzip -nc > $shifted_tag
done
```
- create pooled BED file
```
for f in *p3.nodup.noMT.q30.names.PE2SE.pr1.shifted.tagAlign.gz
do echo ${f/pr1/pool} ; cat $f ${f/pr1/pr2} > ${f/pr1/pool}
done
```
### peak calling with MACS2
- calling peaks on BED, shifted tagAlign pseudoreplicates
#### using BED and using fixed shift and extension sizes
```
module load lang/Python/2.7.14-intel-2018a
parallel -j 8 "macs2 callpeak -t {} --SPMR -f BED --keep-dup all --call-summits --nomodel --shift 37 --extsize 73\
-g hs -n {.}nomodelPE -B -q 0.01 --outdir {.}nomodelPE" ::: *.mm10.p3.nodup.noMT.q30.names.PE2SE.p*.shifted.tagAlign.gz
```
```
for f in *pool.shifted.tagAlignnomodelPE/*pileup.bdg
do echo "macs2 bdgcmp -t $f -c ${f/treat_pileup/control_lambda} --o-prefix ${f/treat_pileup/nonoise} -m FE"; done | parallel -j 4
# not run, as bigwig are just for visualization
# slopBed -i "$prefix"_FE.bdg -g "$chrsz" -b 0 | bedClip stdin "$chrsz" $fc_bedgraph
```
- sort bedgraph before converting to bigwig
```
find *pool.shifted.tagAlignnomodelPE -name '*nonoise*.bdg' | parallel -j 4 "sort -k1,1 -k2,2n {} > {.}.sort.bdg"
```
`bedGraphToBigWig` not working on iris, so use the singularity image
```
module load tools/Singularity
singularity shell -s /bin/bash --bind /work/projects/daneurogen/atac_201901/data/02_paleomix/ /scratch/users/aginolhac/ubuntu-chip-seq.simg
# delete the chr in chromosome name
cd /work/projects/daneurogen/atac_201901/data/02_paleomix
find *pool.shifted.tagAlignnomodelPE -name '*sort.bdg' | parallel -j 2 "bedGraphToBigWig {} \
./GRCm38.p5.refchrom.sizes {.}.bigwig"
```
#### IDR for pseudo-replicates
```
module load lang/Python/3.6.4-intel-2018a
cd ~/install
wget https://github.com/nboley/idr/archive/2.0.3.zip
python3 idr-2.0.3/setup.py install --user
```
- run IDR
`sed -i 's/-1/0/'` on all narrowPeak because of https://github.com/nboley/idr/issues/27
```
for f in *nomodelPE/*narrowPeak
do echo $f; sed -i 's/-1/0/' $f
done
```
```
module load lang/Python/3.6.4-intel-2018a
for sample in MB17-C0031_2-5k MB17-C0031_25k
do echo $sample
REP1_PEAK_FILE=`ls ${sample}*nomodelPE/*pr1.shifted.tagAlignnomodelPE_peaks.narrowPeak`
REP2_PEAK_FILE=${REP1_PEAK_FILE//pr1/pr2}
POOLED_PEAK_FILE=${REP1_PEAK_FILE//pr1/pool}
idr --samples ${REP1_PEAK_FILE} ${REP2_PEAK_FILE} --peak-list ${POOLED_PEAK_FILE} \
--input-file-type narrowPeak --output-file ${sample}.idrout.narrowPeak --rank p.value \
--soft-idr-threshold 0.05 --plot --use-best-multisummit-IDR
done
```
MB17-C0031_2-5k
Initial parameter values: [0.10 1.00 0.20 0.50]
Final parameter values: [1.61 0.76 0.85 0.31]
Number of reported peaks - 65509/65509 (100.0%)
Number of peaks passing IDR cutoff of 0.05 - 28180/65509 (43.0%)
MB17-C0031_25k
Initial parameter values: [0.10 1.00 0.20 0.50]
Final parameter values: [1.71 0.72 0.88 0.33]
Number of reported peaks - 197098/197098 (100.0%)
Number of peaks passing IDR cutoff of 0.05 - 99854/197098 (50.7%)
- filter poor mappability regions
```
wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz
```
```
module load bio/BEDtools
for idr in *idrout.narrowPeak
do sed 's/^/chr/' $idr | bedtools intersect -v -a stdin \
-b /work/projects/daneurogen/genome/mouseGenomemm10/mm10.blacklist.bed.gz > ${idr/.narrowPeak/.lowmp.narrowPeak}
done
```
############################################################################################
### HINT-ATAC
install as virtualenv with python2, as
```
pip install cython numpy scipy
pip install RGT
```
```
ln -s ~/lsru/atac-seq/180604/references/GRCh38.p1.fasta /home/users/aginolhac/rgtdata/hg38/genome_hg38.fa
sed 's/^chr//' T4-6mut50k.idrout.lowmp.narrowPeak > T4-6mut50k.idrout.lowmp.nochr.narrowPeak
module load lang/Python/2.7.14-intel-2018a
rgt-hint footprinting --atac-seq --organism hg38 --paired-end --output-prefix=T4-50k-hint\
T4-6mut50k.GRCh38.p1.q30.nodup.nomt.0.25.bam T4-6mut50k.idrout.lowmp.nochr.narrowPeak
```
needs to keep best SignalValue per interval, see R code below
### IDR
chrom string
Name of the chromosome for common peaks
chromStart int
The starting position of the feature in the chromosome or scaffold for common peaks, shifted based on offset. The first base in a chromosome is numbered 0.
chromEnd int
The ending position of the feature in the chromosome or scaffold for common peaks. The chromEnd base is not included in the display of the feature.
name string
Name given to a region (preferably unique) for common peaks. Use '.' if no name is assigned.
score int
Contains the scaled IDR value, min(int(log2(-125IDR), 1000). e.g. peaks with an IDR of 0 have a score of 1000, idr 0.05 have a score of int(-125log2(0.05)) = 540, and idr 1.0 has a score of 0.
strand [+-.] Use '.' if no strand is assigned.
signalValue float
Measurement of enrichment for the region for merged peaks. When a peak list is provided this is the value from the peak list.
p-value float
Merged peak p-value. When a peak list is provided this is the value from the peak list.
q-value float
Merged peak q-value. When a peak list is provided this is the value from the peak list.
summit int
Merged peak summit
localIDR float -log10(Local IDR value)
globalIDR float -log10(Global IDR value)
rep1_chromStart int
The starting position of the feature in the chromosome or scaffold for common replicate 1 peaks, shifted based on offset. The first base in a chromosome is numbered 0.
rep1_chromEnd int
The ending position of the feature in the chromosome or scaffold for common replicate 1 peaks. The chromEnd base is not included in the display of the feature.
rep1_signalValue float
Signal measure from replicate 1. Note that this is determined by the --rank option. e.g. if --rank is set to signal.value, this corresponds to the 7th column of the narrowPeak, whereas if it is set to p.value it corresponds to the 8th column.
rep1_summit int
The summit of this peak in replicate 1.
```{r}
col_idr <- c("chr", "start", "end", "name", "score", "strand",
"signalValue", "pvalue", "qvalue", "summit",
"local_idr", "global_idr", "rep1_start", "rep1_end",
"rep1_signalValue", "rep1_summit", "rep2_start", "rep2_end",
"rep2_signalValue", "rep2_summit")
type_idr <- "ciicicdddiddiidiiidi"
td4_idr <- read_tsv(file.path(neuro, "T4-6mut50k.idrout.lowmp.nochr.narrowPeak"),
col_names = col_idr, col_types = type_idr)
td4_idr %>%
group_by_at(vars(chr:end)) %>%
add_count() %>% filter(n > 1)
#td4_idr %>% filter(chr == 1, start == 248837883, end == 248839182) %>% View
td4_idr %>%
group_by_at(vars(chr:end)) %>%
filter(rank(desc(signalValue), ties.method = "first") == 1) %>%
# filter score below 1% IDR (-125 * log2(0.01) from https://github.com/nboley/idr)
filter(score >= 830) %>%
write_tsv(file.path(neuro, "T4-6mut50k.idrout.lowmp.nochr.topsignal.narrowPeak"))
```
Running HINT-ATAC with original BAMS and all idrout.lowmp.nochr.topsignal.narrowPeak files
- HFFTH-POS
- T4-6mut50k
## 180604 ATAC-seq
Enquirer: Jochen Ohnmacht
Date: 20180810
### Kundaje lab
main protocol is from
https://github.com/kundajelab/atac-seq-pipeline
and the ATAC-pipeline as [googledoc](https://docs.google.com/document/d/1f0Cm4vRyDQDu0bMehHD7P7KOMxTOP-HiNoIvL1VcBt8/edit)
### FASTQ
short reads with contamination of adapter Nextera Transposase
### using PALEOMIX
#### makefile
- generate a makefile template (after installing paleomix and its dependencies)
```
module load lang/Python/2.7.14-intel-2018a
paleomix bam_pipeline mkfile > template.yaml
```
- edit the compression from `bz2` to `gz`
` CompressionFormat: gz`
- edit the makefile to specify the correct nextera sequences, `AdapterRemoval` needs to look for
comment out the original TruSeq adapter sequences
```
AdapterRemoval:
# Adapter sequences, set and uncomment to override defaults
# --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
--adapter1: CTGTCTCTTATACACATCT
# --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--adapter2: CTGTCTCTTATACACATCT
--minlength: 35
--collapse: no
```
- edit the ExcludeReads:
```
Singleton: yes
CollapsedTruncated: yes
```
- edit the behaviour for duplicates, mark and not filter
` PCRDuplicates: mark`
- edit the features, _i.e_ the files that paleomix needs to produce
```
Features:
# Generate BAM without realignment around indels (yes / no)
RawBAM: yes
# Generate indel-realigned BAM using the GATK Indel realigner (yes / no)
RealignedBAM: no
# To disable mapDamage, write 'no'; to generate basic mapDamage plots,
# write 'plot'; to build post-mortem damage models, write 'model',
# and to produce rescaled BAMs, write 'rescale'. The 'model' option
# includes the 'plot' output, and the 'rescale' option includes both
# 'plot' and 'model' results. All analyses are carried out per library.
mapDamage: no
# Generate coverage information for the raw BAM (wo/ indel realignment).
# If one or more 'RegionsOfInterest' have been specified for a prefix,
# additional coverage files are generated for each alignment (yes / no)
Coverage: yes
# Generate histogram of number of sites with a given read-depth, from 0
# to 200. If one or more 'RegionsOfInterest' have been specified for a
# prefix, additional histograms are generated for each alignment (yes / no)
Depths: no
# Generate summary table for each target (yes / no)
Summary: yes
# Generate histogram of PCR duplicates, for use with PreSeq (yes / no)
DuplicateHist: no
```
we want only RawBaM, Coverage and Summary files.
- edit the References (Prefixes section), here both the nuclear and mito genome
```
Prefixes:
GRCh38.p1:
Path: references/GRCh38.p1.fasta