Through the pipeline, several temporary files will be generated, some of them are only used for settings and transitions, others for continuing the next step, the rest for publishing and interpreting a biological story.
Note
a better way to organize your ChIP-seq project
Format | Type | instruction |
---|---|---|
FASTQ | text | single end |
FASTQ.gz | gz | single end |
FASTQ | text | pair end |
FASTQ.gz | gz | pair end |
Demo data command is as follows:
chilin simple -p narrow -t foxa1_t1.fastq -c foxa1_c1.fastq -i local -o local -s hg19 --skip 10,12 --dont_remove
See skip option for details.
This is major and the easiest mode to run ChiLin for single end data with default bwa mapper, for single end data using comma to separate sample replicates for IP and input ChIP-seq sample:
chilin simple -u your_name -s your_species --threads 8 -i id -o output -t treat1.fastq,treat2.fastq -c control1.fastq,control2.fastq -p narrow -r tf
For pair end data, use semicolon to separate sample replicates, use comma to separate pairs, do not forget to add quotes(”) of your sample file path:
chilin simple --threads 8 -i H3K27me3_PairEnd -o H3K27me3_PairEnd -u you -s mm9 -t "GSM905438.fastq_R1.gz,GSM905438.fastq_R2.gz" -c "GSM905434.fastq_R1.gz,GSM905434.fastq_R2.gz;GSM905436.fastq_R1.gz,GSM905436.fastq_R2.gz" -p both --pe
See more options about simple by:
chilin simple -h
[species]
.This mode is to generate config file for run-mode. A config file is look like this,
[basics]
section.chilin gen -o test.conf
[basics]
user = anonymous
id = local
time = 2014-05-09
species = hg19
factor = tf
treat = foxa1_t1.fastq,foxa1_t1.fastq
cont = foxa1_c1.fastq,foxa1_c1.fastq
output = output_directory
version = 2.0.0
After configurating the config files above, you could use run mode with a single command:
chilin gen -o my_config
## modify tool parameters and run
chilin run -c my_config
This mode help user run dataset one by one with one process.
After configurating a batch of the config files above, such as e.g. 1.conf, 2.conf, 3.conf, then you fill in a file called batch.conf:
1.conf
2.conf
3.conf
you could use batch mode with a single command:
chilin batch -b batch.conf
Common options can be used for simple mode, run and batch modes. Each step control is tolerant, continue running even tool failed processing.
–skip, step control, e.g:
chilin simple -s hg19 -i id -p narrow -o output -u user --skip 1,3,5,9,10,11 -t treat1.fastq,treat2.fastq`
–dont_resume, by default, each re-run would use previous temporary files to resume from the step it crashed. When dont_resume is on, ChiLin would start from first step, so user do not to clean up the work directory.
–dont_remove, keep temporary files
–dry-run, mimic run chilin command
–threads, BWA, Bowtie and FastQC multithreads options.
–mapper, to choose mapping tools, should match your genome index in genome index
[basics]
¶Lists all the meta-data of current workflow. Consist of the following options:
user
¶user name
time
¶time you start to run
species
¶The name of species, written to the QCreport and log
Limit: a string (1) consist of numbers
, alphabets
or '_'
(2) shorter than 20 characters
id
¶This is used as output prefix, such as input id: test, output file would be: test_treat.bam
factor
¶The name of species, writen to DC summary and QCreport, log Limit: a string (1) come from GO standard term
treat
¶The paths of treatment files
Limit: absolute path
of files in supported formats
cont
¶The paths of control files
Limit: absolute or relative path
of files in supported formats
output
¶The paths of output directory.
The tool section is like this:
1 2 3 4 5 6 7 8 | #ChiLin is dependent on several tools, please specify the absolute path to
#these tools--ALL FIELDS ARE REQUIRED
#put bedClip, bedGraphToBigWig, bowtie, star, bwa, fastqc, bedtools, macs2, samtools, seqtk, wigCorrelate
#in executable PATH
#other system tool includes convert, pdflatex, R, python2.7
[tool]
mdseqpos =
macs2 =
|
You can add as many species as possible. To add species, first you need to read dependent data section to fill the following. Then, you should fill the config files species section, the rule is like follows, e.g. hg19 assembly.
1 2 3 4 5 6 7 8 9 10 | [hg19]
genome_index =
# fasta file separated by chromosome, such as chr1.fa
genome_dir =
chrom_len =
dhs =
## blacklist region
velcro =
conservation =
geneTable =
|
And mm9 assembly,
1 2 3 4 5 6 7 8 9 10 | [mm9]
genome_index =
# fasta file separated by chromosome, such as chr1.fa
genome_dir =
chrom_len =
dhs =
## blacklist region
velcro =
conservation =
geneTable =
|
And hg38 assembly,
1 2 3 4 5 6 7 8 9 10 | [hg38]
genome_index =
# fasta file separated by chromosome, such as chr1.fa
genome_dir =
chrom_len =
dhs =
## blacklist region
velcro =
conservation =
geneTable =
|
[species]
¶specific species assembly version you want to analyze Consist of the following options:
genome_index
¶absolute path to corresponding mappers genome index, if you use default bwa, this should be bwa index.
genome_dir
¶absolute path to genome fasta files, separated by chromosome, like chr1.fa, chr2.fa, chr3.fa ...
chrom_len
¶absolute path to chromosome length text file
dhs
¶absolute path union DHS regions
velcro
¶absolute path black list regions
conservation
¶absolute path to the directory containing UCSC Phastcon score bigwig files
geneTable
¶standard refSeq annotation table from UCSC table browser
you can add all species you are suspicious of sampling swap or library contamination.
1 2 3 4 5 6 7 8 9 10 | #------------------------------------------------------------------------------
# Contamination
#------------------------------------------------------------------------------
#OPTIONAL- our contamination module can screen for any species defined below
#specify the species name and the path to the bwa index as follows: e.g.
#ECOLI = /some/path/ecoli
[contamination]
mycoplasma = mycoplasma
# ecoli =
# yeat =
|
[contamination]
¶specific species assembly path that you want to screen.
ChiLin has some user-defined parameters for macs2, regulatory potential, conservation score and motif analysis.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | #the only thing you can affect here is the number of threads used.
[macs2]
#refer to the macs2 help message to find out what these mean, species for effective genome size
extsize = 146
# effecitive genome sizes, support hs, mm, other species, please refer to chromInfo
species =
type = both
fdr = 0.01
keep_dup = 1
[reg]
## regulatory potential score prediction top peaks
peaks = 10000
dist = 100000
[conservation]
## for tf/dnase we suggest 400bp width around summit, for histone 4000
type = tf
peaks = 5000
width = 400
[seqpos]
peaks = 5000
mdscan_width = 200
mdscan_top_peaks = 200
seqpos_mdscan_top_peaks_refine = 500
width = 600
pvalue_cutoff = 0.001
db = cistrome.xml
|
[macs2]
¶macs2 parameters
extsize
¶fixed extension size for macs2 peak calling
type [for macs2]
¶peak calling types, user can choose narrow, broad and both, we suggest user use narrow for TF and active histone marks, use broad for broad histone marks, use both for chromatin regulators.
fdr
¶FDR cutoff for macs2 peak calling
keep_dup
¶duplicates level, suggest 1 for removing redundancy, or all for preserving all redudancy for DNase-seq
[reg]
¶specific species assembly version you want to analyze
[conservation]
¶specific species assembly version you want to analyze
[seqpos]
¶specific species assembly version you want to analyze
peaks [for conservation]
¶top peaks number for seqpos (search in the motif database)
mdscan_width
¶motif scan window width around peak summit
mdscan_top_peaks
¶top peaks for denovo motif scan
width
¶seqpos width
seqpos_mdscan_top_peaks_refine
¶seqpos and mdscan top peaks refine, see mdseqpos
db
¶choose mdseqpos motif database, default cistrome.xml
pvalue_cutoff
¶cutoff for motif analysis
The output prefix is from:
[basics]
section id
part.[basics]
section output
part.Without –dont_remove option, the work directory would be cleaned up:
id
|-- attic
| |-- json
| | |-- id_conserv.json
| | |-- id_contam.json
| | |-- id_dhs.json
| | |-- id_enrich_meta.json
| | |-- id_fastqc.json
| | |-- id_frag.json
| | |-- id_frip.json
| | |-- id_macs2.json
| | |-- id_macs2_rep.json
| | |-- id_map.json
| | |-- id_meta.json
| | |-- id_pbc.json
| | `-- id_rep.json
| |-- id_conserv.pdf
| |-- id_control.bam
| |-- id_control_rep1.bam
| |-- id_control_rep2.bam
| |-- id_gene_score.txt
| |-- id_treat_rep1.bam
| |-- id_treat_rep2.bam
| `-- id_treatment.bam
|-- id_report.pdf
|-- id_control.bw
|-- id_peaks.xls
|-- id_sort_peaks.narrowPeak
|-- id_sort_summits.bed
|-- id_treat.bw
|-- id_treat_rep1_control.bw
|-- id_treat_rep1_peaks.xls
|-- id_treat_rep1_sort_peaks.narrowPeak
|-- id_treat_rep1_treat.bw
|-- id_treat_rep2_control.bw
|-- id_treat_rep2_peaks.xls
|-- id_treat_rep2_sort_peaks.narrowPeak
`-- id_treat_rep2_treat.bw
With –dont_remove option,
output |-- json ## qc statistics | |-- id_conserv.json ## conservation scores | |-- id_contam.json ## library contamination evaluation | |-- id_dhs.json ## union dhs overlap | |-- id_enrich_meta.json ## meta regions reads ratio | |-- id_fastqc.json ## fastqc evaluation | |-- id_frag.json ## fragment size evaluation | |-- id_frip.json ## FRiP scores | |-- id_macs2.json ## merged macs2 peak calling number | |-- id_macs2_rep.json ## macs2 replicates peaks number | |-- id_map.json ## mapping ratio statistics | |-- id_meta.json ## peak meta regions distribution | |-- id_pbc.json ## PBC score | `-- id_rep.json ## replicates consistency |-- latex ## rendered latex document | |-- id_begin.tex | |-- id_conserv.tex | |-- id_contam.tex | |-- id_end.tex | |-- id_fastqc.tex | |-- id_fastqc_gc.tex | |-- id_frip.tex | |-- id_map.tex | `-- id_summary_table.tex |-- id.aux ## latex log file |-- id.cor ## correlation analysis temporary file |-- id.dhs ## dhs overlap analysis temporary file |-- id.log ## latex log file |-- id.meta ## meta regions peak distribution temporary file |-- id.out ## latex log file |-- id_report.pdf ## pdf document generated |-- id.tex ## file latex file |-- id_0_1.overlap ## replicates peak overlap |-- id_bwa_compare.R ## R script for comparing new data to historic data |-- id_bwa_compare.pdf ## pdf generated by R script above |-- id_conserv.R ## conservation plot R code |-- id_conserv.pdf ## pdf generated by R script above |-- id_conserv.txt ## 7 or 5 point conservation scores around summits |-- id_conserv_cluster.R ## conservation scores clustering plot |-- id_conserv_compare.pdf ## conservation pdf generated by R script above |-- id_conserv_img.pdf ## low resolution image of conservation plot |-- id_control.bam ## merged control bam files |-- id_control.bw ## control bigwiggle file |-- id_control_lambda.bdg ## control bedgraph file |-- id_control_lambda.bdg.tmp ## bedClip filtered bedgraph file |-- id_control_rep1.bam ## sorted, mapping quality 1 filtered replicate 1st bam file |-- id_control_rep1.enrich.dhs ## reads ratio in DHS regions |-- id_control_rep1.enrich.exon ## reads ratio in exon regions |-- id_control_rep1.enrich.promoter ## reads ratio in promoter regions |-- id_control_rep1.fastq ## copied fastq file |-- id_control_rep1.frip ## FRiP score from replicate control 1st |-- id_control_rep1.hist ## read locations histogram of replicate control 1st |-- id_control_rep1.nochrM ## chromosome information without chrM |-- id_control_rep1.pbc ## bwa PBC score |-- id_control_rep1.sai ## bwa sai file |-- id_control_rep1.sam ## bwa sam file |-- id_control_rep1.tmp.bam ## mapping quality filtered bam files, without sorting |-- id_control_rep1_100k.fastq ## subsampled fastq reads |-- id_control_rep1_100k_fastqc ## fastqc temporary results | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_control_rep1_100k_fastqc.zip |-- id_control_rep1_4000000.bam ## subsampled 4M reads bam file |-- id_control_rep1_4000000_nochrM.bam ## subsampled non-chrM 4M reads bam file |-- id_control_rep1_mapped.bwa ## replicate control 1st mapped reads statistics |-- id_control_rep1_nochrM.bam ## sorted, mapping quality filtered bam file |-- id_control_rep1_nochrM.sam ## mapped sam files without chrM |-- id_control_rep1_nochrM.sam.4000000 ## subsampled 4M reads without chrM |-- id_control_rep1_total.bwa ## total reads statistics from bwa |-- id_control_rep1_u.sam ## unique reads SAM file |-- id_control_rep1_u.sam.4000000 ## subsampled unique reads SAM file |-- id_control_rep1mbr.bam ## cross species mapping to mbr, or species you specified |-- id_control_rep1mbr.sai |-- id_control_rep1mbr.sam |-- id_control_rep1mbr.tmp.bam |-- id_control_rep1mbr_mapped.bwa |-- id_control_rep1mbr_total.bwa |-- id_control_rep2.bam ## control replicates 2nd bam file |-- id_control_rep2.enrich.dhs |-- id_control_rep2.enrich.exon |-- id_control_rep2.enrich.promoter |-- id_control_rep2.fastq |-- id_control_rep2.frip |-- id_control_rep2.hist |-- id_control_rep2.nochrM |-- id_control_rep2.pbc |-- id_control_rep2.sai |-- id_control_rep2.sam |-- id_control_rep2.tmp.bam |-- id_control_rep2_100k.fastq |-- id_control_rep2_100k_fastqc | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_control_rep2_100k_fastqc.zip |-- id_control_rep2_4000000.bam |-- id_control_rep2_4000000_nochrM.bam |-- id_control_rep2_mapped.bwa |-- id_control_rep2_nochrM.bam |-- id_control_rep2_nochrM.sam |-- id_control_rep2_nochrM.sam.4000000 |-- id_control_rep2_total.bwa |-- id_control_rep2_u.sam |-- id_control_rep2_u.sam.4000000 |-- id_control_rep2mbr.bam |-- id_control_rep2mbr.sai |-- id_control_rep2mbr.sam |-- id_control_rep2mbr.tmp.bam |-- id_control_rep2mbr_mapped.bwa |-- id_control_rep2mbr_total.bwa |-- id_gene_score.txt ## regulatory potential for top 10000 peaks |-- id_peaks.narrowPeak ## merged peak call for narrowPeak or broadPeak |-- id_peaks.xls ## macs2 excel file |-- id_peaks_top_conserv.bed ## top peaks for conservation plot |-- id_peaks_top_reg.bed ## top peaks for regulatory potential score calculation |-- id_raw_sequence_qc.R ## median raw sequence quality plot |-- id_raw_sequence_qc.pdf |-- id_sort_peaks.narrowPeak ## sorted merged peak calling |-- id_sort_summits.bed ## sorted summits of peaks |-- id_summary.txt ## plain text for qc summary |-- id_summits.bed ## merged peak calling summits file |-- id_treat.bw ## merged pileup treatment bigwiggle file |-- id_treat_pileup.bdg ## merged pileup treatment bedgraph file |-- id_treat_pileup.bdg.tmp ## merged pileup treatment bedgraph temporary file |-- id_treat_rep1 ## MACS2 predictd R script |-- id_treat_rep1.bam ## bam file generated by bwa and samtools |-- id_treat_rep1.enrich.dhs |-- id_treat_rep1.enrich.exon |-- id_treat_rep1.enrich.promoter |-- id_treat_rep1.fastq |-- id_treat_rep1.frip |-- id_treat_rep1.hist |-- id_treat_rep1.nochrM |-- id_treat_rep1.pbc |-- id_treat_rep1.sai |-- id_treat_rep1.sam |-- id_treat_rep1.tmp.bam |-- id_treat_rep1_100k.fastq |-- id_treat_rep1_100k_fastqc | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_treat_rep1_100k_fastqc.zip |-- id_treat_rep1_4000000.bam |-- id_treat_rep1_4000000_nochrM.bam |-- id_treat_rep1_control.bw |-- id_treat_rep1_control_lambda.bdg |-- id_treat_rep1_control_lambda.bdg.tmp |-- id_treat_rep1_frag_sd.R ## fragment analysis script for parsing macs2 R script |-- id_treat_rep1_mapped.bwa |-- id_treat_rep1_model.R ## MACS2 R script for analyzing fragment size |-- id_treat_rep1_nochrM.bam |-- id_treat_rep1_nochrM.sam |-- id_treat_rep1_nochrM.sam.4000000 |-- id_treat_rep1_peaks.narrowPeak ## replicate 1 peak calling |-- id_treat_rep1_peaks.xls |-- id_treat_rep1_sort_peaks.narrowPeak |-- id_treat_rep1_summits.bed |-- id_treat_rep1_total.bwa |-- id_treat_rep1_treat.bw |-- id_treat_rep1_treat_pileup.bdg |-- id_treat_rep1_treat_pileup.bdg.tmp |-- id_treat_rep1_u.sam ## uniquely mapping sam file, defined by mapping quality above 1 |-- id_treat_rep1_u.sam.4000000 |-- id_treat_rep1mbr.bam |-- id_treat_rep1mbr.sai |-- id_treat_rep1mbr.sam |-- id_treat_rep1mbr.tmp.bam |-- id_treat_rep1mbr_mapped.bwa |-- id_treat_rep1mbr_total.bwa |-- id_treat_rep2 |-- id_treat_rep2.bam |-- id_treat_rep2.enrich.dhs |-- id_treat_rep2.enrich.exon |-- id_treat_rep2.enrich.promoter |-- id_treat_rep2.fastq |-- id_treat_rep2.frip |-- id_treat_rep2.hist |-- id_treat_rep2.nochrM |-- id_treat_rep2.pbc |-- id_treat_rep2.sai |-- id_treat_rep2.sam |-- id_treat_rep2.tmp.bam |-- id_treat_rep2_100k.fastq |-- id_treat_rep2_100k_fastqc | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_treat_rep2_100k_fastqc.zip |-- id_treat_rep2_4000000.bam |-- id_treat_rep2_4000000_nochrM.bam |-- id_treat_rep2_control.bw |-- id_treat_rep2_control_lambda.bdg |-- id_treat_rep2_control_lambda.bdg.tmp |-- id_treat_rep2_frag_sd.R |-- id_treat_rep2_mapped.bwa |-- id_treat_rep2_model.R |-- id_treat_rep2_nochrM.bam |-- id_treat_rep2_nochrM.sam |-- id_treat_rep2_nochrM.sam.4000000 |-- id_treat_rep2_peaks.narrowPeak |-- id_treat_rep2_peaks.xls |-- id_treat_rep2_sort_peaks.narrowPeak |-- id_treat_rep2_summits.bed |-- id_treat_rep2_total.bwa |-- id_treat_rep2_treat.bw |-- id_treat_rep2_treat_pileup.bdg |-- id_treat_rep2_treat_pileup.bdg.tmp |-- id_treat_rep2_u.sam ## uniquely mapping sam file, defined by mapping quality above 1 |-- id_treat_rep2_u.sam.4000000 |-- id_treat_rep2mbr.bam |-- id_treat_rep2mbr.sai |-- id_treat_rep2mbr.sam |-- id_treat_rep2mbr.tmp.bam |-- id_treat_rep2mbr_mapped.bwa |-- id_treat_rep2mbr_total.bwa `-- id_treatment.bam ## samtools merged filtered bam files