================= Instructions ================= Through the pipeline, several temporary files will be generated, some of them are only used for settings and transitions, others for continuing the next step, the rest for publishing and interpreting a biological story. .. note:: a better way to organize your ChIP-seq project Raw Data ======== .. _Raw Data: ======== ======= ====================================================== Format Type instruction ======== ======= ====================================================== FASTQ text single end FASTQ.gz gz single end FASTQ text pair end FASTQ.gz gz pair end ======== ======= ====================================================== ----- .. _Manual: Instructions to usage ========================= Simple mode (The major mode) --------------------------------------------- Demo data command is as follows:: chilin simple -p narrow -t foxa1_t1.fastq -c foxa1_c1.fastq -i local -o local -s hg19 --skip 10,12 --dont_remove See skip_ option for details. This is major and the easiest mode to run ChiLin for single end data with default bwa mapper, for single end data using comma to separate sample replicates for IP and input ChIP-seq sample:: chilin simple -u your_name -s your_species --threads 8 -i id -o output -t treat1.fastq,treat2.fastq -c control1.fastq,control2.fastq -p narrow -r tf For pair end data, use semicolon to separate sample replicates, use comma to separate pairs, do not forget to add `quotes(")` of your sample file path:: chilin simple --threads 8 -i H3K27me3_PairEnd -o H3K27me3_PairEnd -u you -s mm9 -t "GSM905438.fastq_R1.gz,GSM905438.fastq_R2.gz" -c "GSM905434.fastq_R1.gz,GSM905434.fastq_R2.gz;GSM905436.fastq_R1.gz,GSM905436.fastq_R2.gz" -p both --pe See more options about *simple* by:: chilin simple -h .. _simple-mode: simple mode useful option ------------------------- * -t In simple mode, this is the options for specifying path to treatment. * -c In simple mode, this is the options for specifying path to control. * -p peaks calling type, narrow or broad or both, #e.g., If your factor is transcription factor, and you want to call narrow peaks only, or broad histone mark with broad peak calling * -i prefix of the output name * -o output results directory * -s species, must be filled, the -s option is corresponding to your filled python config file [section], see section for details, :envvar:`[species] <[species]>`. * -u user * --pe pair end mode * --pe pair end mode * --maper to choose your mapping tool * --threads threads number for mapping tool * -r factor type, some default settings for three factor types, tf,histone,dnase (optional) gen mode ------------------------------ This mode is to generate config file for `run-mode`_. A config file is look like this, - The major section user needs to fill is the :envvar:`[basics] <[basics]>` section. .. code-block:: bash chilin gen -o test.conf .. code-block:: ini [basics] user = anonymous id = local time = 2014-05-09 species = hg19 factor = tf treat = foxa1_t1.fastq,foxa1_t1.fastq cont = foxa1_c1.fastq,foxa1_c1.fastq output = output_directory version = 2.0.0 run mode usage ------------------------------ .. _run-mode: After configurating the config files above, you could use run mode with a single command:: chilin gen -o my_config ## modify tool parameters and run chilin run -c my_config batch mode usage ------------------------------ This mode help user run dataset one by one with one process. After configurating a batch of the config files above, such as e.g. 1.conf, 2.conf, 3.conf, then you fill in a file called *batch.conf*:: 1.conf 2.conf 3.conf you could use batch mode with a single command:: chilin batch -b batch.conf .. _skip: Step control, mapper choice and threads control, mimic run ---------------------------------------------------------- Common options can be used for simple mode, run and batch modes. Each step control is tolerant, continue running even tool failed processing. - --skip, step control, e.g:: chilin simple -s hg19 -i id -p narrow -o output -u user --skip 1,3,5,9,10,11 -t treat1.fastq,treat2.fastq` - step 1(*can skip*): FastQC sequence quality evaluation, reads GC contents evaluation and library contamination, this step can be skipped. - step 2: bwa(default), bowtie or star mapping, this step cannot be skipped, because this provided necessary BAM files. - step 3: sub-sample bam files and do macs2 fragment size estimation. - step 4: sub-sample bam files(if step 3 is run, skip) and do PBC evaluation - step 5: call peak for replicates samples, and do replicates peak overlap/correlation analysis - step 6: call peak for merged bam file, this step cannot be skipped, because this provided peak for annotation step - step 7: calculate FRiP scores for each sample. - step 8: use bedAnnotate.py script to evaluate merged peak calling meta regions distribution(promoters, exons, introns, intergenic, dhs, black list regions) - step 9: sub-sample bam files(if step 3 is run, skip) and calculate reads ratio in meta regions - step 10(*can skip*): draw Phastcon scores distribution around peak call summits, if you do not have Phastcon score bigwig files, use --skip 10 or leave chilin.conf blank for that reference - step 11(*can skip*): Regulatory potential score calculation on top 10k peaks - step 12(*can skip*): use MDSeqPos to perform motif analysis, for peaks number less than 200, ChiLin skip this step automatically - step 13(*can skip*): generate report, use explicitly skip to skip 10-12 if necessary. - --dont_resume, by default, each re-run would use previous temporary files to resume from the step it crashed. When dont_resume is on, ChiLin would start from first step, so user do not to clean up the work directory. - --dont_remove, keep temporary files - --dry-run, mimic run chilin command - --threads, BWA, Bowtie and FastQC multithreads options. - --mapper, to choose mapping tools, should match your genome index in :envvar:`genome index` Instructions to config file ============================== basics ---------------------------- .. envvar:: [basics] Lists all the meta-data of current workflow. Consist of the following options: .. envvar:: user user name .. envvar:: time time you start to run .. envvar:: species The name of species, written to the QCreport and log Limit: a string (1) consist of ``numbers``, ``alphabets`` or ``'_'`` (2) shorter than 20 characters .. envvar:: id This is used as output prefix, such as input id: test, output file would be: test_treat.bam .. envvar:: factor The name of species, writen to DC summary and QCreport, log Limit: a string (1) come from GO standard term .. envvar:: treat The paths of treatment files Limit: absolute ``path`` of files in :ref:`supported formats` .. envvar:: cont The paths of control files Limit: absolute or relative ``path`` of files in :ref:`supported formats` .. envvar:: output The paths of output directory. tool ---------------------------- The tool section is like this: .. literalinclude:: ../chilin.conf :language: ini :lines: 8-16 :emphasize-lines: 13-15 :linenos: .. envvar:: [tool] Lists all the meta-data of current workflow. Consist of the following options: .. envvar:: mdseqpos absolute path to ``MDSeqPos.py`` .. envvar:: macs2 absolute path to ``macs2`` species ---------------------------- You can add as many species as possible. To add species, first you need to read :ref:`dependent data ` section to fill the following. Then, you should fill the config files species section, the rule is like follows, e.g. hg19 assembly. .. literalinclude:: hg19.conf :language: ini :lines: 1-10 :emphasize-lines: 1-10 :linenos: And mm9 assembly, .. literalinclude:: hg19.conf :language: ini :lines: 12-21 :emphasize-lines: 12-21 :linenos: And hg38 assembly, .. literalinclude:: hg19.conf :language: ini :lines: 23-32 :emphasize-lines: 23-32 :linenos: .. envvar:: [species] specific species assembly version you want to analyze Consist of the following options: .. envvar:: genome_index absolute path to corresponding mappers genome index, if you use default bwa, this should be bwa index. .. envvar:: genome_dir absolute path to genome fasta files, separated by chromosome, like `chr1.fa, chr2.fa, chr3.fa ...` .. envvar:: chrom_len absolute path to chromosome length text file .. envvar:: dhs absolute path union DHS regions .. envvar:: velcro absolute path black list regions .. envvar:: conservation absolute path to the directory containing UCSC Phastcon score bigwig files .. envvar:: geneTable standard refSeq annotation table from UCSC table browser contamination ---------------------------- you can add all species you are suspicious of sampling swap or library contamination. .. literalinclude:: ../chilin.conf :language: ini :lines: 52-61 :emphasize-lines: 52-61 :linenos: .. envvar:: [contamination] specific species assembly path that you want to screen. software options ---------------------------- ChiLin has some user-defined parameters for macs2, regulatory potential, conservation score and motif analysis. .. literalinclude:: ../chilin.conf :language: ini :lines: 22-51 :emphasize-lines: 23-50 :linenos: .. envvar:: [macs2] macs2 parameters .. envvar:: extsize fixed extension size for macs2 peak calling .. envvar:: type [for macs2] peak calling types, user can choose narrow, broad and both, we suggest user use narrow for TF and active histone marks, use broad for broad histone marks, use both for chromatin regulators. .. envvar:: fdr FDR cutoff for macs2 peak calling .. envvar:: keep_dup duplicates level, suggest 1 for removing redundancy, or all for preserving all redudancy for DNase-seq .. envvar:: [reg] specific species assembly version you want to analyze .. envvar:: peaks [for reg] top peaks number(sorted by macs2 score) for estimating regulatory score. .. envvar:: dist distance to TSS cutoff when calculating regulatory potential score .. envvar:: [conservation] specific species assembly version you want to analyze .. envvar:: type [for conservation] transcription factor or histone mark .. envvar:: peaks top peaks (sorted by macs2 score) for plotting conservation distribution .. envvar:: width window width around peaks summits for plotting conservation .. envvar:: [seqpos] specific species assembly version you want to analyze .. envvar:: peaks [for conservation] top peaks number for seqpos (search in the motif database) .. envvar:: mdscan_width motif scan window width around peak summit .. envvar:: mdscan_top_peaks top peaks for denovo motif scan .. envvar:: width seqpos width .. envvar:: seqpos_mdscan_top_peaks_refine seqpos and mdscan top peaks refine, see mdseqpos .. envvar:: db choose mdseqpos motif database, default cistrome.xml .. envvar:: pvalue_cutoff cutoff for motif analysis .. _Instructions_table: .. _Instructions_results: Instructions to results ========================= The output prefix is from: * `simple-mode`_ -i id specified, or `run-mode`_ filled in :envvar:`[basics] <[basics]>` section :envvar:`id ` part. * The output directory is `simple mode` -o output specified or 2. `run-mode`_ filled in :envvar:`[basics] <[basics]>` section :envvar:`output ` part. * For a fully test dataset with replicates of treatments and replicates of controls, the results folder are like following, which are generated with *-dont_remove* option, to preserve all temporary files, use *-dont_remove* option. .. _cleaner: Without `--dont_remove` option, the work directory would be cleaned up:: id |-- attic | |-- json | | |-- id_conserv.json | | |-- id_contam.json | | |-- id_dhs.json | | |-- id_enrich_meta.json | | |-- id_fastqc.json | | |-- id_frag.json | | |-- id_frip.json | | |-- id_macs2.json | | |-- id_macs2_rep.json | | |-- id_map.json | | |-- id_meta.json | | |-- id_pbc.json | | `-- id_rep.json | |-- id_conserv.pdf | |-- id_control.bam | |-- id_control_rep1.bam | |-- id_control_rep2.bam | |-- id_gene_score.txt | |-- id_treat_rep1.bam | |-- id_treat_rep2.bam | `-- id_treatment.bam |-- id.pdf |-- id_control.bw |-- id_peaks.xls |-- id_sort_peaks.narrowPeak |-- id_sort_summits.bed |-- id_treat.bw |-- id_treat_rep1_control.bw |-- id_treat_rep1_peaks.xls |-- id_treat_rep1_sort_peaks.narrowPeak |-- id_treat_rep1_treat.bw |-- id_treat_rep2_control.bw |-- id_treat_rep2_peaks.xls |-- id_treat_rep2_sort_peaks.narrowPeak `-- id_treat_rep2_treat.bw With `--dont_remove` option, .. code-block:: bash output |-- json ## qc statistics | |-- id_conserv.json ## conservation scores | |-- id_contam.json ## library contamination evaluation | |-- id_dhs.json ## union dhs overlap | |-- id_enrich_meta.json ## meta regions reads ratio | |-- id_fastqc.json ## fastqc evaluation | |-- id_frag.json ## fragment size evaluation | |-- id_frip.json ## FRiP scores | |-- id_macs2.json ## merged macs2 peak calling number | |-- id_macs2_rep.json ## macs2 replicates peaks number | |-- id_map.json ## mapping ratio statistics | |-- id_meta.json ## peak meta regions distribution | |-- id_pbc.json ## PBC score | `-- id_rep.json ## replicates consistency |-- latex ## rendered latex document | |-- id_begin.tex | |-- id_conserv.tex | |-- id_contam.tex | |-- id_end.tex | |-- id_fastqc.tex | |-- id_fastqc_gc.tex | |-- id_frip.tex | |-- id_map.tex | `-- id_summary_table.tex |-- id.aux ## latex log file |-- id.cor ## correlation analysis temporary file |-- id.dhs ## dhs overlap analysis temporary file |-- id.log ## latex log file |-- id.meta ## meta regions peak distribution temporary file |-- id.out ## latex log file |-- id.pdf ## pdf document generated |-- id.tex ## file latex file |-- id_0_1.overlap ## replicates peak overlap |-- id_bwa_compare.R ## R script for comparing new data to historic data |-- id_bwa_compare.pdf ## pdf generated by R script above |-- id_conserv.R ## conservation plot R code |-- id_conserv.pdf ## pdf generated by R script above |-- id_conserv.txt ## 7 or 5 point conservation scores around summits |-- id_conserv_cluster.R ## conservation scores clustering plot |-- id_conserv_compare.pdf ## conservation pdf generated by R script above |-- id_conserv_img.pdf ## low resolution image of conservation plot |-- id_control.bam ## merged control bam files |-- id_control.bw ## control bigwiggle file |-- id_control_lambda.bdg ## control bedgraph file |-- id_control_lambda.bdg.tmp ## bedClip filtered bedgraph file |-- id_control_rep1.bam ## sorted, mapping quality 1 filtered replicate 1st bam file |-- id_control_rep1.enrich.dhs ## reads ratio in DHS regions |-- id_control_rep1.enrich.exon ## reads ratio in exon regions |-- id_control_rep1.enrich.promoter ## reads ratio in promoter regions |-- id_control_rep1.fastq ## copied fastq file |-- id_control_rep1.frip ## FRiP score from replicate control 1st |-- id_control_rep1.hist ## read locations histogram of replicate control 1st |-- id_control_rep1.nochrM ## chromosome information without chrM |-- id_control_rep1.pbc ## bwa PBC score |-- id_control_rep1.sai ## bwa sai file |-- id_control_rep1.sam ## bwa sam file |-- id_control_rep1.tmp.bam ## mapping quality filtered bam files, without sorting |-- id_control_rep1_100k.fastq ## subsampled fastq reads |-- id_control_rep1_100k_fastqc ## fastqc temporary results | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_control_rep1_100k_fastqc.zip |-- id_control_rep1_4000000.bam ## subsampled 4M reads bam file |-- id_control_rep1_4000000_nochrM.bam ## subsampled non-chrM 4M reads bam file |-- id_control_rep1_mapped.bwa ## replicate control 1st mapped reads statistics |-- id_control_rep1_nochrM.bam ## sorted, mapping quality filtered bam file |-- id_control_rep1_nochrM.sam ## mapped sam files without chrM |-- id_control_rep1_nochrM.sam.4000000 ## subsampled 4M reads without chrM |-- id_control_rep1_total.bwa ## total reads statistics from bwa |-- id_control_rep1_u.sam ## unique reads SAM file |-- id_control_rep1_u.sam.4000000 ## subsampled unique reads SAM file |-- id_control_rep1mbr.bam ## cross species mapping to mbr, or species you specified |-- id_control_rep1mbr.sai |-- id_control_rep1mbr.sam |-- id_control_rep1mbr.tmp.bam |-- id_control_rep1mbr_mapped.bwa |-- id_control_rep1mbr_total.bwa |-- id_control_rep2.bam ## control replicates 2nd bam file |-- id_control_rep2.enrich.dhs |-- id_control_rep2.enrich.exon |-- id_control_rep2.enrich.promoter |-- id_control_rep2.fastq |-- id_control_rep2.frip |-- id_control_rep2.hist |-- id_control_rep2.nochrM |-- id_control_rep2.pbc |-- id_control_rep2.sai |-- id_control_rep2.sam |-- id_control_rep2.tmp.bam |-- id_control_rep2_100k.fastq |-- id_control_rep2_100k_fastqc | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_control_rep2_100k_fastqc.zip |-- id_control_rep2_4000000.bam |-- id_control_rep2_4000000_nochrM.bam |-- id_control_rep2_mapped.bwa |-- id_control_rep2_nochrM.bam |-- id_control_rep2_nochrM.sam |-- id_control_rep2_nochrM.sam.4000000 |-- id_control_rep2_total.bwa |-- id_control_rep2_u.sam |-- id_control_rep2_u.sam.4000000 |-- id_control_rep2mbr.bam |-- id_control_rep2mbr.sai |-- id_control_rep2mbr.sam |-- id_control_rep2mbr.tmp.bam |-- id_control_rep2mbr_mapped.bwa |-- id_control_rep2mbr_total.bwa |-- id_gene_score.txt ## regulatory potential for top 10000 peaks |-- id_peaks.narrowPeak ## merged peak call for narrowPeak or broadPeak |-- id_peaks.xls ## macs2 excel file |-- id_peaks_top_conserv.bed ## top peaks for conservation plot |-- id_peaks_top_reg.bed ## top peaks for regulatory potential score calculation |-- id_raw_sequence_qc.R ## median raw sequence quality plot |-- id_raw_sequence_qc.pdf |-- id_sort_peaks.narrowPeak ## sorted merged peak calling |-- id_sort_summits.bed ## sorted summits of peaks |-- id_summary.txt ## plain text for qc summary |-- id_summits.bed ## merged peak calling summits file |-- id_treat.bw ## merged pileup treatment bigwiggle file |-- id_treat_pileup.bdg ## merged pileup treatment bedgraph file |-- id_treat_pileup.bdg.tmp ## merged pileup treatment bedgraph temporary file |-- id_treat_rep1 ## MACS2 predictd R script |-- id_treat_rep1.bam ## bam file generated by bwa and samtools |-- id_treat_rep1.enrich.dhs |-- id_treat_rep1.enrich.exon |-- id_treat_rep1.enrich.promoter |-- id_treat_rep1.fastq |-- id_treat_rep1.frip |-- id_treat_rep1.hist |-- id_treat_rep1.nochrM |-- id_treat_rep1.pbc |-- id_treat_rep1.sai |-- id_treat_rep1.sam |-- id_treat_rep1.tmp.bam |-- id_treat_rep1_100k.fastq |-- id_treat_rep1_100k_fastqc | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_treat_rep1_100k_fastqc.zip |-- id_treat_rep1_4000000.bam |-- id_treat_rep1_4000000_nochrM.bam |-- id_treat_rep1_control.bw |-- id_treat_rep1_control_lambda.bdg |-- id_treat_rep1_control_lambda.bdg.tmp |-- id_treat_rep1_frag_sd.R ## fragment analysis script for parsing macs2 R script |-- id_treat_rep1_mapped.bwa |-- id_treat_rep1_model.R ## MACS2 R script for analyzing fragment size |-- id_treat_rep1_nochrM.bam |-- id_treat_rep1_nochrM.sam |-- id_treat_rep1_nochrM.sam.4000000 |-- id_treat_rep1_peaks.narrowPeak ## replicate 1 peak calling |-- id_treat_rep1_peaks.xls |-- id_treat_rep1_sort_peaks.narrowPeak |-- id_treat_rep1_summits.bed |-- id_treat_rep1_total.bwa |-- id_treat_rep1_treat.bw |-- id_treat_rep1_treat_pileup.bdg |-- id_treat_rep1_treat_pileup.bdg.tmp |-- id_treat_rep1_u.sam ## uniquely mapping sam file, defined by mapping quality above 1 |-- id_treat_rep1_u.sam.4000000 |-- id_treat_rep1mbr.bam |-- id_treat_rep1mbr.sai |-- id_treat_rep1mbr.sam |-- id_treat_rep1mbr.tmp.bam |-- id_treat_rep1mbr_mapped.bwa |-- id_treat_rep1mbr_total.bwa |-- id_treat_rep2 |-- id_treat_rep2.bam |-- id_treat_rep2.enrich.dhs |-- id_treat_rep2.enrich.exon |-- id_treat_rep2.enrich.promoter |-- id_treat_rep2.fastq |-- id_treat_rep2.frip |-- id_treat_rep2.hist |-- id_treat_rep2.nochrM |-- id_treat_rep2.pbc |-- id_treat_rep2.sai |-- id_treat_rep2.sam |-- id_treat_rep2.tmp.bam |-- id_treat_rep2_100k.fastq |-- id_treat_rep2_100k_fastqc | |-- Icons | | |-- error.png | | |-- fastqc_icon.png | | |-- tick.png | | `-- warning.png | |-- Images | | |-- duplication_levels.png | | |-- kmer_profiles.png | | |-- per_base_gc_content.png | | |-- per_base_n_content.png | | |-- per_base_quality.png | | |-- per_base_sequence_content.png | | |-- per_sequence_gc_content.png | | |-- per_sequence_quality.png | | `-- sequence_length_distribution.png | |-- fastqc_data.txt | |-- fastqc_report.html | `-- summary.txt |-- id_treat_rep2_100k_fastqc.zip |-- id_treat_rep2_4000000.bam |-- id_treat_rep2_4000000_nochrM.bam |-- id_treat_rep2_control.bw |-- id_treat_rep2_control_lambda.bdg |-- id_treat_rep2_control_lambda.bdg.tmp |-- id_treat_rep2_frag_sd.R |-- id_treat_rep2_mapped.bwa |-- id_treat_rep2_model.R |-- id_treat_rep2_nochrM.bam |-- id_treat_rep2_nochrM.sam |-- id_treat_rep2_nochrM.sam.4000000 |-- id_treat_rep2_peaks.narrowPeak |-- id_treat_rep2_peaks.xls |-- id_treat_rep2_sort_peaks.narrowPeak |-- id_treat_rep2_summits.bed |-- id_treat_rep2_total.bwa |-- id_treat_rep2_treat.bw |-- id_treat_rep2_treat_pileup.bdg |-- id_treat_rep2_treat_pileup.bdg.tmp |-- id_treat_rep2_u.sam ## uniquely mapping sam file, defined by mapping quality above 1 |-- id_treat_rep2_u.sam.4000000 |-- id_treat_rep2mbr.bam |-- id_treat_rep2mbr.sai |-- id_treat_rep2mbr.sam |-- id_treat_rep2mbr.tmp.bam |-- id_treat_rep2mbr_mapped.bwa |-- id_treat_rep2mbr_total.bwa `-- id_treatment.bam ## samtools merged filtered bam files Built-in tools ================= * conservation_plot.py for generating conservation profiles * bedAnnotate.py is used to calculate meta gene distribuiton.