Appendix: Dependent data¶

Get dependent data¶

ChiLin support all species listed on UCSC website, which includes dependent data as we list in [species]:

(Must) genome index for your species, we recommended bwa index.
(Must) chromosome length information text file.
(Must) standard RefSeq files.
(Optionally)PhastCons conservation bigwiggle files.
(Optionally) genome directory containing chromosome separated sequence fasta files
(Optionally) Union DHS and blacklist regions

We have packaged all dependent data for hg19, hg38, mm9, mm10.

Data details¶

First is large disk usage data:

Data Name	Used by	Data Source
`genome_index`	bwa/bowtie/star	raw fasta indexed files
`genome_dir`	bwa/bowtie/star	genome fasta files
`conservation`	conservation_plot.py	wiggle files

Genome version	Raw genome sequence	Masked genome sequence
hg19	hg19_raw	hg19_mask
hg38	hg38_raw	hg38_mask
mm9	mm9_raw	mm9_mask
mm10	mm10_raw	mm10_mask

Second is small pieces of reference files:

Data Name	Used by	Data Source
`chrom_len`	samtools	UCSC table browser
`dhs`	bedtools	Union DHS regions from Cistrome DB
`velcro`	bedtools	blacklist regions
`geneTable`	bedAnnotate	UCSC table browser
`[contamination]`	bwa	Mycoplasma genome index(set by –mapper)

Followings is how we generate these reference files, if you have any species other than hg19/hg38/mm9/mm10, you can find the reference files with the similar ways.

Mycoplasma genome¶

It seems that Mycoplasma contamination would be a major source of contamination, so we recommended downloading the Mycoplasma fasta for indexing, data is in the link of the mycoplasma genome. Or look at NCBI Nucleotide database.

Then index with bwa index -a is mycoplasma.fasta.

BWA Index¶

download raw genome sequence data, and tar xvfz them and cat *fa > genome.fa. Use the following to index them:

bwa index -a bwtsw genome.fasta

UCSC table browser¶

Use Browser step by step

To get refseq files, open UCSC table browser
Go to the UCSC table browser.
Select desired species and assembly, such as hg19
Select group: Genes and Gene Prediction Tracks
Select track: RefSeq Genes
Select table: refGene
Select region: genome
Select output format: all fields from selected table
Enter output file: species.refgene
Hit the ‘get output’ button
d*ownload and remove the header line with command,

sed 1d species.refgene > sp.refgene

Conservation score¶

(Optional) get Phaston conservation, for most common species version, hg19_conserv, hg38_conserv, mm10_conserv, mm9_conserv and use wigToBigWig to convert them into bigwig, we provide hg19/mm9 conservation score on our server, for other species, just left the chilin.conf conservation section blank. Take hg19 as an example:

wget -r -np -nd --accept=gz http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/
for c in chr*wig*gz
do
bw=${c%phastCons46way.placental.wigFix.gz}bw
echo $bw
gunzip -c $c | wigToBigWig stdin chrom_len $bw  ## chrom_len is where you put your reference chromosome information file
done