Appendix: Dependent data

Get dependent data

ChiLin support all species listed on UCSC website, which includes dependent data as we list in [species]:

  • (Must) genome index for your species, we recommended bwa index.
  • (Must) chromosome length information text file.
  • (Must) standard RefSeq files.
  • (Optionally)PhastCons conservation bigwiggle files.
  • (Optionally) genome directory containing chromosome separated sequence fasta files
  • (Optionally) Union DHS and blacklist regions

We have packaged all dependent data for hg19, hg38, mm9, mm10.

Data details

  • First is large disk usage data:
Data Name Used by Data Source
genome_index bwa/bowtie/star raw fasta indexed files
genome_dir bwa/bowtie/star genome fasta files
conservation conservation_plot.py wiggle files
Genome version Raw genome sequence Masked genome sequence
hg19 hg19_raw hg19_mask
hg38 hg38_raw hg38_mask
mm9 mm9_raw mm9_mask
mm10 mm10_raw mm10_mask
  • Second is small pieces of reference files:
Data Name Used by Data Source
chrom_len samtools UCSC table browser
dhs bedtools Union DHS regions from Cistrome DB
velcro bedtools blacklist regions
geneTable bedAnnotate UCSC table browser
[contamination] bwa Mycoplasma genome index(set by –mapper)
  • Followings is how we generate these reference files, if you have any species other than hg19/hg38/mm9/mm10, you can find the reference files with the similar ways.

Mycoplasma genome

It seems that Mycoplasma contamination would be a major source of contamination, so we recommended downloading the Mycoplasma fasta for indexing, data is in the link of the mycoplasma genome. Or look at NCBI Nucleotide database.

Then index with bwa index -a is mycoplasma.fasta.

BWA Index

download raw genome sequence data, and tar xvfz them and cat *fa > genome.fa. Use the following to index them:

bwa index -a bwtsw genome.fasta

UCSC table browser

Use Browser step by step

  • To get refseq files, open UCSC table browser
  • Go to the UCSC table browser.
  • Select desired species and assembly, such as hg19
  • Select group: Genes and Gene Prediction Tracks
  • Select track: RefSeq Genes
  • Select table: refGene
  • Select region: genome
  • Select output format: all fields from selected table
  • Enter output file: species.refgene
  • Hit the ‘get output’ button
  • d*ownload and remove the header line with command,
sed 1d species.refgene > sp.refgene

Conservation score

  • (Optional) get Phaston conservation, for most common species version, hg19_conserv, hg38_conserv, mm10_conserv, mm9_conserv and use wigToBigWig to convert them into bigwig, we provide hg19/mm9 conservation score on our server, for other species, just left the chilin.conf conservation section blank. Take hg19 as an example:
wget -r -np -nd --accept=gz http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/phastCons46way/placentalMammals/
for c in chr*wig*gz
do
bw=${c%phastCons46way.placental.wigFix.gz}bw
echo $bw
gunzip -c $c | wigToBigWig stdin chrom_len $bw  ## chrom_len is where you put your reference chromosome information file
done