BETA Input Files Format Description

BETA Basic and BETA plus requires both binding and expression file, BETA minus only binding file


Required: Binding data, BED format with 3 columns (chrom, chromStart, chromEnd) or 5 columns (chrom, chromStart, chromEnd, name, score)

Example
chrom      start     end             name        score
chr11208689 1209509 AR_LNCaP_2 51.58
chr11334246 1335348 AR_LNCaP_7 54.55
chr12179351 2180790 AR_LNCaP_9 257.72

*** Please do not contain the header in the bed file, and make sure it is tab delimitated.

Required:Differential expression data

  • LIMMA standard output (LIM)

     ID (optional), RefseqID, logFC, AveExpre, T, Pvalue, Adj.P.Value, B

Example
12196    NM_001548_at    -6.945783684     9.632803007  -138.2402671  6.92E-10     2.08E-05    11.83285762  
15675NM_005409_at-6.112808666.322508161-117.56646511.51E-092.08E-0511.57790488
12213NM_001565_at-6.3523955937.838465214-113.6000902-113.60009022.08E-0511.51589687

  • Cuffdiff standard output (CUF)

     Test_id, gene_id, gene, locus, sample1, sample2, status, value_1, value_2, log2(foldchange), test_stat, p_value, q_value, significant

Example
NM_000014NM_000014-chr12:9217772-9268558q1q2 NOTEST0.1028450.0820513-0.3258780.4982710.6182931no
NM_000015NM_000015-chr8:18248754-18258723q1q2 NOTEST0.1273580.309751.28221-1.323280.1857441no
NM_000016NM_000016-chr1:76190042-76229355q1q2 NOTEST000011no
NM_000017NM_000017-chr12:121163570-121177811q1q2 NOTEST3.477023.624220.0598207-0.1958150.8447551no

   • BETA specific format (BSF)

     GeneID, Regulatory status (value with + or -), statistical value(e.g. FDR or Pvalue, the smaller value, the more significant it is)

Example
NM_000014      -0.325878      0.618293
NM_000015      1.28221      0.185744
NM_000016      0      1
NM_000017      0.0598207      0.844755

   • Other format

    BETA supports other differential expression data format including geneID, regulatory status and statistic values. Set this via parameter --info

*** The differential expression file should contain all the genes in the genome, BETA will use all the info to get the static genes, and isolate the up regulated genes and down regulated genes based on the threshold you input.

*** Make sure your differential expression file do not have the header or add the ‘#’ in the front of your header line.

*** If your gene ID is the official gene symbol, please add the parameter --gname2.

*** Although you can select the type of your differential expression format, in case to make sure BETA get the correct information, you would better set the columns information via --info except you have the same format with the above example.

Option: boundary file (--bf): BED format (6 columns as what showed below)

Example
chrom      start     end             name        score   strand
chr1521336 521779 3 0.986 +
chr1839881 840447 19 0.986 +
chr1968212 968748 48 1 +

*** Please do not contain the header in the bed file, and make sure it is tab delimitated.

Option: Genome annotation (-r): Downloaded from UCSC

BETA provides hg38, hg19, hg18, mm10, and mm9 annotation.The annotation reference file should contain (refseqID chroms strand txstart txend genesymbol) information in order.

Example
refseqID      chrom     strand             start        end   gname2
NM_032291chr1 + 66999824 67210768 SGIP1
NM_001301823chr1 + 33546729 33586132 AZIN2
NM_032785chr1 - 48998526 50489626 AGBL4

*** Please do not contain the header in the bed file, and make sure it is tab delimitated.

Option: Whole genome sequence data: fasta format (--gs)

The format is like:

Example

>chr1: xxxx-yyyyy

ATCGGGACTTGACCC…

>chr2: xxxx-yyyyy

AGCGTGACTAGAGCC…

...

BETA Basic

BETA Basic will do the factor function prediction and direct target detecting

Command Line: $ BETA basic –p 3656_peaks.bed –e AR_expr.xls –k LIM –g hg19 --da500 –n basic


-p specifies the name of TF binding data

-e specifies the name of the corresponding differential expression data

-k LIM stand for the LIMMA Format

-g specifies the genome of your data, hg19 for human and mm9 for mouse, others, ignore this one

-n specifies the prefix of the output files, others, BETA will use ‘NA’ instead

--da 500 means get top 500 most significant expression changed genes of up and down

BETA Plus

BETA Plus will do TF active and repressive function prediction, direct targets detecting and motif analysis in target regions

Command Line: BETA plus –p 3656_peaks.bed –e AR_expr.xls –k LIM –g hg19 --gs hg19.fa --bl


--gs required for motif scan, specifies a fasta format whole genome sequencing data.

--bl is an optional parameters, it is on when some boundaries considered

For other parameters, see BETA Basic above

BETA Minus

BETA Minus detect TF target genes based on regulatory potential score only by binding data

Command Line: $ BETA minus -p 3656_peaks.bed --bl -g hg19


All the parameters in this command line, see BETA Basic above

BETA Optional Parameters

BETA provide optional parameters for different dataset analysis


-n NAME, --name NAME

This argument is used to name the result file.

-o OUTPUT, --output OUTPUT

The directory to store all the output files.

--gname2

If this switch is on, gene or transcript IDs in files given through -e will be considered as official gene symbols, DEFAULT=FALSE

--info EXPREINFO

Specify the geneID, up/down status and statistical values column of your expression data. DEFAULT:2,5,7 for LIMMA; 2,10,13 for Cuffdiff and 1,2,3 for BETA specific format

--pn PEAKNUMBER

The number of peaks will be contribute to the regulatory potential score. DEFAULT=10000

-d DISTANCE, --distance DISTANCE

Set a number which unit is 'base'. It will get peaks within this distance from gene TSS. DEFAULT=100000 (100kb)

--df DIFF_FDR

Input a number 0~1 as a threshold to pick out the most significant differential expressed genes by FDR or other statistical values, DEFAULT = 1,that is select all the genes

--da DIFF_AMOUNT

Get the most significant differential expressed genes by the percentage(0-1) or number (larger than 1). Genes will be ranked by the statistical values. For example, 2000, BETA will set top 2000 genes of up and down as the differentially expressed genes. DEFAULT = 0.5, that is to get top 50 percent genes of up and down separately

-c CUTOFF, --cutoff CUTOFF

Input a number between 0~1 as a threshold of One tail KS test to determine if two datasets differ significantly, default=1e-3

-r REFERENCE, --reference REFERENCE

the refgene info file downloaded from UCSC genome browser.input this file only if your genome is neither hg19 nor mm9

--bl BOUNDARY LIMIT

Boolean Value. Whether or not use CTCF boundary to get a peak’s associated gene, DEFAULT=FALSE

--bf BOUNDARYFILE

Some BED format boundary file, use this only when You set --bl and the genome is neither hg19 nor mm9

BETA Output Files

Active or Repressive Function Prediction

Direct Target Prediction Files


Up Regulate Target; Down Regulate Target Format
Chroms      txStart     txEnd    RefseqID RP    Strands    GeneSymbol  
chr1951376688 51383823 NM_001256080        2.186e-07 + KLK2
chr1951376688 51383823 NM_005551       2.186e-07 + KLK2
chr1951376688 51383823 NR_045762       2.186e-07 + KLK2
chr1951376688 51383823 NR_045763       2.186e-07 + KLK2
chr1951376688 51383823 NM_001002231       2.186e-07 + KLK2
chr1 207191865 207206101 NM_023938        8.822e-07 - C1orf116
chr1 207191865 207206101 NM_001083924        8.822e-07 - C1orf116
chr2142836477 42880085 NM_005656        1.033e-06 - TMPRSS2
chr2142836477 42879992 NM_001135099        1.041e-06 - TMPRSS2

Target Gene Associated Peaks

chrom      start     end             RefseqID        GeneSymbol     Distance        Score
chr195135406051354999NM_001256080KLK2-221590.249983590819
chr195137284151373704NM_001256080KLK2-34160.529067106385
chr195139220751393248NM_001256080KLK2160390.319320493096
chr195135406051354999NM_005551KLK2-221590.249983590819
chr195137284151373704NM_005551KLK2-34160.529067106385
chr195139220751393248NM_005551KLK2160390.319320493096
chr195135406051354999NR_045762KLK2-221590.249983590819

Target Gene Associated Motif Analysis

Motif Analysis results were summarized in an html file, click here to see the example

Motif Text Files Also Provided

MotifID   Species   Symbol   DNA BindDom   PSSM   Tscore   Pvalue

ANY QUESTIONS?  Contact Us