MARGE Input Files and Output Files

Bam file


If you have a H3K27ac ChIP-seq dataset, and want to know the key genes in this cell or tissue or some specific conditions, you might want to try MARGE to help you know much about the key regulated genes (or master regulators)

When you firstly got the H3K27ac ChIP-seq raw data, it is in FASTQ format, we recommend you use BWA (version 0.7.10) do the mapping, you can also use other aligners like BOWTIE.

Once you have mapped your FASTQ file to the reference genome, you will normally end up with a SAM or BAM alignment file. SAM stands for Sequence Alignment/Map format, and BAM is the binary version of a SAM file. We recommend users using BAM format alignment as the input of MARGE, you can use samtools to convert SAM format to BAM format


*** Make sure the BAM files with '.bam' as the suffix.
*** If you want to process multiple bam files at the same time, put them in the same directory.

If bam file is hg38 or mm10, MARGE will output Regulatory Potential, Relative Regulatory Potential
If bam file is hg19 or mm9, MARGE will output Regulatory Potential

Gene List file


Suppose you have a list of interested genes, and want to detect the cis-regulatory regions which might potentially regulate these genes expression. In such case, you can run MARGE with the following input format

MARGE support two kinds of Gene List format: Gene_Only and Gene_Response
*** Both Gene_Only and Gene_Response format files should with '.txt' as the suffix.
*** If you want to process multiple gene list files at the same time, put them in the same directory.

Gene_Only format, one column, each row is a gene, both GeneSymbol ID and RefSeq ID are supported by MARGE

Example
AR
BAGE4
CST1
DLX3

Gene_Response format, two columns, each row is a gene, first column is gene ID and second column is 1(target) or 0(non-target). Both GeneSymbol ID and RefSeq ID are supported by MARGE

Example
NM_000397  1  
NR_0456750
NM_0335181
NM_0529391
NM_0022900
NM_0004950

*** Please do not contain the header in the gene list file, and make sure it is tab delimitated in the Gene_Response file format.

!! Only when assembly is hg38 or mm10, MARGE can do cis-regions prediction.

Output: The 10 most relevant H3K27ac samples, which can best interpret these genes expression status, then MARGE use these 10 samples information to predict cis-regions that might potentially regulate the gene expression status

Usage

All steps below have to be executed in a terminal.


Testing MARGE

We provide some test data and config.json file on the Download page

For usage of MARGE on real data, please see all steps below.

Step 1: Choosing a workflow directory

To predict the key regulated genes or cis-regulatory regions in human or mouse with MARGE. First you should select a directory where the workflow shall be executed and results will be stored. Please choose a meaningful name and ensure that the directory is empty.

Step 2: Initializing a new workflow

We assume that you have chosen a workflow directory (here path/to/my/workflow), and that you have a set of BAM files (sample1.bam sample2.bam sample3.bam sample4.bam...) in one directory (path/to/marge_testdata/) or a set of Gene List files (genelist1.txt genelist2.txt genelist3.txt...) in one directory (path/to/marge_testdata/) or both of them you want to analyze with MARGE. You can initialize the workflow with

$ marge init path/to/my/workflow

This will initialize the MARGE workflow in a given directory. It will install a Snakefile, and a config file in this directory. Configure the config file according to your needs.

Step 3: Configure the workflow

Now change the parameters and paths in the config.json manually via

$ cd path/to/my/workflow

and open the config.json file. Edit the config file to your needs. Especially, define ASSEMBLY for this job, define MARGEdir, which is the path for the MARGE source code directory, define REFdir for the path to reference directory which you can download from MARGE Library, SAMPLESDIR and SAMPLES for the BAM samples, EXPSDIR and EXPS for the Gene List samples, and EXPTYPE for the format of Gene List file and ID type (RefSeq or GeneSymbol) for the gene used in Gene List. Change the path to some tools like MACS2, bedGraphToBigWig, bedClip, bigWigAverageOverBed, and bigWigSummary

Step 4: Execute the workflow

Once configured, the workflow can be executed with Snakemake. First, it is advisable to invoke a dry-run of the workflow with

$ snakemake -n

which will display all jobs that will be executed. If there are errors in your config file, you will be notified by Snakemake. The actual execution of the workflow can be started with

$ snakemake --cores 8

here using 8 CPU cores in parallel. For more information on how to use Snakemake, refer to the Snakemake Documentation.