CaSNP Project:

     Main      Query      HeatMap      Browse      G-Score      Help      Contact

CaSNP Guide

Intro     Input     Browse     Output     Visualization     Heatmap


Cancer is known to have abundant Copy Number Alteration (CNA) which are characterized by distinct and structurally variant regions across the genome. Such genomic alterations greatly contribute to cancer pathogenesis and progression, and interrogation of CNA regions could potentially identify oncogenes and tumor suppressor genes and infer cancer mechanism. CGH and aCGH have been the traditional approaches for detecting CNA, however, recent technological development in SNP arrays has allowed the identification of CNA with unprecedented resolution.
In this study, we analyzed 6348 Affymetrix SNP arrays of 34 different cancer types in 64 studies to profile the genome-wide CNA and SNP in each. This includes all the cancer SNP profiles using Affymetrix SNP arrays (10K to 6.0) with raw data from GEO, with additional arrays from the TCGA consortium and a few individual publications.
Despite CaSNP's large collection of SNP array data, it is user friendly. One just input a genomic region (a gene, or a genomic coordinate), an optional value of Gain and Loss and an optional filter for cancer type.The database will return a .bed file with CNA displayed in UCSC genome browser, and also a summary of the frequencies of Gain/Loss and average copy number for all and each individual study returned.



When one's querying dCHIP, there's only one kind of input definitely required, which is the 'Genomic position'. All other input items are optional. The 'Genomic postion' could be specified by the accesstion number of a gene, a cytological band, or a chromosomal coordinate range. Listed below are examples of valid inputs for positions.
A gene(refSeqID): BRCA1
A coordinate range: chr3:1-1000000
A gene name: ERBB2
A miRNA ID: mir-217


  • In terms of gene ID system, currently CaSNP recognize refSeq ID only.
  • For efficiency consideration, a coordinage range no more than 1Mbps long would be accepted.
A double-level dynamic pull down menu is applied for specifying certain cancer types of users' interest. The primary level is used to specify a more fundamental/general cancer type, e.g. lung cancer; and the secondary is for detailed typing of its corresponding cancer category. e.g. small cell lung carcinoma, lung adenocarcinoma. (Usually a histology/pathology typing). For most cancer types, it will has a subtype literally ended with 'NOS', which means 'Unspecified'. This category contains all the samples which could not be assigned to a specific subtype, usually due to lack of clinical-background annotation. Once again, the input for 'Cancer type' is optional, which means CaSNP will use all cancer samples in the database, or all samples of a certain type, if you leave primary or secondary menu as '---Choose---' correspondingly.
Here are examples of several combinations.
Primary level: Secondary level: Response:
---choose--- ---choose--- All cancer samples in the database
lung cancer ---choose--- All lung cancer samples
lung cancer lung adenocarcinoma All adenocarcinoma samples of lung
lung cancer lung carcinoma NOS All lung cancer samples which could not be assigned to a subtype

The third input item on the main page, which is also optional, is the threshold setting of the degree of copy number aberration. With it set, CaSNP will return the percentage of samples whose copy numbers are out of, or within a user specified bound.
Input: Response:
GT2.2 Return the percentage of samples with copy number greater than 2.2
GT2.2 LT1.8 Return the percentage of samples with copy number greater than 2.2, or less than 1.8, separately
GT1.8 LT2.2 Return the percentage of samples with copy number less than 2.2 and greater than 1.8



On the 'Browse Data' page, user could skim through all the available datasets(series) within CaSNP, together with their annotation info., as abstract of experiment, overall design, platform, as well as authorship and citation information. More importantly, here user could arbitrarily put any individual dataset, or any of their combinations into analysis, by just clicking the checkboxes on the left side, free of the limitation of cancer categorizing.



CaSNP provides both tabulated and graphic view for user's query. In the result table, each row represents an individual dataset. The title of the table indicates the genomic coordinates for user defined region*. Query results including series information, platform information, statistical results are displayed in columns. The first two columns show the GEO ID of the dataset and types of array involved in the experiment. Average copy number is calculated for each sample within the series, and then the average copy number of the series is calculated and displayed in the column 'Avg. CN'. The column 'UCSC' and column 'Get BED' are described in the next section - Data Visualization. The column '#Samples' indicates how many samples are there in the series, which contain data for the defined query. The next column '#SNP' represents the number of SNP markers which are located in the query region. For example, if the column of 'Array Type' shows that there are two types of arrays, Affy 250K Sty and Affy 250K Nsp, involved in the series, and these two kinds of arrays have 1 and 3 markers respectively within the region user specified, then the value for '#SNP' will be 4. There are also two columns which might not be shown in every query. If any threshold is set in the query page, then a column titled 'Freq: <##' or 'Freq: >##' will be generated, which are corresponding to the threshold values. The column contains 2 sub-column which provides information for sample number and percentage within the threshold. Every column shaded with sky-blue means that all the results can be sorted for this column by just clicking.

  • The system requires a broader region in order to provide better statistical results in case of narrow regions are given for query. Therefore any type of user's input will be converted into genomic coordinates (like NM_017414 will be converted to chr22: 18632757 - 18660162), and then extended 10KB both upstream and downstream.



The function for data visualization is combined with query output. After a query is performed, user could see two columns named 'UCSC' and 'Get BED' in the middle of the result page. The contents of these columns are hyper links which point to UCSC Genome Browser and downloading. By clicking a UCSC link, user could browse a bedgraph in UCSC Genome Browser. Below is an ordinary example for bedgraph.
In the example, the user defined a query for NM_017414 and set the threshold to >2.3 and <1.7. The custom track is automatically loaded. The track title shows the GEO ID and the number of markers in the query region. The values in the track are percentages of the profile of the copy number for individual SNP markers. The percentage is calculated for each SNP marker site, and if there are 20 samples which have markers in one site, and 12 of them have copy numbers larger than 2.3, then the percentage for this site is 60%. Likewise, if there are 2 markers which smaller than 1.7 the percentage is -10%. In the image, bars for positive percentages are drawn in orange and bars for negative percentages are drawn in blue. Users can download the custom track file by clicking 'Get BED' link next to the UCSC link.



We provide heatmap function in CaSNP to help user get a more direct view for raw data in our database. In the heatmap page, there is a similar sets of parameters as it is in the query page. After the query process if finished, user can see the results for heatmaps. The legend indicates the color key used for drawing heatmaps. The raw copy numbers are logged and minus 1, so a raw copy number 2 (in most cases 2 is a normal copy number) will be converted to 0 for drawing. Red represents higher copy number and blue represents lower copy number, and white for normal. Also the query region is extended in heatmap drawing, and the length is 100KB upstream/downstream.
In one heatmap, rows are samples involved, and columns are individual markers within the query region. All the heatmaps are sorted first by the number of columns (markers included) and then by the number of rows (samples included). There are two black blocks above each heatmap, indicating the specified query region (before extended by 100KB). The columns between the two blocks are markers within the query region.

  • If a raw copy number is still bigger than 2 or smaller than -2 after being log-converted, its value will be limited to 2 or -2.
  • Please note that the information of gender is not available for some series, so in some cases a male sample could be treated as female during copy number calculation. Therefore when genes located on the X chromosome are queried, like 'AR' in prostate cancer, the result of heatmap would be showing significant deletion because there is only one X chromosome in male.