Cancer is known to have abundant Copy Number Alteration (CNA) which are characterized by distinct and structurally variant regions across the genome. Such genomic alterations greatly contribute to cancer pathogenesis and progression, and interrogation of CNA regions could potentially identify oncogenes and tumor suppressor genes and infer cancer mechanism. CGH and aCGH have been the traditional approaches for detecting CNA, however, recent technological development in SNP arrays has allowed the identification of CNA with unprecedented resolution.
In this study, we analyzed 6348 Affymetrix SNP arrays of 34 different cancer types in 64 studies to profile the genome-wide CNA and SNP in each. This includes all the cancer SNP profiles using Affymetrix SNP arrays (10K to 6.0) with raw data from GEO, with additional arrays from the TCGA consortium and a few individual publications.
Despite CaSNP's large collection of SNP array data, it is user friendly. One just input a genomic region (a gene, or a genomic coordinate), an optional value of Gain and Loss and an optional filter for cancer type.The database will return a .bed file with CNA displayed in UCSC genome browser, and also a summary of the frequencies of Gain/Loss and average copy number for all and each individual study returned.
When one's querying dCHIP, there's only one kind of input definitely required, which is the 'Genomic position'. All other input items are optional. The 'Genomic postion' could be specified by the accesstion number of a gene, a cytological band, or a chromosomal coordinate range. Listed below are examples of valid inputs for positions.
|A coordinate range:||chr3:1-1000000|
|A gene name:||ERBB2|
|A miRNA ID:||mir-217|
|Primary level:||Secondary level:||Response:|
|---choose---||---choose---||All cancer samples in the database|
|lung cancer||---choose---||All lung cancer samples|
|lung cancer||lung adenocarcinoma||All adenocarcinoma samples of lung|
|lung cancer||lung carcinoma NOS||All lung cancer samples which could not be assigned to a subtype|
|GT2.2||Return the percentage of samples with copy number greater than 2.2|
|GT2.2 LT1.8||Return the percentage of samples with copy number greater than 2.2, or less than 1.8, separately|
|GT1.8 LT2.2||Return the percentage of samples with copy number less than 2.2 and greater than 1.8|
On the 'Browse Data' page, user could skim through all the available datasets(series) within CaSNP, together with their annotation info., as abstract of experiment, overall design, platform, as well as authorship and citation information. More importantly, here user could arbitrarily put any individual dataset, or any of their combinations into analysis, by just clicking the checkboxes on the left side, free of the limitation of cancer categorizing.BACK TO TOP
CaSNP provides both tabulated and graphic view for user's query. In the result table, each row represents an individual dataset. The title of the table indicates the genomic coordinates for user defined region*. Query results including series information, platform information, statistical results are displayed in columns. The first two columns show the GEO ID of the dataset and types of array involved in the experiment. Average copy number is calculated for each sample within the series, and then the average copy number of the series is calculated and displayed in the column 'Avg. CN'. The column 'UCSC' and column 'Get BED' are described in the next section - Data Visualization. The column '#Samples' indicates how many samples are there in the series, which contain data for the defined query. The next column '#SNP' represents the number of SNP markers which are located in the query region. For example, if the column of 'Array Type' shows that there are two types of arrays, Affy 250K Sty and Affy 250K Nsp, involved in the series, and these two kinds of arrays have 1 and 3 markers respectively within the region user specified, then the value for '#SNP' will be 4. There are also two columns which might not be shown in every query. If any threshold is set in the query page, then a column titled 'Freq: <##' or 'Freq: >##' will be generated, which are corresponding to the threshold values. The column contains 2 sub-column which provides information for sample number and percentage within the threshold. Every column shaded with sky-blue means that all the results can be sorted for this column by just clicking.
The function for data visualization is combined with query output. After a query is performed, user could see two columns named 'UCSC' and 'Get BED' in the middle of the result page. The contents of these columns are hyper links which point to UCSC Genome Browser and downloading. By clicking a UCSC link, user could browse a bedgraph in UCSC Genome Browser. Below is an ordinary example for bedgraph.
In the example, the user defined a query for NM_017414 and set the threshold to >2.3 and <1.7. The custom track is automatically loaded. The track title shows the GEO ID and the number of markers in the query region. The values in the track are percentages of the profile of the copy number for individual SNP markers. The percentage is calculated for each SNP marker site, and if there are 20 samples which have markers in one site, and 12 of them have copy numbers larger than 2.3, then the percentage for this site is 60%. Likewise, if there are 2 markers which smaller than 1.7 the percentage is -10%. In the image, bars for positive percentages are drawn in orange and bars for negative percentages are drawn in blue. Users can download the custom track file by clicking 'Get BED' link next to the UCSC link.
We provide heatmap function in CaSNP to help user get a more direct view for raw data in our database. In the heatmap page, there is a similar sets of parameters as it is in the query page. After the query process if finished, user can see the results for heatmaps. The legend indicates the color key used for drawing heatmaps. The raw copy numbers are logged and minus 1, so a raw copy number 2 (in most cases 2 is a normal copy number) will be converted to 0 for drawing. Red represents higher copy number and blue represents lower copy number, and white for normal. Also the query region is extended in heatmap drawing, and the length is 100KB upstream/downstream.
In one heatmap, rows are samples involved, and columns are individual markers within the query region. All the heatmaps are sorted first by the number of columns (markers included) and then by the number of rows (samples included). There are two black blocks above each heatmap, indicating the specified query region (before extended by 100KB). The columns between the two blocks are markers within the query region.