Friday, February 1, 2013

How to obtain SNP genotype data? (to be continued)

How to obtain SNP genotype data? (A work in progress)

From dbSNP
I want to use either R or Python to analysis the dbSNP data for different populations.  We are especially interested in the human clinic associated variants. ENsemble imports dbSNP data, and it shows clinic associated snps, frequency in different populations. An example is s1333049

FAQ for dbSNP offers some tips on downloading flat file. Some are related to my purpose here.
Q: I would like to use a script to fetch average allele frequency data for each human SNP from every web page, but I’m afraid that my IP will be blocked by the server for the heavy usage.
A: In general, large amounts of data can be obtained using our ftp, efetch or batch query services. Specifically, if you only need SNP allele frequency data, then use SNPAlleleFreq.bcp.gz, which is located on our ftp site. The Allele.bcp.gz file, also available on the ftp site, has Allele_id to allele_string mapping.
 At the SNP Eutility website, some parameters are listed for batch fetching of various types of data, including Genotype XML.
 eFetch params for EntrezSNP:
# (id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where NNN - number in the history, 0 - clipboard content for current database)
# db=snp (mandatory)
# report= (listed below)

A BioStar discussion on obtaining SNP information discussed UCSC, ensemble, and dbSNP.  Apparently, different Python parsers are needed for SNP xmal data in comparison to other Entrez xml data. Some Python parser for SNP XML were discussed on BioStar.

From Ensemple, 
A PERL example can be found at Biostar.

From Bioconductor,
Bioconductor provide some dnSNP build, as recent as build 137.

From UCSC MySQL
At UCSC website, a discussion on SNP suggests download the xml files from dbSNP. These files are what UCSC used to integrated dbSNP into their annotations. One user commented that dbSNP merged frequency reported by different labs and can lead to biases.
The help page is:  http://genome.ucsc.edu/goldenPath/help/mysql.html
However, "Bot access and excessive program-driven use are not permitted" by UCSC.

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e 'select name,chrom,chromStart,chromEnd,observed from snp130 where name="rs35568883"'
+------------+-------+------------+----------+----------+
| name       | chrom | chromStart | chromEnd | observed |
+------------+-------+------------+----------+----------+
| rs35568883 | chr21 |   38782125 | 38782126 | A/G      |
+------------+-------+------------+----------+----------+
 
 
 
From HapMap
 
 
http://hapmap.ncbi.nlm.nih.gov/



From 1000genome

  FTP site: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data 
 
http://www.1000genomes.org/data
 
  Links and References:
  • https://cgsmd.isi.edu/dbsnpq/
  • XML file for genotype data: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/genotype/
  • SNP Eutility, http://www.ncbi.nlm.nih.gov/SNP/SNPeutils.htm 
  •  http://www.1000genomes.org/data
  • http://hapmap.ncbi.nlm.nih.gov/ 
     
    human disease database 
    http://www.genecards.org/cgi-bin/listdiseasecards.pl?type=full
     

No comments:

Post a Comment