BEDtools suite provides command-line functionality when dealing with genomic coordinate based operations, such as overlapping bed files or getting coverage of a bed file over a genome (similar, not exactly same, functionality in R is provided by IRanges package in bioconductor). If you have the BEDtools suite installed and it is in your path, you can easily call BEDtools executables from R using the "system()" command. See BEDtools website to learn about BEDtools installation.
The function pasted below calls BEDtools executables on two data-frames. Both data-frames are structured as bed files. The function first writes out the data-frames as temporary files and calls the BEDtools programs on those temporary files and writes the output to another temporary file. In the end the function reads in the resulting file as a data frame and deletes the temporary files. I guess you can do this more elegantly with pipe() command in R, but that is left as an exercise to you: )
Showing posts with label UCSC. Show all posts
Showing posts with label UCSC. Show all posts
Tuesday, February 22, 2011
Monday, February 21, 2011
Access all UCSC wiggle tracks from R and your terminal
EDIT: Current version (13 November 2013) of the rtracklayer lets you import bigWig files over predefined regions, which is very convenient way to access most UCSC tracks that come in Wig format. All you need to do is to download the bigWig version of the tracks (if only in wiggle Format you can convert them to bigWig using UCSC command line tools) and use the import() function from rtracklayer.
rtracklayer package allows you to access most of the UCSC wiggle tracks from R. However, there is another way which might more practical in situations where you need to summarize the wig track scores over a given set of genomic coordinates. Although you can get a similar information from rtracklayer, you will need to do the summary statistics for each genomic coordinate and you will need to loop over a RangedList object, which may not be practical if you have a large set of genomic coordinates.
The other way to do this is to use hgWiggle program from UCSC source code. You can download the program from here, and you will need to prepare a three line ".hg.conf" file in your home directory.
.hg.conf should contain the exact following lines:
You should download the related .wib file from UCSC. For human hg18 assembly they are located at http://hgdownload.cse.ucsc.edu/gbdb/hg18/ . Then do "chmod 600 .hg.conf " on the file. Now we are ready to use hgWiggle. See this wiki page for all the other details.
As an example, you can get the statistics for chr16 for phastCons scores as shown below.
hgWiggle -db=hg18 -chr=chr16 -doStats phastCons44wayPlacental
you can also give a bed file with genomic coordinates using -bedFile option, then you will get phastCons summary for the regions in the bed file.
It is also pretty straight forward to call the hgWiggle from R using system() function, so you can include this functionality in your R code. Although I intended to do that here when I started the post, I thought that it is trivial to use the system() command and call any command line tool, so I won't be doing that for now. I might add it later on when I have more time.
rtracklayer package allows you to access most of the UCSC wiggle tracks from R. However, there is another way which might more practical in situations where you need to summarize the wig track scores over a given set of genomic coordinates. Although you can get a similar information from rtracklayer, you will need to do the summary statistics for each genomic coordinate and you will need to loop over a RangedList object, which may not be practical if you have a large set of genomic coordinates.
The other way to do this is to use hgWiggle program from UCSC source code. You can download the program from here, and you will need to prepare a three line ".hg.conf" file in your home directory.
.hg.conf should contain the exact following lines:
db.host=genome-mysql.cse.ucsc.edu db.user=genomep db.password=password
You should download the related .wib file from UCSC. For human hg18 assembly they are located at http://hgdownload.cse.ucsc.edu/gbdb/hg18/ . Then do "chmod 600 .hg.conf " on the file. Now we are ready to use hgWiggle. See this wiki page for all the other details.
As an example, you can get the statistics for chr16 for phastCons scores as shown below.
hgWiggle -db=hg18 -chr=chr16 -doStats phastCons44wayPlacental
you can also give a bed file with genomic coordinates using -bedFile option, then you will get phastCons summary for the regions in the bed file.
It is also pretty straight forward to call the hgWiggle from R using system() function, so you can include this functionality in your R code. Although I intended to do that here when I started the post, I thought that it is trivial to use the system() command and call any command line tool, so I won't be doing that for now. I might add it later on when I have more time.
Subscribe to:
Comments (Atom)