Wednesday, December 30, 2015

parallel R using foreach and doMC

Using foreach and doMC for parallel R computing. The foreach run has to be assigned to an variable. 



while ((j <= popSize) && ( count < popSize*5)) {
  myParrallStep = inputCores * 5
  bufferAges = foreach(i=1:myParrallStep) %dopar% {
    currentNetworkAge = single_network_failure_v2(lambda1, lambda2, degreeThreshold, p, pairs, essenLookupTb)
  goodAges = bufferAges [bufferAges>0] #goodAges() return NULL? 
  count = count + myParrallStep
  if( length(goodAges)> 0 ){ 
    currentEnd = j + length(goodAges)-1
    popAges[j:currentEnd] = goodAges
    j = currentEnd + 1
  } else {
    print(paste("newk_aging_sim:: goodAges", goodAges))
  if(debug>2) { 
  if(debug) { 
    print(paste("netwk_aging_sim::count=",count, "  j=", j))
}# end of j while-loop, population loop

git retrieve a single file

git checkout 08618129e66127921fbfcbc205a06153c92622fe -- [full/path]
To clarify with an example:
git checkout mybranchname ~/src/myapp/myfile.txt

In my case, I need to roll back netwk_aging_sim.v0.1.R.

$ git checkout 4bb272d6a304168fc6711479a8b0b34c55e47182 netwk_aging_sim.v0.1.R
It worked.

The longnumber (branchname?) was found here:

*** CRISPR/Cas9 in crevisiae

multigene knockout in yeast

[Mans15] Mans 2015 FEMS Yeast, CRISPR/Cas9: a molecular Swiss army knife for simultaneous introduction of multiple genetic modifications in Saccharomyces cerevisiae
Mans15 designed guide RNA (gRNA), a chimeric crRNA-tracrRNA, and provide a webtool, Yeastriction webtool,

Yeastriction webtool. The tool is written in Javascript and based on the
stack (MongoDB, Express, AngularJS and Node.js). The source code is available at and ORF sequences were downloaded from SGD
( in GFF and FASTA file format, respectively. ORFs, including their 1 kb up- and downstream sequences, were extracted and imported into Yeastriction, with the aid of an in-house script

linear mixed effect model, lme4


CRISPR/cas 9 systems, Cong and Zheng 2015, Chapter 10

There are 3 known CRISPR systems, types I, II, and III.  Type II CRISPR systems usually only require a single protein, Cas9, to perform the target cleavage. 

Cas9 is a RNA-guided nuclease that is capable of binding to a target DNA and introduce a double strand break in a sequence-specific manner. 

The guide sequence within sgRNA has a length of 20 bp and is the exact complementary sequence of the target site within the genome, sometimes referred to as “protospacer” following the
convention of microbial CRISPR field.

In choosing the target site, it is important to note the requirement of having the “NGG” trinucleotide motif, called protospacer adjacent motif (PAM), right next to the protospacer target on the 3end.

The PAM sequence depends on the Cas9 protein employed in the CRISPR-Cas9 system, and hence, the “NGG” PAM sequence applies specifi cally to the Streptococcus pyogenes Cas9 (SpCas9). 

Monday, December 28, 2015

RNAseq end calling

change point analysis

RNAseq discussion forum

yeast ngs data resource, Waern and Snyder 2013 G3.

Extensive Transcript Diversity and Novel Upstream Open Reading Frame Regulation in Yeast

Karl Waern*,† and Michael Snyder*,†,1
*Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, and †Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520

Some discussion on problems during using these data
"Data is single-end. I just checked the protocol accompanying the data- it confirms that the reads are indeed strand-specific."

Github http.postBuffer out of range problem

git config --global http.postBuffer 1048576000

applied spatial data analysis with R

applied spatial data analysis with R

Sunday, December 27, 2015


Mitochondrial and Cytoplasmic ROS Have Opposing Effects on Lifespan

  • Claire E. Schaar ,

  • Dylan J. Dues ,

  • Katie K. Spielbauer ,

  • Emily Machiela,
  • Jason F. Cooper,
  • Megan Senchuk,
  • Siegfried Hekimi,
  • Jeremy M. Van Raamsdonk mail


Saturday, December 26, 2015

plot viability and mortality curves for the simulated grid networks 'net1' hqin$ pwd


update lifespan.r to lifespan.20140711.R

modify analyze_netwk_aging.v0.0.R to do batch plot mortality and survival curves 
change this to analyze_netwk_aging.v0.1.R

Plots show 100 cells are not enough. So, I modified '' with -n 1000 and re-run the shell script.
8:47pm ->

Sunday, December 20, 2015

Gillespie algorithm, notes


Cancer genomics resource

McFarland, Sunyaev, Mirny PNAS, 2013 impact of deleterious passernger mutations on cancer progression.
Catalogue of Somatic Mutations in Cancer (COSMIC) and The Cancer Genome Atlas (TCGA). We classified them as driver and passenger mutation groups and then characterized their effects
using PolyPhen, a tool widely used in population and medical genetics to predict the damaging effect of missense mutations (15).

Ref 15: Boyko AR, et al. (2008) Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4(5):e1000083.

(todo) R GIS

Friday, December 18, 2015

Wednesday, December 16, 2015

R code to find the range of lambda.

After summarized network aging simulation runs, I realized I need an automated way to find the range of parameters, especially for lambda.

$ Rscript identify_lambda1.v0.0.R -if1 net1/Degree4N1000_network.csv -if2 net1/Degree4N1000_EssenLookupTb.csv -l1 0.006 -l2 0.0002 -dt 0 -p 1.0 -n 20  -op net1 -od net1 -tLS 30 -ml 20 -cs 1.05 -el 0.5

In comparison, here is my analytic simulation results:

degree threshold=0 for grid networks or edges with uniform lambda

# To use lambda1 for all edges, choose threshold = 0
single_network_failure_v2 = function(lambda1, lambda2=lambda1/10, threshold=5, p, pairs, essenLookupTb ) {

It took sometime before I realize this tricky choice.

bug, dataframe and vector, R 3.2.3

After update R to 3.2.3,
I have a bug due to implicit conversion from single column dataframe and vector, but strict requirement in R3.2.3.

essenLookupTb = read.csv(inLookupTbFile, row.names=1);
essenLookupTb = as.vector(essenLookupTb[,1]); #20151215 after update to R3.2.3

bug: R passing variable to function

When I used a variable name in the passing interface that is the same without inside variable, it seems to cause confusion by R, and claimed that the variable is undefined.  I rename the variables, and the bug went away.


File: analyze_netwk_aging.v0.0.R
$ Rscript analyze_netwk_aging.v0.0.R -id net1 -ip net1 -o __test.csv

When I moved the function into 'network.r', a bug was discovered in passing of the variables. I renamed the variables inside and outside of the function scope, and fixed the bug. 

# 2015Dec 16. 
# Plan, loopover netwk-aging files in a directory and generate a report file
# Rscript analyze_netwk_aging.v0.0.R -id net1 -ip net1 -o __test.csv


summarize_mean_from_files = function(myfiles){
  outtb = data.frame(myfiles)
  outtb$mean = NA;
  for( i in 1:length(myfiles)) {
    currentFile = paste( inputdir, '/', myfiles[i], sep='');
    tb = read.csv(currentFile)
    outtb$mean[i] = mean(tb[,1])

inputdir = getwd(); 
inputprefix = '';
outputFile = "_tmp_networkaging_summary.csv";
debug = 0;
  "inputdir|id=s", "input dir of network aging data in csv format. default: getwd()",
  "inputprefix|ip=s", "prefix of input netwk aging csv data files. default empty",
  "outputFile|of=s", "output csv file name",
  "debug|d=i", "debug. default 0"

# inputdir = 'net1'; inputprefix = 'net1'; i = 1
myfiles = list.files(inputdir)
myfiles = myfiles[ grep(inputprefix, myfiles) ]

out = summarize_mean_from_files( myfiles )
#timestamp = format(Sys.time(), "%Y%b%d_%H%M%S")
write.csv( out, outputFile, row.names=F)

R GetoptLong, update to R 3.2.3

The R GetoptLong library was borrowed from PERL GetoptLong.

It requires R 3.2.3. So, I updated R to 3.2.3.

#example from GetoptLong

# Rscript test-getoptlong.R --number 4 --verbose T
# Rscript test-getoptlong.R -n 4 --verbose T

cutoff = 0.05
  "number|n=i", "Number of items, integer, mandatory option",
  "cutoff|c=f", "cutoff to filter results, optional, default (0.05)",
  "verbose|v", "print messages"

print(c("input:", 'number'=number, 'cutoff'=cutoff, 'verbose'=verbose))

Tuesday, December 15, 2015

Generic network aging simulation code

First no-bug run:
R -f _todo.20151213-netsim-generic.R --args net1/Degree4N1000_network.csv  net1/net1/Degree4N1000_EssenLookupTb.csv 0.002 0.0002 4 0.9 5 net1

change name to netwk_aging_sim.v0.1.R

Monday, December 14, 2015

Essential gene flag: zero is false, everything else is true

 essenTb$essenflagNumeric = ifelse( essenTb$essenflag=='essential', 1, 0)  #zero is false, everything else is true

See: 20151012-factorize_networks.R

todo student project, gillespie simulation of yeast replicative aging

mother cells
mitotic arrest
G0, G1, phase etc basec on dissection data

Q: how to estimate parameters using Gillspecie algorithm?

Sunday, December 13, 2015

Gillespie, predator-prey simulation

Example from Pineda-Krch08 J of Statistical Software.

parms = c(b=2, d=1, K=1000, alpha=0.007, w=0.0035, c=2,g=2)
x0 = c(N=1000, P=100)
nu = matrix(c(+1,-1,-1, 0,0,
              0,  0, 0,+1,-1), nrow=2, byrow=TRUE)
a = c("b*N", 
tf = 100
method = "D"
simName = "Predator-prey model"
##       b       d       K   alpha       w       c       g 
## 2.0e+00 1.0e+00 1.0e+03 7.0e-03 3.5e-03 2.0e+00 2.0e+00
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   -1   -1    0    0
## [2,]    0    0    0    1   -1
## [1] "b*N"                 "(d+(b-d)*N/K)*N"     "alpha/(1+w*N)*N*P"  
## [4] "c*alpha/(1+w*N)*N*P" "g*P"
out = ssa(x0, a, nu, parms, tf, method, simName, 
          verbose = TRUE, consoleInterval = 10, maxWallTime = 30)
## Running D method with console output every 10 time step
## Start wall time: 2015-12-13 20:44:52...
## t=0 : 1000,100
## (4.082s) t=10 : 639,313
## (10.931s) t=20.00066 : 123,499
## (18.091s) t=30.00005 : 112,193
## (26.468s) t=40.00018 : 271,95
## t=43.15012 : 374,361
## tf: 43.15012
## TerminationStatus: maxWallTime
## Duration: 30.019 seconds
## Method: D
## Nr of steps: 96521
## Mean step size: 0.0004470542+/-0.0006694862
## End wall time: 2015-12-13 20:45:22
## --------------------
##        V1               N                P      
##  Min.   : 0.000   Min.   :  46.0   Min.   :  9  
##  1st Qu.: 9.575   1st Qu.: 294.0   1st Qu.: 92  
##  Median :19.063   Median : 496.0   Median :206  
##  Mean   :20.357   Mean   : 487.3   Mean   :237  
##  3rd Qu.:31.002   3rd Qu.: 677.0   3rd Qu.:361  
##  Max.   :43.150   Max.   :1014.0   Max.   :622
plot( out$data[,2] ~ out$data[,1], ylab = "N")

plot( out$data[,3] ~ out$data[,1], ylab = "Predator")