《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.1 – 1.3)
《Modern Statistics for Modern Biology》Chapter 一: 离散数据模型的预测(1.4 – 1.5)
《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.1-2.3)
《Modern Statistics for Modern Biology》Chapter 二: 统计建模(2.4-2.5)
《Modern Statistics for Modern Biology》Chapter 二 统计建模(2.6 – 2.7)
《Modern Statistics for Modern Biology》Chapter 二 统计建模(2.8 – 2.9)
从这章开始最开始记录一些markdown等小知识
$\hat{p}=\frac{1}{12}$
: Table 1:Low-level manipulation ofDNAStringSetandAAStringSetobjects.Table 2:Counting / tabulating.
Table 3:Sequence transformation and editing.
Table 4:String matching / alignments.► 解
- 第一行打印遗传密码信息,第二行返回
IUPAC
核苷酸类型。第三行列出了Biostring软件包中所有可用的
vignettes(有哪些参考说明书),第四行显示了一个特定的
vignettes`(功能参数预览)。
> GENETIC_CODE
TTT TTC TTA TTG TCT TCC TCA TCG TAT TAC TAA TAG TGT TGC TGA TGG CTT CTC CTA CTG CCT CCC CCA CCG CAT CAC CAA CAG
"F" "F" "L" "L" "S" "S" "S" "S" "Y" "Y" "*" "*" "C" "C" "*" "W" "L" "L" "L" "L" "P" "P" "P" "P" "H" "H" "Q" "Q"
CGT CGC CGA CGG ATT ATC ATA ATG ACT ACC ACA ACG AAT AAC # AAG AGT AGC AGA AGG GTT GTC GTA GTG GCT GCC GCA GCG
"R" "R" "R" "R" "I" "I" "I" "M" "T" "T" "T" "T" "N" "N" "K" "K" "S" "S" "R" "R" "V" "V" "V" "V" "A" "A" "A" "A"
GAT GAC GAA GAG GGT GGC GGA GGG
"D" "D" "E" "E" "G" "G" "G" "G"
attr(,"alt_init_codons")
[1] "TTG" "CTG"
> IUPAC_CODE_MAP
A C G T M R W S Y K V H D B N
"A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG" "ACT" "AGT" "CGT" "ACGT"
> vignette(package = "Biostrings")
> vignette("BiostringsQuickOverview", package = "Biostrings")
BSgenome
包提供了许多物种的基因组信息,您可以通过键入以下命令来访问包含整个基因组序列的数据包的名称
> library("BSgenome")
> ag = available.genomes()
> length(ag) ## 可以看到有91个基因组的信息
[1] 91
> ag[1:2]
[1] "BSgenome.Alyrata.JGI.v1" "BSgenome.Amellifera.BeeBase.assembly4"
- 接下来我们将探索
AGGAGGT
在E.coli
(大肠杆菌)中出现。我们使用一个特定菌株的基因组序列(Escherichia coli str. K12 substr.DH10B),其NCBI登录号为NC_010473。
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("BSgenome.Ecoli.NCBI.20080805", version = "3.8")
library("BSgenome.Ecoli.NCBI.20080805")
> Ecoli
E. coli genome:
# organism: Escherichia coli (E. coli)
# provider: NCBI
# provider version: 2008/08/05
# release date: NA
# release name: NA
# 13 sequences:
# NC_008253 NC_008563 NC_010468 NC_004431 NC_009801 NC_009800 NC_002655 NC_002695 NC_010498 NC_007946 NC_010473 NC_000913
# AC_000091
# (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator to access a given sequence)
> shineDalgarno = "AGGAGGT"
> ecoli = Ecoli$NC_010473
- 我们可以使用
countPattern
函数计算宽度为50000的窗口中出现的pattern
。
> window = 50000
> starts = seq(1, length(ecoli) - window, by = window)
> ends = starts + window - 1
> numMatches = vapply(seq_along(starts), function(i) {
+ countPattern(shineDalgarno, ecoli[starts[i]:ends[i]],
+ max.mismatch = 0)
+ }, numeric(1))
> table(numMatches)
numMatches
0 1 2 3 4
48 32 8 3 2
> numMatches
[1] 0 1 0 1 4 0 0 0 1 0 0 1 0 0 1 0 0 0 0 3 0 0 4 0 1 0 2 2 1 0 1 0 1 1 1 0 0 0 0 1 3 1 1 1 1 0 0 0 0 1 0 0 1 0 2 1 2 1 1 0 0 0 0
[64] 0 1 0 1 1 0 1 2 0 0 0 0 0 0 0 0 1 1 0 3 1 0 1 2 1 2 1 2 0 1
> str(starts)
num [1:93] 1 50001 100001 150001 200001 ...
> str(numMatches)
num [1:93] 0 1 0 1 4 0 0 0 1 0 ...►问题 What distribution might this table fit ?
► 解
- 我们可以使用
matchPattern
函数检查AGGAGGT
匹配到的位置信息。
> sdMatches = matchPattern(shineDalgarno, ecoli, max.mismatch = 0)
> sdMatches
Views on a 4686137-letter DNAString subject
subject: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATT##AGAGTGTC...TGCTGCATGATATTG##ATATCACC#T##CGCCTTAGTAAGTATTTTTC
views:
start end width
[1] 56593 56599 7 [AGGAGGT]
[2] 199644 199650 7 [AGGAGGT]
[3] 202176 202182 7 [AGGAGGT]
[4] 214433 214439 7 [AGGAGGT]
[5] 217429 217435 7 [AGGAGGT]
... ... ... ... ...
[61] 4438786 4438792 7 [AGGAGGT]
[62] 4498085 4498091 7 [AGGAGGT]
[63] 4536658 4536664 7 [AGGAGGT]
[64] 4546821 4546827 7 [AGGAGGT]
[65] 4611626 4611632 7 [AGGAGGT]
- 可以在R命令行中键入
sdMatches
以获取此对象的summary
。它包含所有65
个模式匹配的位置,表示为一组所谓的原始序列视图。它们之间的距离是多少?
> betweenmotifs = gaps(sdMatches)
> betweenmotifs
Views on a 4686137-letter DNAString subject
subject: AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATT##AGAGTGTC...TGCTGCATGATATTG##ATATCACC#T##CGCCTTAGTAAGTATTTTTC
views:
start end width
[1] 1 56592 56592 [AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATT#...TGAATAATTTCGCTACCTACCACGCTCCAGGTGTCAGAACCCGGCAGAC]
[2] 56600 199643 143044 [#GCTACCGTTATCCAGAATGGTATAGCGGTTTTCGCCACGTTGTTTC...ACATGTTATGGGGCTATAGCTCAGCTGGGAGAGCGCCTGCTTTGCACGC]
[3] 199651 202175 2525 [CTGCGGTTCGATCCCGCATAGCTCCACCATCTCTGTAGTGGTT#TAA...T#GAGTAACGGAGGAGCACGAAGGTTGGCTAATCCTGGTCGGACATC]
[4] 202183 214432 12250 [TAGTGCAATGGCATAAGCCAGCTTGACTGCGAGCGTGACGGCGCGAGCA...GATATTGG#TATCTGATTTGC#TTATCGTGTTATCGCCAGGCTTT]
[5] 214440 217428 2989 [TAATAACATGGGCAGGATAAGCTCGGGAGGAATGATGTTTAAGGCAATA...CCGTAGCGAGAATACTC#ATCATCATAACG#AGCCCCTTACTTGT]
... ... ... ... ...
[62] 4438793 4498084 59292 [GGATTTAATCACGGTAACATTTTGCTGGCTT#GCATCTGCCAGACGC...GAGCAACAGGCGGCAGAGCAAGGTTGGGAGTCATTGCATCGTCAACTTC]
[63] 4498092 4536657 38566 [AGATCCGGTTGCGGCAGCAAGGATTCATCC#TGATCCAC#GGCTT...AGGATATAGTAAGC#TT#ACGGGGT#TTTGAACTCC#TACC]
[64] 4536665 4546820 10156 [GGAATT#GAATGCGATGGATGGTATATTTACG##TAATTGATG...GGGGT#ACGTTTCCTGTAGCACCGTGAGTTATACTTTGTATAACTTA]
[65] 4546828 4611625 64798 [GCAGATGCGTATTACCAT#AAGATGGGGGAACAGTGCAGGTATGGTC...ATGGGGCACCGACACACCGAGCGTGGTGGCGCGCGCATCGCCGAGTGCA]
[66] 4611633 4686137 74505 [CGAGATCGCGGC#AACTCAGGCTCAGCGGCAG#T#ATCATCAG...TATTG##ATATCACC#T##CGCCTTAGTAAGTATTTTTC]
- 因此,这些实际上是66个互补的区域。现在,让我们找到一个模型,分布之间的差距大小的主题。如果模体发生在随机位置,我们期望间隙长度服从指数分布。下面的代码(其输出如
图2.23
所示)评估了这一假设。如果指数分布是一个很好的拟合
,点应该大致在一条直线上。指数分布只有一个参数,即速率
,并给出了与数据估计相对应的斜率直线
。
> library("Renext")
载入需要的程辑包:evd
Warning messages:
1: 程辑包‘Renext’是用R版本3.5.2 来建造的
2: 程辑包‘evd’是用R版本3.5.2 来建造的
> expplot(width(betweenmotifs), rate = 1/mean(width(betweenmotifs)),
+ labels = "fit")图 2.23
2.10.1在依赖关系的情况下建模(Modeling in the case of dependencies)
正如我们在第2.8节中看到的,核苷酸序列通常是依赖的:
在给定的位置发现某个核苷酸的概率往往取决于周围的序列
。在这里,我们将使用马尔可夫链
进行依赖建模。我们将研究人类基因组第8号染色体的区域,并试图发现被称为CpG岛的区域与其他区域之间的差异。我们使用数据(由Irizarry,Wu,和Feinberg(2009)生成),这些数据告诉我们
CpG
的起点和终点在基因组中的什么位置,并观察核苷酸
的频率和‘CG’、‘CT’、‘CA’、‘CC’
的频率。因此,我们可以问,核苷酸的出现之间是否存在依赖关系,如果存在,如何对它们进行建模。
> library("BSgenome.Hsapiens.UCSC.hg19")
> chr8 = Hsapiens$chr8
> setwd("e://compute language/NGS/Modern Statistics for Modern Biology/data/")
> CpGtab = read.table("model-based-cpg-islands-hg19.txt",header = TRUE)
> nrow(CpGtab)
[1] 65699
> head(CpGtab)
chr start end length CpGcount GCcontent pctGC obsExp
1 chr10 93098 93818 721 32 403 0.559 0.572
2 chr10 94002 94165 164 12 97 0.591 0.841
3 chr10 94527 95302 776 65 538 0.693 0.702
4 chr10 119652 120193 542 53 369 0.681 0.866
5 chr10 122133 122621 489 51 339 0.693 0.880
6 chr10 180265 180720 456 32 256 0.561 0.893
> irCpG = with(dplyr::filter(CpGtab, chr == "chr8"), IRanges(start = start, end = end))
> head(irCpG)
IRanges object with 6 ranges and 0 metadata columns:
start end width
[1] 32437 33708 1272
[2] 40354 40656 303
[3] 44536 46203 1668
[4] 99457 100721 1265
[5] 155954 156761 808
[6] 179033 179756 724
- 在上面的行中,我们将数据集
CpGtab
的子集(筛选)仅设为第8染色体
,然后创建一个IRange
对象,该对象的开始
和结束
位置由该数据集中命名相同的列定义。在IRange
函数调用(它从其参数构造对象)中,第一个开始是函数的参数名称,第二个开始是作为filter
的输出在数据集中获得的列;对于end
也是类似的。IRange
是一种通用的数学间隔容器。IRange中
的“I”代表“间隔”,Granges
中的“G”代表“基因组”。
> grCpG = GRanges(ranges = irCpG, seqnames = "chr8", strand = "+")
> grCpG
GRanges object with 2855 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr8 32437-33708 +
[2] chr8 40354-40656 +
[3] chr8 44536-46203 +
[4] chr8 99457-100721 +
[5] chr8 155954-156761 +
... ... ... ...
[2851] chr8 146126089-146128165 +
[2852] chr8 146175867-146176338 +
[2853] chr8 146228006-146228907 +
[2854] chr8 146232962-146233086 +
[2855] chr8 146276730-146278360 +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> genome(grCpG) = "hg19"
> library("Gviz")
载入需要的程辑包:grid
Warning message:
程辑包‘Gviz’是用R版本3.5.1 来建造的
> ideo = IdeogramTrack(genome = "hg19", chromosome = "chr8")
Warning messages:
1: In readChar(f, 1) : 在non-UTF-8 MBCS语言环境里只能读取字节
2: In readChar(f, 1) : 在non-UTF-8 MBCS语言环境里只能读取字节
> ideo
Ideogram track 'chr8' for chromosome 8 of the hg19 genome
> plotTracks(
+ list(GenomeAxisTrack(),
+ AnnotationTrack(grCpG, name = "CpG"), ideo),
+ from = 2200000, to = 5800000,
+ shape = "box", fill = "#006400", stacking = "dense")
- 我们现在定义所谓的染色体序列的观点,对应于
CpG岛
,irCpG
,和之间的区域(gap(IrCpG)
)。生成的对象CGIview
和nonCGIview
只包含坐标,而不包含序列本身(这些对象保存在对象Hsapiens$chr8
中),因此它们在存储方面是相当轻量级的。
> CGIview = Views(unmasked(Hsapiens$chr8), irCpG)
> CGIview
Views on a 146364022-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
start end width
[1] 32437 33708 1272 [CGGGTGCCATCTCAGCAGCTCACGGTGTGG#CTGCGACACTCACA...ATTGGAACTGTGCGTTTGCGGAAGAACACGGCGGGGGGGGCGGCGC]
[2] 40354 40656 303 [CGATTAGCGGGAGCGC#GCGAAGGGGCGGCCTGCGACGTTTTCCT...ACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGCCGG]
[3] 44536 46203 1668 [CGGCGCTGCAGGAGAGGAGATGCCCAGGCCAGGCGGCCGGCGCACGT...TGTGATGTTGTGTTCTCGGTGTGAGTTCATGGGTGTGACGGGGCGT]
[4] 99457 100721 1265 [CGGCTGGCCGGGCAGGGGGGCTGACCCCCCCCACCTCCCTCCTGGAC...GCTCCGCTGCCCGCTGAACTCCATCCTCCCGGCGGTCGGGCGGCGG]
[5] 155954 156761 808 [CGGG#GATTTTATTCACCGTCGATGCGGCCCCGAGTTGTCCC#...GGCACGGCAGGGCTCTCTTGCTCGCAGTATACTGGCGGCACGCCGC]
... ... ... ... ...
[2851] 146126089 146128165 2077 [CGGCAAGTAGCACACTCGACCAACTGCTG#CG#GCAGCTCGTT...#GAGAAGGCTTTATTATTTCAAGGTCGACGGCCAAGGAGACCGG]
[2852] 146175867 146176338 472 [CGACAGGCGGGAACGCCACCAGCCTGTGGGCGGAGTCCTGGCTCGGC...TATCTGGGGCAGCGTAGCCTCTCTAGATGGCGGGAGGCCGGGGCGA]
[2853] 146228006 146228907 902 [CGCCACCCTCGCCCTGCCCGCGCCTCGGCCAACAC#GCGGGCGCT...GGGAGAGGCCAGGTGCGACCCGGGCAGGTGAGGAGCACGGGCCCGG]
[2854] 146232962 146233086 125 [CGCCTCCAGCTGCGCAAGGGCAGGTGCGGGCAAGGAAGTCCCACTGG...TGGGTGAGGGTAGAGAGGATGCCTCAGTGCGTCCGGACACCACCGA]
[2855] 146276730 146278360 1631 [CGTCTCAGG#CACCGCTTTCCACTCCTGTGTCCCCG#GGAATC...ATCCACGGGTGCGAATCCACGGGCACGTGTGATCTGGGGTCCGCGG]
> NOnCGIview= Views(unmasked(Hsapiens$chr8), gaps(irCpG))
> NonCGIview
Views on a 146364022-letter DNAString subject
subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
views:
start end width
[1] 33709 40353 6645 [TCTTCTCATCACGTGCTCTCAGGGGCCACGATGTCAACATGCCTCA...AATGCTGGGTGCCCGCAAGAACGCGGAGCAGCAGGCACTCATTCC]
[2] 40657 44535 3879 [TCTCCAAGTGCCGCCAGCTGCGGGATTTCCTCTGC#AGAC#C...CCAGGCCTGGCGGCCGGCGCACGCGGGTTCTCTGTGGCCAGCAGG]
[3] 46204 99456 53253 [GTGCTGTGTGAGAACATGTGTGTAGTGTTCACATGTCCTCTGTGCG...AGTAGGGGCGGCCGGGCAGAGGCGCCCCTCACCTCCCGGACGGGG]
[4] 100722 155953 55232 [CGGCTGCCCTAAGTGATCTTTTTAAGT#GGAG#CTCACCAGG...AACCGTTCTAACTGGTCTCTGACCTTGATTATTAACGGCTGCAAC]
[5] 156762 179032 22271 [CTGCTGGCAGCTAGGGACATTGCAGGGCCCTCTTGCTCACATTGTA...CCATGTCTGCATCCCGGGCGGTATGTACGCTCTGAGAGATACATG]
... ... ... ... ...
[2850] 146125886 146126088 203 [GGGGGATGACTTTTATATATTTCTGACATGCCTGCATTAGAAGAAG...AG#AGACAGCATCGACAGC#ATGGGGAGAGGATAAGACACC]
[2851] 146128166 146175866 47701 [AGGCATGGCTC#TGTGGCTCCCACATGTGGGCTTAGGTC#GT...CCACCCCTTCACCCCGTCGCTCCGCCCGAGCTCGAATGGTGGCAC]
[2852] 146176339 146228005 51667 [CTCTGTGGTTCAGCCCCAGACCTGGTCCCACTAGCCCTGAGGCCAG...ACAGCCCGTAGACCTGCACGCCCCGCTGCCTCCGCTCGTGCCTCA]
[2853] 146228908 146232961 4054 [AGGGAGGACTTACGAGAATGG#GG#GCAGCCCGTCTTCAGTG...CTGAAGACAGCGCAGCTATCCGTAAGGATCACATGCGGCTTCCCT]
[2854] 146233087 146276729 43643 [GACCATAGTTCTCCAAGGGCACCTAGAGTGGATCCAGCTGAGTTCA...CCTTCGAGACCTTGCAGACCAGGCAGCTAACACTTCCCACCCCAC]
- 我们利用这些数据计算了
CpG岛
和非CpG岛
的跃迁计数。
> seqCGI = as(CGIview, "DNAStringSet")
> seqCGI
A DNAStringSet instance of length 2855
width seq
[1] 1272 CGGGTGCCATCTCAGCAGCTCACGGTGTGG#CTGCGACACTCACACGGGTGCCATC...TCTCTGATTACATTGGAACTGTGCGTTTGCGGAAGAACACGGCGGGGGGGGCGGCGC
[2] 303 CGATTAGCGGGAGCGC#GCGAAGGGGCGGCCTGCGACGTTTTCCTGT#GTTGGG...GCTCAGCACGGACGAGGAGGGAGCCGCGCACGCGCGGTCGGCTCGGCGAGGAGCCGG
[3] 1668 CGGCGCTGCAGGAGAGGAGATGCCCAGGCCAGGCGGCCGGCGCACGTGGGTTCTCTGT...GAGTCCCTGTGTGTGATGTTGTGTTCTCGGTGTGAGTTCATGGGTGTGACGGGGCGT
[4] 1265 CGGCTGGCCGGGCAGGGGGGCTGACCCCCCCCACCTCCCTCCTGGACGGGGCGGCTGG...AGACTGAGACAGCTCCGCTGCCCGCTGAACTCCATCCTCCCGGCGGTCGGGCGGCGG
[5] 808 CGGG#GATTTTATTCACCGTCGATGCGGCCCCGAGTTGTCCC#GCCAGGCAGTG...GCTGGCGACGGGGCACGGCAGGGCTCTCTTGCTCGCAGTATACTGGCGGCACGCCGC
... ... ...
[2851] 2077 CGGCAAGTAGCACACTCGACCAACTGCTG#CG#GCAGCTCGTTTTCGTCTTGAC...AGTTTTTGAG#AGAGAAGGCTTTATTATTTCAAGGTCGACGGCCAAGGAGACCGG
[2852] 472 CGACAGGCGGGAACGCCACCAGCCTGTGGGCGGAGTCCTGGCTCGGCCCCTCGCGGCG...GCCGCTCGTTCTATCTGGGGCAGCGTAGCCTCTCTAGATGGCGGGAGGCCGGGGCGA
[2853] 902 CGCCACCCTCGCCCTGCCCGCGCCTCGGCCAACAC#GCGGGCGCTCAGCACCCCGC...AGCGGAGTGAGGGGAGAGGCCAGGTGCGACCCGGGCAGGTGAGGAGCACGGGCCCGG
[2854] 125 CGCCTCCAGCTGCGCAAGGGCAGGTGCGGGCAAGGAAGTCCCACTGGGGCTTTTGGGG...GCTCGGGCGAGTGGGTGAGGGTAGAGAGGATGCCTCAGTGCGTCCGGACACCACCGA
[2855] 1631 CGTCTCAGG#CACCGCTTTCCACTCCTGTGTCCCCG#GGAATCATCAGACATCT...GGAAGGGTGGAATCCACGGGTGCGAATCCACGGGCACGTGTGATCTGGGGTCCGCGG
> seqNOnCGI= as(NonCGIview, "DNAStringSet")
> dinucCpG = sapply(seqCGI, dinucleotideFrequency)
> dinucNOnCpG= sapply(seqNonCGI, dinucleotideFrequency)
> dinucNonCpG[, 1]
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
389 351 400 436 498 560 112 603 359 336 403 336 330 527 519 485
> dinucCpG[, 1]
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
49 127 55 8 132 44 116 112 54 129 107 100 4 104 112 18
> NOnICounts= rowSums(dinucNonCpG)
> IslCounts = rowSums(dinucCpG)
> NonICounts
AA AC AG AT CA CC CG CT GA GC GG GT TA TC
14223322 7129981 9795572 11291315 10241505 6931732 1174927 9779692 8372468 5706958 6934755 7112531 9602902 8359398
TG TT
10221420 14197986
> IslCounts
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
64103 83109 122131 44000 113346 193847 133549 122377 105186 177326 194517 86752 31210 106738 114592 65861
- For a four state Markov chain as we have, we define the transition matrix as a matrix where the rows are the from state and the columns are the to state. (这里放原文比较好理解)
> TI = matrix( IslCounts, ncol = 4, byrow = TRUE)
> TI
[,1] [,2] [,3] [,4]
[1,] 64103 83109 122131 44000
[2,] 113346 193847 133549 122377
[3,] 105186 177326 194517 86752
[4,] 31210 106738 114592 65861
> TnI = matrix(NonICounts, ncol = 4, byrow = TRUE)
> dimnames(TI) = dimnames(TnI) =
+ list(c("A", "C", "G", "T"), c("A", "C", "G", "T"))
> TI
A C G T
A 64103 83109 122131 44000
C 113346 193847 133549 122377
G 105186 177326 194517 86752
T 31210 106738 114592 65861
- 我们使用每种类型的跃迁次数的计数来计算频率,并将它们放入两个矩阵中 。
> TI
A C G T
A 64103 83109 122131 44000
C 113346 193847 133549 122377
G 105186 177326 194517 86752
T 31210 106738 114592 65861
> MI = TI /rowSums(TI)
> MI
A C G T
A 0.20457773 0.2652333 0.3897678 0.1404212
C 0.20128250 0.3442381 0.2371595 0.2173200
G 0.18657245 0.3145299 0.3450223 0.1538754
T 0.09802105 0.3352314 0.3598984 0.2068492
> MN = TnI / rowSums(TnI)
> MN
A C G T
A 0.3351380 0.1680007 0.23080886 0.2660524
C 0.3641054 0.2464366 0.04177094 0.3476871
G 0.2976696 0.2029017 0.24655406 0.2528746
T 0.2265813 0.1972407 0.24117528 0.3350027► 问题
- 这意味着不同行中的转换是否不同?例如, 图 2.25
2.11 Summary of this chapter
Goodness of fit 我们使用了不同的可视化,并展示了如何运行模拟实验来检验我们的数据是否可以用一个公平的四盒多项式模型进行拟合。我们遇到了
卡方检验
,看到了如何使用QQ图
将模拟和理论进行比较。Estimation 我们解释了
最大似然
和贝叶斯估计
过程。举例说明了这些方法,包括核苷酸模式发现
和单倍型估计
。Prior and posterior distributions 在评估以前研究过的类型的数据(如单倍型)时,计算数据的后验分布可能是有益的。这使我们能够通过简单的计算,在决策过程中加入不确定性。只要有足够的数据,先验的选择对结果影响不大。
CpG islands and Markov chains 我们了解了如何通过
马尔可夫链
转移来模拟DNA序列中的依赖关系
。我们利用这一点来建立基于似然比的分数
,这样我们就可以看到长DNA序列是否来自CpG岛
。当我们制作分数直方图时,我们在图 2.25
中看到了一个值得注意的特征:它似乎是由两个部分组成的。这种双峰是我们第一次遇到混合物,它们是第四章的主题。这是在一些训练数据上建立模型的第一个例子:我们知道序列在CpG岛上,我们以后可以用这些序列对新的数据进行分类。我们将在
第12章
中制定一种更完整的方法来实现这一点。References
Cleveland, William S. 1988. The Collected Works of John W. Tukey: Graphics 1965-1985. Vol. 5. CRC Press.
Rice, John. 2006. Mathematical Statistics and Data Analysis. Cengage Learning.
Elson, D, and E Chargaff. 1952. “On the Desoxyribonucleic Acid Content of Sea Urchin Gametes.” Experientia 8 (4). Springer: 143–45.
Mourant, AE, Ada Kopec, and K Domaniewska-Sobczak. 1976. “The Distribution of the Human Blood Groups 2nd Edition.” Oxford University Press London.
Finetti, Bruno de. 1926. “Considerazioni Matematiche Sull’ereditarieta Mendeliana.” Metron 6: 3–41.
Cannings, Chris, and Anthony WF Edwards. 1968. “Natural Selection and the de Finetti Diagram.” Annals of Human Genetics 31 (4). Wiley Online Library: 421–28.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-seq Data with DESeq2.” Gnome Biology 15 (12): 1–21.
Irizarry, Rafael A, Hao Wu, and Andrew P Feinberg. 2009. “A Species-Generalized Probabilistic Model-Based Definition of Cpg Islands.” Mammalian Genome 20 (9-10). Springer: 674–80.
Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. 1998. Biological Sequence Analysis. Cambridge University Press.
Freedman, David, Robert Pisani, and Roger Purves. 1997. Statistics. New York, NY: WW Norton.
Agresti, Alan. 2007. An Introduction to Categorical Data Analysis. John Wiley.
Grantham, Richard, Christian Gautier, Manolo Gouy, M Jacobzone, and R Mercier. 1981. “Codon Catalog Usage Is a Genome Strategy Modulated for Gene Expressivity.” Nucleic Acids Research 9 (1). Oxford Univ Press: 213–13.
Perrière, Guy, and Jean Thioulouse. 2002. “Use and Misuse of Correspondence Analysis in Codon Usage Studies.” Nucleic Acids Research 30 (20). Oxford Univ Press: 4548–55.
Robert, Christian, and George Casella. 2009. Introducing Monte Carlo Methods with R. Springer Science & Business Media.
Marin, Jean-Michel, and Christian Robert. 2007. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer Science & Business Media.
McElreath, Richard. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC.
放在最后的话 : 这一章真的看得很难受,虽然给的代码中学习到了很多。