基因数据处理33之Avocado运行记录(参考基因组)
2016-05-28 19:51
483 查看
1.数据下载:
avocaodo的test resource中
2.预处理:
3.运行记录:
4.结果:
附录:
(1)部分代码:
(2)内存不够:
avocaodo的test resource中
2.预处理:
cat Homo_sapiens_assembly19.fasta | grep -i -n '>' > Homo_sapiens_assembly19Head.txt
cat Homo_sapiens_assembly19Head.txt
cat Homo_sapiens_assembly19.fasta | head -34770016 |tail -787820 > Homo_sapiens_assembly19chr20.fasta
hadoop fs -put Homo_sapiens_assembly19chr20.fasta /xubo/ref
3.运行记录:
hadoop@Master:~/xubo/data/testTools/avocado$ avocado-submit /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam /xubo/ref/Homo_sapiens_assembly19chr20.fasta /xubo/avocado/test201605281620AvocadoZidai6 /home/hadoop/xubo/data/testTools/basic.properties Using SPARK_SUBMIT=/home/hadoop/cloud/spark-1.5.2//bin/spark-submit Loading reads in from /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam [Stage 8:> (0 + 2) / 3]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
4.结果:
{"variant": {"variantErrorProbability": null, "contig": {"contigName": "20", "contigLength": 63025520, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null, "referenceIndex": null}, "start": 224970, "end": 224971, "referenceAllele": "G", "alternateAllele": "A", "svAllele": null, "isSomatic": false}, "variantCallingAnnotations": {"variantIsPassing": null, "variantFilters": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": 0.07825337, "rmsMapQ": 60.0, "mapq0Reads": null, "mqRankSum": 0.0, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {}}, "sampleId": "NA12878", "sampleDescription": null, "processingDescription": null, "alleles": ["Ref", "Alt"], "expectedAlleleDosage": null, "referenceReadDepth": 3, "alternateReadDepth": 5, "readDepth": 8, "minReadDepth": null, "genotypeQuality": 2147483647, "genotypeLikelihoods": [-32.696815, -5.5451775, -53.880547], "nonReferenceLikelihoods": [-32.696815, -5.5451775, -53.880547], "strandBiasComponents": [], "splitFromMultiAllelic": false, "isPhased": false, "phaseSetId": null, "phaseQuality": null} {"variant": {"variantErrorProbability": null, "contig": {"contigName": "20", "contigLength": 63025520, "contigMD5": null, "referenceURL": null, "assembly": null, "species": null, "referenceIndex": null}, "start": 225057, "end": 225058, "referenceAllele": "A", "alternateAllele": "G", "svAllele": null, "isSomatic": false}, "variantCallingAnnotations": {"variantIsPassing": null, "variantFilters": [], "downsampled": null, "baseQRankSum": null, "fisherStrandBiasPValue": 0.79760426, "rmsMapQ": 59.228653, "mapq0Reads": null, "mqRankSum": -0.23090047, "readPositionRankSum": null, "genotypePriors": [], "genotypePosteriors": [], "vqslod": null, "culprit": null, "attributes": {}}, "sampleId": "NA12878", "sampleDescription": null, "processingDescription": null, "alleles": ["Ref", "Alt"], "expectedAlleleDosage": null, "referenceReadDepth": 49, "alternateReadDepth": 41, "readDepth": 90, "minReadDepth": null, "genotypeQuality": 2147483647, "genotypeLikelihoods": [-507.5003, -62.383247, -409.40555], "nonReferenceLikelihoods": [-507.5003, -62.383247, -409.40555], "strandBiasComponents": [], "splitFromMultiAllelic": false, "isPhased": false, "phaseSetId": null, "phaseQuality": null}
附录:
(1)部分代码:
package org.bdgenomics.avocado.cli import org.apache.spark.rdd.RDD import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkContext, SparkConf} import org.bdgenomics.adam.rdd.ADAMContext._ /** * Created by xubo on 2016/5/27. * 从hdfs下载经过avocado匹配好的数据 */ object parquetRead {。。。}
(2)内存不够:
hadoop@Master:~/xubo/data/testTools/avocado$ avocado-submit /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam /xubo/ref/Homo_sapiens_assembly19chr22.fasta /xubo/avocado/test201605281620AvocadoZidai5 /home/hadoop/xubo/data/testTools/basic.properties Using SPARK_SUBMIT=/home/hadoop/cloud/spark-1.5.2//bin/spark-submit Loading reads in from /xubo/avocado/NA12878_snp_A2G_chr20_225058.sam 16/05/28 19:18:10 ERROR Executor: Exception in task 0.0 in stage 7.0 (TID 6) java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,225058,225059,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/05/28 19:18:10 ERROR TaskSetManager: Task 0 in stage 7.0 failed 1 times; aborting job Command body threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 6, localhost): java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,225058,225059,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 1 times, most recent failure: Lost task 0.0 in stage 7.0 (TID 6, localhost): java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,225058,225059,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1055) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:998) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:998) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:938) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:930) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:930) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:930) at org.apache.spark.rdd.InstrumentedPairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(InstrumentedPairRDDFunctions.scala:485) at org.apache.spark.rdd.InstrumentedPairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(InstrumentedPairRDDFunctions.scala:485) at org.apache.spark.rdd.InstrumentedPairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(InstrumentedPairRDDFunctions.scala:485) at org.apache.spark.rdd.Timer.time(Timer.scala:57) at org.apache.spark.rdd.InstrumentedRDD$.recordOperation(InstrumentedRDD.scala:378) at org.apache.spark.rdd.InstrumentedPairRDDFunctions.saveAsNewAPIHadoopFile(InstrumentedPairRDDFunctions.scala:484) at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$adamParquetSave$1.apply$mcV$sp(ADAMRDDFunctions.scala:75) at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$adamParquetSave$1.apply(ADAMRDDFunctions.scala:60) at org.bdgenomics.adam.rdd.ADAMRDDFunctions$$anonfun$adamParquetSave$1.apply(ADAMRDDFunctions.scala:60) at org.apache.spark.rdd.Timer.time(Timer.scala:57) at org.bdgenomics.adam.rdd.ADAMRDDFunctions.adamParquetSave(ADAMRDDFunctions.scala:60) at org.bdgenomics.avocado.cli.Avocado$$anonfun$run$1.apply$mcV$sp(Avocado.scala:229) at org.bdgenomics.avocado.cli.Avocado$$anonfun$run$1.apply(Avocado.scala:229) at org.bdgenomics.avocado.cli.Avocado$$anonfun$run$1.apply(Avocado.scala:229) at org.apache.spark.rdd.Timer.time(Timer.scala:57) at org.bdgenomics.avocado.cli.Avocado.run(Avocado.scala:228) at org.bdgenomics.utils.cli.BDGSparkCommand$class.run(BDGCommand.scala:54) at org.bdgenomics.avocado.cli.Avocado.run(Avocado.scala:82) at org.bdgenomics.utils.cli.BDGCommandCompanion$class.main(BDGCommand.scala:32) at org.bdgenomics.avocado.cli.Avocado$.main(Avocado.scala:52) at org.bdgenomics.avocado.cli.Avocado.main(Avocado.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.IllegalArgumentException: requirement failed: Received key (ReferenceRegion(20,225058,225059,Independent)) that did not map to a known contig. Contigs are: 22 at scala.Predef$.require(Predef.scala:233) at org.bdgenomics.adam.rdd.GenomicPositionPartitioner.getPartition(GenomicPartitioners.scala:81) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:121) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
相关文章推荐
- 基因数据处理28之avocado运行
- 基因数据处理29之avocado运行snap-basic有问题
- 基因数据处理30之avocado运行avocado-cli中的avocado问题1和2
- 基因数据处理31之avocado运行avocado-cli中的avocado问题3-变异识别找不到RecordGroupSample(null)
- Adam学习25之读取sam生成的alignmentRecord含recordGroupDictionary
- 基因数据处理32之Avocado运行记录(人造数据集)
- 基因数据处理72之GATK安装成功
- writing avocado tests(写avocado测试用例)
- iOS指南针
- 1629 B君的圆锥
- 小白笔记0:re正则表达式
- 算法马拉松14 棋盘问题
- hiho_1057_performance_log
- leetcode 153. Find Minimum in Rotated Sorted Array-二分查找|递归|非递归
- poj 2182 Lost Cows 线段树
- 慕课—R语言之数据可视化—学习笔记 3.6ggplot2绘图系统(下)
- nyoj 2 括号配对问题
- Triangle Count
- AdaBoost与随机森林 简单区别
- 数字带通传输系统——二进制数字调制