Task not serializable是Spark开发过程最令人头疼的问题之一,这里记录下出现这个问题的两个实例,一个是自己遇到的,另一个是stackoverflow上看到。等有时间了再仔细探究出现Task not serialiazable的各种原因以及出现问题后如何快速定位问题的所在,至少目前阶段碰到此类问题,没有什么章法
1.
package spark.examples.streaming import org.apache.spark.SparkConf import org.apache.spark.streaming._ import scala.collection.mutable object NetCatStreamingWordCount3 { def main(args: Array[String]) { val conf = new SparkConf().setAppName("NetCatWordCount") conf.setMaster("local[3]") val ssc = new StreamingContext(conf, Seconds(5)) val lines = ssc.socketTextStream("localhost", 9999) lines.foreachRDD(rdd => { rdd.foreachPartition(partitionIterable=> { val map = mutable.Map[String, String]() while(partitionIterable.hasNext) { val v = partitionIterable.next() map += v ->v } map.foreach(entry => { if (entry._1.equals("abc")) { return; //return语句导致Task无法序列化,两个字:诡异,三个字:太诡异 } }) }) }) ssc.start() ssc.awaitTermination() } }
异常信息:
org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1622) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:805) at spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1.apply(NetCatStreamingWordCount3.scala:15) at spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1.apply(NetCatStreamingWordCount3.scala:14) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.Object Serialization stack: - object not serializable (class: java.lang.Object, value: java.lang.Object@143d53c) - field (class: spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1, name: nonLocalReturnKey1$1, type: class java.lang.Object) - object (class spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1, <function1>) - field (class: spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1) - object (class spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1$$anonfun$apply$1, <function1>) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 20 more Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1622) at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:805) at spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1.apply(NetCatStreamingWordCount3.scala:15) at spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1.apply(NetCatStreamingWordCount3.scala:14) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534) at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.io.NotSerializableException: java.lang.Object Serialization stack: - object not serializable (class: java.lang.Object, value: java.lang.Object@143d53c) - field (class: spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1, name: nonLocalReturnKey1$1, type: class java.lang.Object) - object (class spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1, <function1>) - field (class: spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1$$anonfun$apply$1, name: $outer, type: class spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1) - object (class spark.examples.streaming.NetCatStreamingWordCount3$$anonfun$main$1$$anonfun$apply$1, <function1>) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 20 more
2.
package spark.examples.rdd import org.apache.spark.{SparkConf, SparkContext} object TaskNotSerializationTest { def main(args: Array[String]) { new Testing().runJob } } class Testing { val conf = new SparkConf().setMaster("local").setAppName("TaskNotSerializationTest") val sc = new SparkContext(conf) val rdd = sc.parallelize(List(1, 2, 3)) def runJob = { rdd.map(someFunc).collect().foreach(println) } def someFunc(a: Int) = a + 1 }
异常信息:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158) at org.apache.spark.SparkContext.clean(SparkContext.scala:1622) at org.apache.spark.rdd.RDD.map(RDD.scala:286) at spark.examples.rdd.Testing.runJob(TaskNotSerializationTest.scala:20) at spark.examples.rdd.TaskNotSerializationTest$.main(TaskNotSerializationTest.scala:10) at spark.examples.rdd.TaskNotSerializationTest.main(TaskNotSerializationTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.io.NotSerializableException: spark.examples.rdd.Testing Serialization stack: - object not serializable (class: spark.examples.rdd.Testing, value: spark.examples.rdd.Testing@b8972) - field (class: spark.examples.rdd.Testing$$anonfun$runJob$1, name: $outer, type: class spark.examples.rdd.Testing) - object (class spark.examples.rdd.Testing$$anonfun$runJob$1, <function1>) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164) ... 11 more
第二个问题:stackoverflow上有比较详细的讨论:
http://stackoverflow.com/questions/22592811/task-not-serializable-java-io-notserializableexception-when-calling-function-ou
相关推荐
org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298) at org.apache.spark.util.ClosureCleaner$.org$apache$spark...
Job aborted due to stage failure: Task not serializable 缺失依赖 执行 start-all.sh 错误 - Connection refused Spark 组件之间的网络连接问题 性能 & 优化 一个 RDD 有多少个分区 数据本地性 Spark Streaming ...
Python json 错误xx is not JSON serializable解决办法 在使用json的时候经常会遇到xxx is not JSON serializable,也就是无法序列化某些对象。经常使用django的同学知道django里面有个自带的Encoder来序列化时间等...
序列化 serializable demo ! 序列化 serializable demo !
說明如何將Serializable物件轉成stream
Serializable的增删改查操作,已经经过验证,可以直接运行。
问题描述 因为numpy的int类型无法被json化,所以需要将numpy的int转为原生类型。 解决方案 # pandas返回的 sex_cnt = marks['sex'].value_counts() type(sex_cnt['男']) # numpy.int64 # 3种转化方法 ...
java->serializable深入了解 java->serializable深入了解 java->serializable深入了解
Laravel开发-serializable-values Luminark可序列化值包。
Java_Serializable(序列化) 的理解和总结
Intent传递数据是android开发中最长用的数据传递方式,可是要传递对象不怎么常用,这里介绍第一种传递对象的方法Serializable传递
java.io.Serializable序列化问题
详细讲解了C#中关于对象序列化的知识,包括基本序列化、选择序列化、自定义序列化;对于了解在C#中如何进行对象的序列化有价值
bundle传递基本数据,Parcelable类型数据,Serializable类型数据
Android序列化——Serializable与Parcelable
java 序列化 对象 Serializable 写着玩的Demo 简单 实用
Android中的Serializable
java 将对象序列化 输出对象的值,不懂可以百度序列化干啥的,为什么要用序列化,好处。
[Serializable]在C_中的作用-NET_中的对象序列化,希望有所帮助