`
can_do
  • 浏览: 249484 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

Storm vs Spark

阅读更多
storm vs sparkUpdate: additional question about Storm

The question is to compare Spark to Storm (see comments below).

Spark is still based on the idea that, when the existing data volume is huge, it is cheaper to move the process to the data, rather than moving the data to the process. Each node stores (or caches) its dataset, and jobs are submitted to the nodes. So the process moves to the data. It is very similar to Hadoop map/reduce, except memory storage is aggressively used to avoid I/Os which makes it efficient for iterative algorithms (when the output of the previous step is the input of the next step). Shark is only a query engine built on top of Spark (supporting ad-hoc analytical queries).

You can see Storm as the complete architectural opposite of Spark. Storm is a distributed streaming engine. Each node implements a basic process, and data items flow in/out a network of interconnected nodes (contrary to Spark). With Storm, the data move to the process.

Both frameworks are used to parallelize computations of massive amount of data.

However, Storm is good at dynamically processing numerous generated/collected small data items (such as calculating some aggregation function or analytics in real time on a Twitter stream).

Spark applies on a corpus of existing data (like Hadoop) which has been imported into the Spark cluster, provides fast scanning capabilities due to in-memory management, and minimizes the global number of I/Os for iterative algorithms.

Spark Streaming module is comparable to Storm (both are streaming engines). They work differently though. Spark Streaming accumulates batches of data, and then submit these batches to the Spark engine as if they were immutable Spark datasets. Storm processes and dispatch items as soon as they are received. I don't know which one is the most efficient in term of throughput. In term of latency, it is probably Storm. –




Here is my 2 cents: Spark streaming has concept of sliding window while in Storm you have to maintain the window by yourself
**************************************

>>理解: move the process to the data和move the data to the process的不同
>>>Spark与Hadoop MapReduce非常相似,不同之处是Spark采用内存缓存数据集提高性能,并且使迭代算法更高效(上一步的输出作为下一步的输入)。
==>Spark的RDD计算模型
>>>Shark仅仅是一个构建在Spark之上的查询引擎,支持特定的分析查询。
>>>Spark仍然是基于这样一种思想,假定已有的数据量巨大,并且处理找数据更便宜些,而不是让数据找处理。
>>>Spark是处理找数据,而Storm是数据流向(找)处理
>>>Spark可以理解为,处理找数据,即数据等着被处理,而storm是数据主动找处理

>>>Storm是和Spark完全对立的架构,Storm是一个分布式流处理引擎,每一个节点实现基本处理,数据项在互联节点间流入、流出。所以,对Storm来说,是数据找处理。

>>>Spark和Storm都可以用于大规模数据量的并行计算。

>>>然而,Storm擅长于动态处理许多(大量)产生或者收集的小数据项(比如:Twitter中计算某些聚合功能或者实时分析)
==>Storm用于对大量变化数据的实时分析

>>>Spark像Hadoop一样,应用于已有数据这个范畴,这些数据导入到Spark集群中。
>>>Spark提供快速扫描能力,是因为基于内存管理,对迭代算法来说,使得整体IO数量最小化。

>>>Spark流处理模型可以和Storm进行比较(两者都是流处理引擎)。但是它们工作起来是不同的,Spark流模型积攒成批的数据,然后把这些成批数据提交给Spark引擎,
   对Spark引擎来说,可以看做是不变的Spark数据集。

>>>而Storm的工作方式是:只要数据被storm节点接收到,就会尽可能快的被处理和分发出去。

>>>在吞吐量方面,我不知道哪一个更高效,但是就延迟来说,可能是Storm延迟低。
==>性能指标,高吞吐、低延迟

>>>Spark coded in Scala,Storm coded in Clojure

>>>Spark流模型有滑窗的概念,而在Storm中,你必须自己维护滑窗。
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics