spark overview - 大数据和云计算技术（欢迎关注同名微信公众号） - ITeye博客

`

jiezhu2007

浏览: 245164 次
性别:
来自: 深圳

最近访客更多访客>>

Sehriff

tory7108121

Marln

taowanli

博主相关

博客

微博

相册

收藏

留言

关于我

博客专栏

: hadoop技术学习
浏览量：143975

: 大数据产业分析
浏览量：2977

文章分类

社区版块

存档分类

最新评论

letian611： ...
滴滴背后的大数据应用
MCLoginandPwd：分享一款代码生成器，拖拽式组件结合流式处理，很容易的访问数据库 ...
Docker和hadoop
rashly：可以跟你加个好友吗
智能调度：Stanford的Quasar
rashly：请问你有没有quasar的源代码
智能调度：Stanford的Quasar
jiezhu2007：主要讲架构的书
为什么写《大数据架构详解》这本书

spark overview

博客分类：

hadoop技术专栏

阅读更多

1、 Resilient Distributed Datasets（RDDs）

Immutable,partitioned collections of objects

不可变，对象分区

Created through parallel transformations(map,filter,groupBy,join…) on data in stable storage

在固定存储上的数据创建并行转换

Can be cached for efficient reuse

数据cache在内存中为再次重用

2、 RDD Fault Tolerance

故障容忍

RDDs maintain lineage information that can be used to reconstruct lost partitions

RDDs维护血统信息用来重建丢失的分区

3、 Aggregations on many keys w/ same WHERE clause

同样where条件，多键值聚集，比Hive快40倍。原因是：

Not re-reading unused columns or filtered records

不用预读不用到的列和过滤记录

Avoiding repeated decompression

避免重复的解压

In-memory storage of deserialized objects

串行对象存放在内存存储中

4、 Runs on Apache Mesos to share resources with Hadoop & other apps

运行在Mesos，可以和hadoop等其他应用共享资源

Can read from any Hadoop input source (e.g. HDFS)

可以从任何hadoop读取资源，比如hdfs

No changes to Scala compiler

原生scala

5．Spark scheduler（spark调度）

Dryad-like DAGs

类似DAG调度

Pipelines functions within a stage

Stage内部通过管道函数传输

Cache-aware work reuse & locality

缓存感知工作的重用和本地化

Partitioning-aware to avoid shuffles

分区感知避免shuffles

分享到：

内存引擎：SanssouciDB | 系统设计心得

2013-07-28 19:24
浏览 1369
评论(0)
分类:数据库
查看更多

评论

发表评论

您还没有登录,请您登录后再发表评论

相关推荐

spark 1.2.0 文档(spark-1.2.0-doc): Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution ...

Apache Spark 2.4 and beyond: Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4. Along the way, you’ll learn about the design and implementation of V2 of theData Source API and ...

Mastering.Apache.Spark.178397146: The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural ...

Apache Spark的设计与实现 PDF中文版: Overview 总体介绍 Job logical plan 介绍 job 的逻辑执行图（数据依赖图） Job physical plan 介绍 job 的物理执行图 Shuffle details 介绍 shuffle 过程 Architecture 介绍系统模块如何协调完成整个 job 的执行 ...

Mastering Apache Spark 2.x - Second Edition: The book commences with an overview of the Spark ecosystem. It will introduce you to Project Tungsten and Catalyst, two of the major advancements of Apache Spark 2.x. You will understand how memory ...

Spark相关资料.zip: 首先，"SparkInternal1-Overview.pdf"应该是Spark的总体概述，它可能会介绍Spark的基本概念，如弹性分布式数据集（Resilient Distributed Datasets, RDDs）、DataFrame和Dataset，以及Spark的主要组件，如Spark Core...

Learning Apache Spark 2: Overview Big Data Analytics and its importance for organizations and data professionals. Delve into Spark to see how it is different from existing processing platforms Understand the intricacies of ...

藏经阁-Spark Autotuning.pdf: Alpine Data Overview： Alpine Data 是一个数据科学平台，提供了从数据预处理到模型部署的一整套解决方案。该平台支持 Spark Autotuning，允许数据科学家们快速地开发和部署机器学习模型。 Spark Configuration：...

beginning-apache-spark-2.pdf: There is no better time to learn Spark than...This chapter provides a high-level overview of Spark, including the core concepts, the architecture, and the various components inside the Apache Spark stack

Mastering Apache Spark(PACKT,2015): The book commences with an overview of the Spark eco-system. You will learn how to use MLlib to create a fully working neural ...

Spark大数据内核天机解密- to 丁立清.pdf: 根据官方文档（http://spark.apache.org/docs/latest/cluster-overview.html），Spark支持多种集群管理器，包括Standalone、Apache Mesos以及Hadoop YARN。 **1.1 Standalone** - **定义**: Standalone是Spark自带...

Spark: The Definitive Guide: Big Data Processing Made Simple 英文高清.pdf版: Get a gentle overview of big data and Spark Learn about DataFrames, SQL, and Datasets-Spark's core APIs-through worked examples Dive into Spark's low-level APIs, RDDs, and execution of SQL and ...

藏经阁-Deep Dive_How Spark Uses Memor.pdf: Memory Usage Overview Spark 使用内存来缓存数据，以便在将来使用。这种缓存机制可以提高数据处理速度。在 Spark 中，内存主要用于三个方面：存储、执行和其他。其中，存储内存用于缓存数据，执行内存用于计算、...

藏经阁-Cost-Based Optimizer in Apache Spark 2.2.pdf: Catalyst Optimizer: An Overview Catalyst 是 Spark SQL 的优化器，负责将用户查询转换为执行计划。Catalyst 优化器的目标是选择合适的执行计划，以最小化查询响应时间。Catalyst 优化器的主要组件包括： * ...

Structured Spark Streaming-as-a-Service with Hopsworks: #### Hopsworks Platform Overview Hopsworks is a comprehensive data management and machine learning platform that simplies the deployment and management of Spark applications. Key components of ...

Spark GraphX In Action: Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it ...

Global site tag (gtag.js) - Google Analytics