`

spark-RDD vs DataFrame vs DataSet

 
阅读更多

 

In summation, the choice of when to use RDD or DataFrame and/or Dataset seems obvious. While the former offers you low-level functionality and control, the latter allows custom view and structure, offers high-level and domain specific operations, saves space, and executes at superior speeds.

As we examined the lessons we learned from early releases of Spark—how to simplify Spark for developers, how to optimize and make it performant—we decided to elevate the low-level RDD APIs to a high-level abstraction as DataFrame and Dataset and to build this unified data abstraction across  libraries atop Catalyst optimizer and Tungsten.

Pick one—DataFrames and/or Dataset or RDDs APIs—that meets your needs and use-case, but I would not be surprised if you fall into the camp of most developers who work with structure and semi-structured data.

Note that you can always seamlessly interoperate or convert from DataFrame and/or Dataset to an RDD, by simple method call .rdd. For instance,

 

 

  that is:

--------------------|

| Dataset          |

|- - - - - - - - - -  |

| DataFrame    |

--------------------|

--------------------

| RDD              |

--------------------

 




 

 
 

ref:

[1]A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

When to use them and why

 

[2]Spark SQL: Relational Data Processing in Spark 

 

  • 大小: 64.6 KB
  • 大小: 47 KB
  • 大小: 124.7 KB
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics