- 浏览: 1132710 次
- 性别:
- 来自: 北京
文章分类
- 全部博客 (411)
- Java Foundation (41)
- AI/机器学习/数据挖掘/模式识别/自然语言处理/信息检索 (2)
- 云计算/NoSQL/数据分析 (11)
- Linux (13)
- Open Source (12)
- J2EE (52)
- Data Structures (4)
- other (10)
- Dev Error (41)
- Ajax/JS/JSP/HTML5 (47)
- Oracle (68)
- FLEX (19)
- Tools (19)
- 设计模式 (4)
- Database (12)
- SQL Server (9)
- 例子程序 (4)
- mysql (2)
- Web Services (4)
- 面试 (8)
- 嵌入式/移动开发 (18)
- 软件工程/UML (15)
- C/C++ (7)
- 架构Architecture/分布式Distributed (1)
最新评论
-
a535114641:
LZ你好, 用了这个方法后子页面里的JS方法就全不能用了呀
页面局部刷新的两种方式:form+iframe 和 ajax -
di1984HIT:
学习了,真不错,做个记号啊
Machine Learning -
赵师傅临死前:
我一台老机器,myeclipse9 + FB3.5 可以正常使 ...
myeclipse 10 安装 flash builder 4.6 -
Wu_Jiang:
触发时间在将来的某个时间 但是第一次触发的时间超出了失效时间, ...
Based on configured schedule, the given trigger will never fire. -
cylove007:
找了好久,顶你
Editable Select 可编辑select
Spring Batch: 大数据量批量并行处理框架
- 博客分类:
- J2EE
- 云计算/NoSQL/数据分析
Spring Batch Documentation:
http://static.springsource.org/spring-batch/reference/index.html
Use Cases for Spring Batch:
http://static.springsource.org/spring-batch/cases/index.html
Spring Batch Tutorial:
http://www.mkyong.com/tutorials/spring-batch-tutorial/comment-page-1/#comment-138186
spring batch所能做的,hadoop都能做。但是spring batch写了一个java batch job framework,它的作用就是帮你管理好你的job,各种监控,流程控制,重启等,也可以说是一个标准,免得你自己写一个framework的时候,漏掉很多细节。不过你要把spring batch的工作都放到hadoop里面做,可能hadoop有点大柴小用了。pring batch 对于处理批量任务还是挺棒的,hadoop更加有利于数据挖掘之类。spring batch适合规模不算太大的数据处理,hadoop那肯定是上规模的计算与处理了。
Java EE 7 的 batch 框架基本是和 spring batch 一致的。两者比较见:
https://blog.codecentric.de/en/2013/07/spring-batch-and-jsr-352-batch-applications-for-the-java-platform-differences/
关于 Step:
http://docs.spring.io/spring-batch/reference/html/configureStep.html
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
http://java.dzone.com/articles/chunk-oriented-processing
引用
Spring batch 提供两种 step:
1. Chunk-Oriented task,或称为 READ-PROCESS-WRITE task
2. TaskletStep-Oriented task,或称为 single operation task (即 Tasklet 接口)。The Tasklet is a simple interface that has one method, execute, which will be a called repeatedly by the TaskletStep until it either returns RepeatStatus.FINISHED or throws an exception to signal a failure. Each call to the Tasklet is wrapped in a transaction(即:一次对 TaskletStep 调用时的所有 DB 操作,都是在一个事务中的,所以你不用担心 TaskletStep 调用过程中的 failure 对数据的影响). Tasklet implementors might call a stored procedure, a script, or a simple SQL update statement. To create a TaskletStep, the 'ref' attribute of the <tasklet/> element should reference a bean defining a Tasklet object; no <chunk/> element should be used within the <tasklet/>。
In Spring Batch, A job consists of many steps and each step consists of a READ-PROCESS-WRITE task or single operation task (tasklet).
1 Job = Many Steps.
1 Step = 1 READ-PROCESS-WRITE or 1 Tasklet.(严格一个,或者是 Chunk Oriented task,或者是 TaskletStep Oriented task)
Job = {Step 1 -> Step 2 -> Step 3} (Chained together)。
基本上,对于符合 IN-PROCESS-OUT 模型的 step,使用 Chunk Oriented task;对只需要 IN 或 OUT 两者之一,或者都不需要的(如只是做清理资源、truncate db table 等操作),使用 TaskletStep Oriented task。
1. Chunk-Oriented task,或称为 READ-PROCESS-WRITE task
2. TaskletStep-Oriented task,或称为 single operation task (即 Tasklet 接口)。The Tasklet is a simple interface that has one method, execute, which will be a called repeatedly by the TaskletStep until it either returns RepeatStatus.FINISHED or throws an exception to signal a failure. Each call to the Tasklet is wrapped in a transaction(即:一次对 TaskletStep 调用时的所有 DB 操作,都是在一个事务中的,所以你不用担心 TaskletStep 调用过程中的 failure 对数据的影响). Tasklet implementors might call a stored procedure, a script, or a simple SQL update statement. To create a TaskletStep, the 'ref' attribute of the <tasklet/> element should reference a bean defining a Tasklet object; no <chunk/> element should be used within the <tasklet/>。
In Spring Batch, A job consists of many steps and each step consists of a READ-PROCESS-WRITE task or single operation task (tasklet).
1 Job = Many Steps.
1 Step = 1 READ-PROCESS-WRITE or 1 Tasklet.(严格一个,或者是 Chunk Oriented task,或者是 TaskletStep Oriented task)
Job = {Step 1 -> Step 2 -> Step 3} (Chained together)。
基本上,对于符合 IN-PROCESS-OUT 模型的 step,使用 Chunk Oriented task;对只需要 IN 或 OUT 两者之一,或者都不需要的(如只是做清理资源、truncate db table 等操作),使用 TaskletStep Oriented task。
Spring batch 与 spring integration 的混合使用:
http://blog.springsource.org/2010/02/15/practical-use-of-spring-batch-and-spring-integration/
http://static.springsource.org/spring-batch-admin/trunk/spring-batch-integration/
存疑:
1. Passing data between steps?一个单线程下可行的方案:
http://wangxiangblog.blogspot.com/2013/02/spring-batch-pass-data-across-steps.html
2. processor 处理 item 的过程,可以是批量式的吗?同理 read item 可以是批量读吗?相关:
http://forum.spring.io/forum/spring-projects/batch/63873-itemreader-returning-one-list
3. read 一个 item,但经过 processor 处理后,变为多个 items(即:processor 的 input 为一个 object,但 output 为一个 list),怎么交给 writer?相关:
http://forum.spring.io/forum/spring-projects/batch/111650-itemprocessor-receiving-one-item-returning-more-than-one
Spring Batch ref:
A Job has one to many steps, which has exactly one ItemReader, ItemProcessor, and ItemWriter. A job needs to be launched (JobLauncher), and meta data about the currently running process needs to be stored (JobRepository):
Batch Stereotypes(Chapter 3. The Domain Language of Batch)
A JobLauncher uses the JobRepository to create new JobExecution objects and run them. Job and Step implementations later use the same JobRepository for basic updates of the same executions during the running of a Job. The basic operations suffice for simple scenarios, but in a large batch environment with hundreds of batch jobs and complex scheduling requirements, more advanced access of the meta data is required (4.5. Advanced Meta-Data Usage)
2.4. Meta Data Access Improvements
3.1. Job
Spring Batch uses a 'Chunk Oriented' processing style within its most common implementation. Chunk oriented processing refers to reading the data one at a time, and creating 'chunks' that will be written out, within a transaction boundary 一次面向块的read/write在一个事务中!. One item is read in from an ItemReader, handed to an ItemProcessor, and aggregated. Once the number of items read equals the commit interval, the entire chunk is written out via the ItemWriter, and then the transaction is committed.
5.1. Chunk-Oriented Processing
APIs:
引用
Job - A Job is an entity that encapsulates an entire batch process.
JobInstance - A JobInstance refers to the concept of a logical job run. 可以这样想:JobInstance = Job + JobParameters.
JobExecution - A JobExecution refers to the technical concept of a single attempt to run a Job. An execution may end in failure or success, but the JobInstance corresponding to a given execution will not be considered complete unless the execution completes successfully.
JobParameters - JobParameters is a set of parameters used to start a batch job. "how is one JobInstance distinguished from another?" The answer is: JobParameters.
Job conclusion - A Job defines what a job is and how it is to be executed, and JobInstance is a purely organizational object to group executions together, primarily to enable correct restart semantics. A JobExecution, however, is the primary storage mechanism for what actually happened during a run
Step - A Step is a domain object that encapsulates an independent, sequential phase of a batch job. Therefore, every Job is composed entirely of one or more steps. A Step contains all of the information necessary to define and control the actual batch processing. As with Job, a Step has an individual StepExecution that corresponds with a unique JobExecution.
StepExecution - A StepExecution represents a single attempt to execute a Step. A new StepExecution will be created each time a Step is run, similar to JobExecution. However, if a step fails to execute because the step before it fails, there will be no execution persisted for it. A StepExecution will only be created when its Step is actually started.
Tasklet -
Chunk -
ExecutionContext - An ExecutionContext is a collection of key/value pairs that are persisted by the framework and provide a place to store persistent data that is scoped to a StepExecution or JobExecution. This storage is useful for example in stateful ItemReaders where the current row being read from needs to be recorded.
JobListener -
JobRepository - JobRepository is the persistence mechanism for all of the Stereotypes such as JobInstance/JobParameters/JobExecution/StepExecution/ExecutionContext and so on. It provides CRUD operations for JobLauncher, Job, and Step implementations. When a Job is first launched, a JobExecution is obtained from the repository, and during the course of execution StepExecution and JobExecution implementations are persisted by passing them to the repository.
JobLauncher - JobLauncher represents a simple interface for launching a Job with a given set of JobParameters. It is expected that implementations will obtain a valid JobExecution from the JobRepository and execute the Job.
JobExplorer - provide the function that query the repository for existing executions. 你可以将其认为是 a read-only version of the JobRepository。
JobRegistry - A JobRegistry (and its parent interface JobLocator) is not mandatory, but it can be useful if you want to keep track of which jobs are available in the context. It is also useful for collecting jobs centrally in an application context when they have been created elsewhere (e.g. in child contexts). Custom JobRegistry implementations can also be used to manipulate the names and other properties of the jobs that are registered.
JobOperator - the JobRepository provides CRUD operations on the meta-data, and the JobExplorer provides read-only operations on the meta-data. However, those operations are most useful when used together to perform common monitoring tasks such as stopping, restarting, or summarizing a Job, as is commonly done by batch operators. Spring Batch provides for these types of operations via the JobOperator interface.
ItemReader - ItemReader is an abstraction that represents the retrieval of input for a Step, one item at a time. When the ItemReader has exhausted the items it can provide, it will indicate this by returning null. The basic contract of the ItemReader is that it is forward only.
ItemProcessor - ItemProcessor is an abstraction that represents the business processing of an item. While the ItemReader reads one item, and the ItemWriter writes them, the ItemProcessor provides access to transform or apply other business processing.
ItemWriter - ItemWriter is an abstraction that represents the output of a Step, one batch or chunk of items at a time. Generally, an item writer has no knowledge of the input it will receive next, only the item that was passed in its current invocation.
Introducing Spring Batch series (three parts):
http://keyholesoftware.com/2012/06/22/introducing-spring-batch/
Batch processing in Java with Spring batch (four parts):
http://java-success.blogspot.com/2012/06/batch-processing-in-java-with-spring.html
Srcs:
中文 PPT 大致介绍:
http://www.slideshare.net/chijq/spring-batch
Spring Batch – Imperfect Yet Worthwhile:
http://www.summa-tech.com/blog/2012/01/23/spring-batch-imperfect-yet-worthwhile/
http://www.davenkin.me/post/2012-10-17/40039048526
Looking for some good examples?
Spring Batch - Hello World:
http://java.dzone.com/news/spring-batch-hello-world-1
引用
A batch Job is composed of one or more Steps. A JobInstance represents a given Job, parametrized with a set of typed properties called JobParameters. Each run of of a JobInstance is a JobExecution. Imagine a job reading entries from a data base and generating an xml representation of it and then doing some clean-up. We have a Job composed of 2 steps: reading/writing and clean-up. If we parametrize this job by the date of the generated data then our Friday the 13th job is a JobInstance. Each time we run this instance (if a failure occurs for instance) is a JobExecution. This model gives a great flexibility regarding how jobs are launched and run. This naturally brings us to launching jobs with their job parameters, which is the responsibility of JobLauncher. Finally, various objects in the framework require a JobRepository to store runtime information related to the batch execution. In fact, Spring Batch domain model is much more elaborate but this will suffice for our purpose.
What happends if a process throws an exception ?
http://alain-cieslik.com/2011/06/06/springbatch-what-append-if-a-process-throws-an-exception/
http://forum.springsource.org/showthread.php?61042-Spring-Batch-beginners-tutorial
http://stackoverflow.com/questions/1609793/how-can-i-get-started-with-spring-batch
发表评论
-
Lucene & Solr
2013-05-07 17:30 2377Params of solr query (参见 solrj ... -
Continuous Integration Server:Jenkins & Hudson
2013-04-15 16:15 1428Jenkins: http://jenkins-ci.org/ ... -
Database Table Partitioning
2013-04-09 11:58 1130http://en.wikipedia.org/wiki ... -
Scale-up(纵向扩展) vs Scale-out(横向扩展)
2013-04-08 14:56 3573Scale-up / Scale vertically / 纵 ... -
Spring Integration
2013-03-26 16:52 3010Spring Integration Reference ... -
NOSQL 之 Document Database 之 MongoDB
2013-03-21 18:11 1259Official siste: http://www.mong ... -
高可用与负载均衡:Haproxy(or lVS) + keepalived
2013-01-29 20:35 3116sources: Setting Up A High ... -
Hadoop 异常 总结
2013-01-08 10:35 1151Directory /tmp/hadoop-lee/ ... -
AOP: Aspect Oriented Programming
2013-01-06 11:13 2756The AspectJ Programming Gu ... -
NOSQL 之 Graph Database 之 neo4j
2012-12-28 11:31 1412The Neo4j Manual: http://doc ... -
Performance & Load test tool : JMeter
2012-12-18 14:28 1255Official: http://jmeter.apa ... -
rabbitmq & spring amqp
2012-12-04 00:09 8665My main AMQP post on blogger ... -
javaMail 邮件
2012-11-23 20:14 3441SMTP POP3的区别到底是什么? http://w ... -
未完 Spring MVC
2012-11-15 22:41 2075Annotations for Http pa ... -
Redis: REmote DIctionary Server
2012-11-07 14:01 2222Redis is a key-value store NoSQ ... -
JUnit 单元测试
2012-10-30 12:27 2530测试的分类: http://s ... -
Hadoop
2012-09-25 19:45 1199《Hadoop: The Definitive Guide》r ... -
海量数据 & Hadoop 面试题
2012-08-27 07:19 2533教你如何迅速秒杀99%的海量数据处理面试题 http://ww ... -
NOSQL
2012-08-23 15:45 1262NOSQL(Not Only SQL,不限于SQL)是一 ... -
云之前沿
2012-08-22 22:53 1816iaas, paas, saas: Google后Had ...
相关推荐
基于Spring Batch的大数据量并行处理 基于Spring Batch的大数据量并行处理
最近在研究springBoot+springbatch ,按照官网的实例做了一个实例。 最近在研究springBoot+springbatch ,按照官网的实例做了一个实例。
对于大数据量和高性能的批处理任务,Spring Batch 同样提供了高级功能和特性来支持,比如分区功能、远程功能。总之,通过 Spring Batch 能够支持简单的、复杂的和大数据量的批处理作业。 Spring Batch 是一个批处理应用...
Work with all aspects of batch processing in a modern Java environment using a selection of Spring frameworks. This book provides up-to-date examples using the latest configuration techniques based on...
Spring Batch是一个轻量级的,完全面向Spring的批处理框架,可以应用于企业级大量的数据处理系统。Spring Batch以POJO和大家熟知的Spring框架为基础,使开发者更容易的访问和利用企业级服务。Spring Batch可以提供...
Spring Batch批处理框架Spring Batch批处理框架Spring Batch批处理框架
资源名称:Spring Batch 批处理框架内容简介:《Spring Batch 批处理框架》全面、系统地介绍了批处理框架Spring Batch,通过详尽的实战示例向读者展示了Spring Batch框架对大数据批处理的基本开发能力,并对框架的...
NULL 博文链接:https://virusfu.iteye.com/blog/1150730
基本篇重点讲述了数据批处理的核心概念、典型的作业配置、作业步配置,以及Spring Batch框架中经典的三步走策略:数据读、数据处理和数据写,详尽地介绍了如何对CVS格式文件、JSON格式文件、XML文件、数据库和JMS...
bank-spring-batch:具有多处理器的Spring Batch项目
spring batch批处理框架和对应的源码资源 rar 可以直接运行的
1.本项目运行在tomcat容器中,主要功能为从spring_batch_left库的user_from表抓取数据,之后批量插入到spring_batch_right库的user_to表 2.应用quartz对job进行定时触发(目前设置的定时为每隔一分钟执行一次,目前...
SpringBatch+SpringBoot构建海量数据企业批处理系统和性能优化,Spring Batch是一个基于Spring的企业级批处理框架,所谓企业批处理就是指在企业级应用中,不需要人工干预,定期读取数据,进行相应的业务处理之后,再...
主要给大家介绍了Spring Batch读取txt文件并写入数据库的方法,SpringBatch 是一个轻量级、全面的批处理框架。这里我们用它来实现文件的读取并将读取的结果作处理,处理之后再写入数据库中的功能。需要的朋友可以...
spring-batchr的分区示例的简单实现。
Spring Batch是一个轻量级,全面的批处理框架,旨在开发对企业系统日常运营至关重要的强大批处理应用程序。 Spring Batch构建了人们期望的Spring Framework特性(生产力,基于POJO的开发方法和一般易用性),同时使...
Spring Batch API(Spring Batch 开发文档).CHM。 官网 Spring Batch API,Spring Batch 开发文档
Spring Boot整合Spring Batch的一个小例子,在网上发现这方面的资源比较少,特此将其上传供大家学习。