论坛首页 Java企业应用论坛

数据库大批量数据插入的性能优化问题

浏览 11082 次
精华帖 (0) :: 良好帖 (0) :: 新手帖 (0) :: 隐藏帖 (0)
作者 正文
   发表时间:2004-12-02  
情况是这样的:
用户要在页面上提交或指定服务器上的一个数据文件,然后后台程序对这个数据文件进行分析。分析过程类似字符串匹配,会生成大量的数据,要插入到数据库里面,这里就产生了性能问题。目前在实现上采用spring+hibernate+mysql,PO有两个简单的多对一单向关联。我自己测试了一个174K的数据文件,总共产生一万两千条记录。本来要近10分钟,现在已经减少到20秒以内。已经做了如下的一些改进,
1.采用open session in view,减少不必要的open close session的操作。
2.将batch-size增加调整到50。
3.去掉循环里面的logger输出
4.将原来的双向关联改成单向,减少内存消耗(为了避免已经插入数据库的对象仍旧被引用)
5.将循环内不变的对象放到循环外,需要反复调用的方法中不变的局部变量变成final的实例变量。
用Jprofiler测了一下,发现分析操作和数据库操作占用时间最多(情理之中),都在40%左右。分析操作都是一些字符串和正则表达式之类的操作,自己觉得优化余地不大了
问题
1.以上的一些改进是否有不合理的地方?
2.将一万多条数据插入数据库正常的话时间大概有多长?每条记录都不长,60个字符左右,我不知道数据库操作这块还有多少优化的余地。
3.有人说用load data会比insert快很多,这是为什么?我对这两个东西的底层不熟悉。我这个情况适合用load data么?
4.是不是还有其他可以优化的地方?
cache在这里也没多少作用,都是插入操作,基本上没查询。搞了三天了,自己真的没招了
   发表时间:2004-12-02  
1. 合理
2. 12000 / 20 = 600 per second, 你的db server好强......
3. load data 不是查询么?和insert有什么关系?
4. 偶不知道

已经优化到20秒内完成这样的操作,用户还不满意啊?你们的用户够难搞定的......目标是多少啊?

btw,偶有一个系统到月底结算的时候需要跑16个小时,后来好不容易优化到了1小时,用户感动得鼻涕眼泪一把的了,想想偶真是幸福啊......
0 请登录后投票
   发表时间:2004-12-02  
试试看用多线程把一个文件分成几个小部分,然后这些线程做一样的工作:同时解析,同时插入......
0 请登录后投票
   发表时间:2004-12-02  
这个只是174K的数据文件,我头儿和我说,可能要提交2G大小的文件,偶晕翻。本来刚开始的时候根本没和我说要有这么大数据量的需求。我在怀疑这样的数据分析是不是适合用java做。
那个db server其实是一台烂机子,pIII800 256M,和web server是同一台,可能是我的每条记录比较短吧,所以会快一点。我对数据库的速度没有数量级上的概念,呵呵
那个load data是mysql里面的一个命令,可以导入文本文件里面的数据到mysql里面。据偶同事说会比直接程序里面insert快很多。但是我不知道为什么会这样。
0 请登录后投票
   发表时间:2004-12-02  
如果用多线程处理的话心里没底,因为这个肯定会涉及到同步等问题,效率不知道能提高多少。而且现在头儿已经在催了,没有时间做太大的改动。:(
0 请登录后投票
   发表时间:2004-12-03  
2G的文件?你的老大不是在说笑吧? 用户通过http方式upload 2G 文件不知道么花多久的时间呢......

查了一下mysql文档, 发现它说的load data infile命令比insert要快20倍, 强的......

When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using a lot of INSERT statements. See section 14.1.5 LOAD DATA INFILE Syntax.

引用

7.2.14 Speed of INSERT Statements

The time to insert a record is determined by the following factors, where the numbers indicate approximate proportions:

    * Connecting: (3)
    * Sending query to server: (2)
    * Parsing query: (2)
    * Inserting record: (1 x size of record)
    * Inserting indexes: (1 x number of indexes)
    * Closing: (1)

This does not take into consideration the initial overhead to open tables, which is done once for each concurrently running query.

The size of the table slows down the insertion of indexes by log N, assuming B-tree indexes.

You can use the following methods to speed up inserts:

    * If you are inserting many rows from the same client at the same time, use INSERT statements with multiple VALUES lists to insert several rows at a time. This is much faster (many times faster in some cases) than using separate single-row INSERT statements. If you are adding data to a non-empty table, you may tune the bulk_insert_buffer_size variable to make it even faster. See section 5.2.3 Server System Variables.
    * If you are inserting a lot of rows from different clients, you can get higher speed by using the INSERT DELAYED statement. See section 14.1.4 INSERT Syntax.
    * With MyISAM tables you can insert rows at the same time that SELECT statements are running if there are no deleted rows in the tables.
    * When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using a lot of INSERT statements. See section 14.1.5 LOAD DATA INFILE Syntax.
    * With some extra work, it is possible to make LOAD DATA INFILE run even faster when the table has many indexes. Use the following procedure:
         1. Optionally create the table with CREATE TABLE.
         2. Execute a FLUSH TABLES statement or a @command{mysqladmin flush-tables} command.
         3. Use @command{myisamchk --keys-used=0 -rq /path/to/db/tbl_name.} This will remove all use of all indexes for the table.
         4. Insert data into the table with LOAD DATA INFILE. This will not update any indexes and will therefore be very fast.
         5. If you are going to only read the table in the future, use @command{myisampack} to make it smaller. See section 15.1.3.3 Compressed Table Characteristics.
         6. Re-create the indexes with @command{myisamchk -r -q /path/to/db/tbl_name}. This will create the index tree in memory before writing it to disk, which is much faster because it avoids lots of disk seeks. The resulting index tree is also perfectly balanced.
         7. Execute a FLUSH TABLES statement or a @command{mysqladmin flush-tables} command.
      Note that LOAD DATA INFILE also performs the preceding optimization if you insert into an empty MyISAM table; the main difference is that you can let @command{myisamchk} allocate much more temporary memory for the index creation than you might want the server to allocate for index re-creation when it executes the LOAD DATA INFILE statement. As of MySQL 4.0, you can also use ALTER TABLE tbl_name DISABLE KEYS instead of @command{myisamchk --keys-used=0 -rq /path/to/db/tbl_name} and ALTER TABLE tbl_name ENABLE KEYS instead of @command{myisamchk -r -q /path/to/db/tbl_name}. This way you can also skip the FLUSH TABLES steps.
    * You can speed up INSERT operations that are done with multiple statements by locking your tables:

LOCK TABLES a WRITE;
INSERT INTO a VALUES (1,23),(2,34),(4,33);
INSERT INTO a VALUES (8,26),(6,29);
UNLOCK TABLES;

      A performance benefit occurs because the index buffer is flushed to disk only onc, after all INSERT statements have completed. Normally there would be as many index buffer flushes as there are different INSERT statements. Explicit locking statements are not needed if you can insert all rows with a single statement. For transactional tables, you should use BEGIN/COMMIT instead of LOCK TABLES to get a speedup. Locking also lowers the total time of multiple-connection tests, although the maximum wait time for individual connections might go up because they wait for locks. For example:

Connection 1 does 1000 inserts
Connections 2, 3, and 4 do 1 insert
Connection 5 does 1000 inserts

      If you don't use locking, connections 2, 3, and 4 will finish before 1 and 5. If you use locking, connections 2, 3, and 4 probably will not finish before 1 or 5, but the total time should be about 40% faster. INSERT, UPDATE, and DELETE operations are very fast in MySQL, but you will obtain better overall performance by adding locks around everything that does more than about five inserts or updates in a row. If you do very many inserts in a row, you could do a LOCK TABLES followed by an UNLOCK TABLES once in a while (about each 1,000 rows) to allow other threads access to the table. This would still result in a nice performance gain. INSERT is still much slower for loading data than LOAD DATA INFILE, even when using the strategies just outlined.
    * To get some more speed for MyISAM tables, for both LOAD DATA INFILE and INSERT, enlarge the key cache by increasing the key_buffer_size system variable. See section 7.5.2 Tuning Server Parameters.
0 请登录后投票
   发表时间:2004-12-03  
我上面也说了,一个是通过上传(较小的文件),另一个是通过让用户选择服务器上的数据文件。这个文件可以是ftp上去或者本身就已经在服务器上的。所以2G大小的文件肯定不是http upload 的。
0 请登录后投票
   发表时间:2004-12-03  
去解析2G的文件产生插入语句,我怀疑你的app就先歇菜了。
你现在的做法是先生成插入语句再执行插入操作还是边分析边插入?
0 请登录后投票
   发表时间:2004-12-03  
当然是边分析边插入了,否则文件一大,虚拟机还不当掉。唉,程序改的快不如需求变的快啊,刚开始的时候根本没这样的性能上的需求的,不知道哪个家伙拍脑袋想出来的。
0 请登录后投票
   发表时间:2004-12-03  
这么大的数据文件用JAVA不是找死啊,JVM可用的MEM才多少点,2G......

兄台还是直接写SHELL脚本用LOAD DATA吧,简单方便。
0 请登录后投票
论坛首页 Java企业应用版

跳转论坛:
Global site tag (gtag.js) - Google Analytics