sphinx+python 索引耗内存

youngerblue

浏览: 43460 次
性别:
来自: 杭州

最近访客更多访客>>

pengdeman

zhjllk_2005

烟火急逝

funrun_ok

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

搜索
python

搜索一直用的是coreseek，数据源方面用的是python，以前数据量少的时候没问题，数据量一大，用python建索引就出现问题。

未做优化前，python建索引的过程的第一步就是先捞出所有的数据放内存，这也是官网文档的做法，数据量一大，必然导致占用内存过大，导致进程终止或者服务器负载过高。

实际数据：200W，在执行python数据源时，内存达到2g以上。

优化后的数据：内存维持在427M，没有继续增长。

原先的python的代码把sql的查询都放在def Connected(self)方法，一次性全部查询出来后使用def NextDocument(self)遍历。

现把查询语句调到NextDocument(self)方法中。
基本思路如下：
1. 先查询出该表的最大id值
2.对查询进行分页，比如10页，limit = maxId/10
3.第一次执行NextDocument或者执行次数是limit 的倍数的时候，执行sql查询，把查询结果出分页结果集
4.循环遍历结果集
5.分页次数到达11页时，return False。

此方法基本可以保证大数据量下建索引没有内存问题。

分享到：

php环境搭建问题 | java解析log日志

2012-06-08 23:06
浏览 4246
评论(2)
分类:互联网
查看更多

2 楼 yangguangmeng 2016-12-02

您好：还请指点下，谢谢！我是读取目录下文件中数据，怎么修改呢。下面代码不好使，请赐教下，谢谢！

# -*- coding:utf-8 -*-
# coreseek3.2 python source
# author: HonestQiao
# date: 2010-06-03 11:46

def IsSubString(SubStrList,Str):
    flag=True
    for substr in SubStrList:
        if not(substr in Str):
            flag=False

    return flag

'''
data/1.txt
data/2.txt
'''
def GetFileList(FindPath,FlagStr=[]):
    import os
    FileList=[]
    FileNames=os.listdir(FindPath)
    if (len(FileNames)>0):
       for fn in FileNames:
           if (len(FlagStr)>0):
               if (IsSubString(FlagStr,fn)):
                   fullfilename=os.path.join(FindPath,fn)
                   FileList.append(fullfilename)
           else:
               fullfilename=os.path.join(FindPath,fn)
               FileList.append(fullfilename)

    if (len(FileList)>0):
        FileList.sort()

    return FileList

class MainSource(object):
    def __init__(self, conf):
        self.conf = conf
        self.idx = 0
        self.data = []

    def SetFile(self, filename):
        self.file = filename

    def GetScheme(self):
        return [
            ('id' , {'docid':True, } ),
            ('subject', { 'type':'text'} ),
            ('context', { 'type':'text'} ),
            ('published', {'type':'integer'} ),
            ('author_id', {'type':'integer'} ),
        ]

    def GetFieldOrder(self):
        return [('subject', 'context')]

    def NextDocument(self):
            if self.idx < len(self.data):
                item = self.data[self.idx]
                self.docid = self.id = item['id']
                self.subject = item['subject']
                self.context = item['context']
                self.published = item['published']
                self.author_id = item['author_id']
                self.idx += 1
                return True
            else:
                return False

    def Connected(self):
        pass




if __name__ == "__main__":
    conf = {}
    source = MainSource(conf)
    source.Connected()

    FileList = []
    FileList = GetFileList("data/")
    for file in FileList:
        f = open(file, 'rb')
        print file
        lines = f.readlines()
        #print len(lines)

        i =1
        TP=""
        BT=""
        ID=""

        while i<= len(lines)/154:
            TP = lines[66+154*(i-1)].strip('\n').decode('gbk').encode('utf-8')[1:]
            BT = lines[84+154*(i-1)].strip('\n').decode('gbk').encode('utf-8')[1:]
            ID = lines[92+154*(i-1)].strip('\n').decode('gbk').encode('utf-8')[2:]
            i = i+1

            if TP!="" and BT!="" and ID!="":
                source.data.append({'id':int(ID), 'subject':BT,'context':TP, 'published':int(100001), 'author_id':int(100001)})
                TP=""
                BT=""
                ID=""

        f.close()

    while 1:
        if source.NextDocument():
            #print "id=%d, subject=%s" % (source.docid, source.subject)
            continue
        else:
            break

#eof

1 楼 muzi1012 2012-09-15

你好，我现在用python做数据源，建索引出来的文件才1k、0k大小，是不是有问题啊？
E:\GeoSearch\bin>indexer -c /GeoSearch/etc/scgeo.conf --all
Coreseek Fulltext 4.1 [ Sphinx 2.0.2-dev (r2922)]
Copyright (c) 2007-2011,
Beijing Choice Software Technologies Inc (http://www.coreseek.

using config file '/GeoSearch/etc/scgeo.conf'...
indexing index 'scgeo'...
WARNING: Attribute count is 0: switching to none docinfo
collected 99 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 99 docs, 891 bytes
total 6.113 sec, 145 bytes/sec, 16.19 docs/sec
total 1 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 6 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

建索引操作提示这样的，能帮忙看下什么原因吗？网上找好久也没眉目，谢谢了

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论