注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

phperwuhan的博客

记载一个phper的历程!phperwuhan.blog.163.com

 
 
 

日志

 
 

Zend_Search_Lucene 中文搜索  

2014-08-22 16:54:30|  分类: 搜索技术 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
来源:http://bzyyc.happy.blog.163.com/blog/static/6143064720116292389163/

1.General

Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. it stores its index on the filesystem and does not require a database server.

2. How to install Zend Lucene

DownLoad WebSite :     http://www.zend.com/community/downloads

Zend Framework version :   Zend Framework 1.9 minimal

Download Zend Framework 1.9 minimal from DownLoad WebSite.

Remove everything from Zend Folder but remain following files and directories:

Exception.php

Loader/

Loader.php

Search/

 

3.How to create an index.

an example of creating an index as below:

 <?php

//File Name: createindex.php

require_once 'Zend/Search/Lucene.php';

$productsDataarray(

0=>array("PID"=>1,"url"=>"http://www.cybozu.jp","productName"=>"garoon","Description"=>"garoon Description","lag"=>"en"),

1=>array("PID"=>2,"url"=>"http://www.cybozu.jp","productName"=>"share360","Description"=>"share360 Description" ,"lag"=>"en"),

2=>array("PID"=>3,"url"=>"http://www.cybozu.jp a","productName"=>"日本語の製品名前","Description"=>"日本語の製品","lag"=>"jp"),

3=>array("PID"=>4,"url"=>"http://www.cybozu.jp a","productName"=>"中文产品名","Description"=>"中文产品描述","lag"=>"zh")

);

$index=new Zend_Search_Lucene('index',true);

$doc new Zend_Search_Lucene_Document();

foreach ($productsData as $productData)

{

     $doc->addField(Zend_Search_Lucene_Field::keyword('PID'$productData['PID'], 'UTF-8'));

     $doc->addField(Zend_Search_Lucene_Field::Text('url'$productData['url'], 'UTF-8'));

      $doc->addField(Zend_Search_Lucene_Field::Text('productName'$productData['productName'], 'UTF-8'));

      $doc->addField(Zend_Search_Lucene_Field::Text('Description'$productData['Description'], 'UTF-8'));

     $doc->addField(Zend_Search_Lucene_Field::unIndexed('lan'$productData['lan'], 'UTF-8'));  

 $index->addDocument($doc);

     $index->commit();

    $index->optimize(); 

}

echo 'index has been created!';

In KB project, index data is come from database, using method above , We can index all the text from database.

 

4.Searching index

After creating an index , We can search index as below:

<?php

 //File Name: search.php

 require_once('Zend/Search/Lucene.php');

 $index new Zend_Search_Lucene('index');

$keywords='garoon';

 echo "Index contains {$index->count()} documents.\n";

 $query = Zend_Search_Lucene_Search_QueryParser::parse( $keywords'utf-8' );

 $hits $index->find($query);

 foreach ($hits as $hit)

          {

             echo 'PID: '.$hit->PID.'<br>';

             echo 'Score: '.$hit->score.'<br>';

             echo 'url: '.$hit->url.'<br>';

             echo 'productName: '.$hit->productName.'<br>';

             echo 'lan: '.$hit->lan.'<br>';

        }

If we want to search the text for multiple language, We can get value of lan , and then display different results by lan.

 

5.delete and update index.

If we want to update index , first we must find the document in index by keyword, then delete it ,after deleting the old document ,We can add a new document. This is an example to update an index. We delete PID :1 product,and update the description.

<?php

 require_once('Zend/Search/Lucene.php');

    $index new Zend_Search_Lucene('index');

 //new product data to update

 $productNewData =array("PID"=>1,"url"=>"http://www.cybozu.jp","productName"=>"garoon","Description"=>"update garoon Description","lan"=>"en");

 $keywords="PID:1";

 $hits $index->find($keywords);

 //Delete PID:1

   foreach ($hits as $hit)

         {

             echo 'PID: '.$hit->PID .'has been deleted <br>';

             $index->delete($hit->id);

        }

        $index->commit();

 //add new product data to index   

 $doc new Zend_Search_Lucene_Document();

 $doc->addField(Zend_Search_Lucene_Field::keyword('PID'$productNewData['PID'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::Text('url'$productNewData['url'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::Text('productName'$productNewData['productName'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::Text('Description'$productNewData['Description'], 'UTF-8'));

 $doc->addField(Zend_Search_Lucene_Field::unIndexed('lan'$productNewData['lan'], 'UTF-8'));

 $index->addDocument($doc);

 $index->commit();

 $index->optimize(); 

 

6.How to search japanese or chinese text by lucene.

As default , lucene can only search English text.But in this project , we must search the text by English, Japanese and Chinese. So we have to change default analyzer of Lucene.

This is an extend of default analyzer of Lucene as below:

<?php

// File Name:chinese.php

require_once 'Zend/Search/Lucene/Analysis/Analyzer.php';

require_once 'Zend/Search/Lucene/Analysis/Analyzer/Common.php';

 

class CN_Lucene_Analyzer extends Zend_Search_Lucene_Analysis_Analyzer_Common

{

    private $_position;

    private $_cnStopWords array( );

    

    public function setCnStopWords( $cnStopWords )

    {

        $this->_cnStopWords = $cnStopWords;

    }

 

    /**

    * Reset token stream

    */

    public function reset()

    {

        $this->_position = 0;

        $search array(",""/""\\""."";"":""\"""!""~""`""^""("")""?""-""'""<"">""$""&""%""#""@""+""=""{""}""[","]""""""""""""""""""“""”""‘""’""""""""—"" """"""""…"""""""");

    

        $this->_input = str_replace( $search''$this->_input );

        $this->_input = str_replace( $this->_cnStopWords, ' '$this->_input );

    }

 

    /**

    * Tokenization stream API

    * Get next token

    * Returns null at the end of stream

    *

    * @return Zend_Search_Lucene_Analysis_Token|null

    */

    public function nextToken()

    {

        if ($this->_input === null)

        {

            return null;

        }

        $len = strlen($this->_input);

        //print "Old string".$this->_input."<br />";

        while ($this->_position < $len)

        {

            // Delete space at the begining

            while ($this->_position < $len &&$this->_input[$this->_position]==' ' )

            {

                $this->_position++;

            }

            $termStartPosition $this->_position;

            $temp_char $this->_input[$this->_position];

            $isCnWord false;

            if(ord($temp_char)>127)

            {

                $i 0;      

                while$this->_position < $len && ord( $this->_input[$this->_position] )>127 )

                {

                    $this->_position = $this->_position + 3;

                    $i ++;

                    if($i==2)

                    {

                        $isCnWord true;

                        break;

                    }

                }

 

                if($i==1continue;

            }

            else

            {

                while ($this->_position < $len && ctype_alnum( $this->_input[$this->_position] ))

                {

                    $this->_position++;

                }

                //echo $this->_position.":".$this->_input[$this->_position-1]."\n";

            }

            if ($this->_position == $termStartPosition)

            {

                $this->_position++;

                continue;

            }

    

            $tmp_str = substr($this->_input, $termStartPosition$this->_position - $termStartPosition);

            

            $token new Zend_Search_Lucene_Analysis_Token( $tmp_str$termStartPosition,$this->_position );

            

            $token $this->normalize($token);

 

            if($isCnWord)

            {

                $this->_position = $this->_position - 3;

            }

 

            if ($token !== null)

            {

                return $token;

            }

        }

        

        return null;

    }

 

With the help of chinese.php we can search Japanese and Chinese in kb. And also we must add codes as below before creating an index and searching.

 

require_once 'chinese.php';

Zend_Search_Lucene_Analysis_Analyzer::setDefault(new CN_Lucene_Analyzer());

 

7.Is Zend Lucene need downtime?

  By using Zend Lucene , we don’t need any downtime. When add a new article we can add it to index at the same time, If we edit an article, we need to delete old document and update index with new one .


原文:http://www.cnblogs.com/likwo/archive/2009/10/28/1591319.html

  评论这张
 
阅读(279)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017