Chapter 33. Zend_Search_Lucene

Table of Contents

33.1. Overview
33.1.1. Introduction
33.1.2. Document and Field Objects
33.1.3. Understanding Field Types
33.1.4. HTML documents
33.2. Building Indexes
33.2.1. Creating a New Index
33.2.2. Updating Index
33.2.3. Updating Documents
33.2.4. Retrieving Index Size
33.2.5. Index optimization
33.2.5.1. MaxBufferedDocs auto-optimization option
33.2.5.2. MaxMergeDocs auto-optimization option
33.2.5.3. MergeFactor auto-optimization option
33.2.6. Permissions
33.2.7. Limitations
33.2.7.1. Index size
33.2.7.2. Supported Filesystems
33.3. Searching an Index
33.3.1. Building Queries
33.3.1.1. Query Parsing
33.3.2. Search Results
33.3.3. Limiting the Result Set
33.3.4. Results Scoring
33.3.5. Search Result Sorting
33.3.6. Search Results Highlighting
33.4. Query Language
33.4.1. Terms
33.4.2. Fields
33.4.3. Starting in 1.5, Wildcards
33.4.4. Term Modifiers
33.4.5. Starting in 1.5, Range Searches
33.4.6. Starting in 1.5, Fuzzy Searches
33.4.7. Proximity Searches
33.4.8. Boosting a Term
33.4.9. Boolean Operators
33.4.9.1. AND
33.4.9.2. OR
33.4.9.3. NOT
33.4.9.4. &&, ||, and ! operators
33.4.9.5. +
33.4.9.6. -
33.4.9.7. No Operator
33.4.10. Grouping
33.4.11. Field Grouping
33.4.12. Escaping Special Characters
33.5. Query Construction API
33.5.1. Query Parser Exceptions
33.5.2. Term Query
33.5.3. Multi-Term Query
33.5.4. Boolean Query
33.5.5. Starting in 1.5, Wildcard Query
33.5.6. Starting in 1.5, Fuzzy Query
33.5.7. Phrase Query
33.5.8. Starting in 1.5, Range Query
33.6. Character Set
33.6.1. UTF-8 and single-byte character set support
33.6.2. Default text analyzer.
33.6.3. UTF-8 compatible text analyzers.
33.7. Extensibility
33.7.1. Text Analysis
33.7.2. Tokens Filtering
33.7.3. Scoring Algorithms
33.7.4. Storage Containers
33.8. Interoperating with Java Lucene
33.8.1. File Formats
33.8.2. Index Directory
33.8.3. Java Source Code
33.9. Advanced
33.9.1. Using the index as static property
33.10. Best Practices
33.10.1. Field names
33.10.2. Indexing performance
33.10.3. Index during Shut Down
33.10.4. Retrieving documents by unique id
33.10.5. Memory Usage
33.10.6. Encoding
33.10.7. Index maintenance

33.1. Overview

33.1.1. Introduction

Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. Since it stores its index on the filesystem and does not require a database server, it can add search capabilities to almost any PHP-driven website. Zend_Search_Lucene supports the following features:

  • Ranked searching - best results returned first

  • Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more [6]

  • Search by specific field (e.g., title, author, contents)

Zend_Search_Lucene was derived from the Apache Lucene project. The currently supported Lucene version is 2.2. [7]. For more information on Lucene, visit http://lucene.apache.org/java/docs/ (http://lucene.apache.org/java/2_2_0/).

[Note]

Previous Zend_Search_Lucene implementations support the Lucene 1.9 index format.

Currently any index created using these versions is automatically upgraded to Lucene 2.1 format after the Zend_Search_Lucene update and will not be compatible with previous Zend_Search_Lucene versions.

33.1.2. Document and Field Objects

Zend_Search_Lucene operates with documents as atomic objects for indexing. A document is divided into named fields, and fields have content that can be searched.

A document is represented by the Zend_Search_Lucene_Document class, and this objects of this class contain instances of Zend_Search_Lucene_Field that represent the fields on the document.

It is important to note that any information can be added to the index. Application-specific information or metadata can be stored in the document fields, and later retrieved with the document during search.

It is the responsibility of your application to control the indexer. This means that data can be indexed from any source that is accessible by your application. For example, this could be the filesystem, a database, an HTML form, etc.

Zend_Search_Lucene_Field class provides several static methods to create fields with different characteristics:

<?php
$doc = new Zend_Search_Lucene_Document();

// Field is not tokenized, but is indexed and stored within the index.
// Stored fields can be retrived from the index.
$doc->addField(Zend_Search_Lucene_Field::Keyword('doctype',
                                                 'autogenerated'));

// Field is not tokenized nor indexed, but is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                   time()));

// Binary String valued Field that is not tokenized nor indexed,
// but is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Binary('icon',
                                                $iconData));

// Field is tokenized and indexed, and is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                              'Document annotation text'));

// Field is tokenized and indexed, but is not stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  'My document content'));
            

Each of these methods (excluding the Zend_Search_Lucene_Field::Binary() method) has an optional $encoding parameter for specifying input data encoding.

Encoding may differ for different documents as well as for different fields within one document:

<?php
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title, 'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents, 'utf-8'));
                

If encoding parameter is omitted, then the current locale is used at processing time. For example:

<?php
setlocale(LC_ALL, 'de_DE.iso-8859-1');
...
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents));
                

Fields are always stored and returned from the index in UTF-8 encoding. Any required conversion to UTF-8 happens automatically.

Text analyzers (see below) may also convert text to some other encodings. Actually, the default analyzer converts text to 'ASCII//TRANSLIT' encoding. Be careful, however; this translation may depend on current locale.

Fields' names are defined at your discretion in the addField() method.

Java Lucene uses the 'contents' field as a default field to search. Zend_Search_Lucene searches through all fields by default, but the behavior is configurable. See the "Default search field" chapter for details.

33.1.3. Understanding Field Types

  • Keyword fields are stored and indexed, meaning that they can be searched as well as displayed in search results. They are not split up into separate words by tokenization. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.

  • UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields.

  • Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.

  • Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results.

  • UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate field as an identifier.

    Table 33.1. Zend_Search_Lucene_Field Types

    Field Type Stored Indexed Tokenized Binary
    Keyword Yes Yes No No
    UnIndexed Yes No No No
    Binary Yes No No Yes
    Text Yes Yes Yes No
    UnStored No Yes Yes No

33.1.4. HTML documents

Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:

<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
$index->addDocument($doc);
...
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
            

Zend_Search_Lucene_Document_Html class uses the DOMDocument::loadHTML() and DOMDocument::loadHTMLFile() methods to parse the source HTML, so it doesn't need HTML to be well formed or to be XHTML. On the other hand, it's sensitive to the encoding specied by the "meta http-equiv" header tag.

Zend_Search_Lucene_Document_Html class recognizes document title, body and document header meta tags.

The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for search.

The 'body' field is the actual body content of the HTML file or string. It doesn't include scripts, comments or attributes.

The loadHTML() and loadHTMLFile() methods of Zend_Search_Lucene_Document_Html class also have second optional argument. If it's set to true, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored.

Other document header meta tags produce additional document fields. The field 'name' is taken from 'name' attribute, and the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so documents may be searched by their meta tags (for example, by keywords).

Parsed documents may be augmented by the programmer with any other field:

<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                   time()));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('updated',
                                                   time()));
$doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                              'Document annotation text'));
$index->addDocument($doc);
            

Document links are not included in the generated document, but may be retrieved with the Zend_Search_Lucene_Document_Html::getLinks() and Zend_Search_Lucene_Document_Html::getHeaderLinks() methods:

<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$linksArray = $doc->getLinks();
$headerLinksArray = $doc->getHeaderLinks();
            



[6] Term, multi-term, phrase queries, boolean expressions and subqueries are supported at this time.

[7] Lucene 2.1 index format support (which is also used in Lucene 2.2) is included in the current "trunk" branch. It is available via SVN in current nightly snapshots.

We hope to include Lucene 2.1 index format support in ZF 1.5.0. The current release (ZF V1.0.4) works with Lucene 1.9-2.0 index formats.