Great Architect & Artist :: Apache Lucene

Introduction to Lucene's API : 다른 Lucene Pakcages들의 high-level summary

Materials

Apache Lucene은 고성능의 full-featured text search engine library이다. 여기 indexing을 하고 searching을 하는데 Lucene을 어떻게 사용하는지 예제가 있다. (기대한 결과를 check하기 위해 JUnit을 이용한다.)

Analyzer analyzer = new StandardAnalyzer();
 
    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();
     
    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();

Lucene API는 몇가지 package로 구분된다.
Icon
org.apache.lucene.analysis : Reader에서 TokenStream으로 text를 변환하는 추상화된 Analyzer API 정의.
org.apache.lucene.codecs : index 구조를 뒤집어서 encoding/decoding하는 추상 계층 제공.
org.apache.lucene.document : 간단한 Document class 제공. Document는 Field들의 집합이다. 값으로는 String이나 Reader를 가질 수 있다.
org.apache.lucene.index : 2개의 주요 class 제공. IndexWriter - index에 document를 생성, 추가. IndexReader - index의 data에 access.
org.apache.lucene.search : query를 설명하는 data 구조 제공. (ex. 개별 단어에 대한 TermQuery, Phrase에 대한 PhraseQuery, query상의 boolean 조합에 대한 BooleanQuery). IndexSearcher - TopDocs로 query를 바꿈. QueryParser의 개수는 string이나 xml로부터 query 구조를 만들어 내기 위해 제공된다.
org.apache.lucene.store : persistent data를 저장하기 위한 추상화 클래스. Directory는 IndexOutput에 의해 쓰여지거나 IndexInput에 의해 읽혀진 named file의 집합이다. FSDirectory를 포함하여 여러 가지 형태의 구현체가 제공된다. RAMDirectory는 메모리에 데이터 구조를 쓰기 위한 구현체이다.
org.apache.lucene.util : FixedBitSet과 PriorityQueue와 같은 몇 가지 데이터 구조와 util class를 포함하고 있다.
Lucene을 사용하기 위해서, Application은 다음과 같이 해야 한다.
- Fields를 추가하기 위해서는 Document를 생성해야 한다.
- IndexWriter를 생성하고 addDocument()를 통해서 document를 추가해야 한다.
- String으로부터 Query를 생성하기 위해서는 QueryParser.parse()를 호출해야 한다.
- IndexSearcher를 생성하고 search() method에 query를 전달해야 한다.
간단한 code example은 다음과 같다.
- IndexFiles.java : directory에 있는 모든 파일들에 대해서 index를 생성한다.
- SearchFiles.java : query를 입력하고, index를 검색한다.

이것을 설명하기 위해서 다음과 같이 실행해 보라.

> java -cp lucene-core.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles -index index -docs rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
 [ ... ]> java -cp lucene-core.jar:lucene-demo.jar:lucene-queryparser.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
 [ ... thirty-four documents contain the word "chowder" ... ]
Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
 [ ... two documents contain the phrase "clam chowder" and the word "manhattan" ... ]
 [ Note: "+" and "-" are canonical, but "AND", "OR" and "NOT" may be used. ]

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Apache Lucene - Getting Started 02

댓글+트랙백 RSS :: http://www.yongbi.net/rss/response/735

트랙백 주소 :: http://www.yongbi.net/trackback/735

트랙백 RSS :: http://www.yongbi.net/rss/trackback/735

댓글을 달아 주세요

블로거

카테고리

태그목록

최근에 올라온 글

Great Architect & Artist - 최근 글

달력