Lucene4源代码解析之二:Lucene简介

    添加时间:2013-7-2 点击量:

    Lucene是一个高机能的,可扩大的信息检索库,能让你轻松到在本身的应用中增长全文检索功能。简单的说,Lucene就是实现了搜刮引擎功能的根蒂根基库。

        Lucene是一个高机能的,可扩大的信息检索库,能让你轻松到在本身的应用中增长全文检索功能。简单的说,Lucene就是实现了搜刮引擎功能的根蒂根基库。搜刮引擎就一个功能,按照网名输入的Query,找到相干的文档。文档成千上万上亿,直接字符串匹配次序查找,慢的像蜗牛一样,谁都受不了,所以聪慧人发了然索引(Index)。所谓索引,简单点说就是建树了一些词到一些文档的映射,能敏捷的按照词找到相干的文档。所以Lucene就两个根蒂根基功能:(1)对文档建索引;(2)按照网名输入,经由过程索引敏捷的找到相干的文档。

        索引是现代搜刮引擎的核心,建树索引的过程就是把源数据处理惩罚成很是便利查询的索引文件的过程。为什么索引这么首要呢,试想你如今要在多量的文档中搜刮含有某个关键词的文档,那么若是不建树索引的话你就须要把这些文档次序的读入内存,然后搜检这个文章中是不是含有要查找的关键词,如许的话就会花费很是多的时候,想想搜刮引擎可是在毫秒级的时候内查找出要搜刮的成果的。这就是因为建树了索引的原因,你可以把索引想象成如许一种数据布局,他可以或许使你快速的随机接见存储在索引中的关键词,进而找到该关键词所接洽关系的文档。Lucene 采取的是一种称为反向索引(inverted index)的机制。反向索引就是说我们保护了一个词 / 短语表,对于这个表中的每个词 / 短语,都有一个链表描述了有哪些文档包含了这个词 / 短语。如许在用户输入查询前提的时辰,就能很是快的获得搜刮成果。

        为了对文档进行索引,Lucene 供给了五个根蒂根基的类,他们分别是 Document, Field, IndexWriter, Analyzer, Directory。下面我们分别介绍一下这五个类的用处:

    Document

    Document 是用来描述文档的,这里的文档可以指一个 HTML 页面,一封电子邮件,或者是一个文本文件。一个 Document 对象由多个 Field 对象构成的。可以把一个 Document 对象想象成数据库中的一个记录,而每个 Field 对象就是记录的一个字段。

    Field

    Field 对象是用来描述一个文档的某个属性的,比如一封电子邮件的题目和内容可以用两个 Field 对象分别描述。

    Analyzer

    在一个文档被索引之前,起首须要对文档内容进行分词处理惩罚,这项目组工作就是由 Analyzer 来做的。Analyzer 类是一个抽象类,它有多个实现。针对不合的说话和应用须要选择合适的 Analyzer。Analyzer 把分词后的内容交给 IndexWriter 来建树索引。

    IndexWriter

    IndexWriter 是 Lucene 用来创建索引的一个核心的类,他的感化是把一个个的 Document 对象加到索引中来。

    Directory

    这个类代表了 Lucene 的索引的存储的地位,这是一个抽象类,它今朝有两个实现,第一个是 FSDirectory,它默示一个存储在文件体系中的索引的地位。第二个是 RAMDirectory,它默示一个存储在内存傍边的索引的地位。

        下面是一个简单的建索引的代码:

    public class IndexFiles {
      private IndexFiles() {}

      / Index all text files under a directory. /
      public static void main(String[] args) {
        String usage = java org.apache.lucene.demo.IndexFiles
                     + [-index INDEX_PATH] [-docs DOCS_PATH] [-]\n\n
                     + This indexes the documents in DOCS_PATH, creating a Lucene index
                     + in INDEX_PATH that can be searched with SearchFiles;
        String indexPath = index;
        String docsPath = null;
        boolean create = true;
        for(int i=0;i<args.length;i++) {
          //写索引的目次
          if (-index.equals(args[i])) {
            indexPath = args[i+1];
            i++;
          //需建索引的文档目次
          } else if (-docs.equals(args[i])) {
            docsPath = args[i+1];
            i++;
          //指明是更新还是新建
          } else if (-.equals(args[i])) {
            create = false;
          }
        }

        if (docsPath == null) {
          System.err.println(Usage: + usage);
          System.exit(1);
        }

        final File docDir = new File(docsPath);
        if (!docDir.exists() || !docDir.canRead()) {
          System.out.println(Document directory +docDir.getAbsolutePath()+ does not exist or is not readable, please check the path);
          System.exit(1);
        }
        Date start = new Date();
        try {
          System.out.println(Indexing to directory + indexPath + ...);
          //写索引的目次
          Directory dir = FSDirectory.open(new File(indexPath));
          //实例化一个解析器
          Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
          //索引Writer设备,参数为解析器
          IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer);

          if (create) {
            // Create a new index in the directory, removing any
            // previously indexed documents:
            iwc.setOpenMode(OpenMode.CREATE);
          } else {
            // Add new documents to an existing index:
            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
          }

          // Optional: for better indexing performance, if you
          // are indexing many documents, increase the RAM
          // buffer.  But if you do this, increase the max heap
          // size to the JVM (eg add -Xmx512m or -Xmx1g):
          //
          // iwc.setRAMBufferSizeMB(256.0);

          //索引Writer,参数为索引目次以及索引Writer设备
          IndexWriter writer = new IndexWriter(dir, iwc);
          //建树索引,参数为索引Writer以及需建索引的文档目次
          indexDocs(writer, docDir);

          // NOTE: if you want to maximize search performance,
          // you can optionally call forceMerge here.  This can be
          // a terribly costly operation, so generally its only
          // worth it when your index is relatively static (ie
          // youre done adding documents to it):
          //
          // writer.forceMerge(1);

          writer.close();

          Date end = new Date();
          System.out.println(end.getTime() - start.getTime() + total milliseconds);

        } catch (IOException e) {
          System.out.println( caught a + e.getClass() +
           \n with message: + e.getMessage());
        }
      }

      /
       Indexes the given file using the given writer, or if a directory is given,
       recurses over files and directories found under the given directory.
      
       NOTE: This method indexes one document per input file.  This is slow.  For good
       throughput, put multiple documents into your input file(s).  An example of this is
       in the benchmark module, which can create line doc files, one document per line,
       using the
       <a href=../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html
       >WriteLineDocTask</a>.
        
       @param writer Writer to the index where the given file/dir info will be stored
       @param file The file to index, or the directory to recurse into to find files to index
       @throws IOException If there is a low-level I/O error
       /
      static void indexDocs(IndexWriter writer, File file)
        throws IOException {

        if (file.canRead()) {
          //若是是目次,进行递归调用
          if (file.isDirectory()) {
            String[] files = file.list();
            if (files != null) {
              for (int i = 0; i < files.length; i++) {
                indexDocs(writer, new File(file, files[i]));
              }
            }
          } else {
            //对文件建索引
            FileInputStream fis;
            try {
              fis = new FileInputStream(file);
            } catch (FileNotFoundException fnfe) {
              return;
            }

            try {

              //
              Document doc = new Document();

              // Add the path of the file as a field named path.  Use a
              // field that is indexed (i.e. searchable), but dont tokenize
              // the field into separate words and dont index term frequency
              // or positional information:
              Field pathField = new StringField(path, file.getPath(), Field.Store.YES);
              doc.add(pathField);

              // Add the last modified date of the file a field named modified.
              // Use a LongField that is indexed (i.e. efficiently filterable with
              // NumericRangeFilter).  This indexes to milli-second resolution, which
              // is often too fine.  You could instead create a number based on
              // year/month/day/hour/minutes/seconds, down the resolution you require.
              // For example the long value 2011021714 would mean
              // February 17, 2011, 2-3 PM.
              doc.add(new LongField(modified, file.lastModified(), Field.Store.NO));

              // Add the contents of the file to a field named contents.  Specify a Reader,
              // so that the text of the file is tokenized and indexed, but not stored.
              // Note that FileReader expects the file to be in UTF-8 encoding.
              // If thats not the case searching for special characters will fail.
              doc.add(new TextField(contents, new BufferedReader(new InputStreamReader(fis, UTF-8))));

              if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
                // New index, so we just add the document (no old document can be there):
                System.out.println(adding + file);
                writer.addDocument(doc);
              } else {
                // Existing index (an old copy of this document may have been indexed) so
                // we use Document instead to replace the old one matching the exact
                // path, if present:
                System.out.println(updating + file);
                writer.Document(new Term(path, file.getPath()), doc);
              }
            } finally {
              fis.close();
            }
          }
        }
      }
    }

        哄骗 Lucene 进行搜刮就像建树索引一样也是很是便利的。在上方一项目组中,我们已经为一个目次下的文本文档建树好了索引,如今我们就要在这个索引长进行搜刮以找到包含某个关键词或短语的文档。Lucene 供给了几个根蒂根基的类来完成这个过程,它们分别是呢 IndexSearcher, Term, Query, TermQuery, Hits. 下面我们分别介绍这几个类的功能。

    Query

    这是一个抽象类,他有多个实现,比如 TermQuery, BooleanQuery, PrefixQuery. 这个类的目标是把用户输入的查询字符串封装成 Lucene 可以或许识此外 Query。

    Term

    Term 是搜刮的根蒂根基单位,一个 Term 对象有两个 String 类型的域构成。生成一个 Term 对象可以有如下一条语句来完成:Term term = new Term(“fieldName”,”queryWord”); 此中第一个参数代表了要在文档的哪一个 Field 长进行查找,第二个参数代表了要查询的关键词。

    TermQuery

    TermQuery 是抽象类 Query 的一个子类,它同时也是 Lucene 支撑的根蒂根基的一个查询类。生成一个 TermQuery 对象由如下语句完成: TermQuery termQuery = new TermQuery(new Term(“fieldName”,”queryWord”)); 它的机关函数只接管一个参数,那就是一个 Term 对象。

    IndexSearcher

    IndexSearcher 是用来在建树好的索引长进行搜刮的。它只能以只读的体式格式打开一个索引,所以可以有多个 IndexSearcher 的实例在一个索引长进行操纵。

    Hits

    Hits 是用来保存搜刮的成果的。

        下面是简单的查询代码:

    public class SearchFiles {

      private SearchFiles() {}

      / Simple command-line based search demo. /
      public static void main(String[] args) throws Exception {
        String usage =
          Usage:\tjava org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]\n\nSee http://lucene.apache.org/core/4_1_0/demo/ for details.;
        if (args.length > 0 && (-h.equals(args[0]) || -help.equals(args[0]))) {
          System.out.println(usage);
          System.exit(0);
        }

        String index = index;
        String field = contents;
        String queries = null;
        int repeat = 0;
        boolean raw = false;
        String queryString = null;
        int hitsPerPage = 10;
        for(int i = 0;i < args.length;i++) {
          if (-index.equals(args[i])) {
            index = args[i+1];
            i++;
          } else if (-field.equals(args[i])) {
            field = args[i+1];
            i++;
          } else if (-queries.equals(args[i])) {
            queries = args[i+1];
            i++;
          } else if (-query.equals(args[i])) {
            queryString = args[i+1];
            i++;
          } else if (-repeat.equals(args[i])) {
            repeat = Integer.parseInt(args[i+1]);
            i++;
          } else if (-raw.equals(args[i])) {
            raw = true;
          } else if (-paging.equals(args[i])) {
            hitsPerPage = Integer.parseInt(args[i+1]);
            if (hitsPerPage <= 0) {
              System.err.println(There must be at least 1 hit per page.);
              System.exit(1);
            }
            i++;
          }
        }
        IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
        IndexSearcher searcher = new IndexSearcher(reader);
        Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

        BufferedReader in = null;
        if (queries != null) {
          in = new BufferedReader(new InputStreamReader(new FileInputStream(queries), UTF-8));
        } else {
          in = new BufferedReader(new InputStreamReader(System.in, UTF-8));
        }
        QueryParser parser = new QueryParser(Version.LUCENE_40, field, analyzer);
        while (true) {
          if (queries == null && queryString == null) {                        // prompt the user
            System.out.println(Enter query: );
          }

          String line = queryString != null ? queryString : in.readLine();

          if (line == null || line.length() == -1) {
            break;
          }

          line = line.trim();
          if (line.length() == 0) {
            break;
          }
          Query query = parser.parse(line);
          System.out.println(Searching for: + query.toString(field));
          if (repeat > 0) {                           // repeat & time as benchmark
            Date start = new Date();
            for (int i = 0; i < repeat; i++) {
              searcher.search(query, null, 100);
            }
            Date end = new Date();
            System.out.println(Time: +(end.getTime()-start.getTime())+ms);
          }

          doPagingSearch(in, searcher, query, hitsPerPage, raw, queries == null && queryString == null);

          if (queryString != null) {
            break;
          }
        }
        reader.close();
      }

      /
       This demonstrates a typical paging search scenario, where the search engine presents
       pages of size n to the user. The user can then go to the next page if interested in
       the next hits.
      
       When the query is executed for the first time, then only enough results are collected
       to fill 5 result pages. If the user wants to page beyond this limit, then the query
       is executed another time and all hits are collected.
      
       /
      public static void doPagingSearch(BufferedReader in, IndexSearcher searcher, Query query,
                                         int hitsPerPage, boolean raw, boolean interactive) throws IOException {
        // Collect enough docs to show 5 pages
        TopDocs results = searcher.search(query, 5 hitsPerPage);
        ScoreDoc[] hits = results.scoreDocs;
        int numTotalHits = results.totalHits;
        System.out.println(numTotalHits + total matching documents);

        int start = 0;
        int end = Math.min(numTotalHits, hitsPerPage);
        while (true) {
          if (end > hits.length) {
            System.out.println(Only results 1 - + hits.length + of + numTotalHits + total matching documents collected.);
            System.out.println(Collect more (y/n) ?);
            String line = in.readLine();
            if (line.length() == 0 || line.charAt(0) == n) {
              break;
            }

            hits = searcher.search(query, numTotalHits).scoreDocs;
          }
          end = Math.min(hits.length, start + hitsPerPage);
          for (int i = start; i < end; i++) {
            if (raw) {                              // output raw format
              System.out.println(doc=+hits[i].doc+ score=+hits[i].score);
              continue;
            }

            Document doc = searcher.doc(hits[i].doc);
            String path = doc.get(path);
            if (path != null) {
              System.out.println((i+1) + . + path);
              String title = doc.get(title);
              if (title != null) {
                System.out.println(   Title: + doc.get(title));
              }
            } else {
              System.out.println((i+1) + . + No path for this document);
            }
          }

          if (!interactive || end == 0) {
            break;
          }

          if (numTotalHits >= end) {
            boolean quit = false;
            while (true) {
              System.out.print(Press );
              if (start - hitsPerPage >= 0) {
                System.out.print((p)revious page, ); 
              }
              if (start + hitsPerPage < numTotalHits) {
                System.out.print((n)ext page, );
              }
              System.out.println((q)uit or enter number to jump to a page.);
              String line = in.readLine();
              if (line.length() == 0 || line.charAt(0)==q) {
                quit = true;
                break;
              }
              if (line.charAt(0) == p) {
                start = Math.max(0, start - hitsPerPage);
                break;
              } else if (line.charAt(0) == n) {
                if (start + hitsPerPage < numTotalHits) {
                  start+=hitsPerPage;
                }
                break;
              } else {
                int page = Integer.parseInt(line);
                if ((page - 1) hitsPerPage < numTotalHits) {
                  start = (page - 1) hitsPerPage;
                  break;
                } else {
                  System.out.println(No such page);
                }
              }
            }
            if (quit) break;
            end = Math.min(numTotalHits, start + hitsPerPage);
          }
        }
      }
    }

    文艺不是炫耀,不是花哨空洞的文字堆砌,不是一张又一张的逆光照片,不是将旅行的意义转化为名牌包和明信片的物质展示;很多时候它甚至完全不美——它嘶吼、扭曲,它会痛苦地抽搐,它常常无言地沉默。——艾小柯《文艺是一种信仰》
    分享到: