3.5 索引文档_从Lucene到Elasticsearch：全文检索实战-QQ阅读男生历史网

书名：从Lucene到Elasticsearch：全文检索实战
作者名：姚攀
本章字数：949字
更新时间：2020-11-28 14:50:08

3.5 索引文档

工程搭建完成以后，首先进行索引的构建。要检索的对象是文件，为了简单起见，我们只索引文档名和文档内容。利用Java面向对象的思想，在lucene.file.search.model目录下新建一个实体类，类名为FileModel，表示文件对象，设置title和content两个String类型的成员变量并提供对应的setter和getter方法，最后添加一个有参构造方法，示例见代码清单3-3。

代码清单3-3 使用Parser接口提取文档内容

        package lucene.file.search.model;
        public class FileModel {
                private String title; // 文件标题
                private String content; // 文件内容
                public String getTitle（）{
                    return title;
                }
                public void setTitle（String title）{
                    this.title = title;
                }
                public String getContent（）{
                    return content;
                }
                public void setContent（String content）{
                    this.content = content;
                }
                public FileModel（）{
                }
                public FileModel（String title, String content）{
                    this.title = title;
                    this.content = content;
                }
        }

在lucene.file.search.service目录下新建一个创建索引的Java类，类名为CreateIndex，在类中添加两个静态方法：extractFile()方法和ParserExtraction()方法。extractFile()方法用于列出WebContent/files目录下的所有文件，返回值类型为FileModel类型的列表。Java代码如下：

        public static List<FileModel> extractFile（）throws IOException {
            ArrayList<FileModel> list = new ArrayList<FileModel>（）;
            File fileDir = new File（"WebContent/files"）;
            File[] allFiles = fileDir.listFiles（）;
            for（File f : allFiles）{
            FileModel sf = new FileModel（f.getName（）, ParserExtraction（f））;
                list.add（sf）;
            }
            return list;
        }

ParserExtraction()方法的功能是使用Tika提取文档内容，传入参数为一个File对象。Java代码如下：

        public static String ParserExtraction（File file）{
                String fileContent = ""; //接收文档内容
                BodyContentHandler handler = new BodyContentHandler（）;
                Parser parser = new AutoDetectParser（）; //自动解析器接口
                Metadata metadata = new Metadata（）;
                FileInputStream inputStream;
                try {
                    inputStream = new FileInputStream（file）;
                    ParseContext context = new ParseContext（）;
                    parser.parse（inputStream, handler, metadata, context）;
                    fileContent = handler.toString（）;
                } catch（FileNotFoundException e）{
                    e.printStackTrace（）;
                } catch（IOException e）{
                    e.printStackTrace（）;
                } catch（SAXException e）{
                    e.printStackTrace（）;
                } catch（TikaException e）{
                    e.printStackTrace（）;
                }
                return fileContent;
        }

分词器使用IK分词，把IKTokenizer6x.java和IKAnalyzer6x.java放到lucene.file.search.service包下（Lucene 6.0中使用IK分词器的方法参考第2章）。CreateIndex类中新建主函数，代码如下：

          public static void main（String[] args）throws IOException {
                  // IK分词器对象
                  Analyzer analyzer = new IKAnalyzer6x（）;
                  IndexWriterConfig icw = new IndexWriterConfig（analyzer）;
                  icw.setOpenMode（OpenMode.CREATE）;
                  Directory dir = null;
                  IndexWriter inWriter = null;
                  Path indexPath = Paths.get（"WebContent/indexdir"）;
                  FieldType fileType = new FieldType（）;
                  fileType.setIndexOptions（IndexOptions.
                      DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS）;
                  fileType.setStored（true）;
                  fileType.setTokenized（true）;
                  fileType.setStoreTermVectors（true）;
                  fileType.setStoreTermVectorPositions（true）;
                  fileType.setStoreTermVectorOffsets（true）;
                  Date start = new Date（）; // 开始时间
                  if（! Files.isReadable（indexPath））{
                      System.out.println（indexPath.toAbsolutePath（）+ "不存在或
                          者不可读，请检查！"）;
                      System.exit（1）;
                  }
                  dir = FSDirectory.open（indexPath）;
                  inWriter = new IndexWriter（dir, icw）;
                  ArrayList<FileModel> fileList =（ArrayList<FileModel>）
                  extractFile（）;
                  // 遍历fileList，建立索引
                  for（FileModel f : fileList）{
                      Document doc = new Document（）;
                      doc.add（new Field（"title", f.getTitle（）, fileType））;
                      doc.add（new Field（"content", f.getContent（）, fileType））;
                      inWriter.addDocument（doc）;
                  }
                  inWriter.commit（）;
                  inWriter.close（）;
                  dir.close（）;
                  Date end = new Date（）; // 结束时间
                  // 打印索引耗时
                  System.out.println（"索引文档完成，共耗时：" +（end.getTime（）-
                      start.getTime（））+ "毫秒．"）;
          }

运行CreateIndex类中的main方法，/filesearch/WebContent/files目录下的所有文档，不论是什么格式，文件名和文件内容都被写入到了Lucene索引。