common_grams 过滤器

common_grams 过滤器

common_grams 过滤器是针对短语查询能更高效的使用停用词而设计的。它与 shingles 过滤器类似（参见查找相关词（shingles), 为每个相邻词对生成，用示例解释更为容易。

common_grams 过滤器根据 query_mode 设置的不同而生成不同输出结果：false （为索引使用）或 true （为搜索使用），所以我们必须创建两个独立的分析器：

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "index_filter": { (1)
          "type":         "common_grams",
          "common_words": "_english_" (2)
        },
        "search_filter": { (1)
          "type":         "common_grams",
          "common_words": "_english_", (2)
          "query_mode":   true
        }
      },
      "analyzer": {
        "index_grams": { (3)
          "tokenizer":  "standard",
          "filter":   [ "lowercase", "index_filter" ]
        },
        "search_grams": { (3)
          "tokenizer": "standard",
          "filter":  [ "lowercase", "search_filter" ]
        }
      }
    }
  }
}

<1> 首先我们基于 common_grams 过滤器创建两个过滤器： index_filter 在索引时使用（此时 query_mode 的默认设置是 false ）， search_filter 在查询时使用（此时 query_mode 的默认设置是 true ）。

<2> common_words 参数可以接受与 stopwords 参数同样的选项（参见指定停用词 specifying-stopwords ）。这个过滤器还可以接受参数 common_words_path ，使用存于文件里的常用词。

<3> 然后我们使用过滤器各创建一个索引时分析器和查询时分析器。

有了自定义分析器，我们可以创建一个字段在索引时使用 index_grams 分析器：

PUT /my_index/_mapping/my_type
{
  "properties": {
    "text": {
      "type":            "string",
      "analyzer":  "index_grams", (1)
      "search_analyzer": "standard" (1)
    }
  }
}

<1> text 字段索引时使用 index_grams 分析器，但是在搜索时默认使用 standard 分析器，稍后我们会解释其原因。

索引时（At Index Time）

如果我们对短语 The quick and brown fox 进行拆分，它生成如下词项：

Pos 1: the_quick
Pos 2: quick_and
Pos 3: and_brown
Pos 4: brown_fox

新的 index_grams 分析器生成以下词项：

Pos 1: the, the_quick
Pos 2: quick, quick_and
Pos 3: and, and_brown
Pos 4: brown
Pos 5: fox

所有的词项都是以 unigrams 形式输出的（the、quick 等等），但是如果一个词本身是常用词或者跟随着常用词，那么它同时还会在 unigram 同样的位置以 bigram 形式输出：the_quick ， quick_and ， and_brown 。

单字查询（Unigram Queries）

因为索引包含 unigrams，可以使用与其他字段相同的技术进行查询，例如：

GET /my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "the quick and brown fox",
        "cutoff_frequency": 0.01
      }
    }
  }
}

上面这个查询字符串是通过为文本字段配置的 search_analyzer 分析器 —本例中使用的是 standard 分析器— 进行分析的，它生成的词项为： the ， quick ， and ， brown ， fox 。

因为 text 字段的索引中包含与 standard 分析去生成的一样的 unigrams ，搜索对于任何普通字段都能正常工作。

二元语法短语查询（Bigram Phrase Queries）

但是，当我们进行短语查询时，我们可以用专门的 search_grams 分析器让整个过程变得更高效：

GET /my_index/_search
{
  "query": {
    "match_phrase": {
      "text": {
        "query":    "The quick and brown fox",
        "analyzer": "search_grams" (1)
      }
    }
  }
}

<1> 对于短语查询，我们重写了默认的 search_analyzer 分析器，而使用 search_grams 分析器。

search_grams 分析器会生成以下词项：

Pos 1: the_quick
Pos 2: quick_and
Pos 3: and_brown
Pos 4: brown
Pos 5: fox

分析器排除了所有常用词的 unigrams，只留下常用词的 bigrams 以及低频的 unigrams。如 the_quick 这样的 bigrams 比单个词项 the 更为少见，这样有两个好处：

the_quick 的位置信息要比 the 的小得多，所以它读取磁盘更快，对系统缓存的影响也更小。
词项 the_quick 没有 the 那么常见，所以它可以大量减少需要计算的文档。

两词短语（Two-Word Phrases）

我们的优化可以更进一步，因为大多数的短语查询只由两个词组成，如果其中一个恰好又是常用词，例如：

GET /my_index/_search
{
  "query": {
    "match_phrase": {
      "text": {
        "query":    "The quick",
        "analyzer": "search_grams"
      }
    }
  }
}

那么 search_grams 分析器会输出单个语汇单元：the_quick 。这将原来昂贵的查询（查询 the 和 quick ）转换成了对单个词项的高效查找。