๋กœ์ผ“๐Ÿพ
article thumbnail
Published 2021. 8. 24. 15:09
[Elastic Search] TokenFilter - NGram ...

์ด ๊ธ€์€ ๊น€์ข…๋ฏผ(kimjmin@gmail.com)๋‹˜์˜ ๊ธ€์ž…๋‹ˆ๋‹ค. ๋ฌด๋‹จ ๋ณต์ œ/์ˆ˜์ •์„ ๊ธˆํ•ฉ๋‹ˆ๋‹ค.

 

Elasticsearch๋Š” ๋น ๋ฅธ ๊ฒ€์ƒ‰์„ ์œ„ํ•ด ๊ฒ€์ƒ‰์— ์‚ฌ์šฉ๋  ํ…€ ๋“ค์„ ๋ฏธ๋ฆฌ ๋ถ„๋ฆฌํ•ด์„œ ์—ญ ์ธ๋ฑ์Šค์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ณผํ•™ ์šฉ์–ด์ง‘ ๊ฒ€์ƒ‰ ๊ฐ™์€ ํŠน์ •ํ•œ ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋”ฐ๋ผ ํ…€์ด ์•„๋‹Œ ๋‹จ์–ด์˜ ์ผ๋ถ€๋งŒ ๊ฐ€์ง€๊ณ ๋„ ๊ฒ€์ƒ‰ํ•ด์•ผ ํ•˜๋Š” ๊ธฐ๋Šฅ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๋„ ์žˆ์Šต๋‹ˆ๋‹ค. RDBMS์˜ LIKE ๊ฒ€์ƒ‰ ์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•˜๋Š” wildcard ์ฟผ๋ฆฌ๋‚˜ regexp (์ •๊ทœ์‹) ์ฟผ๋ฆฌ๋„ ์ง€์›์„ ํ•˜์ง€๋งŒ, ์ด๋Ÿฐ ์ฟผ๋ฆฌ๋“ค์€ ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋ชจ๊ฐ€ ๋งŽ๊ณ  ๋Š๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— Elasticsearch์˜ ์žฅ์ ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์‚ฌ์šฉ์„ ์œ„ํ•ด ๊ฒ€์ƒ‰ ํ…€์˜ ์ผ๋ถ€๋งŒ ๋ฏธ๋ฆฌ ๋ถ„๋ฆฌํ•ด์„œ ์ €์žฅ์„ ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋ ‡๊ฒŒ ๋‹จ์–ด์˜ ์ผ๋ถ€๋ฅผ ๋‚˜๋ˆˆ ๋ถ€์œ„๋ฅผ NGram ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋ณดํ†ต์€ unigram(์œ ๋‹ˆ๊ทธ๋žจ – 1๊ธ€์ž), bigram(๋ฐ”์ด๊ทธ๋žจ - 2์ž) ๋“ฑ์œผ๋กœ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

 

Elasticsearch๋Š” NGram์„ ์ฒ˜๋ฆฌํ•˜๋Š” ํ† ํฐ ํ•„ํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ ์„ค์ •์€ "type": "nGram" ์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. "house" ๋ผ๋Š” ๋‹จ์–ด๋ฅผ 2 ๊ธ€์ž์˜ NGram (bigram) ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด "ho", "ou", "us", "se" ์ด 4๊ฐœ์˜ ํ† ํฐ๋“ค์ด ์ถ”์ถœ๋ฉ๋‹ˆ๋‹ค. ngram ํ† ํฐํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ ‡๊ฒŒ 2๊ธ€์ž์”ฉ ์ถ”์ถœ๋œ ํ…€๋“ค์ด ๋ชจ๋‘ ๊ฒ€์ƒ‰ ํ† ํฐ์œผ๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค. ์ด์ œ ์ด ์ธ๋ฑ์Šค์˜ ๊ฒฝ์šฐ์—๋Š” ๊ฒ€์ƒ‰์–ด๋ฅผ "ho" ๋ผ๊ณ ๋งŒ ๊ฒ€์ƒ‰์„ ํ•ด๋„ house ๊ฐ€ ํฌํ•จ๋œ ๋„ํ๋จผํŠธ๋“ค์ด ๊ฒ€์ƒ‰์ด ๋ฉ๋‹ˆ๋‹ค.

 

๋ฌด๋‹จ ๋ณต์ œ/์ˆ˜์ • ๊ธˆ์ง€

์ฃผ์˜ : ngram ํ† ํฐํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ €์žฅ๋˜๋Š” ํ…€์˜ ๊ฐฏ์ˆ˜๋„ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ๋Š˜์–ด๋‚˜๊ณ  ๊ฒ€์ƒ‰์–ด๋ฅผ "ho"๋กœ ๊ฒ€์ƒ‰ ํ–ˆ์„ ๋•Œ house, shoes ์ฒ˜๋Ÿผ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ƒํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜์ ์ธ ํ…์ŠคํŠธ ๊ฒ€์ƒ‰์—๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. ngram์„ ์‚ฌ์šฉํ•˜๊ธฐ ์ ํ•ฉํ•œ ์‚ฌ๋ก€๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ๋ชฉ๋ก์ด๋‚˜ ํƒœ๊ทธ ๋ชฉ๋ก๊ณผ ๊ฐ™์ด ์ „์ฒด ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ง‘๋‹จ์— ์ž๋™์™„์„ฑ ๊ฐ™์€ ๊ธฐ๋Šฅ์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

ngram ํ† ํฐ ํ•„ํ„ฐ์—๋Š” min_gram (๋””ํดํŠธ 1), max_gram (๋””ํดํŠธ 2) ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ง์ž‘ํ•  ์ˆ˜ ์žˆ๋“ฏ์ด ์ตœ์†Œ, ์ตœ๋Œ€ ๋ฌธ์ž์ˆ˜์˜ ํ† ํฐ์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋‹จ์œ„์ž…๋‹ˆ๋‹ค. house๋ฅผ "min_gram": 2, "max_gram": 3 ์œผ๋กœ ์„ค์ •ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ถ„์„๋˜์–ด ์ด 7๊ฐœ์˜ ํ† ํฐ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฌด๋‹จ ๋ณต์ œ/์ˆ˜์ • ๊ธˆ์ง€

๋‹ค์Œ์€ my_ngram ์ธ๋ฑ์Šค์— "min_gram": 2, "max_gram": 3 ์ธ my_ngram_f ํ† ํฐํ•„ํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  house ๋ฅผ ๋ถ„์„ํ•˜๋Š” ์˜ˆ์ œ์ž…๋‹ˆ๋‹ค.

 

my_ngram ์ธ๋ฑ์Šค ์ƒ์„ฑ
PUT my_ngram
{
  "settings": {
    "analysis": {
      "filter": {
        "my_ngram_f": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 3
        }
      }
    }
  }
}

 

Request

my_ngram_f ํ† ํฐํ•„ํ„ฐ๋กœ "house" ๋ถ„์„
GET my_ngram/_analyze
{
  "tokenizer": "keyword",
  "filter": [
    "my_ngram_f"
  ],
  "text": "house"
}

Response

my_ngram_f ํ† ํฐํ•„ํ„ฐ๋กœ "house" ๋ถ„์„ ๊ฒฐ๊ณผ
{
  "tokens" : [
    {
      "token" : "ho",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hou",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ou",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ous",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "us",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "use",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "se",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    }
  ]
}
profile on loading

Loading...