3) nori 한글 분석기 테스트 및 품사 정의

앞에서 설정한 nori 한글분석기가 어떻게 동작을 하는지 테스트 해 보자.

기본 분석기와의 차이점과 한글 품사에 대해 알아 본다.

먼저 앞에서 등록한 검색 데이터의 내용을 기본 분석기와 nori 분석기로 토크나이저 했을떄 차이점을 보자.

테스트 데이터

이것은 테스트하고 있는 컨텐츠

기본 분석기로 테스트 해보자 . 테스트 방식은 엘라스틱서치의 api 중에 _analyze api 이용해서 본다.

localhost:9200/content-index-1654676543411/_analyze

(request body)
{
    "analyzer": "default",
    "text": "이것은 테스트하고 있는 컨텐츠",
    "explain": true
}

(결과)

{
    "tokens": [
        {
            "token": "이것은",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<HANGUL>",
            "position": 0
        },
        {
            "token": "테스트하고",
            "start_offset": 4,
            "end_offset": 9,
            "type": "<HANGUL>",
            "position": 1
        },
        {
            "token": "있는",
            "start_offset": 10,
            "end_offset": 12,
            "type": "<HANGUL>",
            "position": 2
        },
        {
            "token": "컨텐츠",
            "start_offset": 13,
            "end_offset": 16,
            "type": "<HANGUL>",
            "position": 3
        }
    ]
}

위 결과에서 보면 알수 있듯이 기본 분석기는 SP(space)를 기준으로 토크나이저를 하고 있는것을 알 수 있다.

이제 한글 분석기 nori는 어떻게 동작하는지 보자

localhost:9200/content-index-1654676543411/_analyze
(request body)
{
    "analyzer": "nori",
    "text": "이것은 테스트하고 있는 컨텐츠",
    "explain": false
}

(결과)
{
    "tokens": [
        {
            "token": "이것",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "이거",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "은",
            "start_offset": 2,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "테스트",
            "start_offset": 4,
            "end_offset": 7,
            "type": "word",
            "position": 2
        },
        {
            "token": "하",
            "start_offset": 7,
            "end_offset": 8,
            "type": "word",
            "position": 3
        },
        {
            "token": "고",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 4
        },
        {
            "token": "있",
            "start_offset": 10,
            "end_offset": 11,
            "type": "word",
            "position": 5
        },
        {
            "token": "는",
            "start_offset": 11,
            "end_offset": 12,
            "type": "word",
            "position": 6
        },
        {
            "token": "컨텐츠",
            "start_offset": 13,
            "end_offset": 16,
            "type": "word",
            "position": 7
        }
    ]
}

조금 더 한글에서의 의미가 있는 부분으로 토크나이저 된것을 확인이 된다. 그럼 어떤 각 단어들이 정말 필요한것일까 ?

위 nori 분석기로 검색 데이터가 인덱싱 된 상태에서 검색을 해보면 "하" , "고 " 등의 단어로도 검색이 되는것을 확인이 된다.

위에서 검색 결과에서 보면 "하"만 검색으로 해도 하가 들어가된 컨텐츠가 다 검색이 된다.

뭐 이렇게 검색되는것을 원 한것일수도 있지만 그렇지 않는 경우가 대부분이라 생각된다. 의미가 있는 단어로만 검색이 되길 원하며

검색의 질을 높이고 싶을때 의미 없는 조사등의 품사는 토크나이징 되지 않고 싶을때 어떻게 해야 할까?

먼저 앞에서 테스트 한 문구에서는 어떤 품사가 토크나이징 된것인지 확인해보자.

앞에 테스트 옵션중에 explan을 true로 해보자.

{
    "analyzer": "nori",
    "text": "이것은 테스트하고 있는 컨텐츠",
    "explain": true
}


{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "nori_mixed",
            "tokens": [
                {
                    "token": "이것",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "bytes": "[ec 9d b4 ea b2 83]",
                    "leftPOS": "NP(Pronoun)",
                    "morphemes": "이거/NP(Pronoun)",
                    "posType": "INFLECT",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NP(Pronoun)",
                    "termFrequency": 1
                },
                {
                    "token": "이거",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "bytes": "[ec 9d b4 ea b1 b0]",
                    "leftPOS": "NP(Pronoun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NP(Pronoun)",
                    "termFrequency": 1
                },
                {
                    "token": "은",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "bytes": "[ec 9d 80]",
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "J(Ending Particle)",
                    "termFrequency": 1
                },
                {
                    "token": "테스트",
                    "start_offset": 4,
                    "end_offset": 7,
                    "type": "word",
                    "position": 2,
                    "bytes": "[ed 85 8c ec 8a a4 ed 8a b8]",
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NNG(General Noun)",
                    "termFrequency": 1
                },
                {
                    "token": "하",
                    "start_offset": 7,
                    "end_offset": 8,
                    "type": "word",
                    "position": 3,
                    "bytes": "[ed 95 98]",
                    "leftPOS": "XSV(Verb Suffix)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "XSV(Verb Suffix)",
                    "termFrequency": 1
                },
                {
                    "token": "고",
                    "start_offset": 8,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "bytes": "[ea b3 a0]",
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "E(Verbal endings)",
                    "termFrequency": 1
                },
                {
                    "token": "있",
                    "start_offset": 10,
                    "end_offset": 11,
                    "type": "word",
                    "position": 5,
                    "bytes": "[ec 9e 88]",
                    "leftPOS": "VX(Auxiliary Verb or Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "VX(Auxiliary Verb or Adjective)",
                    "termFrequency": 1
                },
                {
                    "token": "는",
                    "start_offset": 11,
                    "end_offset": 12,
                    "type": "word",
                    "position": 6,
                    "bytes": "[eb 8a 94]",
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "E(Verbal endings)",
                    "termFrequency": 1
                },
                {
                    "token": "컨텐츠",
                    "start_offset": 13,
                    "end_offset": 16,
                    "type": "word",
                    "position": 7,
                    "bytes": "[ec bb a8 ed 85 90 ec b8 a0]",
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NNG(General Noun)",
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": []
    }
}

leftPos , rightPos 부분의 해당 단어의 품사를 나타내는 부분이다. 각각의 품사에 대한 정보는 아래를 참조.

값(tag)	영문명	한글명	예시
E	Verbal endings	어미	사랑/하(E)/다
IC	Interjection	감탄사	와우(IC), 맙소사(IC)
J	Ending Particle	조사	나/는(J)/너/에게(J)
MAG	General Adverb	일반 부사	빨리(MAG)/달리다, 과연(MAG)/범인/은/누구/인가
MAJ	Conjunctive adverb	접속 부사	그런데(MAJ), 그러나(MAJ)
MM (*)	ES:Modifier(한정사), 루씬 API:Determiner(관형사)	설명이 다름	맨(MM)/밥
NA	Unknown	알 수 없음
NNB	Dependent noun (following nouns)	의존명사
NNBC	Dependent noun	의존명사(단위를 나타내는 명사)
NNG	General Noun	일반 명사	강아지(NNG)
NNP	Proper Noun	고유 명사	비숑(NNP)
NP	Pronoun	대명사	그것(NP), 이거(NP)
NR	Numeral	수사	하나(NR)/밖에, 칠(NR)/더하기/삼(NR)
SC(*)	Separator (· / :)	구분자	nori_tokenizer가 특수문자 제거
SE(*)	Ellipsis	줄임표(...)	nori_tokenizer가 특수문자 제거
SF(*)	Terminal punctuation (? ! .)	물음표, 느낌표, 마침표	nori_tokenizer가 특수문자 제거
SH	Chinese character	한자	中國(SH)
SL	Foreign language	외국어	hello(SL)
SN	Number	숫자	1(SN)
SP	Space	공백
SSC(*)	Closing brackets	닫는 괄호 ),]	nori_tokenizer가 특수문자 제거
SSO(*)	Opening brackets	여는 괄호 (,[	nori_tokenizer가 특수문자 제거
SY	Other symbol	심벌
UNA	Unknown	알 수 없음
UNKNOWN	Unknown	알 수 없음
VA	Adjective	형용사	하얀(VA)/눈
VCN	Negative designator	부정 지정사(서술격조사)	사람/이/아니(VCN)/다
VCP	Positive designator	긍정 지정사(서술격조사)	사람/이(VCN)/다
VSV	Unknown	알 수 없음
VV	Verb	동사	움직이(VV)/다,먹(VV)/다
VX	Auxiliary Verb or Adjective	보조 용언	가지/고/싶(VX)/다, 먹/어/보(VX)/다
XPN(*)	Prefix	접두사(체언 접두사?)	ES에서 매핑되는 단어를 찾지 못함
XR(*)	Root	어근	ES에서 매핑되는 단어를 찾기 못함
XSA	Adjective Suffix	형용사 파생 접미사	멋/스럽(XSA)/다
XSN(*)	Noun Suffix	명사 파생 접미사	ES에서 매핑되는 단어를 찾기 못함
XSV(*)	Verb Suffix	동사 파생 접미사	ES에서 매핑되는 단어를 찾기 못함

여기서는 의미가 없어 보이는 "은" , "하", "고", "있", "는" 을 없애 보자.

앞에서 생성했던 index template을 아래와 같이 수정했다.

분석기에 stoptags(등록된 품사를 제외시킨다.) 가 있는 filter를 달아서 원하는 품사만 나오게 했다.

{
    "order": 0,
    "version": 1,
    "index_patterns": [
        "content-index-*"
    ],
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "nori": {
                        "type": "custom",
                        "tokenizer": "nori_mixed",
                        "filter": [
                            "my_posfilter"
                        ]
                    }
                },
                "tokenizer": {
                    "nori_mixed": {
                        "type": "nori_tokenizer",
                        "decompound_mode": "mixed"
                    }
                },
                "filter": {
                    "my_posfilter": {
                        "type": "nori_part_of_speech",
                        "stoptags": [
                            "J",
                            "XSV",
                            "E",
                            "VX"
                        ]
                    }
                }
            },
            "number_of_shards": "1",
            "number_of_replicas": "0"
        }
    },
    "mappings": {
        "properties": {
            "contentText": {
                "analyzer": "nori",
                "type": "text"
            },
            "title": {
                "analyzer": "nori",
                "type": "text"
            }
        }
    },
    "aliases": {}
}

템플릿을 수정하고 다시 검색데이터 인덱싱을 한 후에 앞에서 테스트 한 문자열이 어떻게 토크나이징 되는지 보자.

{
    "analyzer": "nori",
    "text": "이것은 테스트하고 있는 컨텐츠",
    "explain": true
}


{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "nori_mixed",
            "tokens": [
                {
                    "token": "이것",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "bytes": "[ec 9d b4 ea b2 83]",
                    "leftPOS": "NP(Pronoun)",
                    "morphemes": "이거/NP(Pronoun)",
                    "posType": "INFLECT",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NP(Pronoun)",
                    "termFrequency": 1
                },
                {
                    "token": "이거",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "bytes": "[ec 9d b4 ea b1 b0]",
                    "leftPOS": "NP(Pronoun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NP(Pronoun)",
                    "termFrequency": 1
                },
                {
                    "token": "은",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "bytes": "[ec 9d 80]",
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "J(Ending Particle)",
                    "termFrequency": 1
                },
                {
                    "token": "테스트",
                    "start_offset": 4,
                    "end_offset": 7,
                    "type": "word",
                    "position": 2,
                    "bytes": "[ed 85 8c ec 8a a4 ed 8a b8]",
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NNG(General Noun)",
                    "termFrequency": 1
                },
                {
                    "token": "하",
                    "start_offset": 7,
                    "end_offset": 8,
                    "type": "word",
                    "position": 3,
                    "bytes": "[ed 95 98]",
                    "leftPOS": "XSV(Verb Suffix)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "XSV(Verb Suffix)",
                    "termFrequency": 1
                },
                {
                    "token": "고",
                    "start_offset": 8,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "bytes": "[ea b3 a0]",
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "E(Verbal endings)",
                    "termFrequency": 1
                },
                {
                    "token": "있",
                    "start_offset": 10,
                    "end_offset": 11,
                    "type": "word",
                    "position": 5,
                    "bytes": "[ec 9e 88]",
                    "leftPOS": "VX(Auxiliary Verb or Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "VX(Auxiliary Verb or Adjective)",
                    "termFrequency": 1
                },
                {
                    "token": "는",
                    "start_offset": 11,
                    "end_offset": 12,
                    "type": "word",
                    "position": 6,
                    "bytes": "[eb 8a 94]",
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "E(Verbal endings)",
                    "termFrequency": 1
                },
                {
                    "token": "컨텐츠",
                    "start_offset": 13,
                    "end_offset": 16,
                    "type": "word",
                    "position": 7,
                    "bytes": "[ec bb a8 ed 85 90 ec b8 a0]",
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "positionLength": 1,
                    "reading": null,
                    "rightPOS": "NNG(General Noun)",
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "my_posfilter",
                "tokens": [
                    {
                        "token": "이것",
                        "start_offset": 0,
                        "end_offset": 2,
                        "type": "word",
                        "position": 0,
                        "bytes": "[ec 9d b4 ea b2 83]",
                        "leftPOS": "NP(Pronoun)",
                        "morphemes": "이거/NP(Pronoun)",
                        "posType": "INFLECT",
                        "positionLength": 1,
                        "reading": null,
                        "rightPOS": "NP(Pronoun)",
                        "termFrequency": 1
                    },
                    {
                        "token": "이거",
                        "start_offset": 0,
                        "end_offset": 2,
                        "type": "word",
                        "position": 0,
                        "bytes": "[ec 9d b4 ea b1 b0]",
                        "leftPOS": "NP(Pronoun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "positionLength": 1,
                        "reading": null,
                        "rightPOS": "NP(Pronoun)",
                        "termFrequency": 1
                    },
                    {
                        "token": "테스트",
                        "start_offset": 4,
                        "end_offset": 7,
                        "type": "word",
                        "position": 2,
                        "bytes": "[ed 85 8c ec 8a a4 ed 8a b8]",
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "positionLength": 1,
                        "reading": null,
                        "rightPOS": "NNG(General Noun)",
                        "termFrequency": 1
                    },
                    {
                        "token": "컨텐츠",
                        "start_offset": 13,
                        "end_offset": 16,
                        "type": "word",
                        "position": 7,
                        "bytes": "[ec bb a8 ed 85 90 ec b8 a0]",
                        "leftPOS": "NNG(General Noun)",
                        "morphemes": null,
                        "posType": "MORPHEME",
                        "positionLength": 1,
                        "reading": null,
                        "rightPOS": "NNG(General Noun)",
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

위 결과에서 보면 filter가 없는 상태의 토크나이저 부분과 필터링된 토크나이저 부분이 보인다.

없애고자했던 품사들이 없어진것을 확인!!

이제 실제 엘라스틱서치에서 search api로 인덱싱된 데이터에서 "하" 를 검색했을때 결과를 확인해 보자.

검색 된 결과가 없는것을 확인! 다시 "테스트"로 검색을 하면 검색 결과가 잘 나오는것을 볼 수 있다.

이렇게 품사를 조절하면서 검색의 질을 높이는 방법에 대해 알아 보았다.

저작자표시 (새창열림)

'AWS > ElasticSearch를 이용한 검색' 카테고리의 다른 글

2) ElasticSearch Alias와 index template 운영방안 (0)	2022.06.08
1) ElasticSearch 설치하고 Springboot로 테스트하기 (0)	2022.03.12

개발하는 데브라쿤의 개발일기

3) nori 한글 분석기 테스트 및 품사 정의

'AWS > ElasticSearch를 이용한 검색' 카테고리의 다른 글

티스토리툴바

3) nori 한글 분석기 테스트 및 품사 정의

'AWS > ElasticSearch를 이용한 검색' 카테고리의 다른 글

'AWS/ElasticSearch를 이용한 검색' Related Articles

티스토리툴바