가장 흔한 단어 찾기

카테고리 없음 2021. 3. 3. 18:06

728x90

가장 흔한 단어 찾기

금지된 단어를 제외한 가장 흔하게 등장하는 단어를 출력하라.

대소문자는 구별하지 않으며 구두점(, . ...)등은 무시한다.

개인적으로 너무 불친절한 문제가 아닐까 한다.

추가로 주어지는 Input 값과 banned 관련해서 테스트케이스가 계속해서 오류가 남을 뒤늦게 확인 할 수 있었다.

Input: 
paragraph = "Bob hit a ball, the hit BALL flew far after it was hit."
banned = ["hit"]
Output: "ball"

Explanation: 
"hit" occurs 3 times, but it is a banned word.
"ball" occurs twice (and no other word does), so it is the most frequent non-banned word in the paragraph. 
Note that words in the paragraph are not case sensitive,
that punctuation is ignored (even if adjacent to words, such as "ball,"), 
and that "hit" isn't the answer even though it occurs more because it is banned.

Note

1 <= paragraph.length <= 1000.
0 <= banned.length <= 100.
1 <= banned[i].length <= 10.

대답은 고유하며 소문자로 작성됩니다 (단락에 대문자 기호가있을 수 있고 고유 명사 일지라도).
단락은 문자, 공백 또는 구두점 기호!? ',;로만 구성됩니다.
하이픈이나 하이픈이있는 단어가 없습니다.
단어는 문자로만 구성되며 아포스트로피 나 기타 구두점 기호는 사용할 수 없습니다.

키포인트

대소문자 , 쉼표 등의 구두점이 존재한다. 따라서 데이터 클랜징이라 부르는 입력값에 대한 전처리 작업이 필요하다.

이 전처리를 lower 와 split으로 처음 작성하였으나 정규표현식을 사용하여 처리를 해야만 모든 테스트 케이스를 통과할 수 있었다.

아래와 같은 테이스 케이스에서 굉장히 애를 먹었으며, 정규표현식을 활용하여 전처리 해주는 방법이 베스트 케이스라고 생각한다.

정규 표현식을 이용해 입렵값 전처리 하는 과정에 익숙해지고

collections 모듈의 Counter 객체로 데이터의 개수를 샐 수 있다는 점으로 약 2개의 반복문을 줄일 수 있었다.

paragraph = "a, a, a, a, b,b,b,c, c"

banned = ["a"]

1번째 풀이

# paragraph의 케이스에서 통과하지 못한다.

paragraph = "Bob hit a ball, the hit BALL flew far after it was hit."
# paragraph = "a, a, a, a, b,b,b,c, c"

banned = ["hit"]
# banned = ["a"]

class Solution:
    def mostCommonWord(self, paragraph: str, banned: [str]) -> str:
        lower_paragraph = paragraph.lower().split()
        strip_paragraph = []
        dict_paragraph = {}
        for char in lower_paragraph:
            strip_paragraph.append(char.strip(",."))

        for key in strip_paragraph:
            if key not in banned:
                count = strip_paragraph.count(key)
                dict_paragraph[key] = count

        word = [key for key, value in dict_paragraph.items() if max(dict_paragraph.values()) == value]
        
        return word[0]

2번째 코드

정규식에서 \w 는 단어 문자를 뜻하며 ^은 not을 의미한다.

아래 정규식은 단어 문자가 아닌 모든 문자를 공백으로 치환하는 역할을 해준다.

테스트 모두 통과하긴 하지만 속도가 굉장히 느리다. 리스트 컴프리 헨션을 2번이나 사용했으므로 코드에 대한 가독성도 떨어진다.

class Solution:
    def mostCommonWord(self, paragraph: str, banned: [str]) -> str:
        lower_paragraph = [char for char in re.sub(r'[^\w]', ' ',paragraph).lower().split()]
        dict_paragraph = {}

        for key in lower_paragraph:
            if key not in banned:
                count = lower_paragraph.count(key)
                dict_paragraph[key] = count

        word = [key for key, value in dict_paragraph.items() if max(dict_paragraph.values()) == value]
        
        return word[0]

3번째 풀이

이전까지는 key라는 단어의 갯수를 새서 새로 dict를 만드는 방법으로 문제를 풀었다.

파이썬에서는 강력한 collections 이라는 모듈이 있다.

데이터의 개수를 셀 때 유용한 파이썬의 collections 모듈의 Counter 클래스 사용하여 간단하게 규현이 가능하다고 한다.

collections 모듈의 Counter 객체를 사용하여 단어 목록이 저장된 리스트에서 각 단어의 개수를 Counter 객체로 만들 수 있으며

Counter 객체는 내부 dict 형태로 만들어진다.(어떤 단어가 몇번 사용되었는지)

most_common(1) 메서드를 이용해 첫번째 값을 도출한다

이때 [(형태1, 형태1값)] ==> [0][0] 최종적으로 첫번째 인덱스의 키를 추출하게 된다!

import re
import collections

class Solution:
    def mostCommonWord(self, paragraph: str, banned: [str]) -> str:
        lower_paragraph = [char for char in re.sub(r'[^\w]', ' ',paragraph).lower().split() if char not in banned]
        counts = collections.Counter(lower_paragraph)
        return counts.most_common(1)[0][0]

import re
import collections

class Solution:
    def mostCommonWord(self, paragraph: str, banned: [str]) -> str:
        lower_paragraph = [char for char in re.sub(r'[^\w]', ' ',paragraph).lower().split() if char not in banned]
        
        print(lower_paragraph)
        # ['bob', 'a', 'ball', 'the', 'ball', 'flew', 'far', 'after', 'it', 'was']
        
        counts = collections.Counter(lower_paragraph)
        
        print(counts)
        # Counter({'ball': 2, 'bob': 1, 'a': 1, 'the': 1, 'flew': 1, 'far': 1, 'after': 1, 'it': 1, 'was': 1})
        
        print(counts.most_common(1))
        # [('ball', 2)]

        return counts.most_common(1)[0][0]

시간복잡도

이번 문제에 대해서는 시간 복잡도를 가늠하기 어려운 것 같다.

해결하는데 굉장히 오랜 시간이 걸린 문제였다.

2번째 풀이

3번째 풀이

728x90

저작자표시 동일조건 (새창열림)

ABOUT ME

muntari Log muntari Log

키포인트

시간복잡도

티스토리툴바