spaCy Sentence Segmentation

spaCy sentence segmentation akan membahas tentang memecah kalimat, dan menambahkan atau mengubah rule dalam sentence segmentation.

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

for sent in doc.sents:
    print(sent)
This is the first sentence.
This is another sentence.
This is the last sentence.

Segementation standar dari spaCy akan memecah kalimat berdasarkan stop token, yaitu tanda titik.

Note, doc.sents adalah generator. Segmentation dilakukan saat doc.sent dipanggil. Jadi kita tidak bisa memanggil doc.sents[index] tanpa memanggil doc.sents terlebih dahulu.

Jika diperlukan untuk mengolah isi doc.sents, simpan dahulu dalam variable

# cara list comprehension, bisa melakukan conditional
doc_sents = [sent for sent in doc.sents]
doc_sents

# cara ini juga bisa dilakukan, namun tidak bisa melakukan conditioning
doc_sents = list(doc.sents)
[This is the first sentence.,
 This is another sentence.,
 This is the last sentence.]

Menambahkan Segmentation Rule

# spaCy default rule
doc3 = nlp(u'"Beware of monotony; it\'s the mother of all the deadly sins." -Edith Wharton')

for sent in doc3.sents:
    print(sent)
"Beware of monotony; it's the mother of all the deadly sins.
" -Edith Wharton

Contoh, kita ingin menambahkan rule, memisahkan sentence berdasarkan semicolon “;”

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == ';':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')

nlp.pipe_names
['tagger', 'set_custom_boundaries', 'parser', 'ner']

Rule baru sudah ditambahkan, jika kita jalankan ulang code dibawah, maka spaCy akan menambahkan semicolon sebagai pemisah sentence.

doc4 = nlp(u'"Beware of monotony; it\'s the mother of all the deadly sins." -Edith Wharton')

for sent in doc4.sents:
    print(sent)
"Beware of monotony;
it's the mother of all the deadly sins.
" -Edith Wharton

Mengubah Segmentation Rule

Default spaCy segmentation rule adalah memisahkan sentence pada tanda titik. Kita bisa mengubah rule tersebut.

# reload model untuk reset perubahan yang dilakukan sebelumnya.
nlp = spacy.load('en_core_web_sm')

mystr = u"This is a sentence. This is another.\n\nThis is a \nthird sentence."

# SPACY DEFAULT:
doc = nlp(mystr)

for sent in doc.sents:
    print([token.text for token in sent])
['This', 'is', 'a', 'sentence', '.']
['This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n', 'third', 'sentence', '.']

Berikut code untuk mengubah segmentation rule

from spacy.pipeline import SentenceSegmenter

def split_on_newlines(doc):
    start = 0
    seen_newline = False
    for word in doc:
        if seen_newline:
            yield doc[start:word.i]
            start = word.i
            seen_newline = False
        elif word.text.startswith('\n'): # handles multiple occurrences
            seen_newline = True
    yield doc[start:]      # handles the last group of tokens


sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_newlines)
nlp.add_pipe(sbd)

Function split_on_newlines boleh menggunakan nama apa saja, asal sesuai aturan penulisan nama fungsi pada Python. Sementara untuk SentenceSegmenter, harus menggunakan nama sbd.

doc = nlp(mystr)
for sent in doc.sents:
    print([token.text for token in sent])
['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', '.', '\n\n']
['This', 'is', 'a', '\n']
['third', 'sentence', '.']

Setelah segmentation rule diubah, spaCy akan memisahkan sentence berdasarkan line break.

Sharing is caring: