An Hybrid Part of Speech Tagger for Setswana Language using a Voting Method

Authors

  • Mary Dibitso
  • Pius A. Owolawi
  • Sunday O. Ojo

Keywords:

PoS tagging , SVM , MaxEnt , Setswana , voting method

Abstract

Part-of Speech (PoS) tagging is a corpus  linguistics that deals with assigning appropriate  lexical categories to each word in a sentence. To  effectively address challenges associated with PoS  tagging, several Natural Language Processing  (NLP) tasks modelling techniques have been  employed, including Conditional Random Fields (CRF), Support Vector Machines (SVM), and  Decision Trees in diverse languages. These PoS  taggers implement the process of associating the  correct PoS (nouns, verbs, adjectives, adverbs,  etc.) with each word in a sentence. However,  creating language resources is an expensive  process for many languages, including the  indigenous languages of South Africa that are  classified as resource-scarce. Therefore, using  Setswana as a language with limited resources,  this study explores and applies methods to  increase the utilization of existing resources and  tagger accuracy. This is done using Setswana's  two PoS taggers: a Maximum Entropy (MaxEnt)  and an SVM, which achieved an accurateness of  94.4 per cent and 95.59 per cent respectively. To  find errors in the taggers, an error analysis is  carried out. The Setswana PoS Tagger was then  built using a voting algorithm to improve results  and attain 97.06 per cent accuracy. The  combination of taggers reduces the error rate by  up to 2.01 per cent.  

https://doi.org/10.59200/ICONIC.2022.027

Downloads

Published

2022-12-31