The Status of Arabic Stemming Algorithms

This article presents a comprehensive report on the current landscape of open-source Arabic language stemming algorithms and services.

Stemming, a fundamental task in natural language processing, involves reducing Arabic words to their root forms, thereby enabling efficient information retrieval, text analysis, and machine learning applications. The study evaluates five different stemming algorithms, each developed by independent researchers and communities. The aim of this research is to compare and analyze the performance of these stemmers in terms of accuracy, computational efficiency, and language coverage. The evaluation includes a diverse set of Arabic texts, encompassing various domains and genres. The results highlight the strengths and weaknesses of each algorithm, shedding light on their applicability to different scenarios. The findings contribute to the Arabic NLP community by offering an in-depth analysis of open-source stemming tools and facilitating informed decision-making for researchers, developers, and practitioners.

Introduction

Arabic, with its rich linguistic heritage and diverse dialects, poses unique challenges for natural language processing (NLP) tasks, including stemming. Stemming plays a vital role in various NLP applications, such as information retrieval, text analysis, and machine learning. By reducing Arabic words to their root forms, stemming enables efficient indexing, search, and analysis of textual data. As the demand for Arabic language processing continues to grow, the availability of accurate and reliable open-source stemming algorithms becomes crucial.

This article aims to address the specific needs of a dictionary project by conducting a comprehensive study on the current available open-source Arabic language stemming algorithms and services. The dictionary project, focused on Arabic language, necessitates a stemmer that can accurately identify the root forms of words, while considering the intricacies of the Arabic language’s morphology and orthography.

To fulfill this objective, we evaluate five distinct Arabic stemming algorithms, the study considers the language coverage of each algorithm to determine their applicability to various domains and genres typically encountered in a dictionary project.

Integration of a stemmer into a dictionary project requires ease of use, scalability, and well-documented resources. Assessing the suitability of these services is an essential aspect of our analysis, as it provides insights into their practicality and usability within the context of a dictionary project.

By presenting a detailed report on the strengths and weaknesses of each stemmer, along with an examination of available services, this article aims to assist researchers, developers, and practitioners in making informed decisions regarding the selection of a suitable Arabic language stemming algorithm for their dictionary projects. The findings of this study contribute to the growing body of knowledge in Arabic NLP and serve as a valuable resource for those working on Arabic language processing and related applications.

The studied algorithms/implementations are (Boudlal and Lakhouaja, 2010), (Eldesouki, 2017), (Chelli, 2018), (Taghva et al., 2005). We also made a use of an older report of (Sawalha and Atwell, 2008) which studied the performance of other majors stemmers, including (Khoja, 1999), (Al-Shalabi et al., 2003) and other analyzers, however, only the final results of these are used for the purpose of consolidating the argument about the status of modern Arabic stemmers.

Methodology

Rather than investing time in deeply testing each algorithm weak points; the test make use of a randomly chosen set of verbs and known morphological balance, this way we can easily have an insight of the falls of each algorithm and its overall accuracy. (github)

All the algorithms are tested as vendored by their main developers, wrapped in a generic Common Lisp solution (code on github) which gather the required data about the elapsed running time, accuracy and return values.

On the track of that; the results are plotted using a simple GnuPlot script to demonstrate them. The test program separates each morphological root test set.

Testing

This section presents the findings and analysis of the evaluation conducted on the algorithms and services. The results provide insights into the performance, accuracy, computational efficiency, and language coverage of each stemmer, shedding light on their suitability for integration into a dictionary project.

(Eldesouki, 2017)

The first tested was developed in 2017, it uses a simple trivial approach of eliminating articles from words and normalize them to their root.

for article in Light10stemmer.larkey_defarticles:
│   length = len(article)
│   if (wordlen > length + 1) and (token[:length] == article):
│   │   token = token[length:]
│   │   break
│   │
if len(token) > 2:
│   wordlen = len(token)
│   for suffix in Light10stemmer.larkey_suffixes:
│   │   suflen = len(suffix)
│   │   if (wordlen > len(suffix) + 1) and token.endswith(suffix):
│   │   │   token = token[:wordlen - suflen]
│   │   │   wordlen = len(token)

Which is very intuitive approach, however it performs the worst between the other stemmers we got, it was only able to pass 1 test case.

(Taghva et al., 2005)

The ISRI Arabic Stemmer is a Natural Language Toolkit (NLTK) algorithm that was developed at University of Nevada by a set of researchers. The algorithm works much better (compared to most of its other non-root dictionary implementations) in rooting words and it works using a pattern recognition to morphological patterns. It implements 25 balances. It was able to pass 17 out of 35 testcases.

def pro_w53(self, word):
│   """process length five patterns and extract length three roots"""
│   if word[2] in self.pr53[0] and word[0] == "\u0627":
│   │   word = word[1] + word[3:]
│   elif word[4] == "\u0629" and word[1] == "\u0627":
│   │   word = word[0] + word[2:4]
│   │   │   .............................
│   │   │   .............................
│   elif word[4] == "\u064a" and word[2] == "\u0627":
│   │   word = word[:2] + word[3]

(Chelli, 2018)

Assem’s Arabic Light Stemmer authored by Assem Chelli is a stemmer that makes use of Snowball stemming framework, it implements the renowned algorithm for the Arabic script, trained by set of data. It performs just as the ISRI’s; however it is about 200% faster than it. It was able to pass 17 out of 35 testcases.

(Boudlal and Lakhouaja, 2010)

The Natural Language Processing Team in Computer Science Laboratory developed one of the most accurate ’systems’ so far for stemming in Arabic; they are using a brute-force-like approach solving the problem, or as they describe in their publication:

“Our approach is based on modelling a very large set of Arabic morphological rules, and also on integrating linguistic resources, such as the root database, vocalized patterns associated with roots, and proclitic and enclitic tables”.

Which leads to the main issue with the stemmer, not only it’s far from being lightweight (compared to other known Latin stemmers) since it depends on a large database system, but 1. it’s profoundly unintelligent in searching the word dictionary it owns, for example if the diacritic differs it will not give any results. 2. it does not contain any results for words that were not recorded it its database, which leaves it with only solution which is to have each word in the language recorded.

(Khoja, 1999) and (Al-Shalabi et al., 2003)

Both of (Khoja, 1999) and (Al-Shalabi et al., 2003) was not tested in this study, however, a comparative study (Sawalha and Atwell, 2008) concluded the following results using the Qur’an Gold Standards:

Conclusion

In conclusion, this article conducted a comprehensive analysis of open-source Arabic language stemming algorithms and services. Stemming is a vital task in natural language processing, enabling efficient information retrieval, text analysis, and machine learning applications. The evaluation of five different stemming algorithms revealed their respective strengths and weaknesses.

Among the tested stemmers, the “Eldesouki 2017” algorithm performed poorly, while the “Taghva, Elkhoury, and Coombs 2005” and “Chelli 2018” stemmers showed better results, passing a significant number of test cases. However, it is important to note that even the most accurate stemmer achieved only around 60% accuracy, indicating room for improvement in the field. Further research and development are needed to enhance the accuracy and effectiveness of Arabic stemming algorithms.

References

Al-Shalabi, R., Kanaan, G., Al-Serhan, H., 2003. New approach for extracting arabic roots, in: Proceedings of the International Arab Conference on Information Technology (Acit’2003). Egypt.

Boudlal, A., Lakhouaja, A., 2010. Alkhalil morpho sys1: A morphosyntactic analysis system for arabic texts. Int. arab conf. inf. technol 1–6.

Chelli, A., 2018. Assem’s Arabic Stemmer. https://doi.org/10.6084/m9.figshare.7295690.v1

Eldesouki, M., 2017. Arabicprocessingcog. Github repository.

Khoja, S., 1999. Stemming arabic text.

Sawalha, M., Atwell, E., 2008. Comparative evaluation of arabic language morphological analysers and stemmers. 107–110.

Taghva, K., Elkhoury, R., Coombs, J., 2005. Arabic stemming without a root dictionary 152–157 Vol. 1. https://doi.org/10.1109/ITCC.2005.90

لّermontov