MasakhaNER: Named Entity Recognition for African Languages
Summary
The paper addresses the under-representation of African languages in natural language processing (NLP) by creating a high-quality dataset for named entity recognition (NER) in ten African languages. The study aims to improve the availability of resources and tools for these languages, which are often overlooked in NLP research. The authors bring together language speakers, dataset curators, and NLP practitioners to develop and evaluate NER datasets and models for these languages.
The researchers curated NER datasets from local news sources to ensure relevance for native speakers. They trained and evaluated multiple NER models, including CNN-BiLSTM-CRF, mBERT, and XLM-R, across ten African languages. The study also explored cross-domain and cross-lingual transfer learning, using datasets like WikiAnn and CoNLL-2003, to assess the models' ability to generalize across different languages and domains.
Key results indicate that pre-trained language models like XLM-R and mBERT perform well even on languages they were not originally trained on, although languages supported by these models tend to have better performance. The study found that fine-tuning language-specific models and using gazetteers can further improve NER performance. Cross-lingual transfer was most effective when transferring from geographically and linguistically similar languages.
The study highlights the challenges of cross-domain transfer, particularly when the source domain has limited data. It also identifies specific difficulties in recognizing zero-frequency and long entities, suggesting areas for future research. The authors release the datasets, code, and models to encourage further research on African NLP.
The paper suggests future work should focus on increasing the number of annotated sentences per language, expanding the dataset to more African languages, and exploring the use of pre-trained word embeddings for model initialization. The study's findings have implications for improving NLP tools and resources for African languages, which can enhance information retrieval and other applications in these regions.