Capstone Project
Data Mining Machine Learning

Capstone Project: English to Amharic Translation using Neural Machine Translation

(Photo credit: Africa-facts)

Introduction

My name is Edosa Leta, a recent computer science graduate from the African Leadership University. I am originally from Ethiopia. Since I was young, I was interested in Tech, and ALU gave me a chance to explore more and identify my real passion within the field. I have recently become fond of Artificial Intelligence and its potential in solving problems we have in Africa. While exploring AI, I also learned about Machine Learning, the study of computer algorithms that improve automatically through experience and by the use of data (Wikipedia). 

Motivation

In 2018, I joined Black in AI, a place for sharing ideas, fostering collaborations, and discussing initiatives to increase the presence of Black people in the field of Artificial, where I met my current mentor and founder of Alliance4AI. Since joining Alliance4AI as a Program Manager, I helped budding AI startups become more organized, find capital, and strengthen their portfolios. While working with Widebot, a chatbot startup that detects varying dialects of the Arabic language, I was overcome with the notion that machine translation can be an immensely transformative concept for disenfranchised communities. My home country of Ethiopia has over 86 individual languages and countless dialects. Moreover, I know firsthand how this vast array of languages creates barriers to entry in technology and business. Furthermore, I have become fascinated with the promise that machine translation, and natural language processing, holds in propelling the development of communities speaking low-resource languages like mine. 

Capstone Project

As for my final year capstone, I worked on a Neural Machine Translation project for translating English to Amharic. I used the JoeyNMT framework, a friend model for new people who are just getting started with Neural Machine Translation. 

This research project is based on translating content from English to Amharic and vise versa. This project aspires to help researchers from Ethiopia to be able to understand scientific research papers written in other languages with ease. Amharic uses a script that originated from the Ge’ez alphabet. It has 33 primary characters, with each having seven forms or variations for each consonant-vowel combination. As a result, this makes it difficult for the model to use these scripts and translate them correctly. 

JoeyNMT Model

The Masakhane community developed a notebook that was used as a guide to work on this project. It also used a pre-trained model since I do not have the computing power (GPU) to train the model for a long time. The different stages of the notebook include importing the necessary data (relevant data collected for the ). After that, the data needed to be pre-proceed for the model to use it. The data also needs to be converted into a format that the model can use. The next step is critical in the process, which involves creating the system configurations containing the basic configurations to train the model. 

To understand more about the system/started a notebook for this project, you can refer to the original notebook here.

Next Steps

I am going to start my Masters in Computer Science Program at the University of Maryland Baltimore County this coming fall. I am going to focus on Data Science and more on Natural Language Processing and its applications. I want to work eagerly to create a crowdsourcing platform for people to contribute data for Academic research purposes (For Ethiopian Languages). These low resources languages we have in Africa do not have enough data (Academic data) for researchers to work with and train models. I want to keep working on this project and bridge the gap between researchers who love working with their native languages and create more content in the language.

Resources

  • Capstone Project Notebook – link
  • Original Masakhane Notebook – link
  • Video on how to use the Original Notebook – link

References

  1. J. Kreutzer, J. Bastings and S. Riezler, Joey NMT: A Minimalist NMT Toolkit for Novices. 2018.
  2. Kreutzer, J., Bastings, J., & Riezler, S. (2019). Joey NMT: A Minimalist NMT Toolkit for Novices. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, 109–114. https://doi.org/10.18653/v1/D19-3019
  3. Chala, S. A. (n.d.). A thesis submitted to the School of Graduate Studies of Addis Ababa University in partial fulfillment of the requirements for the Degree of Master of Science in Information Science. 86.
  4. Heyi, G. T. (n.d.). A Thesis Submitted to the Department of Computer Science in Partial Fulfilment for the Degree of Master of Science in Computer Science. 103.

Leave a Reply

Your email address will not be published. Required fields are marked *