Abdessamad Nafissi AI Language Engineer
Computational Linguist bridging linguistics and machine learning to build reliable language systems. Specializing in **LLM training data, evaluation datasets, and advanced Arabic AI engineering**.
Computational Linguist bridging linguistics and machine learning to build reliable language systems. Specializing in **LLM training data, evaluation datasets, and advanced Arabic AI engineering**.
Pioneering solutions at the intersection of human linguistics and machine computation.
I’m a Computational Linguist and AI Language Engineer who bridges linguistics and machine learning to build reliable language systems. With 15+ years of experience in translation, proofreading, and localization (Arabic, French, and English), I developed a deep intuition for how language works, down to morphology, syntax, semantics, and the details that most systems struggle with, like Arabic diacritics and dialect variation.
Today, I focus on LLM training and evaluation data, dataset quality, and language tooling. I use Python, SQL, and regex to extract and validate high-signal data, and I build practical NLP workflows with tools like Hugging Face, PyTorch, spaCy, NLTK, and scikit-learn. My goal is simple: turn linguistic precision into measurable improvements in AI quality, especially for underrepresented and high-complexity languages like Arabic.
A customized combination of expert linguistic analysis, software engineering, and machine learning.
Tracing my academic education and dual professional accomplishments over the years.
Specialized in advanced multilingual translation techniques, linguistics methodologies, and legal/economic terminology preparation across English, French, and Arabic.
Thorough training in structural linguistics, grammatical synthesis, semantics, translation protocols, and literature studies.
Rigorous self-directed curriculum in advanced calculus, probability, neural network architecture (RNNs, LSTMs, Transformers), NLP pipeline design, and machine translation tuning.
Focusing on LLM training and evaluation data, dataset quality curation, and custom language tooling. Utilizing Python, SQL, and regex to build high-signal dataset mixtures, validate model outputs, and establish rigorous alignment guidelines to improve AI quality, with a primary specialization in underrepresented and high-complexity languages like Arabic.
Promoted to lead linguistic asset curation, localization QA, and data evaluation workflows for Siri voice features. Engineered custom NLP workflows, developed core terminology structures, and collaborated closely with machine learning engineers to turn linguistic precision into measurable Siri quality improvements.
Analyzed complex morphological, syntactic, and semantic patterns to build voice assistant systems. Designed large-scale text datasets and resolved systemic edge cases including Arabic diacritics, dialectal variations, and cultural nuances to train NLU Siri classifiers.
Provided expert localization and translation services across English, French, and Arabic. Specialized in app/web software localization and dense subject matters (ecology, economics, law, AI/IT) using CAT tools like Trados Studio and MemoQ.
Explore the tangible outputs of my work in advanced computational linguistics and machine learning.
This project investigates how Maghribi Arabizi, an informal Romanized form of Moroccan Arabic, can be automatically converted into Arabic script. I benchmarked three de-romanization approaches: rule-based character mapping, statistical MLE word mapping, and a neural character-level Seq2Seq model. The results show that the MLE approach performs strongest on the held-out test set, highlighting the value of data-driven lexical mappings for noisy low-resource dialect text.
A highly customizable Python NLP pipeline designed to scrape, clean, and structure unstructured social media texts. Built-in elements parse custom URLs, extract base domain names, sanitize characters, and convert visual emojis into high-semantic text tokens without compromising grammatical syntax or context.
Co-provided professional manuscript preparation and technical translation for the comprehensive academic volume "Coviability of Social and Ecological Systems: Reconnecting Mankind to the Biosphere". Meticulously resolved dense economic, regulatory, and ecological terminologies between English, French, and Arabic.
Test the hybrid MLE and RegEx fallback decoding system for Moroccan Arabizi (Darija) proposed in our coming research.
Arabizi is a romanized writing system where Latin characters and numerals are used to transcribe Arabic dialects (especially in chat messages). Because Moroccan Darija is highly morphologically rich and phonetic, general Seq2Seq models often struggle due to low-resource training data.
Our research proposes a hybrid approach combining a Maximum Likelihood Estimation (MLE) model with a robust phonological RegEx Fallback decoder. On my held-out blind test split, the MLE system achieved the strongest performance among the three implemented baselines.
Internal evaluation on a held-out blind test split from my research notebook.
Note: Results are from an internal research experiment and are not a published benchmark.
Dataset Credit: Powered by the UBC-NLP/nilechat-arabizi-mor dataset, which is based on the original NileChat corpus.
Looking to collaborate on LLM training data, custom NLP tooling, or advanced Arabic linguistics? Send me a message below.
Tampa Bay Area, FL