WHERE LINGUISTICS MEETS CODE

Abdessamad Nafissi AI Language Engineer

Computational Linguist bridging linguistics and machine learning to build reliable language systems. Specializing in **LLM training data, evaluation datasets, and advanced Arabic AI engineering**.

Abdessamad Nafissi Portrait
About Me

Bridging Two Intellectual Worlds

Pioneering solutions at the intersection of human linguistics and machine computation.

I’m a Computational Linguist and AI Language Engineer who bridges linguistics and machine learning to build reliable language systems. With 15+ years of experience in translation, proofreading, and localization (Arabic, French, and English), I developed a deep intuition for how language works, down to morphology, syntax, semantics, and the details that most systems struggle with, like Arabic diacritics and dialect variation.

Today, I focus on LLM training and evaluation data, dataset quality, and language tooling. I use Python, SQL, and regex to extract and validate high-signal data, and I build practical NLP workflows with tools like Hugging Face, PyTorch, spaCy, NLTK, and scikit-learn. My goal is simple: turn linguistic precision into measurable improvements in AI quality, especially for underrepresented and high-complexity languages like Arabic.

15+
Years Experience
Translation & Localization
3+
Core Languages
Arabic, French, English
10+
NLP Systems
Pipelines, Summarizers, MT
1
Published Volume
Manuscript Preparation
Skills Matrix

My Technical Toolkit

A customized combination of expert linguistic analysis, software engineering, and machine learning.

Linguistic Engineering

Translation & Proofreading 95%
Software Localization 90%
Morphological & Syntactic Parsing 90%
Multilingual Terminology DBs 85%

Natural Language Processing

Custom Preprocessing Pipelines 90%
Sentiment Analysis (VAD/Categorical) 85%
Transformers & MT (Seq2Seq) 80%
Named Entity Recognition (NER) 80%

Machine Learning & Code

Python (Pandas, Numpy, Scikit) 85%
PyTorch & Neural Networks 75%
Hugging Face APIs & Fine-Tuning 80%
Azure Cloud & ML Ops 70%
Timeline

My Professional Journey

Tracing my academic education and dual professional accomplishments over the years.

Education

2011 - 2013

Master of Translation (MIT)

Specialized in advanced multilingual translation techniques, linguistics methodologies, and legal/economic terminology preparation across English, French, and Arabic.

2008 - 2011

Bachelor of Arts (BA) in English Studies

Thorough training in structural linguistics, grammatical synthesis, semantics, translation protocols, and literature studies.

2021 - Present

Self-Directed Advanced NLP & DL Specialization

Continuous Academic Study

Rigorous self-directed curriculum in advanced calculus, probability, neural network architecture (RNNs, LSTMs, Transformers), NLP pipeline design, and machine translation tuning.

Experience

Sep 2024 - May 2026

AI Language Engineer

Meta

Focusing on LLM training and evaluation data, dataset quality curation, and custom language tooling. Utilizing Python, SQL, and regex to build high-signal dataset mixtures, validate model outputs, and establish rigorous alignment guidelines to improve AI quality, with a primary specialization in underrepresented and high-complexity languages like Arabic.

Jan 2022 - Mar 2022

Senior Data Linguist

Apple

Promoted to lead linguistic asset curation, localization QA, and data evaluation workflows for Siri voice features. Engineered custom NLP workflows, developed core terminology structures, and collaborated closely with machine learning engineers to turn linguistic precision into measurable Siri quality improvements.

May 2021 - Dec 2021

Data Linguist

Apple

Analyzed complex morphological, syntactic, and semantic patterns to build voice assistant systems. Designed large-scale text datasets and resolved systemic edge cases including Arabic diacritics, dialectal variations, and cultural nuances to train NLU Siri classifiers.

2013 - May 2021

Freelance Translator & Localization Specialist

Self-employed / Professional Translation Platforms

Provided expert localization and translation services across English, French, and Arabic. Specialized in app/web software localization and dense subject matters (ecology, economics, law, AI/IT) using CAT tools like Trados Studio and MemoQ.

My Portfolio

Featured Research & Projects

Explore the tangible outputs of my work in advanced computational linguistics and machine learning.

Research Paper Page Overview
Research Project Maghribi Arabizi Lexical Mapping NLP Decoders

Maghribi Arabizi De-Romanization into Arabic Script

This project investigates how Maghribi Arabizi, an informal Romanized form of Moroccan Arabic, can be automatically converted into Arabic script. I benchmarked three de-romanization approaches: rule-based character mapping, statistical MLE word mapping, and a neural character-level Seq2Seq model. The results show that the MLE approach performs strongest on the held-out test set, highlighting the value of data-driven lexical mappings for noisy low-resource dialect text.

Modular Social Media NLP pipeline architecture flowchart
Python API Pipeline Data Cleaning Social Media NLP

Modular Social Media Text Pipeline

A highly customizable Python NLP pipeline designed to scrape, clean, and structure unstructured social media texts. Built-in elements parse custom URLs, extract base domain names, sanitize characters, and convert visual emojis into high-semantic text tokens without compromising grammatical syntax or context.

ACADEMIC PRESS
Coviability of Social & Ecological Systems
RABIAA • SOUGRI • NAFISSI
Book Translation Scientific Terminology Ecology

Book Translation: Ecological Coviability

Co-provided professional manuscript preparation and technical translation for the comprehensive academic volume "Coviability of Social and Ecological Systems: Reconnecting Mankind to the Biosphere". Meticulously resolved dense economic, regulatory, and ecological terminologies between English, French, and Arabic.

Research Sandbox

Arabizi De-romanization

Test the hybrid MLE and RegEx fallback decoding system for Moroccan Arabizi (Darija) proposed in our coming research.

Arabizi is a romanized writing system where Latin characters and numerals are used to transcribe Arabic dialects (especially in chat messages). Because Moroccan Darija is highly morphologically rich and phonetic, general Seq2Seq models often struggle due to low-resource training data.

Our research proposes a hybrid approach combining a Maximum Likelihood Estimation (MLE) model with a robust phonological RegEx Fallback decoder. On my held-out blind test split, the MLE system achieved the strongest performance among the three implemented baselines.

Notebook Experiment Results

Internal evaluation on a held-out blind test split from my research notebook.

7.24
Rule-Based
35.44
MLE Prediction
9.62
Seq2Seq NMT

Note: Results are from an internal research experiment and are not a published benchmark.

Dataset Credit: Powered by the UBC-NLP/nilechat-arabizi-mor dataset, which is based on the original NileChat corpus.

Arabizi-Decoder v1.0
Loading MLE dictionary & evaluating phonetics...
Get in Touch

Start a Conversation

Looking to collaborate on LLM training data, custom NLP tooling, or advanced Arabic linguistics? Send me a message below.

Location

Tampa Bay Area, FL

Connect Internationally

For open-source developments, resume deep-dives, or professional network inquiries, find me on these networks.