Technology · AI EPUB Vocabulary Assistant
About the AI EPUB Vocabulary Annotator
Our mission is to help Spanish learners and translators stay immersed in authentic texts. This page explains the models, datasets, and design choices that power the AI annotations—so you know exactly how each gloss appears in your EPUB.
Annotation Pipeline at a Glance
- Tokenization & CEFR scoring: We parse chapter XHTML, normalize tokens, and score each word against CEFR-ranked frequency lists.
- Gloss selection: Words above your selected mastery threshold enter the annotation queue. Cognates and named entities are filtered automatically.
- Translation aggregation: The assistant first checks curated glossaries, then queries LibreTranslate or MyMemory (based on your configuration) to build a concise English hint.
- EPUB reconstruction: The output merges original XHTML with
rubytags, preserving CSS and layout assets untouched.
Design Pillars
Context-aware vocabulary detection
We blend CEFR-ranked word lists with part-of-speech tagging and heuristic filters to avoid glossing proper nouns or obvious cognates.
Transparent translation sourcing
Translations come from curated bilingual datasets, LibreTranslate, or MyMemory, with priority given to custom glossaries you upload.
Privacy & security by default
Processing stays in-memory, temporary files are wiped immediately, and you can route translation calls through your own infrastructure.
Pedagogy-aligned output
Gloss formatting mirrors language-learning best practices, displaying short definitions and example hints without overwhelming the reader.
Roadmap for the Annotator
We are expanding beyond static glosses by experimenting with grammar hints, custom word banks, and collaborative classroom dashboards. Sign up for updates to test new features as soon as they ship.
Ready to See the Annotator in Action?
Upload a Spanish EPUB, choose your CEFR mastery level, and watch the AI pipeline generate glosses that match your study goals.