DP-63 — Multilingual labels & CIDOC @lang convention across the EM stack

DP-63

Multilingual labels & CIDOC @lang convention across the EM stack

Core Language v1.6 StratiGraph ↗ s3DgraphyEMtoolss3D config (rules)

Description

EM has been multi-disciplinary from the start (archaeology, philology, conservation, 3D modelling) and increasingly multi-country (Italy, Spain, Germany, Romania, Greece, the Arabic-speaking sites, the UK and Northern Europe contributors). The data has always been multilingual; the formalism, until DP-63, has not. The TBox of em.ttl (the OWL ontology declared inside the em: namespace) already uses RDF 1.1 language tags consistently — every rdfs:label carries @en — but the ABox (the instance data the RDF exporter writes out) emits literals without any tag, and the s3dgraphy datamodel has no notion of “this text is in language X”. Two corollaries: a triplestore can’t answer FILTER(lang(?label) = "it") on EM data, and a node imported from a non-English site loses the linguistic provenance of its text the moment it crosses into the property graph.

DP-63 closes that gap end-to-end. The handle is the RDF 1.1 language-tag convention — "Tempio Grande"@it is a distinct literal from "Great Temple"@en, queryable in SPARQL, idiomatic for CIDOC-CRM consumers, native to every triplestore. The work splits across three layers (TBox, ABox, datamodel) plus the import / UI plumbing that ties the source language into each new node.

TBox — curated translations of em.ttl. The existing 50 rdfs:labels become 50 × N as we add languages alongside English. Italian first (the EM home language), then probably Spanish and German given the active 3DSC + StratiGraph user base. The work is curated, not architectural: a translator goes through JSON_config/em.ttl and writes the @it companion next to every @en. SKOS gets layered on top for the E55_Type qualia (skos:prefLabel + skos:altLabel per language) so the v4.0 expanded qualia vocabulary (height, color, aesthetic_value, …) becomes properly multilingual. The Subjectivity Project (DP-08) and the Vocabulary Project (DP-09) are the natural homes for the curated lexicographic work — DP-63 is the enabling shape, not the translation effort itself, which sits with the people who do that translation well.

ABox — language-tagged literals from rdf_exporter.py. Today exporter/rdf_exporter.py writes (line 542, 545, etc.) every text literal as Literal(text) with no lang=. The change is mechanical for the surface but architectural for the source: the exporter needs to know the language of each text it emits. Two approaches that we’d implement together: a graph-level defaultLang declared on EMGraph metadata (most projects are single-language and shouldn’t pay a per-field tax), plus an optional per-node lang override when a node carries genuinely multi-language content (a name in Italian plus a description in English, common when an Italian team publishes findings in English). Triple-store consumers can then issue language-aware SPARQL out of the box.

Datamodel — lang field on text properties. This is the smallest piece of code and the largest piece of design. Nodes get a lang field where text properties live (name, description, notes); the field is optional and falls back to the graph default when absent. For multi-language nodes — uncommon but real — an alternates dict allows {"it": "Tempio Grande", "en": "Great Temple"} alongside the primary text. The shape mirrors the s3dgraphy approach already in PR #22 for unita_tipo: preserve the original in the canonical attribute slot, expose the tag (or alias map) for downstream consumers that need to discriminate. The datamodel change is additive — old graphs without lang import as defaultLang and write back identically.

Importer plumbing — language detection at the source. pyArchInit knows its UI language: the same pyarchinit_i18n_stratigraphic.UNIT_TYPE_ABBREV table that PR #22 uses to map SU / WSU / SE / … back to canonical US / USM also encodes which language the running install is using. The pyArchInit bridge stamps that language onto every text field it imports — descriptions, notes, interpretations, all the prose that today crosses the boundary as a bare string. The XLSX importer takes the language from a top-level manifest field in em_data.xlsx. The GraphML importer reads it from a graph-level metadata attribute. yEd-authored graphs default to the graph language declared on the canvas header (DP-40’s slot is the obvious place).

UI plumbing — language picker in EM Tools. EM Setup gains a language selector that drives two things at once: which @xx-tagged variant is shown in the panels (when multiple are available), and which @xx is applied to new text the user types into the graph. Default from the Blender locale on first install; explicit override per-graph for projects where the author works in a different language from their OS. This is the surface the modeller actually sees; everything else is plumbing under it.

How DP-63 relates to PR #22 (DP-62, the precedent). PR #22 solved one half of the multilingual problem: stratigraphic codes (US, USM) that pyArchInit localises (SU/WSU/SE/MSE/UE/UEM/USZ/ΣΜ/ΤΣΜ). For codes there’s a canonical form and the rest are aliases that resolve back to it — UNITA_TIPO_CANONICAL is the alias map, the original code stays in attributes['unita_tipo'] for round-trip, the canonical is used only for structural decisions. DP-63 is the other half: free-form text in different languages, where there is no canonical form and all variants are equi-valid. Same architectural shape (preserve original + tag), different semantics (tag-to-discriminate vs alias-to-canonicalise). The two pieces together give EM proper multilingual support across both vocabulary and prose.

Open decisions for the review call (we’ll thread these on the issue when DP-63 moves out of concept).

Single-default vs per-node-alternates on the datamodel. Recommendation: ship both. The default keeps the common single-language project ergonomic; the alternates dict makes the multi-team case expressible. The alternates dict is unset for 99% of nodes — zero per-node cost.
Untagged-text fallback policy on import. Recommendation: graph-level defaultLang is used, with a one-time warning surfaced through the existing GraphML Warnings panel. Fail-loud is too aggressive; silent-default is too quiet. The warning gives the importer a chance to declare the missing tag once and move on.
Translation provenance on em.ttl. Recommendation: each language pair declares its dcterms:contributor block in a SKOS-Concept-style sidecar, so the curated @it translations can be cited back to the translator the same way EM cites papers back to their authors. Aligns with the Author Node work in 1.5 (DP-51).
Where to source-of-truth the language packs. Recommendation: the curated translations of em.ttl ship in s3dgraphy itself (one TTL file per language under JSON_config/lang/em-it.ttl, etc.) so the language pack lives with the schema. Community-contributed packs for archaeological-school-specific lexicons can flow through DP-61’s EM Mappings Registry mechanism — a translator submits a PR with their language pack, registry mirrors with Zenodo DOI.

Targeted to EM 1.6 within the StratiGraph project umbrella. StratiGraph (the EU project pulling in Romanian, Spanish and other archaeological traditions with their own controlled vocabularies) is the natural funder + first beneficiary: the project’s deliverables involve multi-national EM authoring, which is exactly the multilingual workflow DP-63 enables. The work is incremental — TBox curation can land independently of the ABox / datamodel / UI plumbing, and the ABox change is backward-compatible with pre-DP-63 graphs. Sequencing within 1.6: TBox curation (@it first) → datamodel lang field + exporter language tags → importer plumbing → EM Tools UI picker. The whole arc fits inside the 1.6 cycle if scheduled deliberately.

Status

Concept

Target EM Version

1.6

Impacts

s3DgraphyEMtoolss3D config (rules)

Components

TBox: `JSON_config/em.ttl` already uses `@en` on all 50 `rdfs:label`s — add curated `@it` translations alongside, plus `skos:prefLabel` + `skos:altLabel` for E55_Type qualia (so the controlled vocabularies become properly multilingual).
ABox: `s3dgraphy.exporter.rdf_exporter` currently emits `Literal(name)` with no `lang=`. Switch every literal where a language meaningfully applies (`RDFS.label`, `DCTERMS.description`, `CRM.P3_has_note`, `CRM.P102_has_title`, `CRM.P90_has_value` when free-text) to `Literal(text, lang='xx')`.
Datamodel: add a `lang` field on the s3dgraphy node's text properties (`name`, `description`, `notes`). Two shapes on the table — single-language nodes (`name_lang: str = 'en'`) vs per-language alternatives (`name: [(lang, text), ...]`). Likely both: a default for the common case + an optional alternates dict for genuinely multi-lingual sites.
Importer plumbing: pyArchInit knows its UI language via `pyarchinit_i18n_stratigraphic.UNIT_TYPE_ABBREV` (the same source PR #22 uses for the canonical `unita_tipo` aliasing) — propagate that as the source language for every text field on import. XLSX importer needs a graph-level `lang` declaration in the manifest.
EM Tools UI: language picker in EM Setup that drives both display (which `@xx` to show in the panels) and graph-write (which `@xx` is applied to newly-authored text). Default from the user's Blender locale; explicit override per-graph.
SKOS discipline for qualia: every qualia type in the v4.0 expanded set (height, color, aesthetic_value, …) gets `skos:prefLabel`@en/@it/… plus `skos:altLabel` for the local nomenclature variations (e.g. archaeological-school-specific terms). The Subjectivity Project (DP-08) and the Vocabulary Project (DP-09) are the natural homes for the curated translation work.
Round-trip discipline: importing a `@it`-tagged literal must round-trip identically on re-export — no canonicalisation to English, no stripping of the tag. Pattern mirrors PR #22's `attributes['unita_tipo']` preservation: the original-language text is the data, the language tag is meta, and exporters consume the tag rather than rewrite the text.
Migration: an EM graph authored against pre-DP-63 s3dgraphy (text without `lang`) imports as `lang='en'` (or graph-level default) on first read; re-export becomes language-tagged. No breaking change for existing graphs; the upgrade path is opt-in via the new field.

Key Study

Needed

Notes

Triggered by the architectural conversation 2026-06-06 around Enzo Cocca's PR #22 (s3dgraphy fix(sync): multilingual unita_tipo codes, closes issue #21) and the broader observation that PR #22 solves the LOCALIZED CODE problem (US/SU/SE/UE/WSU/MSE/…) one specific way — preserve original + map to canonical for structural decisions — but the same architectural shape is needed for the LANGUAGE TAG problem on free text. The two are related but distinct: codes have a canonical form (US/USM) and aliases that resolve back to it; free text in different languages is equi-valid and only needs the tag to discriminate. The latter is the RDF 1.1 / CIDOC-CRM idiom. Surface coverage: `rdfs:label` on every IRI instance, `crm:P3_has_note` / `crm:P102_has_title` / `dcterms:description` for narratives, `skos:prefLabel` + `skos:altLabel` on E55_Type qualia, `crm:E41_Appellation` for naming. Open decisions: (1) single-default vs per-language-alternates on the datamodel — recommend BOTH (default field + optional alternates dict), so the common single-language case stays ergonomic and the multi-team excavation case is expressible; (2) when an importer sees text without a language hint, fall back to a graph-level `defaultLang` or fail-loud — recommend graph-level default with a one-time warning; (3) translation provenance — should curated `@it` translations of `em.ttl` carry `dcterms:contributor` for each translator? Worth a SKOS-Concept-style block per language pair. Cross-refs: DP-08 (Subjectivity Project) and DP-09 (Vocabulary Project) are the natural homes for the curated SKOS work on qualia and controlled vocabularies; DP-61 (EM Mappings Registry) could host community-contributed language packs for `em.ttl` (a translator submits a PR with `@it` labels and the registry mirrors them with Zenodo DOI). DP-62 (PyArchInit canonical-edges) established the 'preserve original + canonicalise for structural decisions' pattern — DP-63 is the natural follow-up that applies the same architectural shape to free-text content with a language tag instead of a canonical alias map.