Pseudonymizer Tool

approved

by Axelle Abbadie

This plugin has not been manually reviewed by Obsidian staff. Pseudonymize and correct interactional transcripts (Jefferson, ICOR, SRT, CHAT/CHA). Designed for qualitative researchers in linguistics and conversation analysis.

↓ 45 downloadsUpdated 10d agoGPL-3.0

Install in Obsidian View on GitHub

Pseudonymizer Tool

An Obsidian plugin for pseudonymizing and correcting interactional transcripts for researchers in linguistics and conversation analysis. It fills a gap: existing tools (Sonal, Whispurge) do not integrate into a note-taking and analysis environment, and few applications support multimodal transcription conventions (Jefferson, ICOR).

Français — README.fr.md

Workflow

noScribe (.html, .vtt)  or  raw transcript (.srt, .cha, .md, .txt)
        ↓ automatic import  (audio file imported alongside)
Obsidian — native Markdown editing  (**S00** [HH:MM:SS] : text)
        ↓ pseudonymization  (manual · NER · dictionary scan)
Annotated source file  +  correspondence table  +  word timestamps (.words.json)
        ↓ export
Pseudonymized transcript (.pseudonymized.md / .pseudonymized.vtt)

Two approaches, freely combined:

Manual — right-click a term → choose a pseudonym → the plugin replaces it and saves the rule
Automatic (NER) — an AI model detects identifying entities and highlights them in blue in the editor → the researcher validates or rejects each candidate

Pseudonyms are wrapped in {{...}} markers in the source file and export to remain visually distinct from raw data. Correspondence tables are never included in exports.

Getting started

First launch

A setup wizard opens automatically on first load:

Welcome — plugin overview
Language — choose the interface language
Storage — configure vault folders (we recommend one vault per corpus)
NER detection — enable the local AI model (WASM download ~19 MB) or work with manual rules only
Dictionaries — install replacement candidate dictionaries from the online catalogue

The wizard can be relaunched at any time: Settings → Pseudonymizer Tool → Setup wizard.

Supported formats

Format	Extension	Notes
noScribe HTML	`.html`	Qt Rich Text from noScribe — speaker labels, word timestamps, audio path
noScribe VTT	`.vtt`	noScribe v0.7 output — also standard Whisper WebVTT with word timestamps
Timestamped subtitles	`.srt`	Whisper / AI output — timestamps and structure preserved
CHAT / CLAN	`.cha`, `.chat`	`@`, `*`, `%` lines preserved
Annotated Markdown	`.md`	Jefferson or ICOR conventions
Plain text	`.txt`	No convention markers

All formats are automatically converted to Markdown on import. Alongside the .md, the plugin creates:

<basename>.mapping.json — pseudonymization rules
<basename>.words.json — word-level timestamps (noScribe / Whisper only), used for VTT re-export
If an audio file is referenced in the transcript, it is imported to the vault automatically.

Install the Data Files Editor plugin to view mapping JSON files directly in your vault.

Installation

Via the Obsidian community plugins directory (under review):

Obsidian → Settings → Community plugins → Browse → "Pseudonymizer Tool"

Manual installation from the latest GitHub release:

Download main.js, manifest.json, styles.css
Copy into .obsidian/plugins/pseudonymizer-tool/ in your vault
Enable in Settings → Community plugins

WASM files (NER) are downloaded automatically by the wizard on first launch if you enable automatic detection. They can also be downloaded manually from the release.

Features

Manual pseudonymization

All actions are available via right-click on a selection in the editor:

Action	Description
Pseudonymize	Replace the term (this occurrence or all) with `{{...}}` markers
Pseudonymize with Prof. Baptiste Coulmont	Queries coulmont.com — suggests sociologically equivalent first names (same social background, same decade)
Create a rule	Saves the correspondence in the mapping JSON without modifying the text
Edit rule	Modifies an existing rule (available on orange or green highlighted terms)
Cancel pseudonymization	Restores the original term for this occurrence (available on green terms)

Automatic NER detection

The named entity recognition engine detects first names, surnames, places and institutions without a pre-existing list, using syntactic context. Unlike a dictionary approach, it distinguishes "Florence" the person from "Florence" the city. Sub-terms of a known compound entity are automatically filtered (if "Saint-Jean-de-Luz" is a rule, "Jean" and "Luz" do not appear as NER candidates).

Usage:

Side panel → NER tab → Identify candidates
Entities appear highlighted in blue in the editor
Right-click a blue term → Pseudonymize or Create a rule

Model: Xenova/bert-base-multilingual-cased-ner-hrl via transformers.js. 100% local execution. One-time downloads on first use: WASM (~19 MB) + NER model (~66 MB). Works offline after the first download.

Settings (NER tab in the panel):

Confidence threshold (0.50–1.00): increasing it reduces false positives
Excluded function words: editable list of tokens to always ignore ("de", "du", "la"…)

Side panel

Accessible via the ribbon icon or Ctrl+P → Pseudonymization: open panel.

Tab	Content
Mappings	Active rules · Edit · Delete · Add · Scan file · Exceptions section
Dictionaries	Mini cards · Dictionary scan · Local import
Exports	Create pseudonymized version · Export correspondence table · Re-export as VTT/SRT/CHAT
NER	Visible if NER enabled · Identify candidates · Confidence threshold · Function words
Corpus	Files by class · Class management · Move files · Final export destination

Highlighting and markers

Highlighting is active in all open files, including .pseudonymized.* export files, which automatically inherit rules from the source file.

Colour	Meaning
🟠 Orange + outline	Source term still present — to be pseudonymized
🟢 Green + underline	Pseudonym applied directly in the file
🔵 Blue + outline	NER candidate — no rule yet
🔴 Red + underline	Exception — specific occurrence explicitly ignored (context-aware; persisted in mapping)

In exported files, pseudonyms are wrapped in {{Pierre}} markers to distinguish them from raw data (enabled by default, configurable in settings).

Correspondence tables

Three scope levels: file.mapping.json · folder.mapping.json · vault.mapping.json
Per-rule statuses: Active (validated), Partial, Ignored, Suggested
Per-occurrence ignored exceptions: IgnoredOccurrence {text, contextBefore, contextAfter} stored in the rule
Z-index priority: free integer — longer entities take precedence (Saint-Jean-de-Luz > Jean)
JSON format documented in SPECS.md §5

Pseudonymization dictionaries

Dictionaries provide replacement candidates and feed automatic detection. They are hosted in a dedicated repository and downloaded into your vault — no transcript text ever leaves Obsidian.

Installing from the catalogue:

Wizard → Dictionaries step → choose from the online catalogue
Or: Settings → Setup wizard → Dictionaries step

Usage:

Side panel → Dictionaries tab
Check the dictionaries to activate for the scan
Click Scan active file (or the magnifier button on each card for a single dictionary)
A review modal opens: each detected entity is shown with its context excerpt, proposed class and automatically computed replacement (Ville_1, Métropole_2…)
Uncheck false positives · edit the prefix if needed · Create rules

Available dictionary — French communes (GeoAPI INSEE):

34,957 French communes, 5 classes: Village · Petite_Ville · Ville · Grande_Ville · Métropole
Roles: detection + class-based replacement with incremental index
Recommended scope: file (each file gets its own numbering)

Dedicated repository: core-hn/pseudobsidian-dictionaries — contributions welcome.

Corpus organization

The plugin helps you structure your corpus in folders before starting work. Use the command Organize corpus (Ctrl+P) to define classes (sub-folders). Each class automatically creates mirrored folders for transcriptions, mapping tables and exports. When adding a transcription, you are prompted to select a class.

We recommend one Obsidian vault per corpus. This keeps mapping files, dictionaries and exports together with their transcriptions, and makes archiving or sharing a corpus straightforward.

Privacy and security

All processing is local — no transcript text is sent to an external server
The NER model runs in Obsidian via WASM, without any network call
Documented exception: "Pseudonymize with Coulmont" sends the source first name (not transcript content) to coulmont.com/bac to suggest equivalent names. B. Coulmont states on his website that searches are not logged.
Correspondence tables are never included in pseudonymized exports.

For developers

git clone https://gitlab.huma-num.fr/aabbadie/pseudobsidian-ization.git
cd pseudobsidian-ization
npm install
npm run dev           # watch build (esbuild)
npm test              # Jest test suite
npm run build         # production build
npm run deploy        # build + copy to test_vault/
npm run build:cities  # regenerate assets/cities.dict.json from GeoAPI INSEE

Repository structure:

src/
├── main.ts               # Obsidian entry point
├── settings.ts           # Persistent settings
├── types.ts              # Shared types (MappingRule, IgnoredOccurrence, …)
├── i18n/                 # Internationalization (en, fr)
├── parsers/              # SrtParser, ChatParser, VttParser,
│                         # NoScribeHtmlParser, NoScribeVttParser, TranscriptConverter
├── mappings/             # MappingStore, ScopeResolver
├── pseudonymizer/        # Engine, ReplacementPlanner, SpanProtector, Redaction
├── scanner/              # OccurrenceScanner, OnnxNerScanner
├── dictionaries/         # DictionaryLoader
└── ui/                   # PseudonymizationView, modals (incl. OccurrencesContextModal),
                          # CM6 highlighting (PseudonymHighlighter)

Status — v0.1.x

Phase	Status	Description
0–6	✅	Parsers · Engine · Commands · Scopes · Highlighting · Validation
7 — Coulmont	✅	Equivalent first name suggestions · JSON/CSV import
8 — Side panel	✅	3 tabs · Embedded NER · Wizard · Cancellation · Export highlighting
9 — Structured dictionaries	✅	Format v1.1 · DictionaryLoader · Dictionary scan · Review modal · French communes
10 — Refinement & noScribe	🔄	i18n · Corpus UI · noScribe import · per-occurrence scan · exceptions · rename cascade · export destination
11 — EMCA functions	⏳	Turn navigation · Jefferson/ICOR correction · ELAN export

See ROADMAP.md for the full phase breakdown and planned features.

Contributing

Contributions are welcome, particularly:

Dictionaries: first name lists, place names, institutions for specific corpora (regional languages, non-French fields, historical periods) — see core-hn/pseudobsidian-dictionaries
Transcription conventions: parsers for other systems (ELAN, Praat TextGrid, EXMARaLDA)
Usage feedback: issues to report edge cases encountered on real corpora

Please open an issue before submitting a pull request for significant features.

Licence

GPL 3.0

Code

The Beerware License (Revision 42)

Axelle Abbadie wrote this code. You can do whatever you want with it
as long as you keep this notice. If we meet someday, and you think
it was worth it, you can buy me a beer.

This plugin is made to be modified. If your fieldwork involves particular transcription conventions, a regional dialect, multilingual corpora or institution-specific export formats, adapt the code to your needs.

Credits

Intellectual property

Axelle Abbadie — design, specifications, development direction, UX research. (cvHAL)
Vibe-coded with Claude Sonnet 4.6 (Anthropic)

Inspiration

Sonal pi — Following a meeting with Maxime Beligné at the "Pseudonymiser, Anonymiser?" day organized by MSH-Sud. Software: Sonal-pi, developed since 2008 by Alex Alber.

Acknowledged work

Baptiste Coulmont — pseudonymization tool (coulmont.com/bac) used for first name suggestions.
Stefan Schweter (Bayerische Staatsbibliothek) — multilingual NER model bert-base-multilingual-cased-ner-hrl, used for automatic entity detection.
Joshua Lochner / Xenova — ONNX model conversion and transformers.js library, enabling local execution in Obsidian without a Python dependency.

For plugin developers

Search results and similarity scores are powered by semantic analysis of your plugin's README. If your plugin isn't appearing for searches you'd expect, try updating your README to clearly describe your plugin's purpose, features, and use cases.

Pseudonymizer Tool

Pseudonymizer Tool

Workflow

Getting started

First launch

Supported formats

Installation

Features

Manual pseudonymization

Automatic NER detection

Side panel

Highlighting and markers

Correspondence tables

Pseudonymization dictionaries

Corpus organization

Privacy and security

For developers

Status — v0.1.x

Contributing

Licence

Code

Credits

Intellectual property

Inspiration

Acknowledged work

Speech to Text

Audio Transcription

AI Transcriber

Noter

Deepgram Transcriber

Realtime Transcription

Chat Splitter

Chat Splitter

Audio Transcript

Dictionary Sync

For plugin developers