Lyrics Writer — Technical Details

1. IPA Phonetic Transcription

Every word typed into a lyric line is converted to its IPA representation in real time, entirely in the browser — no network requests.

Pipeline

flowchart TD W([Word typed by user]) W --> D{In built-in\ndictionary?} D -- Yes --> IPA1[Exact IPA from dictionary] D -- No --> R[Rule-based converter] R --> IPA2[Approximate IPA] IPA1 --> OUT([IPA display + rhyme detection]) IPA2 --> OUT

Built-in dictionary

Each language ships with a hand-curated dictionary of irregular or hard-to-convert words: common pronouns, contractions, verb forms, and prepositions. These are looked up first for exact results.

Rule-based converters

For any word not in the dictionary, a deterministic set of substitution rules converts spelling to IPA. Quality depends on how regular the language's orthography is:

Language	Regularity	Notes
🇪🇸 Spanish	Very high	Nearly one-to-one letter↔sound mapping
🇮🇹 Italian	High	A few digraphs: sci, gli, gn, ch/gh
🏛️ Latin	High	Classical pronunciation; digraphs ae/oe/au/ph/th
🇫🇷 French	Medium	Silent letters, liaison, nasal vowels
🇬🇧 English	Low	Highly irregular; rule results are approximate

2. Word Embedding Data Pipeline

The suggestion engine needs two things per vocabulary word: its IPA (for rhyme scoring) and its embedding vector (for meaning scoring). These are precomputed offline by scripts/build_word_data.py and saved as data/words_{lang}.js.

flowchart TD subgraph Download A["FastText Wikipedia .vec\nhosted by Meta AI"] A -->|"Stream first 300 000 lines\n~270 MB partial download"| B["300k words\n300d float32 vectors"] end subgraph Filter B --> F{"Keep word?"} F -->|"✗ stopword\n✗ proper noun\n✗ digit / punctuation\n✗ length outside 3–20"| SKIP[Discard] F -->|"✓ passes all checks"| KEEP["~250k words"] end subgraph Embeddings KEEP -->|"PCA 300d → 50d"| PCA["50d float32"] PCA -->|"Unit-normalise · ×127 · round · clip"| Q["50d int8"] Q -->|"Row-major bytes → base64"| B64["Base64 string"] end subgraph Phonetics KEEP -->|"phonemizer + espeak-ng\nbatch=2000, jobs=4"| IPA["IPA strings"] end subgraph Output B64 --> JS["data/words_{lang}.js\n~23 MB per language"] IPA --> JS end

FastText word vectors

FastText vectors are trained on Wikipedia by Meta AI using a skip-gram model. Each word is placed in a 300-dimensional space so that words appearing in similar contexts end up pointing in similar directions. The .vec files are plain text sorted by frequency, which lets us stream just the top N words without downloading the full 2–4 GB file.

PCA dimensionality reduction

Principal Component Analysis reduces each 300-dimensional vector to 50 dimensions while maximising preserved variance. The PCA is fit on the full filtered vocabulary, so the 50 components capture the dominant semantic axes of the language.

$$\mathbb{R}^{300} \xrightarrow{\;\text{PCA}\;} \mathbb{R}^{50}$$

Benefits: 6× smaller file · faster browser computation · slight denoising.
Trade-off: minor loss of semantic precision (negligible for top-N retrieval).

Int8 quantization

After unit-normalising each vector (‖v‖ = 1), values are scaled by 127, rounded, and stored as signed bytes. Cosine similarity in int8 is nearly identical to float32 because the vectors were unit-normalised before scaling — quantisation error per dimension is ≤ 0.004, negligible over 50 dimensions.

$$\mathbf{v}_{\text{float32}} \;\longrightarrow\; \hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|} \;\longrightarrow\; \left\lfloor 127\,\hat{\mathbf{v}} \right\rceil_{\text{int8}}$$

Storage per word: $50$ bytes (vs $200$ bytes for float32) — total for $250\text{k}$ words: $12.5\,\text{MB}$ int8 vectors $+$ words/IPA JSON $+$ base64 overhead $\approx 23\,\text{MB}$ per file.

espeak-ng IPA

espeak-ng is an open-source speech synthesis engine used as a pronunciation oracle. phonemizer wraps it into a Python API. Words are processed in batches of 2 000 to amortise subprocess startup cost, with 4 parallel workers.

3. Rhyme Score 3

The rhyme score counts how many phonemes two words share from the end — the classic phonetic definition of a rhyme. It is displayed as an orange number on each chip.

Step 1 — IPA cleaning

Before comparison, both the target and candidate IPA strings are normalised to make rhyme detection accent-tolerant:

1

Strip prosodic markers ˈ ˌ ː / [ ]

2

Neutralise voiced/unvoiced pairs z→s d→t b→p ɡ→k v→f ð→θ ʒ→ʃ
This prevents near-homophones from being ranked as non-rhymes (e.g. "bees" / "peace").

3

Strip trailing schwa ə — unstressed final vowel that varies by accent

Step 2 — Suffix overlap count

Scan both IPA strings character by character from the right. The score is how many characters match before the first mismatch.

Example — "night" vs "light" (3 sounds in common)

night n a ɪ t

light l a ɪ t

✓ 3 matching sounds: /aɪt/

Example — "morning" vs "warning" (5 sounds in common)

morning m ɔ r n ɪ ŋ

warning w ɔ r n ɪ ŋ

✓ 5 matching sounds: /ɔrnɪŋ/

Example — "love" vs "dove" (devoicing: v → f)

love l ʌ f (v neutralised to f)

dove t ʌ f

✓ 2 matching sounds: /ʌf/

Step 3 — Normalisation

$$\text{rhyme}_i = \frac{\text{raw\_rhyme}_i}{\displaystyle\max_j(\text{raw\_rhyme}_j)} \in [0, 1]$$

The best-rhyming candidate always scores 1.0. The displayed orange number is the raw count before normalisation.

4. Meaning Score 85%

The meaning score measures how semantically close a candidate word is to the target, using cosine similarity between their word embedding vectors. It is displayed as a purple percentage on each chip.

Cosine similarity

$$\cos(\theta) = \dfrac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \times \|\mathbf{b}\|}$$

Symbol	Meaning
$\mathbf{a}$	50d embedding of the target word
$\mathbf{b}$	50d embedding of the candidate word
$\mathbf{\theta}$	angle between $\mathbf{a}$ and $\mathbf{b}$

Value	Interpretation
$+1$	Same direction — semantically very close
$\phantom{+}0$	Orthogonal — unrelated words
$-1$	Opposite — contrasting contexts

Geometric intuition

Word embeddings place semantically similar words at small angles from each other in vector space:

Normalisation

$$\text{sem}_i = \frac{\max\!\left(0,\; \cos(\mathbf{t},\, \mathbf{b}_i)\right)}{\displaystyle\max_j\!\left(\max(0,\cos(\mathbf{t},\mathbf{b}_j))\right)} \in [0, 1]$$

$\mathbf{t}$ is the target word's vector. Negative similarities are clamped to 0 before normalisation. The displayed purple % is $\text{round}(\text{sem}_i \times 100)$.

Why cosine and not Euclidean distance?

Cosine similarity measures the angle between vectors, ignoring magnitude. Since all vectors are unit-normalised before int8 quantization, their magnitudes are all ≈ 127. Using cosine ensures the score reflects semantic direction, not accidental magnitude differences from rounding.

5. Final Ranking

Score combination

$$\text{score}_i = \underbrace{w}_{\text{slider}} \cdot \underbrace{\text{sem}_i}_{\substack{\text{cosine similarity} \\ \in\,[0,1]}} \;+\; \underbrace{(1-w)}_{\text{slider}} \cdot \underbrace{\text{rhyme}_i}_{\substack{\text{suffix overlap} \\ \in\,[0,1]}} \qquad w \in [0,\,1]$$

Slider positions

Slider	w	Formula	Effect
0% — full Rhyme	0.0	`1.0 × rhyme`	Best-rhyming words first
30% (default)	0.3	`0.3 × sem + 0.7 × rhyme`	Rhyme-leaning blend
50%	0.5	`0.5 × sem + 0.5 × rhyme`	Equal weight
100% — full Meaning	1.0	`1.0 × sem`	Semantically closest words first

Why independent normalisation matters

Both scores are normalised independently — each divided by its own maximum. Without this, whichever score has larger raw values would dominate at any slider position. With it, w = 0.5 means genuinely equal influence, regardless of the raw scale of each score.

6. Complexity

$N = 250\,000$ words, $d = 50$ (embedding dim), $D = 300$ (original FastText dim).

Step	When	Complexity	Typical execution time
Stream vectors	One-time, offline	$\mathcal{O}(N \cdot D)$	~10 min download ~270 MB from FastText
PCA		$\mathcal{O}(N \cdot D^2)$	~10 min SVD on $250\text{k} \times 300$ matrix
espeak-ng IPA		$\mathcal{O}(N)$	~10 min 125 batches × 4 parallel workers
Data file load	First ✦ click per language	$\mathcal{O}(N)$	1–3 s Parse 23 MB JS · `atob` on 16 MB base64 string · fill Int8Array of 12.5 M bytes · insert 250k entries into Map
Score all candidates	Target word changes	$\mathcal{O}(N \cdot d)$	~300 ms 250k × 50 multiply-adds (cosine) + 250k IPA suffix matches
Re-rank	Every slider move	$\mathcal{O}(N \log N)$	~30 ms recompute weighted score + sort 250k candidates

Scoring results are cached per target word — moving the slider only triggers a re-sort, not a re-score. The dominant one-time cost is the data file load; all subsequent interactions reuse the same in-memory Int8Array.