Unicode and Ancient Greek

1. Alphabetic Order
2. Problems with Accents
3. Beta Code
4. unicode->betacode-char

1. Alphabetic Order

The letter α is #x03b1 (hex) 945 (dec), so we get the alphabet by adding 1 up to 969, ω.

(let (lalala '())
  (dotimes (x 25) (push (char-to-string (+ 945 x)) lalala))
  lalala)

("ω" "ψ" "χ" "φ" "υ" "τ" "σ" "ς" "ρ" "π" "ο" "ξ" "ν" "μ" "λ" "κ" "ι" "θ" "η" "ζ"
 "ε" "δ" "γ" "β" "α")

As push adds the items in the beginning, so reverse the result.

(setq αλφαβητα (reverse *))
("α" "β" "γ" "δ" "ε" "ζ" "η" "θ" "ι" "κ" "λ" "μ" "ν" "ξ" "ο" "π" "ρ" "ς" "σ" "τ"
 "υ" "φ" "χ" "ψ" "ω")

2. Problems with Accents

Vowels with precombined accents have two codepoints: a lower one with @grk{τόνος|tónos}, for Modern, and a higher one with @grk{ὀξεῖα|oxeîa}, for Ancient Greek. Both mean an acute accent, so that is an error. The default is to use only the lower codepoints but, of course, not everyone does that. Here are the offending characters in their lower codepoint and in their higher equivalent.

(setq greek-accents
      '(("ά" . "ά") ("έ" . "έ") ("ή" . "ή") ("ί" . "ί") ("ό" . "ό") ("ύ" . "ύ")
        ("ώ" . "ώ") ("Ά" . "Ά") ("Έ" . "Έ") ("Ή" . "Ή") ("Ί" . "Ί") ("Ό" . "Ό")
        ("Ύ" . "Ύ") ("Ώ" . "Ώ") ("ΐ" . "ΐ") ("ΰ" . "ΰ")))

To query this association list, just do (assoc the-char greek-accents), and it will return the cons with the-char as the first member.

To get the correspondent of the-char, just use (cdr (assoc the-char greek-accents)).

To convert a string from using the low codepoints to the higher ones, ’tis best to convert the characters to numbers. That’s done using (string-to-char ""). Parsing the whole list:

(setq greek-accents (let (greek-numbers '())
  (dolist (this-cons greek-accents)
    (push (cons (string-to-char (car this-cons))
                (string-to-char (cdr this-cons)))
          greek-numbers))
  greek-numbers))

((944 . 8163) (912 . 8147) (911 . 8187) (910 . 8171) (908 . 8185) (906 . 8155)
 (905 . 8139) (904 . 8137) (902 . 8123) (974 . 8061) (973 . 8059) (972 . 8057)
 (943 . 8055) (942 . 8053) (941 . 8051) (940 . 8049))

Now, to replace/substitute the characters in greek-accents to their higher codepoint character:

(defun greek-make-higher (input)
  (let ((output '()))
    (dolist (this-char (string-to-list input))
      (if (assoc this-char greek-accents)
          (push (cdr (assoc this-char greek-accents)) output)
        (push this-char output)))
    (reverse output)))

This gives us a very intuitive list of characters, like (955 8051 947 969). To transform that back to a string:

;; but cf. https://stackoverflow.com/questions/18979300
(defun list-to-string (input)
  (let ((output ""))
    (dolist (this-char input)
      (setq output (concatenate 'string output (char-to-string this-char))))
    output))

Mending greek-make-higher to use this function:

(defun greek-make-higher (input)
  (let ((output '()))
    (dolist (this-char (string-to-list input))
      (if (assoc this-char greek-accents)
          (push (cdr (assoc this-char greek-accents)) output)
        (push this-char output)))
    (list-to-string (reverse output))))

Thus, if we throw λέγω (955 941 947 969) at greek-make-higher, it throws λέγω (955 8051 947 969) back at us.

3. Beta Code

Beta Code is a way of encoding Ancient Greek using ASCII characters. The original encoding used uppercase letters only, and marked the actual majuscules with an asterisk. That suggest an encoding like the one used in punch cards, quite early for the computer timescale. And somehow that has something to do with my printer. How small this world is!

4. `unicode->betacode-char`

(defun unicode->betacode-char (chr)
  (let ((maybe-replacement (assoc chr greek-tonos->oxeia)))
    (if maybe-replacement (setq chr (cdr maybe-replacement)))
    (let ((maybe-substitute (car (rassoc chr betacode->unicode))))
      (if maybe-substitute
          maybe-substitute
        (char-to-string chr)))))

If it cannot find a replacement for a character, it spits the character out as it came.

(setq teste "Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος")


(mapcar #'unicode->betacode-char (coerce teste 'list))
=> ("*m" "h=" "n" "i" "n" " " "a)/" "e" "i" "d" "e" "," " " "q" "e" "a/" "," " "
"*p" "h" "l" "h" "i+" "a/" "d" "e" "w" " " "*)a" "x" "i" "l" "h=" "o" "s2")

;; this is news to me
(mapconcat #'unicode->betacode-char (coerce teste 'list) "")
=> "*mh=nin a)/eide, qea/, *phlhi+a/dew *)axilh=os2"

I hope Beta Code will be a handy intermediary representation of Ancient Greek, since the characters are already decomposed: an ἄ for instance is represented as a)/, an alpha (a) with smooth breathing ()) and acute accent (/). Thus, if need be to convert either to precomposed or decomposed unicode, that is already done, and in an easily readable way, quite unlike unicode itself.

(mapconcat #'unicode->betacode-char (coerce "ὕβρις" 'list) "")
=> "u(/bris2"

Now, to transliterate to anything human-readable, it should be just a matter of pruning out the excesses.

(setq testes "αἰτία βασιλεύς γίγνομαι δῶρον εἶδος Ζεύς ἡδύς θεός ἰδεῖν κέρδος
λαός μοῖρα νοῦς ξένος ὁμιλία πίνω ἐρημία ῥόδον ποίησις τίκτω ὕβρις φίλος
χάρις ψυχή ὠμός. ἄγγελος ἀνάγκη ἄγχω σφίγξ. ἀγορᾷ κεφαλῇ λύκῳ ᾠδῇ")

;; sigmas
(replace-regexp-in-string "[123]" "" *)
;; rough breathing:
(replace-regexp-in-string "([aehiowu]\\)(" "h\\1" *)
;; soft breathing:
(replace-regexp-in-string ")" "" *)
;; acute accent:
(replace-regexp-in-string "/" (string 769) *)
;; grave accent:
(replace-regexp-in-string "\\" (string 768) *)
;; diaeresis:
(replace-regexp-in-string "+" (string 776) *)
;; circumflex:
(replace-regexp-in-string "=" (string 770) *)
;; the rest
(replace-regexp-in-string "h" "ē" *)
(replace-regexp-in-string "f" "pʰ" *)
(replace-regexp-in-string "q" "tʰ" *)
(replace-regexp-in-string "x" "kʰ" *)
(replace-regexp-in-string "w" "ō" *)

=> "aitía basileús gígnomai dō̂ron eîdos Zeús hēdýs tʰeós ideîn kérdos laós moîra
noûs xénos homilía pínō erēmía rʰódon poíēsis tíktō hýbris pʰílos kʰáris psykʰḗ
ōmós. ángelos anánkē ánkʰō spʰínx. agorâi kepʰalē̂i lýkōi ōidē̂i"

Unicode and Ancient Greek

Table of Contents

1. Alphabetic Order

2. Problems with Accents

3. Beta Code

4. unicode->betacode-char

4. `unicode->betacode-char`