Hearing in Tongues
After creating a Spanish agglomerative insult generator last week and having recently read Neal Stephenson’s Snow Crash, I had language and Babel on the brain.
The other day I was looking at some Chinese money I found in a pocket of an old pair of pants, marveling at the weirdness of the Manchu script that appears beneath all the other supposed indigenous Chinese languages in several places. It’s relatively easy to identify a language with a peculiar script, but how do you distinguish between superficially similar languages, say Italian and Spanish or Dutch and German?
Well, for starters you could use my Bayesian filter Java application. It accepts any number of example language files (in the case below, selected texts from Project Gutenberg) and the text to be identified either on the command line or in a file. In a matter of seconds determines to the best of its ability the linguistic provenance of said text. Though it overconfidently and erroneously identifies languages it hasn’t been “trained” in, it has yet to misidentify a language it has been given as an argument:
To create the script, I modified Adam Parrish’s example code so that it identifies one language rather than listing each and assigning it a score and then displays the word that proved the most relevant in the program’s assessment. If you feed the script a language it doesn’t know and it confidently classifies the text as, say Danish, usually the word it found was most relevant is unusual in Danish and much more common in a language that a quick google will reveal.