Sufficiently Wrong: Linguistics: Comparative Linguistics

This post was started out before I realized I needed to split up the linguistics-related stuff into several chunks. Although I've done some reediting, it's possible it still retains some pre-split things.

What does it Mean when Languages are Related?

Essentially, two languages are related if they both derive - through language change - from a common ancestral form. This of course opens up the possibility of having later, intermediate nodes from which languages derive, such as something along the lines of this graph:

Uralic languages, one proposed tree model.

What we should try and find in order to posit that two languages are closer related within a tree than some other languages in the same tree are shared innovations, whereas shared retentions - if we can tell them apart from shared innovations - are not indicative of any close relation.

Looking at languages using the earliest attested forms as well as dialectal variation and figuring out the situation in as many of them as possible and then comparing to other languages that may be related (for which the same process is carried out if the other languages form a clear group as well, and see if things in the other group may lend itself to figuring things out in the other ...), can finally give a situation where the data between the languages interlock so clearly and support reasonable solutions in the other that we must assume there is some historical connection.

Nowadays, it seems the first step to determining how a family of languages is internally related uses lexical statistics. If a subset among the languages share some innovations, and other than that also share a significant portion of similar words, they are assigned to a common branch. This is done recursively. An example of such a thing would be if one number has changed - if the majority of languages in the family have something like

one two three four five six seven

ara mas des nok kido wokan sisen < Language 1

ora maþ des nox kiro wogau sideu < Language 2

ala wa des nuk kidu wukam sitim < Language 3

.

.

.,

and two have

ale me te nu kilu irme toru < Language N-1

ala ma di no tsilo irmi doro, < Language N

the new "irme/irmi" and "toru/doro" should suggest these two languages may form a node. Other such evidence increases the likelihood they indeed are closer affiliated to each other than to the others.

Further on, other things can be taken into account: the sound changes likely to have operated on the languages, later loans, etc. In many families, the details of the structure of the tree is not well-known: there are even some open questions in Indo-European linguistics, and even more so in Afro-Asiatic, Uralic, ... linguistics.

If two contemporaneously spoken languages are said to be related, it is generally not the case that one of them is older than the other, or that one of them has "come from" the other. They both derive from a previous form, and have changed in different ways from the common starting point. Saying that German is older than English, or that Italian is older than French would be misleading and wrong. However, we can assess how conservative a language is, but that is also an iffy business, as our assessment can be mislead simply by the criteria we set up to determine age - we easily select criteria that would lead to whichever conclusion we want to draw.

We could probably come up with criteria by which English is older than Swedish (retention of /ð/ and /þ/, retention of person inflection of verbs), criteria by which Swedish is older (retention of reflexive pronouns, retention of gender to some extent, retention of strong adjectives), criteria by which neither is particularly old (definite and indefinite articles), criteria that show Swedish has innovated a lot (verbs inflected for passive, definite article is a suffix, loss of person inflection), criteria by which English has innovated a lot (about half a metric ton more Latin loans) criteria by which both are quite old (retention of some words from PIE), etc. It is not obvious how much any given thing counts towards being older or younger. If we want to, we can definitely show either to be older, though, if we lend any credence to the idea of age of a language.

Of course, with regard to dead languages or fossilized/standardized written forms of a language, it is possible, but in that case we are merely counting from the day the language went extinct or the standard was published or the fossilized form went out of actual everyday use. A lot of errors along these lines are made in Acharya's books, assuming that one language out of two simultaneously spoken one represents the ancestral form. Language change does not halt, although it does happen at different paces.

Comparative Linguistics

As already pointed out, sound change tends to be fairly regular, but since it leads to information loss over time it is not reversible. (The information loss per se does not lead to any problem though, as new generation fills the gaps about as fast with new information). An example of such an information loss would be this:

imagine a language where syllables are of the form CVC (consonat-vowel-consonant), and there are no restrictions on which consonants can go in the onset and which can go at the coda of syllables. Let's say this language has a set of consonants including /p t k b d g/. Let us say this change occurs:

[d] → [t], /_# (the symbol # marks word boundary)

That would mean d turns into t at the end of words. If this language were an isolate - no related dialects whatsoever, and it also were isolating - words do not inflect at all - and it were unwritten, this would make it fairly likely that we could never know after it has happened that such a change did happen.

The fact that sound change can be lossy means we may not just apply the change in reverse when trying to reconstruct an earlier stage: [d] → [t], /_# and [t] → [d], /_# are not perfect reversals of each other. In the latter case, instances of t# that preexisted the sound-change [d] → [t], /_# are also made into [d]. Since the resulting word form does not necessarily tell us which words have been hit by the change and which had [t] already earlier, we may need to look at other inflected forms - in case they retain another consonant, in the hopes that no other change has happened that muddies the waters further. Evidence may also be obtained from potentially related languages where a different sound change has occured.

Any evidence - written texts, irregularities in morphology, forms in related languages or in dialects, etc, can be used. Say we have some inflected form here a word shows an alteration between [t] and [d], where [t] appears whenever its the last sound of the word, but [d] appears whenever a suffix is applied. This would suggest to the linguist that some sound change - either [t] → [d] /V_V (t turning to d between vowels) or [d] → [t] /_# has occured. Looking at other words, he might notice that several voiceless [t] do occur in positions where they should have turned to d if the first sound change had been applied, so the latter change can be assumed until counterexamples to that are found. Let us further assume some words are found where [t] occurs both in the final position and with suffixes - we can infer that likely, this word has not had any sound change has hit it. This way, something about the previous stage of the language can be reconstructed. Of course, it is quite possible that stage also had similar remains of earlier changes, and the previous stage probably had them as well (ad infinitum), but at least some information has been obtained.

Turns out English does have a similar thing in some words:

leaf - leaves; wife - wives; loaf - loaves

More recent words ending on f do not have this behavior (or obtain it through analogy) - chief, chiefs. Dwarf originally pluralized dwarfs, but it seems fantasy authors have made dwarves a rather popular form. The reason dwarf did not behave like leaf or loaf - despite being an old word ending in -[f] - can actually be understood by looking at the sound changes it has been through. Looking at cognate languages, we find forms like <dvärg>, <dwerg>, <Zwerg>, <Dwarch>, ... we also find that in Old English texts, forms such as dweorg, <dweorh> occur - but even if we did not know that, we could reconstruct the situation fairly well by recourse to cognate languages in this case. We can therefore probably assume dwarf did not end in -[f] at the time [f] → [v] /V_V occurred, but instead ended in another fricative. We notice a peculiar thing elsewhere in English where some /f/ are spelled with <gh>. Positing that a previous velar fricative (like the consonant in Scottish English <loch>) occurred in those positions makes sense both from comparing to related languages and from sound changes we know to have happened elsewhere. In IPA, that sound is generally written [x]. Hence, we can pinpoint a change [x] → [f] in all contexts^? to have occurred after [f] → [v] /V_V took place.

If we have good reason to suspect several languages to be related, we can even take this process a bit further, and combine evidence from several of them to reconstruct the language from which they all stem. Turns out when this is done with the Romance languages, the result is recognizably Latin, albeit with some flaws; some vocabulary items cannot be reliably reconstructed, and some morphology is entirely lost.

Pretty nifty, though - sound changes explain why some nouns in English have a f-v alteration in their morphology, why some other words that are old enough that they should have been hit by that change do not have it, and other things linguists are aware of (sociolinguistics, etc) explain why dwarves is gaining ground on dwarfs these days. Will a laugh, many laughs follow the same path? Doubtful.

Early on, assumptions such as cases/tenses/aspects/moods/... tend to be lost more often than gained were made. Such an assumption will tend to lead to reconstructions that always have increasingly many features. Nowadays, linguists realize that languages do gain features as well, and the Slavic languages with the largest number of cases, for instance, are considered to have gained cases since Common Slavic, rather than having retained all cases since Proto-Indo-European.

If we have a long written tradition for a language, we can trace what has happened to it along its existence. To some extent, the Romance languages provide an example of this, even though the written form often harked back to Latin until late medieval times.

This is a kind of puzzle that very talented people spent very great amounts of time and effort on in the 19th century, and still to this day - now with computer aid - people are doing it.

Turns out if we apply this kind of process to the languages of Europe and Asia, we find that there are a few bunches of languages that cluster together. The tree diagram of the Uralic languages provides on example of such. The relation between these languages has been known roughly for as long as the relation between another important language family in Europe and Asia, viz. the Indo-European family.

During colonial times, scholars realized that Sanskrit had a remarkable similarity to ancient Greek and Latin, and to some extent even Gothic and Old Irish. Iranian languages soon too were included, and soon Armenian, Albanian, the other Germanic and Celtic languages, the Slavic languages and so on were incorporated into this tree. Turns out sometimes, looking at evidence from one of these (sub)families can help in figuring out questions in the other families. This is because, mainly, these divergences have happened by regular changes. An original language, Proto-Indo-European was posited, and different relatively similar reconstructions for it exist.

It was the ambition for a rigid reconstruction of Proto-Indo-European that inspired the neo-grammarians to posit certain rules for the reconstruction of languages, so as to make it a properly scientific endeavor. Turns out this was rather successful - later on, archaeological records of two language families deriving out of Proto-Indo-European have been found, and both can be derived from PIE using relatively few and simple changes. Nearly all cognates remaining between different groups of Indo-European languages can be shown to derive by regular changes, and the few exceptions can very well be genuine exceptions to the sound changes or results of interdialectal loans.

However, as Indo-European was being reconstructed, a problem arose: there seemed to be several sets of words that behaved somewhat irregularly, where the given laws would lead to different results. Ferdinand Saussure proposed an explanation: all of these words had contained sounds which later had disappeared in all descendant languages, leaving no other trace than the unexplained, irregular changes that bothered linguists at the time. Due to the type of changes that occured in these positions, and the later disappearance of the sounds, Saussure posited there had been throaty fricatives or somesuch in those positions, hence the name laryngeals. In Indo-European reconstructions they are now labelled h1, h2, h3, and H (for "unknown")..

This, of course, sounds fairly unfalsifiable, and it does not really predict anything either. But luck struck! The Hittite civilization was discovered in Anatolia, and in its writings - using a Semitic alphabet - there were letters designating throaty fricatives with the same distribution as Saussure had predicted.

A further archaeological discovery that supports the Indo-European theory is the Tocharian languages from Western China, which easily could be derived from Proto-Indo-European without even having contributed almost anything to the reconstruction itself.

No such reconstruction is attainable, it would seem, that would fit the Uralic and Indo-European languages together into one unified tree, nor is any such reconstruction available that would connect either of those to any of the other families - for which often some reconstructive work has been done, such as the Afro-Asiatic family, the Dravidic family, the Kartvelian family, etc. Attempts are being made, though, and it is possible some breakthrough may occur at some point. Some scholars posit that Afro-Asiatic and Indo-European will be proven to be related, some posit Indo-European and Uralic, etc. Fortescue seems to have presented a convincing argument in favor of linking Uralic and Yukaghir that now is widely accepted, and his arguments for linking Uralo-Yukaghir with Chukoto-Kamtchatkan and Eskimo languages is strong [1]. Ket of north Asia and Na-Dene of North America have been shown relatively recently to be related [2].

Another important thing is that we cannot derive English, Greek, Irish, Armenian or Latin from Sanskrit, nor from Greek, nor from Latin. There has been information lost between Proto-Indo-European and Sanskrit that had not been lost in a pre-English, pre-Greek, pre-Latin stage that influenced further sound changes in these languages, and likewise, information has been lost in pre-Greek that was retained up until Sanskrit and Latin, and so on; simply put, if these languages derived from Sanskrit/Greek/Latin, there would be different mergers between words in them than there are now, and a reconstruction would not run into a situation where a large number of words require ad hoc rules to apply to them. Since Occam's razor rules, we assume the hypothesis that requires the least ad hoc jigsaw puzzling.

Anyways, since sound change tends to be fairly regular, and different languages often have gone through slightly dissimilar sound changes, or had them in different orders and so on, could we possibly be able to figure out what the languages were like before they started changing? Turns out it is possible to some extent.

However, at every step along the way, the chances that some minor error creeps in does grow a bit, and one should probably try and correct the intermediate steps by looking at possible evidence both from reconstructions that are chronologically deeper and less deep as well as the modern variants.

This makes reconstruction rather challenging, as it requires data from a multitude of languages as well as language (sub)families.

Onwards through time

We notice that one language over time can split into multiple languages, and this easily leads to the projection that probably all languages originate with one single language.

If all spoken human languages ultimately stem from one original language - which is not established - all languages would indeed be related, although we cannot currently establish what that family tree looks like at such a depth of time. It is also possible that in the early days of mankind, several tribes developed language somewhat independently, and modern languages all descend from several such independent beginnings. Some languages may even descend from several ones of them. Another explanation for why languages spoken by humans could be internally unrelated could be some human tribes having adopted the language of some tribe of some other hominid species, if they had language. [2].

However, as said, information loss (as well as information generation) happens during language change. At some point, the noise introduced by this loss of information will make it impossible to look much further back. Some linguists seem to place that line somewhere between 10,000 and 8,000 years ago. It seems at least R.M. Dixon argues that areal diffusion (borrowing and other influences from one language to another) under some circumstances can be strong enough to even make the idea of language families almost untenable. His main evidence for this comes from the situation in the Australian languages, where he considers areal diffusion until very recently to have been strong enough not to permit linguistic relations to work along the family-tree model [3]. His view does seem to be held by a minority, though, but he is well-respected among scholars of Australian languages.

Nevertheless, in mainland Eurasia, the main families accepted by mainstream scholars are Dravidian, Afro-Asiatic, Indo-European, Uralic, Yukaghir (Uralic and Yukaghir very probably forming a family), Turkic, Mongolic, Tungusic, Korean, Japonic (Japanese and a few small, closely related languages spoken on various islands of Japan), Sino-Tibetan, Hmong-Mien, Thai-Kadai, South Caucasian, Northeast Caucasian, Northwest Caucasian, Chukotko-Kamchatkan, Ongan, Yeniseian and Austronesian. The theories that go any further back are not accepted by the mainstream, although some may be greeted with scholarly optimism. (Uralo-Yukaghir, for instance, has had a respected scholar recently propose a close relation to the Eskimo languages, with a relatively large number of systematic correspondences and a reasonably large vocabulary to boot.)

In addition to these, there are a handful of languages that have not been possible to include in any family: Basque, Burushaski, Nivkh, Ainu, and finally a few extinct families, such as the Hurro-Urartian languages, Tyrsenian languages (Etruscan and relatives, potentially Indo-European) and Sumerian. For convenience, some of the smaller families in Siberia are sometimes catalogued together as Paleo-Siberian.

Proving any of these to be related has proven difficult - decades of work trying to untangle the tantalizing clues for Indo-European<>Uralic, Indo-European<>Afro-Asiatic, Turkic<>Tungusic<>Mongolian(<>Korean<>Japanese), Basque<>any of the Caucasian families, Burushaski<>Yeniseian, any kind of Caucasian<>Indo-European, any Caucasian<>Turkic, any Caucasian<>any other Caucasian, Uralic<>Turkic, ... have all proven futile this far. Possibly we lack methods for it, the reconstructions we have this far are flawed, or the information that would be necessary to demonstrate such a connection genuinely has been lost. Finally, Dixon's proposed explanation could also account for the inability to go any further back.

Requisite Caution

When an amateur sets out to prove that two languages are related, the result will generally be a list of similar-looking words or sentences. However, the similarities will be inconsistent. One rule will be applied for this pair, another rule for that pair. Internal etymologies will be ignored - generally ones in the language the amateur does not know very well. That is, the non-linguist will not look at language-internal explanations for a single word.

Let's say we want to prove that English muskrat is a Greek word. We can set out and find some Greek etymology for it. We could posit, say μῦς κράτεος, mus krateos. This is probably malformed Greek, but would mean something like "mouse/rat of strength", where the strength would refer to the strength of its smell. In that case, we have ignored the most likely native etymology, which simply is musk(y) rat, in favour of a theory that is supported neither by evidence or history: the Greeks did not have any particularly great influence on the English language at the point when speakers of English first discovered the muskrat. We do, however, know that people in the area where this contact happened had names for it starting out with syllables along the lines of musk/mosk - viz. the Algonquins, who called it muscascus or the Abenaki, who called it mòskwas. This form was borrowed, and then through a likely folk-etymology reinterpreted as referring to its musky scent, to which rat was added. The kind of fake etymologizing which creates false explanations like the μῦς κράτεος above are surprisingly simple with just a bunch of dictionaries, some imagination and some lack of critical thinking. Familiarity with Occam's razor is helpful, and willingness to use it necessary if we are to find any truth in etymological speculation.

As an example of pseudo-scholars who neither realize the use Occam's razor has nor care to take a swing at their hypotheses with it, one Edo Nyland systematically ignores the grammar of both the source and target languages in his theory that all languages (except maybe Chinese) derive from Basque. His musings on German, English and Hebrew should serve as examples how not to reconstruct the history of a language and the form the language took at some stage.

Many linguistic arguments that try to drive home a political-nationalist point are based on similar bullshit. In Finland, Paula Wilson, a member of our Swedish-speaking minority (just like I am, by the way) distorting any Finnish place-name she comes across to show that the Indo-Europeans settled Finland first. It doesn't bother her if her fake etymology posits Celtic, Hittite or Iranian words here, those still are proof that a tribe of Swedes settled here before the Finns, as Celts, Slavs, Iranians and Romans all are Indo-European just as the Swedes are - as though which group were here first even is relevant somehow. Still, the Swedish-speaking media in Finland bought her arguments wholesale, even after scholars had convincingly refuted them.

The important points:

This wraps up the linguistics-detour for now. This has been a somewhat superficial look at how these things are researched - and way better sources exist. Understanding why some of Acharya's claims about linguistics are wrong does not require reading books on the topic, but it helps. Alas, I can not give much greater detail on the Indo-European reconstructions and the justifications and problems in it - the details are not things I know by heart. However, some important points are these:

Linguistic reconstruction is precision work and not just looking at words that look similar and going "hey, these languages probably are related". Demonstrating that two languages are related takes painstaking work showing that regular correspondences between them can be established.
There is a lot of people who use misleading linguistic arguments to score points for any variety of reasons. These reasons include religious, nationalistic and political delusions.
It is easy to mislead using false linguistic arguments, since people in general are not aware of the implications of the nature of sound change, the chance for coincidences[5], and so on.

[1] Michael Fortescue, Language Relations Across Bering Strait. Not giving an exact page, since this is the thesis the book in its entirety sets out to establish.

[2] Edward Vajda, A Siberian link with Na-Dene languages, see also http://anthropology.net/2008/03/27/more-on-vajdas-siberian-na-dene-language-link/ for a short article on it.

[3] Robert M.W. Dixon, Areal Diffusion and Genetic Inheritance details the problems involved.

[4] An idea presented in Justin B. Rye's Pleistocenese, an essay about paleolinguistic possibilities. This particular claim appears in this bit: http://www.xibalba.demon.co.uk/jbr/pleisto.html#IIII The fact that Europeans have some Neanderthal genes and the Papuans, Australian aborigines and some Asian tribes have some Denisovan genes kind of contributes a bit to the likelihood of this scenario. However, the particular essay has not been per-reviewed, but the important point is rather that such a scenario is imaginable.

[5] Mark Rosenfelder, How likely are chance resemblances between languages. http://zompist.com/chance.htm

Sufficiently Wrong

Tuesday, November 13, 2012

Linguistics: Comparative Linguistics

What does it Mean when Languages are Related?

Comparative Linguistics

Onwards through time

Requisite Caution

1 comment: