Monday, April 22, 2013

Language Complexity, pt I

[This post has been in the drafts folder for about half a year already, with me occasionally adding or removing bits. I decided to complete the first half of it, and split it in two. There will also be a third post which in part answers what this has to do with the more general idea of this blog.]

Language Complexity

One common idea is that primitive people speak primitive languages, or vice versa: that primitive language is a mark of primitive people. This idea is badly mistaken - not only is it mistaken, it is also badly defined.

I found a rather intriguing but specific example of this on a forum recently: the idea that omission of vowels in the writing of Semitic languages is a sign of backwardsness and primitiveness. The person opining thus did so in order to present Muslims in particular (and the Quran) as savages from a linguistic point of view. It might appear to the average speaker of a language written using the Latin alphabet that regularly omitting a kind of sound from writing, indeed, is primitive and rather inefficient.

However, let us look into this idea a bit closer: is it perchance possible that our writing too omits phonetic information? Almost trivially, this turns out to be true - English neither marks pitch nor timing in its script. Yet both of those can convey very crucial information, sometimes essentially negating the meaning of the entire utterance! Meanwile, omiting te ocasional leter and faling to us the rite word and faling t use capitalizaton correcly in english does not make an English text impossible to parse.

It would be interesting to know the Shannon entropy of Quranic Arabic, Biblical Hebrew, Persian in different scripts (including Arabic), English, Runic versions of Germanic and so on. I suspect that, indeed, there is slightly greater density to the information coded per letter in Arabic and Hebrew than in English. This would mean that a lost or damaged letter in a text with greater likelihood alters the meaning or makes the meaning irretrievable than in a language with less density in the writing system. Still, as I said - sometimes a rather significant piece of information may be entirely unmarked in English writing.

Where do we draw the line between backwards and too dense or advanced and sufficiently full of redundancy? It is all arbitrary.

But back to the more general issue - language complexity. Until the second half of the 19th century, those linguists who considered the genesis of languages assumed something along these lines: as a civilization emerges and consolidates, its scholars and leaders design and establish its language. Once the civilization reaches a certain level, laziness kicks in, and the language starts slipping into disarray.

These linguists felt this explained why languages such as Ancient Greek, Latin, Sanskrit, Persian and so on were such sophisticated languages, while their early descendants - Vulgar Latin and the early transitional forms leading into the medieval vernaculars of Romance-speaking Europe, various Prakrits, ... - were such haphazard and shoddy things.

How do we measure haphazardness in a language? How do we measure shoddiness? These were of course not the terms they would have used, but it is quite clear they thought of it in such terms. Latin was admirable, Sanskrit was "almost perfect", and so on - all quite subjective and emotional terms.

Meanwhile, they held that primitive peoples had not gone through this language-building stage, and thus spoke languages with imperfect grammar or downright less grammar. We would expect, if this were true, that languages such as English, Russian, Japanese, Swedish, Finnish and so on would have advanced and complicated grammars, whereas languages such as those spoken by Australian Aboriginals should have simple and primitive grammars: you Jane, me Yarrawalluma. Me have boomerang go kill Kangaroo. (Even then, the previous two sentences, although very bad English, could very well have been constructed using complicated grammatical rules, although obvious distinct from those of English. 'go kill, 'have X (verb phrase)', etc, all could be grammaticalized constructions.)

To test this, we would need to have some kind of means to quantify complexity of a grammar. Another question: does complexity imply expressivity? Does expressivity imply complexity? Is it really expressivity we should quantify? - But expressivity is way more culture-specific, and it is very easy for a person having grown up in one culture not even to realize the other culture can express a lot of things we do not even realize are being expressed.

Compare how Higgins thought Hebrew's perfect and imperfect aspects did not really express anything - they were just a failed attempt at emulating the tenses of more advanced languages. It is clear we cannot just assume that our not understanding a feature is proof that it does not really have any function.

Further, we cannot just count the number of forms and constructions we can see - clearly, one form can have multiple functions, and one function can affect multiple forms. (That is, a given function can be encoded as a change in the form of more than one word, in the rearrangement of morphology, word order, extra particles, intonation, ...) Let us consider one pretty cool example which is rather difficult to detect: hierarchies. To get there, though, we will have to consider the subject and object distinction, and how it is marked:

There are languages that distinguish subject and object by noun case:
mies osti auton (man buy.past.3sg car.acc)
auto paloi (car burn.past.3sg
karhu raateli miehen. (bear savage.past.3sg man.acc)
In this case, the leftmost nouns are all nominative, which in Finnish is marked with a zero ending in most nouns. The -s in mies only appears in the nominative, so I consider that a nominative suffix for now.
Rearranging the word order may sound awkward for some of these particular sentences - few would ever say auton osti mies, altho' auton mies osti flies a bit lower under my radar for weird syntax. With different subjects and objects, however, both of those orders can work really well. Rearranging may alter the connotations of the sentence. A given order may code for more than one connotation:
miehen karhu raateli:  ~it was a man the bear savaged
karhu miehen raateli: ~it was a bear that savaged the man
miehen raateli karhu: ~the man was savaged by a bear, a bear savaged the man
karhu raateli miehen: ~the/a bear savaged a man

As I do not fully have native intuition for Finnish, I can only tell you these are roughly how a native would understand the connotations of different rearrangements - at least introspection doesn't tell me all the possible interpretations these sentences could have in different situations, and of course there's a double translation issue here, as I am no native speaker of English either. The most common one is subject first, verb in the middle. Case marking need not be in the form of suffixes - particles qualify as well, and we find this in Spanish (a, used with direct objects with certain properties) and Hebrew (et, again, restricted to objects with certain properties and apparently not mandatory).

This method of distinction is familiar to anyone having studied Russian, German, Greek, Latin, Sanskrit, and a whole bunch of other languages. It can be found on every continent.

A method common in lots of languages in Africa - but also attested elsewhere, including, in some analyses, spoken French! - is to have the verb have agreement markers for the subject and object. Thus, in most Bantu languages, there is a rigid order of prefixes to the verb, one agreeing in noun class (a lot like gender) with the subject, another with the object, and depending on the language there may be several other prefixes as well. Since the verb tells us which of the nouns is subject and which is object, there is little need to mark the nouns or use any other method. However, what if the nouns are of the same class? We will get to that a bit later. Subject agreement on the verb - which is common in large parts of the world too - does already by itself help a lot of the time, especially if there is gender and number agreement (or comparable) in the morphology. In Russian, neuter nouns do not distinguish accusative and nominative, nor do inanimate masculine nouns. I do not know which method is used to resolve which is the subject and object in case one noun of each of those types act on each other, except that in the past tense, the verb agrees with the subject in gender - which would resolve this situation.

Another common method is word order. This is present in English and Swedish, and partially in many languages that do have the above style of object marking as well. In English, rearranging the order of subject and object is not really permissible at all, although you can front the object at times - but in those cases, you obtain OSV rather than OVS. Swedish requires OVS in such frontings, and seems to be more permissive when it comes to inverting the object and subject, without any grammatical marking involved. Oftentimes, this involves pronouns - which in English and Swedish still carry case object marking - or nouns that are strongly semantically associated with their verbs.

Such association should probably count as part of grammar! Similar things probably apply in the situation where both subject and object in a language marking them on the verb are of the same class. How do we quantify that aspect of a grammar? To really drive that point home, though, let's look at a final class of languages:

There are languages where neither word order, case marking or verb inflections are used, yet speakers can with great likelihood identify which is subject and which is object. What magic is this?

The explanation for this is the presence of grammatical hierarchies. In these languages, subject and object are resolved by recourse to a hierarchy: the noun higher in the hierarchy, is the noun that generally is more likely to be the subject. Is this grammar? Yes. It quite clearly is, yet it is also rather obvious that it might not be easy to spot the existence of such a grammatical detail.

How does that hierarchy compare, quantificatively, to a case marking system or a verb marking system?  Is it even a well-defined question?

The reason we have discovered that grammatical detail, probably, is along these lines: in many languages, the subject-object distinction is rather central. (The existence of subjects in language is not, apparently, universal, though! Objects, however, apparently are.) When linguists have come across languages where there is no immediately apparent way of distinguishing subjects from objects, they have studied the language until they have realized what exactly is going on under the hood.

It should be rather obvious that other distinctions that may be encoded in similar ways, but which we have no idea to go looking for, very well may exist in languages around the world - even in European languages! For this reason alone, it should be obvious that speaking about one language or another having more or less complex grammar. I find it likely the basic amount of grammar to get simple statements and queries understood varies a bit, but I am pretty certain the difference in amount of grammar between languages is not large. However, that of course would be predicated on having a reasonable way of quantifying grammar in the first place - and a way of being sure when we have mapped out all the grammar present in a -lect of some kind (idiolect, sociolect, dialect, ...).

Even now, the assumption among those without any education in linguistics seems to be that the complexity of a language stands in direct relation to the advancedness of a culture or the social standing of the speaker community. This mistaken notion should have been abandoned decades ago.

We know now that complexity in languages is not a result of a cabal of clever people developing the language as civilization emerges, and we know language change is not the result of their meticulous work being abandoned due to laziness in subsequent generations. I will leave the obvious question - where does language complexity come from - to the next post on the issue.


  1. You make some great points here!

    One thing that should be noted with respect to your "vowel-less Semitic is primitive" example is that claims such as this confuse LANGUAGE and SCRIPT, two entirely different things. The same language can be represented by using different scripts and the same script can be used for very different languages. Whether the vowel-less Semitic script is more primitive is a question that can be debated (one could even claim that while Semitic peoples had a writing system, however imperfect or "primitive", the linguistic ancestors of the critics had NONE!), but it has nothing to do with Semitic languages per se.

  2. I'm citing you in this new post: