Showing posts with label linguistics. Show all posts
Showing posts with label linguistics. Show all posts

Thursday, November 14, 2013

A review of Strange Linguistics (Mark Newbrook)

A review of Strange Linguistics - a skeptical linguist looks at non-mainstream ideas about language. 


I have for a while tried writing a review of Strange Linguistics (Mark Newbrook with Jane Curtain and Alan Libert). As a relative newbie to writing reviews, this is somewhat of a challenge, especially as it is a somewhat difficult book to provide a summary of. It does not set out to prove or discuss any one specific hypothesis - it is rather an overview of a large number of pseudoscientific theories, complete with short explanations why these theories are pseudoscience in the first place. Thus, it is difficult to conclude whether it provides a sufficient argument in favor of some hypothesis - as no such intention is set out. This lack of focus does not detract from the work, but does make the life of the reviewer somewhat more difficult.

Newbrook et.al. do give the claims, in general, a fair hearing, and proceed to explain why these claims do not cut it. In the introductory chapter, he dutifully explains how some of these mistaken views probably are entirely harmless, but how others easily can be used to inflame ethnic conflict and just generally trick people - I especially find the claims made by the likes of David Oates to be likely to make people ruin other people's lives over badly justified claims:
Oates and his followers have applied the analysis of RS [reverse speech] in various practical domains, some of them involving matters of great sensitivity and potential harm. If RS is not genuine, this work is valueless at best and quite possibly extremely damaging. The areas in question include child psychology, alleged cases of child molestation, other alleged criminal offences (this includes the 'O.J Simpson' case) and the analysis and treatment of sexual and other personal problems more generally. [1, p. 168] 
As for the fairness Newbrook grants, it is well worth noting that he has led a research project into linguistic material provided by alleged alien abductees, with entirely inconclusive results, which he in some details elaborates on in the chapter on language from mysterious sources. (By 'inconclusive', take this to mean that Occam's razor justifies rejecting the claims of alien origin for these allegedly alien linguistic snippets, which indeed is the conclusion Newbrook draws from his research.)

For some claims the authors investigate, there could be some justification in providing a somewhat more detailed explanation as to why they are wrong. If it had overviews of topics such as the statistical likelihood of chance resemblances between languages, the comparative method, and some other relevant parts of linguistics, it could be very useful indeed.

It is definitely a good book if you already have some background in linguistics. It would also be a worthwhile addition to the library of any scholar or journalist who is not well-versed in linguistics but on occasion has to evaluate the value of claims that deal with linguistics - if they are willing to do some extra research on their own, alternatively, accept the claims of a bona fide linguist without looking closer at the evidence in his favor. As for journalists, I would even say the relevant chapters of this book should be relevant reading before writing any article on linguistic matters whatsoever. Alas, the lack of clearer elaboration on linguistic methodology might make it a bit too inconclusive to those unfamiliar with the field.

Linguists themselves probably can figure out the problems with various claims such as those presented in this book - and doing so could be a good exercise for a course in skepticism for undergraduate linguists (and even more so students of philology, whose understanding of linguistics sometimes may leave some room for improvement). Ultimately though, the book presents little new for the linguist - except maybe as a convenient source to refer to when there is no time to devote to the proper debunkage of some claim, or as an overview of exactly what kinds of weird beliefs about language are being peddled on the marketplace of ideas (which can be a bit of a shock even to seasoned skeptics).

If the book ever is translated, local crackpot linguistic theories should probably be given a more in-depth treatment: Swedish or Finnish translations probably should include more detailed investigations into both Ior Bock and Paula Wilson's claims (quite distinct types of claim, even if both are wildly wrong; Ior Bock's claims are described and rejected for the same reasons any number of other claims are, Paula Wilson is not mentioned at all which for a non-Scandinavian audience is an entirely justified omission), any Indian edition should probably debunk the various notions regarding Sanskrit that are popular there, Hungarian editions need to elaborate on why it is unlikely that Hungarian is related to the Turkic languages, etc. How such supplementary chapters would be written and incorporated into the book would probably be a challenge though.

There is a certain morbid humor to reading it, the endless amount of bullshit that humans have come up with is as fascinating as any good supernatural thriller. Newbrook in a way comes off as the straight man in a comedy, granting much leeway to the strange antics of a weird coterie of peculiar thinkers and crackpots. The amount of leeway he grants may seem excessive at times, but many of these weird theories are so wrong that even the loosest criteria are enough to debunk them.

There are two chapters whose inclusion at first may seem odd - one chapter on skepticism of mainstream linguistics, which does present some reasonable objections to Chomskyan (and related) linguistics, and another chapter on constructed languages. Some people that have constructed languages indeed base their hobby on pseudo-scientific notions of how language works - this is especially prevalent among those who wish their languages to have an actual population of speakers, oddly enough. However, the inclusion of languages that are framed as fiction or part of fictional worlds would be decidedly odd if it were not for the fact that non-practitioners of that particular hobby may misunderstand it. Here, the treatment could have made it clearer that hobbyists often do not see their hobby as any kind of scientific statement or claim, but rather as works of 'art' or similar. That chapter could have done with somewhat better research, but at the same time it might be the least important chapter, and therefore, not investing that much on getting a detailed picture of the constructed languages-scene is very justified.

The main drawback as far as I can tell is the lack of an index, making it difficult to find things quickly. An index would improve its usability especially for journalists, who often write with very strict deadlines looming. Some of the particular claims listed could fit in several different chapters according to the classification (and some are, indeed, mentioned in several places, often with a mention of where the main treatment of the claims occurs). I imagine a more lexicon-like layout could have fit, and would have provided an easy way of expanding the book in the future, but on the other hand that would separate the description of individual claims from the description of the main types of problems that mainly accompany specific kinds of claims.

In conclusion, it is a book that should probably be consulted by any number of people - especially non-linguists and journalists whose work at times intersect linguistics, but there is some room for improvement. On the other hand, it is possible an edition incorporating the improvements I would suggest would get unwieldy in size, and thus a complementary volume could maybe be justified. However, to some extent such a volume would be your basic introduction to linguistics anyway, the contents of (the relevant parts of) which should probably be learned by anyone before consulting this book anyway.

[1] Mark Newbrook with Jane Curtain and Alan Libert. Strange Linguistics - a skeptical linguist looks at non-mainstream ideas about language. Lincom Europa, 2013.

Sunday, October 20, 2013

The Christ Conspiracy: Chapter 15, pt 1

Chapter 15, The Patriarchs and Saints are the Gods of Other Cultures, at its core presents a thesis that probably is shocking to true believers, but nothing new to those who are somewhat well-read on these topics. Nevertheless, to some extent it also exaggerates the thesis, making statements that do exceed what can be known as though these statements were fact. As usual, it also intersperses a fair share of speculation mixed with tendentiously presented facts.

The most fascinatingly weird claim she makes is present in the segment on Noah. It is well known that the Noah myth derives from older myths, but Murdock is not content with that:
Xisuthros or Ziusudra was considered the "10th king," while Noah was the "10th patriarch." Noah's "history" can also be found in India, where there is a "tomb of Nuh" near the river Gagra in the distrct of Oude or Oudh, which evidently is related to Judea and Judah. The "ark-preserved" Indian Noah was also called "Menu." Noah is also called "Nnu"[SIC] and "Naue," as in "Joshua, son of Nun/Jesus son of Naue," meaning not only fish but also water, as in the waters of heaven. Furthermore, the word Noah, or Noé, is the same as the Greek νους, which means "mind," as in "noetics," as does the word Menu or Menes, as in "mental." In Hebrew, the word for "ark" is THB, as in Thebes, such that the Ark of Noah is equivalent to the Thebes of Menes, the legendary first king of the Egyptians, from whose "history" the biblical account also borrowed.[1, p. 238]
This is so full of mistakes, that I figure a list will help keep track of the debunking.

  • Xisuthros/Ziusudra being the tenth king.
  • Oude/Oudh having anything to do with Judea (see 'short words' further down).
  • No source provided regarding the tomb of Nuh, no attempt to establish its age - is it potentially more recent than the arrival of Islam in India?
  • Naue or Nun signifying "water" - which particular language? We cannot all be superlinguists who can identify languages on sight just based on one monosyllabic word! Since no language is mentioned, it is also quite difficult to verify the relevance of the claim.
  • The far-fetched stretch of etymologies Noah → νους → mentes → Menes, Tevah → Thebes
  • She clearly mentions the (significantly more recent form) "Noé" to make the apparent similarity to "Noetics" (also a much more recent word) greater. Noé is a much more recent rendering of נוח, the comparison should be nous or noos to Noach, not noetics to Noé.

As it happens, the Hebrew word for 'ark' is TBH, תבה, rather than THB. Of course, Thebes is not T+H+B either, it's Θβαι - the "TH" bit is a single sound that in some languages - particularly English - is written using a sequence of letters (here, the fact that letters and sounds are separate things is relevant). It being represented as a sequence TH does not signify it having any actual resemblance or relation to the actual sequence T+H. (In fact, though, Hebrew ת was probably sometimes pronounced a lot like English th is, which does not support my case particularly much.)  What we reach here, however, is a phenomenon Mark Newbrook calls 'very short words'[2]. The designation refers to the phenomenon where the shorter the words that we go looking for, the greater the chance that we find similar stems embedded in words in other languages. Thebes/Θῆβαι and Tevah share rather little - just one single syllable. Further, we have several forms for Thebes: Thebes, Θῆβαι, Ta-Opet (Classical Egyptian), Ta-Pe (Demotic), wꜣs.t (Classical Egyptian, not strictly related to the other words but signifying the same town). There is thus lots of space to come up with potential cognates for biblical names, and little material to falsify such claims with (consider Ta-Opet vs. Tabitha or Tappuah, Ta-Pe vs. Tappuah, Topheth or Tobiah, etc. What makes Tevah more favorable for linking to any of these than the above, equally spurious suggestions?).

What takes this a step beyond Newbrook's label for similar bogus cognates is that there is not even any similarity between the meanings of תבה and Ta-Opet. There is nothing that makes the connection apparent. Murdock could keep going through towns and kings of antiquity until she found anything to connect Noah to, there is nothing that per se forces Thebes to be the preferred alternative - and this opens up a huge space of unfalsifiable claims. Let us be generous and say there are only twenty relevant names in the Bible (considering how insignificant some of the characters she's pointing to are, the number could easily be argued to be significantly larger). Let's also be generous and say there are only two hundred kings in antiquity. This gives us a whopping four thousand potential pairings - analogously to the birthday paradox, it is more than likely that quite a few of them are similar.

A significant problem here is also the number of languages she has to use for this reasoning to make sense - does anyone really believe that there is a quadri- or pentalingual pun involved? For real?

Keep in mind, as well, the number of other flood-heroes: Utnapishtim, Δευκαλίων, Noah, ... would we not expect these names to have some kind of similar connections? We would also probably expect more clearly explicit such connections.


Obviously, then, Noah's famous "ark," which misguided souls have sought upon the earth, is a motif found in other myths. As Doane relates, "The image of Osiris of Egypt was by the priests shut up in a sacred ark on the 17th of Athyr (Nov. 13th), the very day and month on which Noah is said to have entered his ark." Noah is, in fact, another solar myth, and the ark represents the sun entering into the "moon-ark," the Egyptian "argha," which is the crescent or arc-shaped lunette or lower quarter of the moon. This "argha of Noah" is the same as Jason's "Argonaut" and "arghanatha" in Sanskrit. Noah's ark and its eight "sailors" are equivalent to the heavens, earth and the seven "planets," i.e., those represented by the days of the week. As to the "real" Noah's ark, it should be noted that it was a custom, in Scotland for one, to create stone "ships" on mounts in emulation of the mythos, such that any number of these "arks" may be found on Earth.[1, p. 238]
Indeed, the ark did not exist, yet Murdock manages to turn even such an almost trivially true claim into pseudoscience. There are two separate words natha in Sanskrit (assuming nāthá, नाथ, is the word Murdock intends, since she does not follow any scholarly transliteration we cannot know!), one signifying 'lord, master' and a whole complex of similar meanings, the other signifying a refuge or resort. The first one is in the masculine gender, the other neuter. The greek -naut is instead cognate with Greek νας, Latin navis and Sanskrit नौ, नाव (nau, nava), all these signifying ships. Again, it would have been helpful had Murdock provided some kind of scholarly transliteration of arghanata. A list of the problems may help in keeping track of the problems:
  • argha noah is irrelevant, as ark only appears in the tradition of biblical texts about 700 years after they were originally composed, when Jerome translated them into Latin, until then you had had tevah and  the greek kibotos in the main versions of the text. 
  • argha noah is a theosophic 19th century invention with no evidence in support of it. It is basically just as made up as the religions it is made up to subvert.
  • argha noah is clearly made up to correspond to the English phrase 'ark of Noah' (compare the phrase in some other languages - Noas ark, Nooan arkki, kovček Noya, something like kidobani nois (or maybe nois kidobani) in Georgian,  whereas the phrase in Biblical Hebrew would've been Tevat Noah, and in Biblical Greek kibotos tou Noe or something in the vicinity of that (I do not know Biblical Greek well enough to make any promises). Had these fanciful authors spoken Georgian, their "Egyptian" concept would go by a name like kidobanois or somesuch. (The nature, btw, of this 'egyptian' concept differs significantly from one author to another - one author says it is a yearly festival, another says it's a monthly lunar occurrence, and so on. See, for instance, Jordan Maxwell's The Naked Truth, where he attributes argha noah to something entirely different - this suggesting to me that there is no validity to the concept whatsoever.) 
Murdock goes on and makes improbably claims about the meaning of Noah, Shem, Ham and Japhet:
The sons of Noah, of course, are also not historical, as Shem "was actually a title of Egyptian priests of Ra." The three sons of Noah, in fact, represents the three divisions of the heavens into 120° each. As characters in the celestial mythos, Noah corresponds to the sun and Shem to the moon, appropriate since the Semitic Jews were moon-worshippers.[1, p. 239]
Claiming that a given set of three persons corresponds to a division of something into three does require some kind of supporting evidence. What about these three make this correspondence obvious? Why does one of the three - a whopping 120 degrees of the heaven - also correspond to the moon? This claim again comes from Hazelrigg the astrologer, rather than from any actual scholarly sources.

Also, duly note that the word 'Semite' and related forms as a designation for the Semitic peoples is a term that only goes back to the 18th century. The Biblical Jews did not describe themselves as Semites, and such a connection between moon-worship, Shem-as-the-moon and Semites cannot have occurred to the authors of the biblical narratives.
Abraham also seems to have been related to the Persian evil god, Ahriman, whose name was originally Abriman. Furthermore, Graham states, "The Babylonians also had their Abraham, only they spelt it Abarama. He was a farmer and mythological contemporary with Abraham."[1, p. 240]
Please provide a source for Ahriman's name originally being Abriman? The accepted etymology among most scholars has it stemming from Angra Maynu. As for the quality of Graham's work, I will actually refer to a source I would usually avoid using, but whose summary of this book seems fairly legit, viz tektonics.org's review[3] .
Furthermore, Abram's "Ur of the Chaldees" apparently does not originally refer to the Ur in Mesopotamia and to the Middle Eastern Chaldean culture but to an earlier rendition in India, where Higgins, for one, found the proto-Hebraic Chaldee language. [1, p. 240]
Murdock's obsession with moving the Semitic family of languages to India grows absurd at times. Her reliance on Higgins' bumbling amateur linguistics is laughable. However, one step further, this is worse than pure conjecture. It is counterfactual conjecture at best. At the very least, a claim such as this requires significant amounts of supporting data. Murdock provides but one datapoint for that, and that datapoint - Higgins' finding the Chaldee language in India - is nothing but wrong. Finally, Chaldee is not proto-Hebraic. Chaldee is closer to proto-Semitic than Hebrew is (time-wise), but they are in different branches of Semitic.
In fact, the Greek name for the constellation of Bootes, or Adam, is Ιοσεφ or Joseph.[1, p. 250]
The source given for the claim that Bootes is Adam and is called Iosef in Greek is Karl Anderson (p. 126, Astrology in the Old Testament). Anderson fails to provide any source for this. Modern sources such as tufts.perseus.edu makes it possible to search a very huge corpus of ancient Greek texts for words such as Ιοσεφ. It turns out not a single instance of the word is in a context where the surface meaning of the text has anything to do with any asterism, nor is there any large number of instances of it in the first place. A similar claim is made in a quoted portion from Hazelrigg - another 19th century astrologer who, as I keep emphasizing, did not bother with providing sources.

This kind of fabricated linguistics being included in the work of a person that regularly labels herself a linguist is saddening. The chapter itself could have been good - had she decided to go no further than the idea that most Old Testament characters have no historical background. But as it stands, she included way too much in ways of 19th century theosophy,

[1] D.M. Murdock, The Christ Conspiracy, 1999
[2] Mark Newbrook, Strange Linguistics, 2012. There is an entire chapter devoted to the phenomenon. The opening pages of the chapter describe the relevant problems with this kind of approach to evidence.
[3] http://www.tektonics.org/gk/grahamlloyd01.html

Wednesday, June 26, 2013

Linguistics: Language Complexity and its origins

I previously pointed out some notional difficulties with linguistic complexity - difficulties in exhaustively measuring it, difficulties in defining it, etc. Nevertheless, it is undeniable that languages do have something we reasonably can call 'complexity' - and that this complexity appears on several levels. How does this complexity come about?

We first need to look at the context in which a language normally exists: the brain and the speech community.  The brain, obviously, is what produces linguistic utterances - and other brains parse these utterances. I think we can meaningfully say that grammar (as well as lexicon, as well as stylistics, as well as ...) all boil down to one thing: patterns. 

Side-track 1: Neural networks



Neural networks, as it happens, are pretty good at some things. Among them is pattern recognition. And since patterns are a relevant thing, we should probably start with them. Neural networks are a well-known and researched general architecture for pattern recognition. Our brains are, as it happens, instances of neural networks - probably the most complex known examples, in fact.


Most readers probably have not read a lot about them, so I figure a short introduction is called for. I like to go for the excessively abstract when describing things - along the styles of "imagine an arbitrary multiset". I realize this does not work for most readers, so I will try to avoid it.

Imagine a simple sensory organ (oh man, there I go), where there are several sensors that pass on a signal if they are triggered. Different sensors react to different types of stimuli, or stimuli of similar kinds of different qualities, or even the same kind of stimuli - but by virtue of being at different locations they still convey different information onward. Each sensor, when the right stimuli is present, sends a signal. Let us assume the signal is binary - yes or no.

Let us further imagine that there are a bunch of things - we call them nodes - that receive these signals and sometimes also pass them on, along directed vertices (that is, lines from one node to another). These form a huge network, where each node can send (henceforth fire) to other nodes, and likewise receive from other nodes. The node cannot, though, decide which nodes to fire to - firing always transmits on all outwards vertices.

Every vertex is ascribed a weight. The weights of all simultaneously firing vertices reaching a node is added up (or multiplied or had some function applied to it), and if the sum (or product) exceeds some treshold, the receiving node also fires.
Now, this would be a static neural network - whenever the source sends a given signal, exactly the same things transpire down the line in the network. (Well, unless there is some loop where signals cause some feedback - but we could consider those loops a type of source as well, so we ignore those or consider them a special case, for now at least. The brain does have loops, though.)

So, we add this mechanism: a firing node increases the weight of whichever incoming vertex also carried a firing signal, and decreases the weight of whichever incoming vertices did not simultaneously fire. Essentially, if node A fires after it received signals along vertices a,b and d, from then on, it will listen more closely to those vertices.

The exact function by which the weight is changed affects the neural network's properties - most texts I have read on this uses sigmoid functions of varying steepness. In the brain, the function probably also varies with age, diet, part of day, what part of the brain it is happening in, etc. I am not very knowledgeable about the biological mechanisms involved, but I would figure there are various biochemical components involved. (Complications can be added, but these do not alter the fundamental computational power of the network - only changing details of the implementation. Certainly such changes affect efficiency on specific problems, but a problem-agnostic architecture should preferably be as simple as possible. An example of such a complication could be adjusting downwards the weight of firing vertices when the recipient node is not triggered. Another architecture has a second kind of vertex too - a blocking vertex. A firing signal sent down a blocking vertex prevents the recipient node from firing.)

Now, the cleverness of the system described above may not be obvious at first sight - and I know I am not good at explaining these things.

The system above only observes whether signals cooccur often enough to exceed some treshold when added up. If they do, their assigned weights are increased. Some things will coöccur by coincidence, whereas some things will coöccur by correlation. E.g. 'twinkle twinkle little' tends to correlate with 'star', because, quite obviously, these are the lyrics of a somewhat popular lullaby. However, the neural network can also be unlucky - and have some non-correlating things often appear in the input data, by coincidence. "Locally", the nodes do not have the power to reason as to whether two signals they have seen coöccuring are coincidences or genuine correlations.

There is no flawless approach to weeding out coincidences from correlations. Good updating functions, however, may help a bit. A function that increases the value of incoming vertices drastically will obviously start considering many coincidences as though they were bona fide patterns; on the other hand, a function that increases the value very conservatively may not even accept genuine correlations until they have occurred very often - and if the event is infrequent, the network may never adjust itself into recognizing it as a pattern. Meanwhile, the way vertices are decreased is also important - a false positive that only is recognized by the neural network for a short while is not a problem in the long run; if the weight of vertices that do not co-fire - or even worse, incoming vertices that fire when the target node does not - are adjusted down quickly, this probably will remove false positives, but it may also remove genuine positives.

Details in how the senses work are somewhat unnecessary - and often, the way we now consciously think of some things - language, things according to what class we assign them to (cup, glass, pitcher, beaker, chalice, ... and the various objects that may be in more than one of these classes), etc, there has often already been a bunch of layers of neurons acting on it. When you hear a word, first it passes through several layers of neural networks - one parses pitch content, another parses relative pitch contour, another parses and classifies the acoustic events as phonemes, another parses and classifies these phonemes as morphemes, and does guesswork that corrects possible mishearings, another parses and tries to reconstruct the syntactic structure that generated the sentence in another mind.

Each of these can gainfully be described as pattern recognition: recognizing speech sounds requires recognizing sounds with some vague similarity as far as their acoustics go, as well as having recognized which kinds of variations in these patterns are to be expected. Recognizing a word is recognizing a pattern of sounds; error-correction recognizes various other information as indicating that maybe this other word (or even maybe it's this word, which at least is a word unlike the audio information that actually entered the process) is the one heard, as these words and extralinguistic facts - things we have seen or know by other means - tend to pattern together.

The unit of recognition is the whole pattern, not the parts of the pattern - if pieces are amiss or wrong, if the pattern is recognized we are likely to recognize the whole pattern, rather than the mistaken details in it. And probably, some neuron activity may contribute to another neuron that also gets already processed signals stemming from the same sources, so there is a fair share of things, ultimately, that complicate the matter.

However, what "a whole pattern" is depends on the size of the "circuitry" we are discussing - when you hear someone say something, there are parts of your brain that react to patterns in the intonation, there are parts that react to the acoustics of really short samples, there are parts that react to the acoustics of a set of the samples (and recognize that yeah, this is Eric's voice), there are parts that react to the speech sounds (this is a d, this is an ɪ, this is an s, this is some noise I couldn't recognize, this is an z, ...), there are parts that react to the series of speech sounds (this is dɪs ?z, and because this pattern is similar enough to ðɪs ɪz it will by fortunate accident be identified as this is), there are parts that react to the words in the previously identified sentences, and that parse the grammar. This, in turn, also interacts with other things stored in the neural network, such as things the listener knows about the speaker that may affect what he is saying at the moment, and so on - of course depending on whether the relevant parts of the neural network have been forming connections between them and so on.

I do acknowledge that the above bit does not explain how and why a neural network also produces linguistic content. A longer essay on mirror neurons, and on the interactions of different other parts of the brain, and the evolutionary pressures that have caused those parts of the brain to trigger certain things would be needed to explain why neural networks also have behaviors, instead of just identifying a pattern and sending out a positive or negative conclusion to a final node.

It is worth noting that linguistics is split on . Chomskyists hold that the neurons of the brain come, to some extent, preprogrammed. This means that it is easy for a human to learn language, because there are already some basic patterns embedded in our brains - all we need to do is know which ways these patterns are implemented. One such pattern is supposedly the object, i.e. verbs often can have an argument that is, in some sense, a primary complement. Objects but not subjects being universal could point to objects being such an embedded notion. (I have seen other sources maintain that only subjects are universal, so do not quote me on either of these.)

However, I will not present any arguments in favor of whether such a language organ is present in the brain, or the brain more generally just happens to enable language without a portion devoted to the purpose, but I will admit that I do find the Chomskyite school on the topic to be more convincing. Linguists do not seem often to explicitly talk about neural models of language - however, this is mostly the result of the analysis being a bit more abstract and more general models of computation suffice. Ultimately, detailed analysis of neural networks is cumbersome, and this may be why their study is not common in linguistics departments. They do have applications in computational linguistics, though.

Side-track 2: The speech community

A language without a speech community is a dead language. The speech community consists of the speakers of some language. For most of the history of mankind, all actual linguistic context has been rather fresh off another neural network.

This means what one should look at is the effect of having a lot of neural networks - all with slightly unique architecture - the neurons are probably not perfectly identical in the first place, the information that has entered the network differs, etc. It should be obvious different individual networks will have identified different patterns - as well as having different false positives as well.

However, as the same patterns that are used to recognize language also are used to produce language, these patterns will be present in the linguistic data that we are exposed to throughout our lives. Thus, there are definite patterns in the language we hear - simply because these patterns appear from other pattern-matching neural architectures.

As previously stated, it is somewhat likely different individuals' have slightly different setups in their brain. Thus, some patterns that exist in the population may not exist in other members of the population. Meanwhile, some may have identified the things differently.

Consider, for instance, the development of the word beads. To what extent the meaning shift that occurred was the result of intentional metaphor or not, it is quite clear at one point in the history of English, nearly everyone understood bead as referring to prayer, whereas at one later point, nearly everyone understood it as referring to small, round solid objects. Over time, a mistaken pattern got so popular, it replaced a previous one. Identifying what referent a word had was reinterpreted. Those who counted their prayers so often were counting them by counting a kind of round solid object, that observers took the word to signify the round solid objects, because of a rather obvious coöccurence.

Neural networks explain both how grammar is passed down the generations, and how grammar changes as it is parsed slightly differently down the lines. How do accidental patterns create grammar though?

If a certain tendency for collocation has appeared - some words or morphemes tend to occur in sequence or near each other under certain circumstances - this easily is understood as expressing circumstances along those lines. If others pick up the pattern, this grammaticalizes, and suddenly there is grammar. All grammar in all languages probably originate with this phenomenon, but later, influence between languages also has added some to the mix.

Of course, some individuals may not have grammaticalized the same patterns, and we do find some variation in how subcommunities of a speech community - and even individual speakers - understand and use constructions. Many dismiss this as speaking sloppily or being ignorant, but a lot of neural network effort has gone into generalizing other underlying patterns. Can we really say one generalization is better than another? Which one is better - one that is more consistent with other patterns in the language? One that is more elegant? One that is more parsimonious? Turns out the standard language sometimes expects a more parsimonious pattern, and sometimes a less parsimonious pattern (see, e.g. begs the question, where the standard language expects a very unnatural and often not very obvious interpretation!).

Ultimately, a population of neural networks exchanging messages in a flexible protocol which adjusts for the properties of both the neural network and the medium over which the signal is passed (audio through air or text on various materials or morse beeps over electric lines or so on) is a sufficient explanation for how grammar appears. The neural networks identify patterns - even unintentional patterns- and generalize them. Sometimes, the identification goes wrong. If many speakers do this misidentification, it is likely the entire language changes with them - in reality, we should probably think of any specific language as some kind of average of how the population parses and generates the language.

Further, lots of grammar ensures some grade of redundancy in the language - and this is useful to ensure that the language has some persistance over a noisy channel, as the world does happen to be such a noisy channel. Some grammar - verb conjugations, case agreement, noun gender, etc, probably is the result of elements being repeated that can help guess the intended meaning even if noise happens to eat some important syllables; if there's fifteen words in the language that begin with ka-, and you don't hear the rest of the word, but some other word gives away that the word also is of, say, neuter gender, it is likely that the neural network can exclude many candidate words if the context otherwise wasn't enough to exclude all but one.

We do, in fact, mishear a lot more than we think, and our brains use cues along these lines to reconstruct the data. In a language without redundancy, the amount of times people in our surroundings would keep going "what?", "excuse me, what'd you say", would probably cause us to repeat poignant information that helps this - hence, e.g. the widespread use of double negation throughout languages in the world. If such repetition gets turned into a pattern, and this pattern is worn down by sound change, it easily is turned into regular morphology. Other reasons probably also underlies the appearance of grammar, though, such as the Chomskyan notion of part of the brain constituting a language organ with certain pre-wired settings conductive to learning language.


Wednesday, April 24, 2013

Linguistics: Measuring the Age of Living Languages

(I found this one in my drafts folder, it's been sitting there for about four months or so, so I figured I may just go ahead and post it after some slight editing - including incomplete sections and all. I may edit it a bit in the future.)

Measuring the Age of Living Languages

It is fairly common to hear - even from fairly educated people - that this or that language is the oldest language. Examples I have recently heard such claims for various African languages (especially ones with clicks) such as !Kung, or languages like Basque, Latin or Sanskrit. Sometimes, it is just a dialect of some language that is considered 'the oldest' - this or that dialect of English, Swedish or German is described as "the oldest dialect of X".

What does such a statement even try to say? What determines the age of a dialect that is spoken as a living language in the present? 

Of course, if we took an old text and read it out loud, the language encoded in that text is indeed centuries old - but aspects of our rendition would be newer: it is unlikely the reader would get the pronunciation identical to the way it was pronounced centuries ago, and similar problems go for intonation and possibly  even likely  how he would understand the meaning contained therein. It will probably be at least slightly misunderstood by the reader or listener: since the written form is a zombified instance of a linguistic utterance, the language we use to parse it no longer is exactly the same language as it was back in the day, its tissue has decayed: secondary connotations of the words have been lost, as have a whole lot of other things, so we are probably at a loss in trying to understand certain implicit details in such a text - simple things like whether it is sarcasm, dead serious or something else along those lines.

The Old Testament in Masoretic or LXX form probably is in a language more than 2000 years old, and clay tablets from Mesopotamia are in languages even older than that. But this is essentially the only situation in which it makes sense to speak of the age of a language, and as explained above, we then have zombified languages, where information loss already has set in.

However, with a living, spoken dialect, what aspect of the language are we speaking of when we say it is older than another language? Is it something to do with how well it has conserved the meaning of words over time? Is it how well the grammar has been conserved? Is it the conservation of pronunciation? Is it the pragmatics - the ways we use the language to express things - which is an important aspect, but one of the hardest to pin down? How would we go about measuring any of these in an objective manner, even?

We could pick a sort of objective thing - the point in time when it diverged from another language or dialect. In that case, the language could have gone through great changes every generation since it split and still be the oldest language out of two closely related ones!

Scenario: on an island, Island A, in the pacific, people speak a language. We will call it Islandean. As population grows, a group set off to settle another island – Island B  far away, which they have spotted during their frequent fishing expeditions. Contact between the populations on Island A and Island B is infrequent after the initial settlement. They both start out speaking Islandean, but as the amount of contact as been reduced, Islandean at A and Islandean at B therefore diverge, and a while down the line, the descendant versions Islandean, Islandean A' and Islandean' B (where ' marks "new version") have diverged enough not to be mutually intelligible. They are now two languages. A time appears, again, when Island A gets crowded, and its population sets off to colonize Island C. The linguistic divergence again sets off - both start speaking Islandean A', but as time goes by, Island A has Islandean A'', and Island C has Islandean C (which too is a derivative of Islandean A'). Going by a family tree model, Islandean B is the oldest of these languages - it split from its two relatives the earliest:


What happens if Islandean A' or Islandean C goes extinct? The most recent split that either of Islandean B' and Islandean A' have had from each other still remains unchanged at the root of the tree- yet we know a later split happened in Islandean A'/C, a split whose one branch just happened to terminate - should we then claim Islandean B' is the older one, since it's been diverging for two generations, while A' only diverged for one since its most recent sibling - regardless of this sibling having since gone extinct?

The time at which mutual intelligibility was lost - and thus distinct languagehood, if we go by some definitions -  might not be entirely trivial to decide, as different speakers probably would have different ability to quickly adapt their linguistic skills in order to understand the other language - and it is possible one of the languages would be more difficult for speakers of the other to understand. 

Let us ignore that kind of tricky question for now, and instead decide that the 'older' language is whichever one is the more 'conservative' among them. As I already pointed out, that's not trivial. Do we count the number of sound changes, and pick the language that has had the fewest of them? The number of semantic changes? The number of grammar changes? Should we assign different significance to different kinds of grammar/sound/semantic changes? 

Even if we roughly can guess what the ancestral language was like, we're still taking a stab in the dark when it comes to measuring these things. There may have been countless changes that haven't altered any structural features of the languages, and there may have been changes that we cannot even be sure whether they happened at all, since later changes may have eradicated their results or made further structural changes. Trying to measure the age of a living language is a meaningless task and this is why real linguists do not talk about which dialect or language is older than the other.

Monday, April 22, 2013

Language Complexity, pt I

[This post has been in the drafts folder for about half a year already, with me occasionally adding or removing bits. I decided to complete the first half of it, and split it in two. There will also be a third post which in part answers what this has to do with the more general idea of this blog.]

Language Complexity

One common idea is that primitive people speak primitive languages, or vice versa: that primitive language is a mark of primitive people. This idea is badly mistaken - not only is it mistaken, it is also badly defined.

I found a rather intriguing but specific example of this on a forum recently: the idea that omission of vowels in the writing of Semitic languages is a sign of backwardsness and primitiveness. The person opining thus did so in order to present Muslims in particular (and the Quran) as savages from a linguistic point of view. It might appear to the average speaker of a language written using the Latin alphabet that regularly omitting a kind of sound from writing, indeed, is primitive and rather inefficient.

However, let us look into this idea a bit closer: is it perchance possible that our writing too omits phonetic information? Almost trivially, this turns out to be true - English neither marks pitch nor timing in its script. Yet both of those can convey very crucial information, sometimes essentially negating the meaning of the entire utterance! Meanwile, omiting te ocasional leter and faling to us the rite word and faling t use capitalizaton correcly in english does not make an English text impossible to parse.

It would be interesting to know the Shannon entropy of Quranic Arabic, Biblical Hebrew, Persian in different scripts (including Arabic), English, Runic versions of Germanic and so on. I suspect that, indeed, there is slightly greater density to the information coded per letter in Arabic and Hebrew than in English. This would mean that a lost or damaged letter in a text with greater likelihood alters the meaning or makes the meaning irretrievable than in a language with less density in the writing system. Still, as I said - sometimes a rather significant piece of information may be entirely unmarked in English writing.

Where do we draw the line between backwards and too dense or advanced and sufficiently full of redundancy? It is all arbitrary.

But back to the more general issue - language complexity. Until the second half of the 19th century, those linguists who considered the genesis of languages assumed something along these lines: as a civilization emerges and consolidates, its scholars and leaders design and establish its language. Once the civilization reaches a certain level, laziness kicks in, and the language starts slipping into disarray.

These linguists felt this explained why languages such as Ancient Greek, Latin, Sanskrit, Persian and so on were such sophisticated languages, while their early descendants - Vulgar Latin and the early transitional forms leading into the medieval vernaculars of Romance-speaking Europe, various Prakrits, ... - were such haphazard and shoddy things.

How do we measure haphazardness in a language? How do we measure shoddiness? These were of course not the terms they would have used, but it is quite clear they thought of it in such terms. Latin was admirable, Sanskrit was "almost perfect", and so on - all quite subjective and emotional terms.

Meanwhile, they held that primitive peoples had not gone through this language-building stage, and thus spoke languages with imperfect grammar or downright less grammar. We would expect, if this were true, that languages such as English, Russian, Japanese, Swedish, Finnish and so on would have advanced and complicated grammars, whereas languages such as those spoken by Australian Aboriginals should have simple and primitive grammars: you Jane, me Yarrawalluma. Me have boomerang go kill Kangaroo. (Even then, the previous two sentences, although very bad English, could very well have been constructed using complicated grammatical rules, although obvious distinct from those of English. 'go kill, 'have X (verb phrase)', etc, all could be grammaticalized constructions.)

To test this, we would need to have some kind of means to quantify complexity of a grammar. Another question: does complexity imply expressivity? Does expressivity imply complexity? Is it really expressivity we should quantify? - But expressivity is way more culture-specific, and it is very easy for a person having grown up in one culture not even to realize the other culture can express a lot of things we do not even realize are being expressed.

Compare how Higgins thought Hebrew's perfect and imperfect aspects did not really express anything - they were just a failed attempt at emulating the tenses of more advanced languages. It is clear we cannot just assume that our not understanding a feature is proof that it does not really have any function.

Further, we cannot just count the number of forms and constructions we can see - clearly, one form can have multiple functions, and one function can affect multiple forms. (That is, a given function can be encoded as a change in the form of more than one word, in the rearrangement of morphology, word order, extra particles, intonation, ...) Let us consider one pretty cool example which is rather difficult to detect: hierarchies. To get there, though, we will have to consider the subject and object distinction, and how it is marked:

There are languages that distinguish subject and object by noun case:
mies osti auton (man buy.past.3sg car.acc)
auto paloi (car burn.past.3sg
karhu raateli miehen. (bear savage.past.3sg man.acc)
In this case, the leftmost nouns are all nominative, which in Finnish is marked with a zero ending in most nouns. The -s in mies only appears in the nominative, so I consider that a nominative suffix for now.
Rearranging the word order may sound awkward for some of these particular sentences - few would ever say auton osti mies, altho' auton mies osti flies a bit lower under my radar for weird syntax. With different subjects and objects, however, both of those orders can work really well. Rearranging may alter the connotations of the sentence. A given order may code for more than one connotation:
miehen karhu raateli:  ~it was a man the bear savaged
karhu miehen raateli: ~it was a bear that savaged the man
miehen raateli karhu: ~the man was savaged by a bear, a bear savaged the man
karhu raateli miehen: ~the/a bear savaged a man

As I do not fully have native intuition for Finnish, I can only tell you these are roughly how a native would understand the connotations of different rearrangements - at least introspection doesn't tell me all the possible interpretations these sentences could have in different situations, and of course there's a double translation issue here, as I am no native speaker of English either. The most common one is subject first, verb in the middle. Case marking need not be in the form of suffixes - particles qualify as well, and we find this in Spanish (a, used with direct objects with certain properties) and Hebrew (et, again, restricted to objects with certain properties and apparently not mandatory).

This method of distinction is familiar to anyone having studied Russian, German, Greek, Latin, Sanskrit, and a whole bunch of other languages. It can be found on every continent.

A method common in lots of languages in Africa - but also attested elsewhere, including, in some analyses, spoken French! - is to have the verb have agreement markers for the subject and object. Thus, in most Bantu languages, there is a rigid order of prefixes to the verb, one agreeing in noun class (a lot like gender) with the subject, another with the object, and depending on the language there may be several other prefixes as well. Since the verb tells us which of the nouns is subject and which is object, there is little need to mark the nouns or use any other method. However, what if the nouns are of the same class? We will get to that a bit later. Subject agreement on the verb - which is common in large parts of the world too - does already by itself help a lot of the time, especially if there is gender and number agreement (or comparable) in the morphology. In Russian, neuter nouns do not distinguish accusative and nominative, nor do inanimate masculine nouns. I do not know which method is used to resolve which is the subject and object in case one noun of each of those types act on each other, except that in the past tense, the verb agrees with the subject in gender - which would resolve this situation.

Another common method is word order. This is present in English and Swedish, and partially in many languages that do have the above style of object marking as well. In English, rearranging the order of subject and object is not really permissible at all, although you can front the object at times - but in those cases, you obtain OSV rather than OVS. Swedish requires OVS in such frontings, and seems to be more permissive when it comes to inverting the object and subject, without any grammatical marking involved. Oftentimes, this involves pronouns - which in English and Swedish still carry case object marking - or nouns that are strongly semantically associated with their verbs.

Such association should probably count as part of grammar! Similar things probably apply in the situation where both subject and object in a language marking them on the verb are of the same class. How do we quantify that aspect of a grammar? To really drive that point home, though, let's look at a final class of languages:

There are languages where neither word order, case marking or verb inflections are used, yet speakers can with great likelihood identify which is subject and which is object. What magic is this?

The explanation for this is the presence of grammatical hierarchies. In these languages, subject and object are resolved by recourse to a hierarchy: the noun higher in the hierarchy, is the noun that generally is more likely to be the subject. Is this grammar? Yes. It quite clearly is, yet it is also rather obvious that it might not be easy to spot the existence of such a grammatical detail.

How does that hierarchy compare, quantificatively, to a case marking system or a verb marking system?  Is it even a well-defined question?

The reason we have discovered that grammatical detail, probably, is along these lines: in many languages, the subject-object distinction is rather central. (The existence of subjects in language is not, apparently, universal, though! Objects, however, apparently are.) When linguists have come across languages where there is no immediately apparent way of distinguishing subjects from objects, they have studied the language until they have realized what exactly is going on under the hood.

It should be rather obvious that other distinctions that may be encoded in similar ways, but which we have no idea to go looking for, very well may exist in languages around the world - even in European languages! For this reason alone, it should be obvious that speaking about one language or another having more or less complex grammar. I find it likely the basic amount of grammar to get simple statements and queries understood varies a bit, but I am pretty certain the difference in amount of grammar between languages is not large. However, that of course would be predicated on having a reasonable way of quantifying grammar in the first place - and a way of being sure when we have mapped out all the grammar present in a -lect of some kind (idiolect, sociolect, dialect, ...).

Even now, the assumption among those without any education in linguistics seems to be that the complexity of a language stands in direct relation to the advancedness of a culture or the social standing of the speaker community. This mistaken notion should have been abandoned decades ago.

We know now that complexity in languages is not a result of a cabal of clever people developing the language as civilization emerges, and we know language change is not the result of their meticulous work being abandoned due to laziness in subsequent generations. I will leave the obvious question - where does language complexity come from - to the next post on the issue.

Thursday, November 15, 2012

Quality of Sources: Godfrey Higgins, pt 3

Higgins, pt 3: Anacalypsis: Preface, Preliminary Observations


Higgins lived during a pre-neogrammarian age, and hence at the time, comparative linguistics had not reached the point where it could be called a science. 

For a very long time, and during the writing of the greater part of my work, I abstained from the practice of many etymologists, of exchanging one letter for another, that is, the letter of one organ for another of the same organ; such, for instance, as Pada for Vada, or Beda for Veda, in order that I might not give an opportunity to captious objectors to say of me, as they have said of others, that by this means I could make out what I pleased. From a thorough conviction that this has operated as a very great obstacle to the discovery of truth, I have used it rather more freely in the latter part of the work, but by no means so much as the cause of truth required of me. The practice of confining the use of a language while in its infancy to the strict rules to which it became tied when in its maturity, is perfectly absurd, and can only tend to the secreting of truth. The practice of indiscriminately changing ad libitum a letter of one organ for another of the same organ, under the sanction of a grammatical rule, - for instance, that B and V are permutable, cannot be justified. It cannot, however, be denied that they are often so changed; but every case must stand upon its own merits. [1, preface]
At the time of his writing, historical linguistics did indeed operate a bit along these lines - although some basic idea of regular sound change had occurred to the scholars of the time. It was this kind of completely ad hoc approach to historical linguistics that lead the neo-grammarians to come up with less ad hoc reasoning.

The idea that languages go from infancy to maturity is not generally taken seriously by anyone these days. Certainly, languages change over time, and assuming that the same rules apply now as a thousand years ago in some language would be misguided.

In a somewhat better speculation on how language may have come about, he expounds:
4. After he [early mankind] had arrived at the art of speaking with a tolerable degree of ease and fluency, without being conscious that he was reasoning about it, he would probably begin to turn his thoughts to a mode of recording or perpetuating some few of the observations which he would make on surrounding objects, for the want of which he would find himself put to inconvenience. This I think was the origin of Arithmetic. He would probably very early make an attempt to count a few of the things around him, which interested him the most, perhaps his children; and his ten fingers would be his first reckoners; and thus by them he would be led to the decimal instead of the more useful octagonal calculation which he might have adopted; that is, stopping at 8 instead of 10. ... There is nothing natural in the decimal arithmetic; it is all artificial, and must have arisen from the number of the fingers; which, indeed, supply an easy solution to the whole enigma. Man would begin by taking a few little stones, at first in number five, the number of fingers on one hand. This would produce the first idea of numbers. After a little time he would increase them to ten. ... To these heaps or parcels of stones, and operations by means of them, he would give names; and I suppose that he called each of the stones a calculus, and the operation a calculation. [1, p. 1]
Indeed, it is generally agreed by all linguists today that the preponderance of the decimal system is closely related to our ten fingers - however, it is also noted that there is a preponderance of vigesimal systems (base-20), with sub-bases of 5 or 10. It is unclear whether Higgins genuinely meant that the early humans called these stones (or pebbles or fingers or whatever) and the act of counting calculation and calculus, or whether he is just giving a name to these things so as to be able to talk of them later on.

5. The ancient Etruscans have been allowed by most writers on the anitquities of nations, to have been among the oldest civilized people of whom we have any information. In my Essay on the Celtic Druids, I have shewn that their language, or that of the Latins, which was in fact their language in a later time, was the same as the Sanscrit of India. This I have proved not merely by the uncertain mode of shewing that their words are similar, but by the construction of the language. The absolute identity of the modes of comparison of the adjective, and of the verb impersonal, which in my proof I have made use of, cannot have been the effect of accident.  [1, p. 2]
As it turns out, precious little is known of the Etruscan language. One entire book is extant - liber linteus zagrabiensis. Scholars have not been able to read it - despite knowing the alphabet. If it were (closely) related to Latin, it would undoubtedly be possible to read by recourse to Proto-Indo-European or somesuch. Robert Ellis' The Armenian Origin of the Etruscans listed about 200 words - other sources give about the same amount even 150 years later. Some of these words have been obtained through Greek and Latin sources. It is generally agreed that it is not an Indo-European language [2].

Its grammar is rather different from Indo-European languages of the time as well [3]. It shares the rather notable feature of suffixaufnahme with a bunch of other languages, including Old Georgian, Hurrian, Urartian and Lycian - of which only Lycian is Indo-European. 

Suffixaufnahme, for those who are interested, is a generalized congruence phenomenon. Normally, a noun (and maybe its adjectives and other determiners) are marked for a case - e.g. Latin meo antiquo malleo - my(dative, masculine) ancient(dative, masculine) hammer(dative, masculine). In languages with suffixaufnahme, such case markings can be stacked, obtaining things like - and this is fake Latin - artificiso antiquo malleo. "the builder(masc,genitive)(masc, dative) ancient(masc,dative) hammer(masc,dative)".

Considering how successful Indo-Europeanists have been with both Hittite and Tocharian, basically showing what regular sound changes and other changes needed to take place to derive those languages, the failure to do so with Etruscan - for which evidence has been available for a much longer time - indicates something about the problem.

Considering how ignorant Higgins has proven to be already about linguistics, I will not ascribe any credibility whatsoever to his claim to have shown Etruscan to be identical to Sanskrit; he seems unaware it is not even related to Latin.

7. A very careful inquiry was made by Dr. Parsons some years ago into the arithmetical systems of the different nations of America, which in these matters might be said to be yet in a state of infancy, and a result was found which confirms my theory in a very remarkable manner. It appears, from his information, that they must either have brought the system with them when they arrived in America from the Old World, or have been led to adopt it by the same natural impulse and process which I have pointed out. [1, p. 2]
Today, we know there are numeral systems of varying complexity in the Americas, with some especially simple ones in Brazil, for instance the Pirãha language. Other bases than 10 do occur - Higgins does point out 5, but he fails to point out the vigesimal systems. [4]


8. The ten fingers with one nation must have operated the same as with the other. They all, acccording to their several languages, give names to each unit, from one to ten, which is their determinate number, and proceed to add an unit to the ten, thus ten one, ten two, then three, &c., till they amount to two tens, to which sum they give a peculiar name, and so on to three tens, four tens and till it comes to ten times ten, or to any number of tens. This is also practised among the Malays, and indeed all over the East: but to this among the Americans there is one curious exception, and that is, the practice of the Caribbeans.
Exceptions also occur in Papua New Guinea, where various strange systems obtain in languages, even ones where it does not make great sense to talk of bases. Bases of size 4,5,6,8,10, 12, 20 (subbases 5 or 10), 24 (subbase 6), 32, 60 (subbase 10) and 80 (subbases 20, 10 and 5) are attested. Some both in the Americas and the Old World, some exclusively in the Old World, etc.
They make their determinate period at five, and add one to the name of each of these fives, till they complete ten, and they then add two fives, which bring them to twenty, beyond which they do not go. They have no words to express ten or twenty, but a periphrasis is made use of. From this account of Dr. Parsons', it seems pretty clear that these Americans cannot have brought their figure and system of notation with them from the Old World, but must have invented them; because if they had brought it, they would have all brought the decimal system, and some of them would not have stopped at the quinquennial, as it appears the Caribbees did. If they had come away after the invention of letters, they would have brought letters with them: if after the invention of figures, but before letters, they would all have had the decimal notation.
Although he is probably somewhat right about counting not having developed very far at the time the Americas first were settled, the logic here still is terrible. Certainly there could have been tribes with more advanced counting systems in parts of Eurasia at the time the first tribes had passed into America without the numbers being taken along. Ultimately, he presents a theory of convergent evolution for the numeral systems. Now, it seems today this evolution is more one of assimilation into the predominant systems of the big languages of the world - small languages in Latin America adopt it from Spanish and Portuguese, Aborigines from English, Africans from English, French and Arabic, and so on. Comrie even apparently noted that unusual numeral bases are going extinct faster than the languages that have them!

From that point on, the numbered paragraphs go on for a few pages elaborating on things about the development of astrological things, such as the zodiac and lunar mansions and so on. In paragraph 16 he claims that "the science of the Babylonians and Egyptians was but the débris of former systems, lost at that time by them, as it is known to have been in later times lost by the Hindoos" [1, p. 4]. I think this seems to be a strawman argument - he assumes they were more advanced than they were, their achievements were entirely achievable with the tools we know them to have used, and so on. At the very least in order to make the claim that they had lost some knowledge, he should provide some actual argument that shows they knew of things or had calculated things they should not be capable of with their tools.

22. A traveller of the ancients, of the name of Jambulus, who visited Palibothra, and who resided seven years in one of the oriental islands, supposed to be Sumatra, states, that thei nhabitants of it had an alphabet consisting of twenty-eight letters, divided into seven classes, each of four letters. There were seven original characters which, after undergoing four different variations each, constituted these seven classes. I think it is very difficult not to believe that the origin of the Chinese Lunar Zodiac and of these twenty-eight letters was the same, namely, the supposed length of the Lunar revolution. The island of Sumatra was, for many reasons, probably peopled from China.
Took me a while to find anything on Jambulus, since his name apparently normally is written Iambulos. His works are considered absurd and mere fiction [5].
I guess the reason for the probable Chinese origin of the Sumatran population has to do with both the Chinese and the Sumatrans being Asian? Now, the source he quotes to support this entire thing - Asiatick Researches vol X, says:
The two alphabets of the Sumatrans consist only, one of twenty-three, and the other of nineteen letters: but it is probably that there were two sorts of them formerly, as in India, and which were originally the same. One was used by the more civilized and learned classes, and at court, the other was current among the lower classes, whose poor and barren dialect had fewer sounds to express. 
Speculation instead of fact. What makes this even less easy to figure out is that on Sumatra, over 50 languages are spoken, none of which goes by the name Sumatran. At this point, it is basically unfalsifiable - sure, some group might very well entirely by accident have had 28 letters. Until I know which one, it is impossible to check the reason - maybe they just so happened to distinguish 28 sounds.

One interesting thing regarding "fewer sounds" and so on, is how until fairly recently - and even to this day in some pseudo-scientific circles - Sanskritists and Hindu nationalists were so certain about the perfection of the Sanskrit language, that they denied the existence of speech sounds that were not distinguished in Sanskrit. Thus, if a language derived from Sanskrit had increased the amount of sounds it distinguished, this was denied - the new sounds were really just versions of the same sound, since it was inconceivable that sounds unavailable in Sanskrit even existed. Therefore, many of the orthographies of the languages of India mark fewer distinctions than the spoken languages do, in this charade to make Sanskrit seem superior.

Generally, though, languages' alphabets do not tend to have a specific number of letters due to doctrine or dogma, but due to either tradition (which changes over time, of course), or due to functional concerns: these are the sound-units we need to distinguish, let us make up letters to distinguish them.

29. About the time this was going on [dividing of time into smaller units, development of astrology], it would be found that the Moon made thirteen lunations in a year, of twenty-eight days each, instead of twelve only of thirty: from this they would get their Lunar year much nearer the truth than their Solar one. They would have thirteen months of four weeks each. They would also soon discover that the planetary bodies were seven; and after they had become versed in the science of astrology, they allotted one to each of the days of the week; a practice which we know prevailed over the whole of the Old World.
Except we know it did not. Weeks of every conceivable length are attested, and the spread of seven-day weeks is ultimately rather recent - starting out at about 2000 years ago.

After arguing that the wide spread of "X" as a symbol for ten - and ad hoc explanations as to why it is absent as such in some places (not even honestly mentioning multiple places where X was not used to denote ten, which is the majority of places). This goes on to
32. General Vallancey observes, That from the X all nations began a new reckoning, because it is the number of fingers on both hands, which were the original instruments of numbering: hence יד (id) iod in Hebrew means both the hand and the number ten. [1, p. 7]
Except this is not the case! יד adds up to 14, although י itself does signify ten, and has for its name iod. However, the name of the number is עֶשֶׂר , 'eser.

He further claims to have proven that the alphabet of the Hebrews, Samaritans, Phœnicians, Greeks and so on all had sixteen letters. This sounds like so much bull, but the actual argument is in The Celtic Druids and I will not read that book just yet.

84. There have been authors who have wasted their time in inquiries into the modei n which the inventor of the alphabet proceeded to divide the letters into dentals, labials and palatines. There surely never was any such proceeding. The invention was the effect of unforeseen circumstance - what we call accident; and when I consider the proofs, so numerous and clear, of the existence of the oldest people of whom we have any records, the Indian Buddhists in Ireland, and that in that country their oldest alphabet has the names of trees, I cannot be shaken in my opinion that the trees first gave names to letters, and that the theory I have pointed out is the most probable. [1, p. 15]
The bolded part is not bolded in the book. I think the bolded part needs no actual further debunking, it quite sufficiently debunks itself. There are no such records. We have no earlier attestation of the Ogham alphabet than a few hundred years CE. There is some internal evidence that requires the Ogham alphabet to be maybe a couple hundred years older than the earliest attestation.

Generally, Higgins seldom provides sources for his claims. His writing is dense, boring and so full of misconceived notions that pointing out any starting point where he first went wrong is near impossible. It seems the lack of quality research of his time, a wild imagination, and even further lack of any kind of requisite education on how to understand linguistic evidence conspired to create some very wild claims. I have only reached paragraph #100 at this point, and I know there's a few between #84 and #100 I should explain why they are wrong; I have omitted quite a few earlier ones as well, since they repeat previous assertions or hinge upon ones I have noted already. The times he gives sources, the reference is often not to a particular page in a book, but to the book in its entirety. 

This must be among the most frustrating stuff I have ever read.

[1] Higgins, Anacalypsis, vol 1 1.

Monday, November 12, 2012

Linguistics: Language Change

Language Change

(This post has some slight prerequisites; an idea of phone vs. phoneme is helpful, and the very idea of phoneme is necessary. A very basic introduction to these ideas can be found in this post.)

All linguists agree that languages change over time. There are any number of books on language change - both ones explaining the basics and ones getting into more theoretical concerns. The questions that may interest linguists regarding language change comprises various things like what kinds of changes are likely to happen, how they happen, how they interact with grammar, with universals of human language, with first language acquisition, with phonology, sociolinguistics, and so on.

There has been some development in the approach and the theories of language change over time, and such advances and theories will be mentioned at relevant places.

I have previously written some posts at a forum on these things. These may go a bit more into detail than necessary for the current topic, and I do not provide sources. It seems many of the same examples of language change, like particular changes in English, are mentioned in many books.


An obsolete view of historical linguistics

Back in the 19th century, until the rise of the neogrammarians, language history was viewed, simplistically in this manner, this closely paralleling Campbell's summary: 
Many languages are primitive, and spoken by dumb primitive tribes. As a tribe grows increasingly civilized, its language too evolves into a more advanced thing. After this apex is reached, laziness sets in and the language decays. The neogrammarians realized we need to assume that language works the same everywhere - sound change and analogy are not restricted to a period of decay, and the same forces acting on a language will have similar effects.. [2, p. 334]

Types of historical change

Languages change in several ways over time. The main ways in which languages change are changing the sounds of words, changing the actual sounds of the language, changing the meanings of words, and grammar change. Change at much higher 'levels' of language than that (such as pragmatics - how we tend to express that which we want to express) has not, as far as I can tell, been very efficiently described as the formalism for describing things at those levels is not well-enough developed currently. (Well, that can probably be discussed as well)

Sound Change 

As explained earlier, a spoken language has phonemes. These phonemes consist of phones. Over time, a phoneme or even a phone can fall out of use - the fricative /h/ sometimes meets this destiny entirely throughout a language, sometimes . This would mean any instance of the sound h (but not necessarily the letter h, as it could conceivable represent other sounds simultaneously) would fall silent. In a language with a conservative orthography, words could still - in writing - be distinguished by the presence of an h, but if that loss has happened, had and ad would not be distinguished in speech

More often, sounds change in some given context triggered by some other sound in its context. (Or by some other fact about its context, such as the sound occurring next to a word boundary or a syllable boundary or after the primary stress or anything like that.) 

Oftentimes, it seems sound changes just reassign a sound in a word to another, similar sound. This may merge words that previously have been distinct.

In a literate society, a mark of such a change after it has happened could be an increase in confusion between those two words, that is, say their and there would occur in contexts each where the other one is expected. We notice a similar thing with regard to some English words that are spelled differently but pronounced identically. For instance, they're vs. there vs. their, than vs. then, . . .

Sometimes, sound change leads to a new sound appearing in a language. As far as we can tell, Proto-Germanic did not have a separate sh-sound originally. At some point, this sound has appeared in some descendant languages (as we can see just looking at the fact that it's present at least in Modern English, but also in some varieties of Swedish, Norwegian, Danish ...)

At some point, some combinations of sounds assimilated each other and turned into one new sound; East Swedish "sh" comes from clusters such as -/sj/-, -/sk/-. East Swedish "tsh" comes from /k/ + front vowel, /tj/, /kj/ and so on. At some point, probably, the sounds in these clusters acquired allophones specific to these clusters - realizations that were identified as the same basic sounds - and soon this allophonic variation become reduced a bit, and as the speakers were left with pairs of words that earlier had been distinguishable by an extra sound in there (sked vs. sed or somesuch pairs), the distinguishing feature between these words now was a different type of s-sound instead.

Likewise, the appearance of the front vowels y and ö in the Scandinavian languages and German occurred through assimilation to an i in the following syllable under certain circumstances. Traces of this are still present in English as well (man,. men and mouse, mice for instance, is a result of a similar change, although later sound changes have obscured the relation there a bit).

The neo-grammarian stance of the late 19th century was along the lines that all sound changes happen throughout the vocabulary simultaneously. As though a language consists of a bunch of strings, and sound change happens as a search-and-replace throughout the bunch of strings. This search-and-replace need not be of the form replace every k by h or somesuch, but can rather be something like replace every unstressed back vowel before a stressed front vowel by a front, round vowel of corresponding opening as the back vowel had. 

I am unaware of any historical linguists having posited any specifically computational restriction on how complicated these rules may be, although I have the feeling Optimality Theory could easily be applied to obtain such a restriction or even basically already has compiled enough information about sound change in general that such a restriction already could be known if someone applied themselves to obtaining that knowledge.

Now, more modern research indicates that it is not a complete search-and-replace, but rather somewhat probabilistic, and more likely to hit common words than less common words, and sometimes the sound change peters out before having affected all the words. However, the notion of slightly information-theoretical analogies for linguistic historical development will appear again in this blog on occasion, as it is both a fairly good model for thinking of this stuff, and a sufficiently practical thing for many modern readers to grasp easily. One can also, of course, program probabilistic models and emulations of it, but tracing the history of a probabilistic thing is less straightforward, for obvious reasons, than tracing the history of a predictable mechanical apparatus.

Further, analogy further muddies the waters, as does interdialectal loan. Analogy serves to regularize a word where sound change has hit some forms of it but not all. If Latin, once intervocalic s had turned into r, had made flos - florem more regular, either by restoring the s to *flosem or analogized flos to *flor, this would seem to be an irregular sound change. (But the change that actually had happened would have been a grammatical change). Sometimes, analogies happen between unrelated words as well, as well as dissimilations: some dialects of German apparently have changed zwei to zwo to easier distinguish it from drei, whereas in some other languages, numbers cause each other to increase in similarity.

In my dialect of Swedish, a "fake sound change" can be observed, although the interesting bit is the analogy we are applying to new loans - and the function of the analogy is really the point of this paragraph. However, first, final -a in Swedish nouns correlates to final -u in my dialect, so there are a bunch of nouns such as penna - /pɛn:u/, vecka - /vɪku/, ficka - /fɪk:u/, klocka - /klok:u/, kråka - /kru:ku/, tunna - /tɔn:u/, brygga - /brød͜ʒu/ . This is the result of the accusative case having replaced the nominative - a case distinction lost in both Swedish and my dialect (so, the change is not a result of final a turning to u); the relevant bit is we still analogize this when borrowing new nouns from Swedish, such that any noun ending in -a is rendered with an -u in the dialect. This analogy does not happen when nouns are borrowed from Finnish, however. Similar application of a sound change can easily happen in languages as well, if the result is regular enough that analogy makes it seem reasonable.

What makes dialects differ, is in part local developments in the vocabulary, but also sound changes that covers just a region. My dialect of Swedish has a sound change which seems never to have spread beyond one village in Finland, although an identical sound change has occured in a dialect in the northernmost parts of the Swedish linguistic area of Sweden (in a region that even now mostly is inhabited by speakers of Finnish). Other sound changes seem to have spread all the way through, basically, the entire Germanic language area (possibly excepting the East Germanic languages). If a dialect has not had a certain sound change, or has had one not shared by the rest of the speech comunity, and a word is loaned into the wider speech community from that dialect, if that word has been hit by the sound change that differentiated the two groups of dialects, it will seem to be a word that has violated the regularity of sound change.

However, once people speaking the pre-change form no longer are around, the pre-change form is essentially lost from the actual language and has no future impact on the language. Of course, the invention of writing makes it possible to store this information to some extent, and that can undo the strength with which the previous claim applies.

An indication of historical sound changes are rhymes (or similar devices) that no longer rhyme due to sound change. Of course, the poet might have been slightly careless as well.

Semantic Change

(Turns out all the samples I have here can be found in Lyle Campbell's Historical Linguistics an Introduction; fascinating, as I had not read that book until I had half-finished this post; seems all books on historical linguistics for an anglophone audience use the same examples. Pretty much every one of these can be found in the chapter on change of meaning).

Over time, words change meanings. This generally follows much less regular patterns than the change of sounds. A really weird example in English is that of beads. Originally, it signified prayers, but as rosaries were common and people were said to count their beads while engaged in prayer, this somehow appears to have been reinterpreted as signifying the physical objects that represented the prayers. 

Meanwhile, some more normal examples include meat, which earlier signified food in general. This meaning still is present in some dialectal expressions and in words such as sweetmeat.A cognate is present in the Scandinavian languages, viz. mat, that still signifies food. Deer, likewise is cognate to Scandinavian dyr/djur, and signified any animal at an earlier stage of English - which it still does in Scandinavian. Scandinavian, on the other hand, has extended some meanings from their original meaning that English has kept unchanged. [1, p 256.]

Salary, on the other hand, has become less specific. Originally (and in the original Latin), it signified the allotment of salt of a soldier.

It should be clear here that nothing like the neogrammarian search-and-replace can apply to word meanings, and they change due to different evolutionary pressures and random happenstance. 

However, once a word has changed its meaning, its new meaning is what it means. This fact has escaped some people, and sometimes you run into people who will say that "technically, anti-semitism is the hate of the semites, whereas you are talking of anti-hebraism/judaism/..". In this case, anti-semitism is nowadays a separate lexical entry, whose meaning has developed separately from the word semite. We can find a similar divergence between two related roots in decimate and decimal.

Figuring out how and why a meaning has changed is not entirely trivial. One key bit of understanding, as far as I can tell, is that language has not been designed, and even if it were, its designer could not know all the uses it would be put to. This means that language is, for very many imaginable and unimaginable situations incomplete. But we are flexible beings, and we can realize that hey, the person I am talking to is probably extending the meaning of the word right now a bit (or restricting it), because it seems it would make sense that the meaning is not quite the usual meaning of that word or that combination of words at this moment.

(Wittgenstein's notion of language games is maybe even a better thing to think about here, since those too force one to realize just how flexible language is. It is obvious this flexibility undermines rigid meanings.)


Grammar Change

The most common kind of grammar change we think of is the loss of inflection that has occured over wide areas of Europe during the last 2000 years and maybe more. The average reader probably is aware that English back in the day had cases and inflected its verbs for more persons than it currently does, likewise did Old Swedish and Old Norwegian, and Latin had a richer morphology than any of its descendant languages have. It seems the pre-neogrammarian view of language decay still is widely believed by most people who have not studied linguistics to any extent.

(Morphology, for those not in the know, is the manner in which, or the study of the manner in which, a language has its words change forms to express different things with these words, such as jumps, jumping, jumped or I, me, my, mine or ox, oxen, ox's ... or kauppa, kaupan, kauppaa, kauppoja, kaupassa, kaupalla, kaupasta, kaupalta, ... or wide, wider, widest, widen, widened, widening, widens, widenin', width, widths, width's, widths', ...the last series including some derivative morphology as well. )

But this is not the only way languages change over time - and how could it be? If this were the only way grammar changed, we would have to conclude that the first languages man spoke were immensely rich as far as inflections go, and from then on only ever have lost inflections. Some languages - Turkish and Finnish among them, have lost less than Spanish and English, other languages again have lost even more, such as Mandarin. Prior to the Neo-grammarians, it was assumed the early stages of civilization brought linguistic advances, which later succumbed to laziness.

Of course, this idea is compelling! It is very easy to believe that the loss of a case or a tense is the result of laziness or bad learning. Even further, this is helped along by observing that immigrants do not learn the grammar properly, and I would guess this even is further supported by the willingness we have to perceive immigrants as lazy.

That is not generally how it occurs, though, but let us ask the other question first: how did these suffixes come about in the first place?

Generally, the process that gives birth to morphology is called grammaticalization. Grammaticalization is the process in which a word first loses much of its original meaning through a process called semantic bleaching. Once a word has been semantically bleached, there is no guarantee it will grammaticalize as a suffix - it may also grammaticalize as a particle, and neither does it have to be entirely semantically bleached before being grammaticalized as a particle.

Let us consider a simple thing in the Scandinavian verb. The Scandinavian languages all form passive verbs by suffixing -s on the verb. Historically, this stems from a pronoun, sig, which means (it/him/her/)self/themselves. This, when placed directly after the verb was - depending on intonation patterns and such - sometimes reduced to just -s. This was then analysed as part of the verb, and the reflexiveness was reinterpreted as passiveness. A similar thing seems to have occured in Russian, although the -ся, -сь suffix (-sja, -sj) there is more often reflexive in meaning than it is in Scandinavian.

Similar processes have given us all the morphological grammar ever. But sound change, as mentioned above, can cause irregularities. Say we have a sound change along the line of a back vowel, when the next syllable has an /i/, is fronted to a vowel of roughly the same openness as the back vowel. What if this situation only obtains in some forms? We get situations similar to man - men!

Analogy can undo this, restoring the latter and making it conform with the more regular way the language forms plural (and it seems Afrikaans is the one Germanic language for which that particular noun has been analogized). As rather good pattern-matching machines, we like to extend patterns, and this easily makes the most common type of inflecting or conjugating something slowly conquer the entire vocabulary, but other sound changes may undo that process in the meantime. Some verbs have actually become strong in English in recent centuries, and some strong verbs have become weak.

A common kind of sound change is the loss of a bit of words. There is a terminology for the different kinds of losses: loss of final sounds, loss of unstressed sounds, loss of initial sounds, and loss of medial sounds. These are strictly sound changes, and so belong under the previous heading. The position in which sounds are most likely to disappear are positions in which they naturally are weakened or are difficult to pronounce.

This kind of change we know to have occured in the lead-up to several classical languages as well, so it does not reduce how advanced a language is by any means. However, many grammatical affixes do occur at the edges of words, and these are places where sound changes are likely to chip away at them until the case system or tense forms are reduced.

However, grammar can exist in forms other than affixes and rich declension tables, and I think that deserves a separate post of its own, explaining how people tend to underestimate the amount of non-morphological grammar in languages vastly, and how the idea that we, in a reasonable manner, somehow can measure how complex languages are is misguided.

As for sources for this post, I haven't actually read any books on linguistic change in several years. Skimming through Lyle Campbell's book confirms I have not forgotten much, and I recommend it wholeheartedly to whomever so wishes to learn more about this topic.

[1] Campbell, Lyle; Historical Linguistics an introduction,