I previously pointed out some notional difficulties with linguistic complexity - difficulties in exhaustively measuring it, difficulties in defining it, etc. Nevertheless, it is undeniable that languages do have something we reasonably can call 'complexity' - and that this complexity appears on several levels. How does this complexity come about?
We first need to look at the context in which a language normally exists: the brain and the speech community. The brain, obviously, is what produces linguistic utterances - and other brains parse these utterances. I think we can meaningfully say that grammar (as well as lexicon, as well as stylistics, as well as ...) all boil down to one thing: patterns.
Side-track 1: Neural networks
Neural networks, as it happens, are pretty good at some things. Among them is pattern recognition. And since patterns are a relevant thing, we should probably start with them. Neural networks are a well-known and researched general architecture for pattern recognition. Our brains are, as it happens, instances of neural networks - probably the most complex known examples, in fact.
Most readers probably have not read a lot about them, so I figure a short introduction is called for. I like to go for the excessively abstract when describing things - along the styles of "imagine an arbitrary multiset". I realize this does not work for most readers, so I will try to avoid it.
Imagine a simple sensory organ (oh man, there I go), where there are several sensors that pass on a signal if they are triggered. Different sensors react to different types of stimuli, or stimuli of similar kinds of different qualities, or even the same kind of stimuli - but by virtue of being at different locations they still convey different information onward. Each sensor, when the right stimuli is present, sends a signal. Let us assume the signal is binary - yes or no.
Let us further imagine that there are a bunch of things - we call them nodes - that receive these signals and sometimes also pass them on, along directed vertices (that is, lines from one node to another). These form a huge network, where each node can send (henceforth fire) to other nodes, and likewise receive from other nodes. The node cannot, though, decide which nodes to fire to - firing always transmits on all outwards vertices.
Every vertex is ascribed a weight. The weights of all simultaneously firing vertices reaching a node is added up (or multiplied or had some function applied to it), and if the sum (or product) exceeds some treshold, the receiving node also fires.
Now, this would be a static neural network - whenever the source sends a given signal, exactly the same things transpire down the line in the network. (Well, unless there is some loop where signals cause some feedback - but we could consider those loops a type of source as well, so we ignore those or consider them a special case, for now at least. The brain does have loops, though.)
So, we add this mechanism: a firing node increases the weight of whichever incoming vertex also carried a firing signal, and decreases the weight of whichever incoming vertices did not simultaneously fire. Essentially, if node A fires after it received signals along vertices a,b and d, from then on, it will listen more closely to those vertices.
The exact function by which the weight is changed affects the neural network's properties - most texts I have read on this uses sigmoid functions of varying steepness. In the brain, the function probably also varies with age, diet, part of day, what part of the brain it is happening in, etc. I am not very knowledgeable about the biological mechanisms involved, but I would figure there are various biochemical components involved. (Complications can be added, but these do not alter the fundamental computational power of the network - only changing details of the implementation. Certainly such changes affect efficiency on specific problems, but a problem-agnostic architecture should preferably be as simple as possible. An example of such a complication could be adjusting downwards the weight of firing vertices when the recipient node is not triggered. Another architecture has a second kind of vertex too - a blocking vertex. A firing signal sent down a blocking vertex prevents the recipient node from firing.)
Now, the cleverness of the system described above may not be obvious at first sight - and I know I am not good at explaining these things.
The system above only observes whether signals cooccur often enough to exceed some treshold when added up. If they do, their assigned weights are increased. Some things will coöccur by coincidence, whereas some things will coöccur by correlation. E.g. 'twinkle twinkle little' tends to correlate with 'star', because, quite obviously, these are the lyrics of a somewhat popular lullaby. However, the neural network can also be unlucky - and have some non-correlating things often appear in the input data, by coincidence. "Locally", the nodes do not have the power to reason as to whether two signals they have seen coöccuring are coincidences or genuine correlations.
There is no flawless approach to weeding out coincidences from correlations. Good updating functions, however, may help a bit. A function that increases the value of incoming vertices drastically will obviously start considering many coincidences as though they were bona fide patterns; on the other hand, a function that increases the value very conservatively may not even accept genuine correlations until they have occurred very often - and if the event is infrequent, the network may never adjust itself into recognizing it as a pattern. Meanwhile, the way vertices are decreased is also important - a false positive that only is recognized by the neural network for a short while is not a problem in the long run; if the weight of vertices that do not co-fire - or even worse, incoming vertices that fire when the target node does not - are adjusted down quickly, this probably will remove false positives, but it may also remove genuine positives.
Details in how the senses work are somewhat unnecessary - and often, the way we now consciously think of some things - language, things according to what class we assign them to (cup, glass, pitcher, beaker, chalice, ... and the various objects that may be in more than one of these classes), etc, there has often already been a bunch of layers of neurons acting on it. When you hear a word, first it passes through several layers of neural networks - one parses pitch content, another parses relative pitch contour, another parses and classifies the acoustic events as phonemes, another parses and classifies these phonemes as morphemes, and does guesswork that corrects possible mishearings, another parses and tries to reconstruct the syntactic structure that generated the sentence in another mind.
Each of these can gainfully be described as pattern recognition: recognizing speech sounds requires recognizing sounds with some vague similarity as far as their acoustics go, as well as having recognized which kinds of variations in these patterns are to be expected. Recognizing a word is recognizing a pattern of sounds; error-correction recognizes various other information as indicating that maybe this other word (or even maybe it's this word, which at least is a word unlike the audio information that actually entered the process) is the one heard, as these words and extralinguistic facts - things we have seen or know by other means - tend to pattern together.
The unit of recognition is the whole pattern, not the parts of the pattern - if pieces are amiss or wrong, if the pattern is recognized we are likely to recognize the whole pattern, rather than the mistaken details in it. And probably, some neuron activity may contribute to another neuron that also gets already processed signals stemming from the same sources, so there is a fair share of things, ultimately, that complicate the matter.
However, what "a whole pattern" is depends on the size of the "circuitry" we are discussing - when you hear someone say something, there are parts of your brain that react to patterns in the intonation, there are parts that react to the acoustics of really short samples, there are parts that react to the acoustics of a set of the samples (and recognize that yeah, this is Eric's voice), there are parts that react to the speech sounds (this is a d, this is an ɪ, this is an s, this is some noise I couldn't recognize, this is an z, ...), there are parts that react to the series of speech sounds (this is dɪs ?z, and because this pattern is similar enough to ðɪs ɪz it will by fortunate accident be identified as this is), there are parts that react to the words in the previously identified sentences, and that parse the grammar. This, in turn, also interacts with other things stored in the neural network, such as things the listener knows about the speaker that may affect what he is saying at the moment, and so on - of course depending on whether the relevant parts of the neural network have been forming connections between them and so on.
I do acknowledge that the above bit does not explain how and why a neural network also produces linguistic content. A longer essay on mirror neurons, and on the interactions of different other parts of the brain, and the evolutionary pressures that have caused those parts of the brain to trigger certain things would be needed to explain why neural networks also have behaviors, instead of just identifying a pattern and sending out a positive or negative conclusion to a final node.
It is worth noting that linguistics is split on . Chomskyists hold that the neurons of the brain come, to some extent, preprogrammed. This means that it is easy for a human to learn language, because there are already some basic patterns embedded in our brains - all we need to do is know which ways these patterns are implemented. One such pattern is supposedly the object, i.e. verbs often can have an argument that is, in some sense, a primary complement. Objects but not subjects being universal could point to objects being such an embedded notion. (I have seen other sources maintain that only subjects are universal, so do not quote me on either of these.)
However, I will not present any arguments in favor of whether such a language organ is present in the brain, or the brain more generally just happens to enable language without a portion devoted to the purpose, but I will admit that I do find the Chomskyite school on the topic to be more convincing. Linguists do not seem often to explicitly talk about neural models of language - however, this is mostly the result of the analysis being a bit more abstract and more general models of computation suffice. Ultimately, detailed analysis of neural networks is cumbersome, and this may be why their study is not common in linguistics departments. They do have applications in computational linguistics, though.
Side-track 2: The speech community
A language without a speech community is a dead language. The speech community consists of the speakers of some language. For most of the history of mankind, all actual linguistic context has been rather fresh off another neural network.
This means what one should look at is the effect of having a lot of neural networks - all with slightly unique architecture - the neurons are probably not perfectly identical in the first place, the information that has entered the network differs, etc. It should be obvious different individual networks will have identified different patterns - as well as having different false positives as well.
However, as the same patterns that are used to recognize language also are used to produce language, these patterns will be present in the linguistic data that we are exposed to throughout our lives. Thus, there are definite patterns in the language we hear - simply because these patterns appear from other pattern-matching neural architectures.
As previously stated, it is somewhat likely different individuals' have slightly different setups in their brain. Thus, some patterns that exist in the population may not exist in other members of the population. Meanwhile, some may have identified the things differently.
Consider, for instance, the development of the word beads. To what extent the meaning shift that occurred was the result of intentional metaphor or not, it is quite clear at one point in the history of English, nearly everyone understood bead as referring to prayer, whereas at one later point, nearly everyone understood it as referring to small, round solid objects. Over time, a mistaken pattern got so popular, it replaced a previous one. Identifying what referent a word had was reinterpreted. Those who counted their prayers so often were counting them by counting a kind of round solid object, that observers took the word to signify the round solid objects, because of a rather obvious coöccurence.
Neural networks explain both how grammar is passed down the generations, and how grammar changes as it is parsed slightly differently down the lines. How do accidental patterns create grammar though?
If a certain tendency for collocation has appeared - some words or morphemes tend to occur in sequence or near each other under certain circumstances - this easily is understood as expressing circumstances along those lines. If others pick up the pattern, this grammaticalizes, and suddenly there is grammar. All grammar in all languages probably originate with this phenomenon, but later, influence between languages also has added some to the mix.
Of course, some individuals may not have grammaticalized the same patterns, and we do find some variation in how subcommunities of a speech community - and even individual speakers - understand and use constructions. Many dismiss this as speaking sloppily or being ignorant, but a lot of neural network effort has gone into generalizing other underlying patterns. Can we really say one generalization is better than another? Which one is better - one that is more consistent with other patterns in the language? One that is more elegant? One that is more parsimonious? Turns out the standard language sometimes expects a more parsimonious pattern, and sometimes a less parsimonious pattern (see, e.g. begs the question, where the standard language expects a very unnatural and often not very obvious interpretation!).
Ultimately, a population of neural networks exchanging messages in a flexible protocol which adjusts for the properties of both the neural network and the medium over which the signal is passed (audio through air or text on various materials or morse beeps over electric lines or so on) is a sufficient explanation for how grammar appears. The neural networks identify patterns - even unintentional patterns- and generalize them. Sometimes, the identification goes wrong. If many speakers do this misidentification, it is likely the entire language changes with them - in reality, we should probably think of any specific language as some kind of average of how the population parses and generates the language.
Further, lots of grammar ensures some grade of redundancy in the language - and this is useful to ensure that the language has some persistance over a noisy channel, as the world does happen to be such a noisy channel. Some grammar - verb conjugations, case agreement, noun gender, etc, probably is the result of elements being repeated that can help guess the intended meaning even if noise happens to eat some important syllables; if there's fifteen words in the language that begin with ka-, and you don't hear the rest of the word, but some other word gives away that the word also is of, say, neuter gender, it is likely that the neural network can exclude many candidate words if the context otherwise wasn't enough to exclude all but one.
We do, in fact, mishear a lot more than we think, and our brains use cues along these lines to reconstruct the data. In a language without redundancy, the amount of times people in our surroundings would keep going "what?", "excuse me, what'd you say", would probably cause us to repeat poignant information that helps this - hence, e.g. the widespread use of double negation throughout languages in the world. If such repetition gets turned into a pattern, and this pattern is worn down by sound change, it easily is turned into regular morphology. Other reasons probably also underlies the appearance of grammar, though, such as the Chomskyan notion of part of the brain constituting a language organ with certain pre-wired settings conductive to learning language.