The principle of comparative complexity

GarryM · September 27, 2024, 3:42am

The principle of comparative complexity

In the realm of reproducing biological neural properties into software, we have the principle of comparative complexity. This is the idea that if a biological system we wish to emulate has complexity on order of 10^n, then we shouldn’t expect to be able to implement that system in software with complexity significantly less than 10^n. For reference, the human brain has approximately 10^11 neurons, and 10^14 synapses.

The assumption of biological efficiency

The above principle though is dependent on the assumption of biological efficiency. That is, there does not exist a system with essentially identical properties, but has significantly less complexity. In other words, evolution has not settled on a system with large scale inefficiency.

The factorization short-cut

It sometimes turns out that complexity can be factored. Instead of a single system of complexity 10^n, we can decompose it into smaller systems, say with 10^m complexity, and then, without much further work, reproduce that module 10^k times. Where, m + k ~ n.

Deep learning

Deep learning is the canonical example of the factorization short-cut.

The approximation short-cut

The other short-cut is that of approximation. We build a simpler system by only implementing key features of the original system.

The ideal gas law

In the case of the ideal gas law, a litre of air contains approximately 10^23 molecules, yet it can be approximated by 4 variables, 1 constant, and 1 equation

PV = nRT

P = pressure of the gas
V = volume of the gas
n = number of moles of the gas
R = idea gas constant
T = absolute temperature of the gas in Kelvin

Implications for AI related projects

Given how many neurons, synapses, and other complexity of a real brain, then we should not expect to be able to implement an interesting AI project without it having significant complexity in one way or another. Our best hope are the two short-cuts, factorization and approximation.

Future work

If there is sufficient interest in this brief note, perhaps it could be expanded on.

jacobeverist · September 27, 2024, 4:12pm

One thing I’ve found is that in biology, even though the systems have this theoretical 10^n complexity, the actual resources used are a significant fraction, in almost every aspect.

Activity

Sparse neural activations are preferred over dense activations. This means only a fraction of the total neurons are used for creating activation potentials. We don’t need to simulate 10^n neurons when running an algorithm. Just a fraction of them.

Connectivity

Neurons are not connected to each other in a complete graph, which would create the theoretical 10^{n+n} many connections. Instead, connectivity tends toward sparsity as well. However, each neuron has at least one connection, so the number of connections would be greater than 10^n, assuming that we’re still determined to simulate 10^n neurons.

Schemas

As you said, complexity can be factored, and that seems to be what the brain does by forming schemas. This is an instantiation of a network of circuits to perform a particular task. This can be seen as a form of factorization in this framework.

Neuron Features

A fully-featured neuron has a lot of extra stuff that is only used in certain special cases. The neurons are specialized depending upon their membership in a circuit or location in the brain. there are also a number of vestigial functions as well that are rarely used. Sometimes the brain will discard these vestigial functions, but sometimes they won’t. That doesn’t mean our algorithms need to simulate these vestigial functions.

If we can understand the core computational principles of particular neural circuits, we can carefully select the neuron models we are using for each task, and thereby throw out a lot of computational burden. I suppose this can be seen as a situational form of approximation that requires really understanding what the computational purpose of a given circuit is for.

Prune Principle

So let’s call this the prune principle. To build a neuromorphic algorithm that approximates this theoretical 10^n complexity system, we need to be aggressive in our pruning of inessential structures for delivering our intended functionality.

jacobeverist · September 27, 2024, 4:14pm

FYI, you can use this notation to do mathjax/latex math formatting

$10^n$

which creates

10^n

GarryM · September 28, 2024, 3:43am

@jacobeverist, you are much more of an optimist than I am in this regard! The human brain is biologically very expensive, so I am not convinced that there is wide-scale inefficiency in its implementation. Evolution surely would have evolved something that was cheaper to maintain?

I also disagree about sparse activity and sparse connectivity, since I see them as being orthogonal to the idea of complexity. A system can be sparse, but, it seems to me, that does not necessarily imply the system has lower complexity. Indeed, there is an argument that a graph with all nodes connected has significantly lower complexity than a system with sparse connectivity. Since in the first case, it is easy to just link everything to everything, but in the second case you need to keep track of all the individual links between graph nodes. So it is represented using more information/complexity. But yeah, I do of course agree that the brain is a sparse system.

As for your prune principle yeah, we should be looking for approximations that result in neuromorphic algorithms that are close enough to what we want to get done, with hopefully significantly lower complexity. I’m just a pessimist over whether they actually exist. Or whether humans are smart enough to invent them? For example, HTM theory is searching for encoders that are relatively cheap to implement in code, but I’m not convinced that the vast complexity of the human visual and auditory cortex can be approximated so cheaply. All of those neurons and synapses must be doing something!

OK. On further thought, I will admit something though! The human brain’s algorithm for doing simple arithmetic is vastly inefficient compared to that implemented using computers.

GarryM · September 28, 2024, 4:09am

There is also the well known analogy, we want to fly using a plane not a bird. Ditto neuromorphic code.

GarryM · September 28, 2024, 4:45am

There are some nice parameter counts and stats here. In particular:

ChatGPT-4 is estimated to have roughly 1.8 trillion parameters
GPT-4 is therefore over ten times larger than its predecessor, GPT-3
ChatGPT-4o Mini could be as small as 8 billion parameters

So ChatGPT-4 is ~ 10^{12} parameters vs roughly 10^{11} neurons and 10^{14} synapses in a human brain.

Though there is an unknown conversion factor here. We need a measure of complexity as a function of Large Language Model parameters, vs a measure of complexity as a function of neuron and synapse count. But for this post, assuming ~ 1:1 will do, in which case, ChatGPT is clearly in the realm of interesting AI.

blimpyway · November 2, 2024, 9:36am

An optimistic assumption is brain size has a lot to do with its ability to learn.

It (likely) learns in an evolutionary way in which every column is picking random, small samples or “bits” from current state and attempting to make predictions using this sparse data. Having millions of columns doing the same stuff in parallel, eventually, whenever a hidden pattern is discoverable, some of the columns say “evrika” and become more “powerful” within that given context.
It’s pretty much an evolutionary process selecting most “plausible” from a very large number of random simple hypotheses whenever unexpected events occur.

That’s why it needs to be much complex (I mean in size) than DL/NN, because it can’t afford long searches on lots of data. It has to speculate some useful results with scarce data and get results quickly.

A DL example - Nvidia’s Hover can control reasonably a human body with a puny 1.5M parameter count. Learning is equivalent to an evolutionary process augmented by a physics simulation speed 10000 higher than real time. Learning is expensive, result is cheap.

So I consider this assumption as optimistic because:

first, using an evolutionary search process for learning can be massively parallelized on off the shelf, cheap hardware.
the results of learning have the potential to be massively compressed, since iterative evolution can be driven to figure out not only a model that works, but also a minimal one that tunes in to account for the most significant criteria/parameters that lead to an useful conclusion.

GarryM · November 5, 2024, 1:55am

OK, this has my attention. In my SynaptiFlux toy model I was looking for a way to auto-learn new patterns, and your comment has some similarity with what I was trying to do. I settled on the idea of partitioning neurons into layers, and implementing a delay counter. Then we store patterns that occur within some time window, in some layer, and does not activate an existing neuron. Further, I implemented activation counts for each neuron (ie, a measure of how powerful or useful a given neuron is), and then later we have a method to prune all neurons below some activation count (there is no point in keeping neurons that are never used).

This method also crossed my mind, but I couldn’t think of an efficient way to do the sampling step, given there are potentially many, many neurons, and even more ways to do random sub-sampling. That is why I settled on layers and a delay counter. I would be interested to hear if you have an efficient way to do this sampling?

blimpyway · November 5, 2024, 9:27am

Yeah, it makes sense.
Also a global scale negative/neutral/positive feedback should be useful in order to guide a decision on whether to activate learning or not. Animals somehow know when their recent choices need correction. ML researchers can afford making a priori assumptions upon what is “correct” and “incorrect” and engineer the training process according to these assumptions.

But living things don’t, they need an internal global metric on recent past quality which can be labeled “as expected” or “surprising” and surprise being polarised between “good” and “bad” surprises.

Further on, “value” for a neuron’s survival is also driven by its past influence in making correct decisions, even when their activation is low for a long time.

blimpyway · November 5, 2024, 10:20am

I personally don’t.
But I’m fascinated by the classical ML decision trees vs random forest case.

You know, a monolithic decision tree model attempts to account for all training data points and every single input feature in order to build up its decision branching.
A random forest is much more powerful than a single decision tree, and what seems most obvious is “sure, an ensemble of decision trees voting should be better than a single tree”, but there-s more to it. A RF isn’t only an ensemble of decision trees, but each one is built up using a random sub-set (aka sparse) of both input data points and input features.

All trees in a forest are weak, yet a simple voting ensemble of weak treas becomes (relatively) very strong.

And there are few important important lessons/observations about this:

First is that even with dozens or hundreds of weak trees, far from any attempt to cover all possible combinations, the performance increases dramatically over a single large tree.
Second is the RF algorithm just picks random sub-trees without any selective attempt process of measuring which data points or which features have most impact in performance in order to drive a selective/evolutionary process on finding relevant data, to figure out what matters and what is irrelevant.
Third observation is that big random forests can become prohibitively expensive, and having a meta learning layer to predict the 1% sub-trees that are most likely to give useful answers in current context might also unlock scaling.

I know the above rhetoric is focused in decision trees / random forests but (IMO) the same rationale/principles should apply to any underlying “learner” - NNs, spiking nets, etc…

I think picking any specific architecture for the weak learners (besides decision trees) would have limited impact upon the ensemble/forest .

PS Here-s a comparison of various learning algos over a simple problem they all end up having similar performance. Picking one should have more to do with its performance on hardware at hand, than with philosophical principles.

GarryM · November 8, 2024, 3:50am

Yeah, global feedback on whether to reinforce the current neural state, or not, sounds like a desirable feature. I can’t help but think of sport’s players, or tennis players, when they hit a great shot they have a very over the top positive response, and ditto the opposite, when they mess up a shot. Eg, fist pumping vs smashing their racket against something. In the context of SF, I don’t know how to implement this idea, it seems something far higher up the abstraction level than I am currently working on.

I don’t think that should be a problem in SF. As long as we don’t reset all the activation levels of our neurons, if in the past it was activated above the prune threshold, then it should be fine even if not activated for a long time period. As for the human brain, it certainly seems that neurons are pruned according to some measure of half-life. The longer it is not used the more likely it is to be pruned, and the higher the historical activation level, the less likely it is to be pruned. But in SF, that is a minor detail to consider later. Perhaps a related concept is that of long term potentiation. In SF, I think that should be a feature of our synapse functions, but I’m not yet sure the cleanest way to implement it.

GarryM · November 8, 2024, 4:30am

Interesting … This reminds me of a documentary I saw a long while back. A mathematician was testing the idea of averaging approximations. So he went to a fish market and bought a random fish. Then he walked down the street asking people to estimate its weight just based on the look of it. Of course, they were all over the place, but at the end of the day he averaged their result and it was accurate to one to two decimal places! Somewhat remarkable, or at least surprising. So in this analogy, weak decision trees are humans, and the RF is the average of their predictions.

I guess your second point is we should not be too fussy about how we select our sub-patterns when attempting to auto-learn. If we have enough of them they tend to converge on a correct prediction regardless. Perhaps with some filtering to filter out the really terrible patterns.

An ML technique also comes to mind, but it was so long ago I can’t recall all the details! Damn neuron half-lives . The idea was roughly you have a collection of academic papers in some field, but no labelled training data. Then you get researchers in that field to write very simple short python scripts or regex’s designed to look for certain patterns in the training data (such as the names for genes, say). But they were specifically very weak pattern detectors, but that didn’t matter to this process, it was more of a feature than a bug. The bit I don’t remember is what they did next (or what this approach was called). There was then some process that combined all this and auto-magically produced labelled training data with high accuracy. This labelled training data was then used in a standard ML approach for the next set of unlabelled training data in that field of study.

In the context of my SF toy, when defining a trigger pattern for some neuron there are two approaches to combining multiple patterns. If the patterns are dissimilar enough, then you pool those patterns, but if they are of similar structure then you can superposition add them. Let me give some code. First we define a neuron with a single pattern:

N1 = sf.Neuron('neuron example', 0, [1,1,1,1], ['alpha', 'beta', 'gamma', 'epsilon'], sf.trigger_list_min_simm_threshold, {'threshold': 0.98}, sf.pooling_or, {})

Then we superposition add more patterns:

    N1.apply_operator(0, sf.operator_add, {'coeffs1': [1,1], 'labels1': ['alpha', 'gamma']})
    N1.apply_operator(0, sf.operator_add, {'coeffs1': [1,1], 'labels1': ['beta', 'gamma']})
    N1.apply_operator(0, sf.operator_add, {'coeffs1': [1,1,1], 'labels1': ['alpha', 'gamma', 'delta']})

Resulting in this neuron:

Neuron: neuron example
    layer: 0
    activation count: 0
    pooling: <function pooling_or at 0x7f1ee2ed9e18>
    params: {}
    patterns: 1
        0    trigger: <function trigger_list_min_simm_threshold at 0x7f1ee2ed9d90>
        0    params: {'threshold': 0.98}
        0    [3.0, 2.0, 4.0, 1.0, 1.0]
        0    ['alpha', 'beta', 'gamma', 'epsilon', 'delta']

    axon: []

Finally, if we drop-below(threshold) using this code:

N1.apply_operator(0, sf.operator_drop_below, {'threshold': 2.5})

we end up with this final neuron:

Neuron: neuron example
    layer: 0
    activation count: 0
    pooling: <function pooling_or at 0x7f1ee2ed9e18>
    params: {}
    patterns: 1
        0    trigger: <function trigger_list_min_simm_threshold at 0x7f1ee2ed9d90>
        0    params: {'threshold': 0.98}
        0    [3.0, 4.0]
        0    ['alpha', 'gamma']

    axon: []

Anyway, the point I’m trying to make is, this might be an approach for auto-learning too? We either pool or add patterns, and then filter the individual elements of the added patterns based on how frequently they have occurred. Anyway, more thought is required on this idea.

For reference, my operators are here.

blimpyway · November 9, 2024, 7:53pm

Well, I have no clue. My neurons are quite rusty too, all nice with talking till they have to follow code.