"Grey neurons": statistical mechanics of neural networks
The history of neural networks and artificial intelligence in a nutshell.
The main pioneering steps that traced the "route to neural networks" have been played mainly by mathematicians and biologists, more than theoretical physicists, that "joined the community" in relatively late times. In this walk, the milestones that to me at least have played a major role can be summarized as follows: from one side the great advances in pure Logics, as initially achieved by Russell and Whitehead with their Principia Mathematica, that paved the strand to the conceptual and epistemological revolution achieved by Godel with his Completeness and Incompleteness Theorems (see more here), and later on this line the celebrate books by George Boole "the mathematical analysis of logics" and "an investigation on the laws of toughts". This branch of "pure mathematics" has been an incredible seed for other contemporary breaktroughts, as those achieved by Wiener (the "father of cybernetics") and his PhD student Shannon (the "father of information theory"): it is remarkable that, working on problems as information transport on a (cable) line, the "flavour" of their works is genuinely resembling statistical mechanics, despite at this level of resolution it is still a separate discipline of Science. It is also worth stressing that Columbia shares online its best PhD thesis and, if I can suggest, the 25pages length PhD thesis by Shannon (with Wiener as supervisor) is a document that, by far, deserves to be read... the interested reader may find it here. Meanwhile, with these arguments and results safetly shared among these pioneers, two other leading figures obtained other crucial results, this time despite their very impressive mathematical skills with a "practical spirit" and a "concrete attention on thinking machine", the two names are Alan Turing and John Von Neumann. Their contributions to computing machines (from computers to neural networks) is astonishing, they really built up the first modern calculators, despite keeping "roots of the past" as the main ideas were adapted from the Babbage machine and the revolutionary sewing frame from textile industry: there are two short and divulgative books available by these giants of the past, "Mechanical Intelligence" by the former, Turing, and "computer and brain" by the latter, Von Neumann: the interested reader should certainly read them really seriously. On a similar line also the milestones by Rosenblatt and Minsky & Papert (the authors of "Perceptrons") appeared in Literature. However, trying to focus on more modern aspects of biological computation, and, in particular, focusing on the contribution by Theoretical Physicists (and, in particular, by scientists involved in Statistical Mechanics as this whole website of mine is all built with and around Statistical Mechanics), they finally found a (crucial) role in contributing to explain a distributed working memory as a concrete possibility for information repository in cortical areas of real brain: in a nutshell the story can be summarized as follows. Due to all these pioneers, (and due to further contributions by the biological branch as the "cable theory" by Hodkin and Huxley or the impressively many works by Tuckwell), it was "understood" how to translate in Boolean terms statements from languages, being them English, Italian or Mathematics, thus, in some sense, "computation" was an addressed skill by their knowhow, however, another very important problem remained elusive for long time, namely a robust mechanism to store information. To overcome also this challenging question the main route has been paved by Donald Hebb, with his incredibly deep book "the organization of behavior", where a concrete proposal for a distributed memory (to be called "associative" "emergent" or "collective" later on by physicists) were addressed. Indeed Hebb work inspired a very influential paper by John Hopfield, that as a matter of fact played as a novel generation of neural networks, a novel paradigm that (within its several variations on theme) lasts up to the present days. However the story is not finished yet as, so far, the implication of Statistical Mechanics still does not emerge. The last point to be elucidated has been the role of Giorgio Parisi in offering, via statistical mechanics, an incredibly deep picture of the remarkable (and even "incredible" for the first years one is left to study his works, then it becomes comprhensible:) organization of the low temperature states of spin glasses, that has been one of the most deep and amazing cultural achievement of the past century. This is a central point because the Hopfield network is a particular example of a spin glass, thus, now thanks to Parisi theory, neural networks (and in particular these novel and very potent ones by Hopfield) could be studied finally via Statistical Mechanics, and, in particular, within the theory of spin glasses. From that point theoretical physicists joined on the red carpet the community of researchers involved in brain research and a few years elapsed since the Parisi theory appeared in Literature, than Amit Gutfreund and Sompolinky developed a pure and entire statistical mechanical formulation of neural networks, that is the modern reference frame for this discipline. A part Francesco Guerra (who really played a by far central role in my cultural and scientific formation as he has been and he is still today my mentor for more than twelve years of continuous scientific interchanges among us), I also studied theoretical physics (and in particular statistical mechanics and neural networks) attending the courses by Giorgio Parisi and Daniel Amit, and is sharply in the way they paved that my research on neural networks keeps going in these years. The interest in tackling cognitive systems for a scientist involved in statistical mechanics is the highly systemic nature of their performances: a (biological) neural network as well as an artificial one is a system (i.e. a brain) formed by a huge amount of neurons, whose axones and dendirites (roughly speaking their output and input connections) are densely connected, the botton linking them (that is linking the output of a neuron the axon on possibly many inputs of other neurons, thus dendites) being the synapse. The latter can be both excitatory (adding to the membrane potential of the receiving cell a positive electric signal contributing to cell polarization) or inhibitory (the contrary): this implies frustration in the network and allows for a direct comparison among neural networks and spin glasses, the latter being the prototype of the most complex system analyzed by statistical mechanics. Summary Neural networks are such a fascinating field of science that its development is the result of contributions and efforts from an incredibly large variety of scientists, ranging from engineers (mainly involved in electronics and robotics), theoretical physicists (mainly involved in statistical mechanics and stochastic processes), and mathematicians (mainly working in logics and graph theory) to (neuro)biologists and (cognitive)psychologists. Tracing the genesis and evolution of neural networks is very difficult, probably due to the broad meaning they have acquired along the years. Seminal ideas regarding automation are already in the works of Lee during the XIIX century, if not even back to Descartes, while more modern ideas regarding spontaneous cognition, can be attributed to A. Turing and J. Von Neumann or to the join efforts of M. Minsky and S. Papert, just to cite a few.; scientists closer to the robotics branch often refer to the W. McCulloch and W. Pitts model of perceptron (note that the first ''transistor'', crucial to switch from analogical to digital processing, was developed only in 1948), or the F. Rosenblatt version, while researchers closer to the neurobiology branch adopt D. Hebb's work as a starting point. On the other hand, scientists involved in statistical mechanics, that joined the community in relatively recent times, usually refer to the seminal paper by Hopfield or to the celebrated work by Amit Gutfreund Sompolinky, where the statistical mechanics analysis of the Hopfield model is effectively carried out. Whatever the reference framework, at least 30 years elapsed since neural networks entered in the theoretical physics research and in these decades we understood that toy models for paramagneticferromagnetic transition (i.e. the CurieWeiss model) are natural prototypes for the autonomous storage/retrieval of information patterns and play as operational amplifiers in electronics. Then, analyzing the capabilities of glassy systems (ensembles of ferromagnets and antiferromagnets) in storing/retrieving extensive numbers of patterns physicists have been able to recover the Hebb rule for learning in two different ways (the former guided by ferromagnetic intuition, the latter guided by glassy counterpart), both far from the original route contained in his milestone (The Organization of Behavior) and here we gave prescription to map these glassy systems in ensembles of amplifiers and inverters (thus flipflops) of the engineering counterpart so to offer a concrete bridge between the two communities of theoretical physicists working with complex systems and engineers working with robotics and information processing. In what follows I will now deepen the role played by theoretical physics, and in particular by statistical mechanics in this adventure in order to try and share the underlying philosophy of our approach. A BRIEF HISTORICAL SUMMARY OF PHYSICAL APPROACHES IN NEURAL NETWORKS Hereafter we summarize the fundamental steps that led theoretical physicists towards artificial intelligence; despite this parenthesis may look rather distant from neural network scenarios, it actually allows us to outline and to historically justify the physicists perspective. Statistical mechanics aroused in the last decades of the XIX century thanks to its founding fathers Ludwig Boltzmann, James Clarke Maxwell and Josiah Willard Gibbs. Its ``solely'' scope (at that time) was to act as a theoretical ground of the already existing empirical thermodynamics, so to reconcile its noisy and irreversible behavior with a deterministic and time reversal microscopic dynamics. While trying to get rid of statistical mechanics in just a few words is almost meaningless, roughly speaking its functioning may be summarized via toyexamples as follows. Let us consider a very simple system, e.g. a perfect gas: its molecules obey a Newtonlike microscopic dynamics (without friction as we are at the molecular level thus timereversal because dissipative terms in differential equations capturing system's evolution are coupled to odd derivatives that here are absent) and, instead of focusing on each particular trajectory for characterizing the state of the system, we define order parameters (e.g. the density) in terms of microscopic variables (the particles belonging to the gas). By averaging their evolution over suitably probability measures, and imposing on these averages energy minimization and entropy maximization, it is possible to infer the macroscopic behavior in agreement with thermodynamics, hence bringing together the microscopic deterministic and time reversal mechanics with the macroscopic strong dictates stemmed by the second principle (i.e. arrow of time coded in the entropy growth). Despite famous attacks to Boltzmann theorem (e.g. by Zermelo or Poincarè), statistical mechanics was immediately recognized as a deep and powerful bridge linking microscopic dynamics of a system's constituents with (emergent) macroscopic properties shown by the system itself, as exemplified by the equation of state for "perfect gases" obtained by considering an Hamiltonian for a single particle accounting for the kinetic contribution only. One step forward beyond the perfect gas, Van der Waals and Maxwell in their pioneering works focused on real gases, where particle interactions were finally considered by introducing a nonzero potential in the microscopic Hamiltonian describing the system: this extension implied fiftyyears of deep changes in the theoreticalphysics perspective in order to be able to face new classes of questions. The remarkable reward lies in a theory of phase transitions where the focus is no longer on details regarding the system constituents, but rather on the characteristics of their interactions (this is an incredible shift in the perspective, quite similar to abandoning anthropocentric theories toward objective data driven approaches). Indeed, phase transitions, namely abrupt changes in the macroscopic state of the whole system, are not due to the particular system considered, but are primarily due to the ability of its constituents to perceive interactions over the thermal noise. For instance, when considering a system made of by a large number of water molecules, whatever the level of resolution to describe the single molecule (ranging from classical to quantum), by properly varying the external tunable parameters (e.g. the temperature this system eventually changes its state from liquid to vapor (or solid, depending on parameter values); of course, the same applies generally to liquids. A curiosity: in the treatment of the Van der Waals real gas, an heuristic criterion, the Maxwell equal areas rule, useful for evaluating the latent heat during a phase transition was so clear that the need for a proof of its correctness was not a major task in Statistical Mechanics. It is funny (for me only I'd say) that, however, a mathematical ground to Maxwell principle has been achieved only in recent times, by myself in a joint work with Antonio Moro that appeared in the Annals of Physics in 2015 (publication n. 70 in my publication list, whose name is "Exact solution of the Van der Waals model in the critical region"). The fact that the macroscopic behavior of a system may spontaneously show cooperative, emergent properties, actually hidden in its microscopic description and not directly deducible when looking at its components alone, was a fist to Reductionism and was also definitely appealing in neuroscience (as the brain shows remarkable "emergent properties" that are not found when analying a neuron alone). In fact, in the 70s neuronal dynamics along axons, from dendrites to synapses, was already rather clear (see e.g. the celebrated book by Tuckwell) and not too much intricate than circuits that may arise from basic human creativity: remarkably simpler than expected and certainly trivial with respect to overall cerebral functionalities like learning or computation, thus the aptness of a thermodynamic formulation of neural interactions to reveal possible emergent capabilities was immediately pointed out, despite the route was not clear yet. Interestingly, a big step forward to this goal was prompted by problems stemmed from condensed matter. In fact, quickly theoretical physicists realized that the purely kinetic Hamiltonian, introduced for perfect gases (or Hamiltonian with mild potentials allowing for real gases), is no longer suitable for solids, where atoms do not move freely and the main energy contributions are from potentials. An ensemble of harmonic oscillators (mimicking atomic oscillations of the nuclei around their rest positions) was the first scenario for understanding condensed matter: however, as experimentally revealed by crystallography, nuclei are arranged according to regular lattices hence motivating mathematicians in study periodical structures to help physicists in this modeling, but merging statistical mechanics with lattice theories resulted soon in practically intractable models (For instance the famous Ising model, dated 1920 and curiously invented by Lenz whose properties are known in dimensions one and two, is still waiting for a solution in three dimensions). As a paradigmatic example, let us consider the onedimensional Ising model, originally introduced to investigate magnetic properties of matter: the generic, out of N, nucleus labeled as $i$ is schematically represented by a spin $\sigma_i$, which can assume only two values ($\sigma_i=1$, spin down and $\sigma_i=+1$, spin up); nearest neighbor spins interact reciprocally through positive (i.e. ferromagnetic) interactions $J_{i,i+1}>0$, hence the Hamiltonian of this system can be written as $H_N(\sigma) \propto \sum_i^N J_{i,i+1}\sigma_i \sigma_{i+1} h \sum_i^N \sigma_i$, where $h$ tunes the external magnetic field and the minus sign in front of each term of the Hamiltonian ensures that spins try to align with the external field and to get parallel each other in order to fulfill the minimum energy principle. Clearly this model can trivially be extended to higher dimensions, however, due to prohibitive difficulties in facing the topological constraint of considering nearest neighbor interactions only, soon shortcuts were properly implemented to turn around this path. It is just due to an effective shortcut, namely the so called ``mean field approximation'', that statistical mechanics approached complex systems and, in particular, artificial intelligence. The route to complexity: The role of mean field limitations As anticipated, the ``mean field approximation'' allows overcoming prohibitive technical difficulties owing to the underlying lattice structure. This consists in extending the sum on nearest neighbor couples (which are O(N)) to include all possible couples in the system (which are O(N^2)), properly rescaling the coupling (J > J/N) in order to keep thermodynamical observables linearly extensive. If we consider a ferromagnet built of by N Ising spins \sigma_i = \pm 1 with i in (1,...,N), we can then write H_{N}(\sigmaJ) = \frac1N \sum_{i<j}^{N,N} J_{ij}\sigma_i \sigma_j \sim  \frac{1}{2N}\sum_{i,j}^{N,N}\sigma_i\sigma_j, where in the last term we neglected the diagonal term (i=j) as it is irrelevant for large N. From a topological perspective the meanfield approximation equals to abandon the lattice structure in favor to a complete graph. When the coupling matrix has only positive entries, e.g. P(J_{ij}) = \delta(J_{ij}J), this model is named CurieWeiss model and acts as the simplest microscopic Hamiltonian able to describe the paramagneticferromagnetic transitions experienced by materials when temperature is properly lowered. An external (magnetic) field h can be accounted for by adding in the Hamiltonian an extra term proportional to  h \sum_i^N \sigma_i. According to the principle of minimum energy, the twobody interaction appearing in the Hamiltonian above tends to make spins parallel with each other and aligned with the external field if present. However, in the presence of noise (i.e. if temperature T is strictly positive), maximization of entropy must also be taken into account. When the noise level is much higher than the average energy (roughly, if T >> J), noise and entropydriven disorder prevail and spins are not able to ``feel'' reciprocally; as a result, they flip randomly and the system behaves as a paramagnet. Conversely, if noise is not too loud, spins start to interact possibly giving rise to a phase transition; as a result the system globally rearranges its structure orientating all the spins in the same direction, which is the one selected by the external field if present, thus we have a ferromagnet. In the early '70 a scission occurred in the statistical mechanics community: on the one side ``pure physicists" saw meanfield approximation as a merely bound to bypass in order to have satisfactory pictures of the structure of matter and they succeeded in working out iterative procedures to embed statistical mechanics in (quasi)threedimensional reticula, yielding to the renormalization group developed by Kadanoff and Wilson in America and DiCastro and JonaLasinio in Europe. This proliferative branch gave then rise to superconductivity, superfluidity and manybody problems in condensed matter. Conversely, from the other side, the meanfield approximation acted as a breach in the wall of complex systems: a thermodynamical investigation of phenomena occurring on general structures lacking Euclidean metrics (e.g. ErdosRenyi graphs, smallworld graphs, diluted, weighted graphs) was then possible. In general, as long as the summations run over all the indexes (hence meanfield is retained), rather complex coupling patterns can be solved (see e.g., the striking Parisi picture of meanfield glassy systems) and this paved the strand to complex system analysis by statistical mechanics, whose investigation largely covers neural networks too. STATISTICAL MECHANICS FOR NEURAL NETWORKS Hereafter we discuss how to approach neural networks from models mimicking ferromagnetic transition. In particular, we study the CurieWeiss model and we show how it can store one pattern of information and then we bridge its inputoutput relation (called selfconsistency) with the transfer function of an operational amplifier. Then, we notice that such a stored pattern has a very peculiar structure which is hardly natural, but we will overcome this (fake) flaw by introducing a gauge variant known as Mattis model. This scenario can be looked at as a primordial neural network and we discuss its connection with biological neurons and operational amplifiers. The successive step consists in extending, through elementary thoughts, this picture in order to include and store several patterns. In this way, we recover, via the first alternative route (w.r.t. the original one by Hebb), both the Hebb rule for synaptic plasticity and, as a corollary, the Hopfield model for neural networks too that will be further analyzed in terms of flipflops and information storage. Storing the first pattern: CurieWeiss paradigm. The statistical mechanical analysis of the CurieWeiss model (CW) can be summarized as follows: Starting from a microscopic formulation of the system, i.e. N spins labeled as i,j,..., their pairwise couplings J_{ij} > J, and possibly an external field h, we derive an explicit expression for its (macroscopic) free energy A(\beta). The latter is the effective energy, namely the difference between the internal energy U, divided by the temperature T=1/\beta, and the entropy S, namely A(\beta) = S  \beta U, in fact, S is the ``penalty'' to be paid to the Second Principle for using U at noise level \beta. We can therefore link macroscopic free energy with microscopic dynamics via the fundamental relation A(\beta) = \lim_{N \to \infty} \frac1N \ln \sum_{\{ \sigma \}}^{2^N} \exp\left[ \beta H_N(\sigmaJ,h)\right], where the sum is performed over the set {\sigma} of all 2^N possible spin configurations, each weighted by the Boltzmann factor \exp[\beta H_N(\sigmaJ,h)] that tests the likelihood of the related configuration. From an explicit expression of A(\beta), we can derive the whole thermodynamics and in particular phasediagrams, that is, we are able to discern regions in the space of tunable parameters (e.g. temperature/noise level) where the system behaves as a paramagnet or as a ferromagnet. Thermodynamical averages, denoted with the symbol <.> provide for a given observable the expected value, namely the value to be compared with measures in an experiment. For instance, for the magnetization m(\sigma) = \sum_{i=1}^N \sigma_i /N we have <m (\beta) > = \frac{\sum_{\sigma} m(\sigma) e^{\beta H_N(\sigmaJ)}}{\sum_{\sigma}e^{\beta H_N(\sigmaJ)}}. When \beta > \infty the system is noiseless (zero temperature) hence spins feel reciprocally without errors and the system behaves ferromagnetically (<m>>1), while when \beta \to 0 the system behaves completely random (infinite temperature), thus interactions can not be felt and the system is a paramagnet (<m>> 0). In between a phase transition happens. In the CurieWeiss model the magnetization works as an order parameter: its thermodynamical average is zero when the system is in a paramagnetic (disordered) state (<m>=0), while it is different from zero in a ferromagnetic state (where it can be either positive or negative, depending on the sign of the external field). Dealing with order parameters allows us to avoid managing an extensive number of variables \sigma_i, which is practically impossible and, even more important, it is not strictly necessary. Now, an explicit expression for the free energy in terms of <m> can be obtained carrying out summations in the equation for A(\beta) and taking the thermodynamic limit N > \infty as A(\beta) = \ln 2 + \ln\cosh [ \beta (J \langle m \rangle + h) ]  \frac{\beta J}{2}\langle m \rangle^2. In order to impose thermodynamical principles, i.e. energy minimization and entropy maximization, we need to find the extrema of this expression with respect to <m> requesting \partial_{\langle m(\beta) \rangle}A(\beta)=0. The resulting expression is called the \emph{selfconsistency} and it reads as \partial_{\langle m \rangle}A(\beta)=0 \Rightarrow \langle m \rangle = \tanh[\beta (J \langle m \rangle + h)]. This expression returns the average behavior of a spin in a magnetic field. In order to see that a phase transition between paramagnetic and ferromagnetic states actually exists, we can fix h=0 (and pose J=1 for simplicity) and expand the r.h.s. of the equation above to get <m> \sim \sqrt{\beta J  1}. Thus, while the noise level is higher than one (\beta < \beta_c \equiv 1 or T > T_c \equiv 1) the only solution is <m>=0, while, as far as the noise is lowered below its critical threshold \beta_c, two differentfromzero branches of solutions appear for the magnetization and the system becomes a ferromagnet. The branch effectively chosen by the system usually depends on the sign of the external field or boundary fluctuations: <m> >0 for h>0 and vice versa <m> <0 for h<0. Clearly, the lowest energy minima correspond to the two configurations with all spins aligned, either upwards (\sigma_i = +1 , \forall i) or downwards (\sigma_i = 1 , \forall i), these configurations being symmetric under spinflip \sigma_i \to  \sigma_i. Therefore, the thermodynamics of the CurieWeiss model is solved: energy minimization tends to align the spins (as the lowest energy states are the two ordered ones), however entropy maximization tends to randomize the spins (as the higher the entropy, the most disordered the states, with half spins up and half spins down): the interplay between the two principles is driven by the level of noise introduced in the system and this is in turn ruled by the tunable parameter $\beta \equiv 1/T$ as coded in the definition of free energy. A crucial bridge between condensed matter and neural network could now be sighted: One could think at each spin as a basic neuron, retaining only its ability to spike such that \sigma_i=+1 and \sigma_i=1 represent firing and quiescence, respectively, and associate to each equilibrium configuration of this spin system a {\em stored pattern} of information. The reward is that, in this way, the spontaneous (i.e. thermodynamical) tendency of the network to relax on freeenergy minima can be related to the spontaneous retrieval of the stored pattern, such that the cognitive capability emerges as a natural consequence of physical principles: we well deepen this point along the whole paper. The CurieWeiss model and the (saturable) operational amplifier Let us now tackle the problem by another perspective and highlight a structural/mathematical similarity in the world of electronics: the plan is to compare selfconsistencies in statistical mechanics and transfer functions in electronics so to reach a unified description for these systems. Before proceeding, we recall a few basic concepts. The operational amplifier, namely a solidstate integrated circuit (transistor), uses feedback regulation to set its functions: there are two signal inputs (one positive received (+) and one inverted, thus negative received ()), two voltage supplies (V_{sat}, V_{sat}) where ``sat'' stands for saturable  and an output ($V_{out}$). An ideal amplifier is the linear approximation of the saturable one (technically the voltage at the input collectors is thought constant so that no current flows inside the transistor, and Kirchoff rules apply straightforwardly): if R_{in} stands for the input resistance while R_f represents the feedback resistance, i_+=i_ =0 and assuming R_{in} = 1\Omega without loss of generality as only the ratio R_{f}/R_{in} matters the following transfer function is achieved: V_{out}=G V_{in} = (1 + R_{f}) V_{in}, where G=1+R_f is called "gain", therefore as far as 0>R_f>\infty (thus retroaction is present) the device is amplifying. Let us emphasize deep structural analogies with the CurieWeiss response to a magnetic field $h$: First, we notice that all these systems saturate. Whatever the magnitude of the external field for the CW model, once all the spins become aligned, increasing further h will no longer produce a change in the system; analogously, once in the opamp reached V_{out} = V_{sat}, larger values of V_{in} will not result in further amplification. Also, notice that both the selfconsistency and the transfer function are two inputoutput relations (the input being the external field in the former and the inputvoltage in the latter, the output being the magnetization in the former and the outputvoltage in the latter), and, once fixed $\beta=1$ for simplicity, expanding \langle m \rangle = \tanh( J\langle m \rangle + h) \sim (1+J)h, we can compare term by term the two expression as V_{out} &=&(1 + R_f) V_{in}, <m> = (1 + J) h. We see that R_f plays as J, and, consistently, if R_f is absent the retroaction is lost in the opamp and the gain is no longer possible; analogously if J = 0, spins do not mutually interact and no feedback is allowed to drive the phase transition. Such a bridge is robust as operational amplifiers perform more than signal amplifying; for instance they can perform as latches, namely analog to digital converters. Latches can be achieved within the CurieWeiss theory simply working in the lownoise limit as the sigmoidal function (<m> versus h) of the selfconsistency approaches a Heavyside, hence, while analogically varying the external field h, as soon as it crosses the value h=0 (say from negative values), the magnetization jumps discontinuously from <m> = 1 to <m> = +1 hence coding for a digitalization of the input. The route from CurieWeiss to Hopfield Actually, the CurieWeiss Hamiltonian would encode for a rather poor model of neural network as it would account for only two stored patterns, corresponding to the two possible minima (that in turn would represent pathological network's behavior with all the neurons contemporarily completely firing of completely silenced), moreover, these ordered patterns, seen as information chains, have the lowest possible entropy and, for the ShannonMcMillan Theorem, in the large N limit they will never be observed. This criticism can be easily overcome thanks to the Mattisgauge, namely a redefinition of the spins via \sigma_i > \xi_i^1 \sigma_i, where \xi_i^1= \pm 1 are random entries extracted with equal probability and kept fixed in the network (in statistical mechanics these are called quenched variables to stress that they do not contribute to thermalization, a terminology reminiscent of metallurgy). Fixing J =1 for simplicity, the Mattis Hamiltonian reads as H_N^{Mattis}(\sigma\xi) = \frac{1}{2N} \sum_{i,j}^{N,N}\xi_i^1\xi_j^1\sigma_i \sigma_j  h\sum_i^N\xi_i^1 \sigma_i. The Mattis magnetization is defined as $m_1 = \sum_i \xi_i^1 \sigma_i$. To inspect its lowest energy minima, we perform a comparison with the CW model: in terms of the (standard) magnetization, the CurieWeiss model reads as $H_N^{CW} \sim (N/2)m^2  hm$ and, analogously we can write $H_N^{Mattis}(\sigma\xi)$ in terms of Mattis magnetization as $H_N^{Mattis} \sim (N/2)m_1^2  h m_1$. It is then evident that, in the low noise limit (namely where collective properties may emerge), as the minimum of free energy is achieved in the CurieWeiss model for $\langle m \rangle \to \pm 1$, the same holds in the Mattis model for $\langle m_1 \rangle \to \pm 1$. However, this implies that now spins tend to align parallel (or antiparallel) to the vector $\xi^1$, hence if the latter is, say, $\xi^1 = (+1,1,1,1,+1,+1)$ in a model with $N=6$, the equilibrium configurations of the network will be $\sigma=(+1,1,1,1,+1,+1)$ and $\sigma=(1,+1,+1,+1,1,1)$, the latter due to the gauge symmetry \sigma_i > \sigma_i enjoyed by the Hamiltonian. Thus, the network relaxes autonomously to a state where some of its neurons are firing while others are quiescent, according to the stored pattern \xi^1. Note that, as the entries of the vectors $\xi$ are chosen randomly \pm 1 with equal probability, the retrieval of free energy minimum now corresponds to a spin configuration which is also the most entropic for the ShannonMcMillan argument, thus both the most likely and the most difficult to handle (as its information compression is no longer possible). Two remarks are in order now. On the one side, according to the selfconsistency equation, <m> versus h displays the typical graded/sigmoidal response of a charging neuron, and one would be tempted to call the spins sigma "neurons". On the other side, it is definitely inconvenient to build a network via N spins/neurons, which are further meant to be diverging (i.e. N > \infty) in order to handle one stored pattern of information only. Along the theoretical physics route overcoming this limitation is quite natural (and provides the first derivation of the Hebbian prescription in this paper): If we want a network able to cope with $P$ patterns, the starting Hamiltonian should have simply the sum over these P previously stored patterns, namely H_N(\sigma\xi) = \frac{1}{2N}\sum_{i,j=1}^{N,N} \left(\sum_{\mu=1}^P \xi_i^{\mu}\xi_j^{\mu} \right) \sigma_i \sigma_j, where we neglect the external field (h=0) for simplicity. This Hamiltonian constitutes indeed the Hopfield model, namely the harmonic oscillator of neural networks, whose coupling matrix is called Hebb matrix as encodes the Hebb prescription for neural organization. MULTITASKING NEURAL NETWORKS Neural networks rapidly became the “harmonic oscillators” of parallel processing: Neurons, thought of as “binary nodes” (Ising spins) of a network, behave collectively to retrieve information, the latter being spread over the synapses, thought of as the interconnections among nodes. However, common intuition of parallel processing is not only the underlying parallel work performed by neurons to retrieve, say, an image on a book, but rather, for instance, to retrieve the image and, while keeping the book securely in hand, noticing beyond its edges the room where we are reading, still maintaining available resources for further retrieves as a safety mechanism. Standard Hopfield networks are not able to accomplish this kind of parallel processing. Indeed, spurious states, conveying corrupted information, cannot be looked at as the contemporary retrieval of several patterns, but they are rather an unwanted outcome, yielding to a glassy blackout. Such a limit of Hopfield networks can be understood by focusing on the deep connection (in both direct and inverse approach) with restricted Boltzmann machines (RBMs). In fact, given a machine with its set of visible (neurons) and hidden (training data) units, one gets, under marginalization over the latter, that the thermodynamic evolution of the visible layer is equivalent to that of an Hopfield network. It follows that an underlying fullyconnected bipartite RBM necessarily leads to bit strings of length equal to the system size and whose retrieval requires an orchestrated arrangement of the whole set of spins. This implies that no resources are left for further tasks, which is, from a biological point of view, too strong a simplification. One of our research line in neural networks is to relax this constraint so to extend standard neural networks toward multitasking capabilities, whose interest goes far beyond the artificial intelligence framework. In particular, starting from a RBM, we perform dilution on its links in such a way that nodes in the external layer are connected to only a fraction of nodes in the inner layer. As we show, this leads to an associative network which, for nonextreme dilutions, is still embedded in a fully connected topology, but the bitstrings encoding for information are sparse (i.e. their entries are +1,−1 as well as 0), (fig.1, right); for relatively low and large degrees of dilution, this ultimately makes the network able to parallel retrieve without falling into spurious states. In summary, the structural equivalence of associative networks and RBMs allows significant developments, both practical and theoretical. For instance, one can simulate the dynamics of these networks by dealing with an update of N+P spins and a storage of only NP synapses, instead of updating N spins and storing ∼ N^2 synapses. Moreover, the equivalence suggests that traditional associative networks, where the whole set of neurons needs to be properly arranged in order to achieve retrieval, are not optimal. We overcome this constraint by diluting the links of the RBM, which translates into partially blank patterns. Interestingly, the resulting associative network is not only still able to perform retrieval, but it can actually retrieve several patterns contemporary, without falling into spurious states. This is an important step toward real autonomous parallel processing and may find applications not only in artificial intelligence, but also in biological contexts. 

HIERARCHICAL NEURAL NETWORKS In the last decade, extensive research on complexity in networks has evidenced the widespread of modular structures and the importance of quasiindependent communities in many research areas such as neuroscience, biochemistry and genetics, just to cite a few.
In particular, the modular, hierarchical architecture of cortical neural networks has nowadays been analyzed in depths, yet the beauty revealed by this investigation is not captured by the statistical mechanics of neural networks, nor standard ones (i.e. performing
serial processing) neither multitasking ones (i.e. performing parallel processing). In fact, these models are intrinsically meanfield, thus lacking a proper denition of metric distance among neurons. Hierarchical structures have been proposed in the past as (relatively) simple models for ferromagnetic transitions beyond the meaneld scenario the Dyson hierarchical model (DHM) and are currently experiencing a renewal interest for understanding glass transitions in finite dimension.
Therefore, times are finally ripe for approaching neural networks embedded in a nonmeanfield architecture, and this letter summarizes our
findings on associative neural networks where the Hebbian kernel is coupled with the Dyson topology.
First, we start studying the DHM mixing the AmitGutfreundSompolinsky ansatz approach (to select candidable retrievable states) with the interpolation technique (to check their thermodynamic stability) and we show that, as soon as ergodicity is broken, beyond the
ferromagnetic/pure state (largely discussed in the past), a number of metastable states suddenly
appear and become stable in the thermodynamic limit. The emergence of such states implies the breakdown
of classical (meaneld) selfaveraging and stems from the weak ties connecting distant neurons, which, in the thermodynamic limit, effectively get split into detached communities. As a result, if the latter are initialized with opposite magnetizations, they remain
stable.
This is a crucial point because, once implemented the Hebbian prescription to account for multiple pattern storage, it allows proving that the system not only executes extensive serial processing a la Hopeld, but its communities perform autonomously, hence making parallel
retrieval feasible too. We stress that this feature is essentially due to the notion of metric the system is endowed with, differently from the parallel retrieval performed by the meanfield multitasking networks (whose development we achieved in the past three years) which require blank pattern entries.
Therefore, the hierarchical neural network is able to perform both as a serial processor and as a parallel processor.
We corroborate this scenario merging results from statistical mechanics, graphtheory, signaltonoise technique
and extensive numerical simulations.
In particular, the modular, hierarchical architecture of cortical neural networks has nowadays been analyzed in depths, yet the beauty revealed by this investigation is not captured by the statistical mechanics of neural networks, nor standard ones (i.e. performing
serial processing) neither multitasking ones (i.e. performing parallel processing). In fact, these models are intrinsically meanfield, thus lacking a proper denition of metric distance among neurons. Hierarchical structures have been proposed in the past as (relatively) simple models for ferromagnetic transitions beyond the meaneld scenario the Dyson hierarchical model (DHM) and are currently experiencing a renewal interest for understanding glass transitions in finite dimension.
Therefore, times are finally ripe for approaching neural networks embedded in a nonmeanfield architecture, and this letter summarizes our
findings on associative neural networks where the Hebbian kernel is coupled with the Dyson topology.
First, we start studying the DHM mixing the AmitGutfreundSompolinsky ansatz approach (to select candidable retrievable states) with the interpolation technique (to check their thermodynamic stability) and we show that, as soon as ergodicity is broken, beyond the
ferromagnetic/pure state (largely discussed in the past), a number of metastable states suddenly
appear and become stable in the thermodynamic limit. The emergence of such states implies the breakdown
of classical (meaneld) selfaveraging and stems from the weak ties connecting distant neurons, which, in the thermodynamic limit, effectively get split into detached communities. As a result, if the latter are initialized with opposite magnetizations, they remain
stable.
This is a crucial point because, once implemented the Hebbian prescription to account for multiple pattern storage, it allows proving that the system not only executes extensive serial processing a la Hopeld, but its communities perform autonomously, hence making parallel
retrieval feasible too. We stress that this feature is essentially due to the notion of metric the system is endowed with, differently from the parallel retrieval performed by the meanfield multitasking networks (whose development we achieved in the past three years) which require blank pattern entries.
Therefore, the hierarchical neural network is able to perform both as a serial processor and as a parallel processor.
We corroborate this scenario merging results from statistical mechanics, graphtheory, signaltonoise technique
and extensive numerical simulations.
Panels a and b: Magnetizations obtained via MC simulations of the Dyson Hierarchical Model for different sizes (main figure) and
comparison with theoretical curves obtained by our research group (insets).
Notice that the spontaneous switch between serial and parallel state in panel b is a finitesize effect.
Lower panels: Mattis magnetizations obtained via MC simulations of the HHM (main gures) and comparison with theoretical curves given
by theoretical predictions (insets) for p = 2 (panel c) and for p = 4 (panel d).
The noise level in analytical results was rescaled to collapse on the one numerically estimated via Binder cumulants.
comparison with theoretical curves obtained by our research group (insets).
Notice that the spontaneous switch between serial and parallel state in panel b is a finitesize effect.
Lower panels: Mattis magnetizations obtained via MC simulations of the HHM (main gures) and comparison with theoretical curves given
by theoretical predictions (insets) for p = 2 (panel c) and for p = 4 (panel d).
The noise level in analytical results was rescaled to collapse on the one numerically estimated via Binder cumulants.
SMALL WORLDS One of the most stricking features of these networks is their natural ability to clusterize and behave as small worlds. In these plots at left we show the clustering of the nets by varying the tunable parameters network size versus network dilution and in the "red regions" (tha occupy a huge area of the whole plot) small world behavior is preserved.
It should be noticed that no rewiring is performed to obtain this feature, rather, a topological phase transition, splitting network behavior from ErdosRenyilike to WattsStrogatzlike, comes to play.
These plots at the right show the network's fringes.
To be deepen.
The figure at left shows the field insisting on a neuron in two different types of diluted neural networks: on the left we studied the Sompolinsky dilution roughly speaking an Hopfield network on an ErdodRenyi random graph while on the right we show the fields of our multitasking associative networks roughly speaking an Hopfield network with diluted patterns obtained by marginalizing a diluted Boltzman machine. d stands for "dilution" such that for d=0 both the extensions recover the Hopfield model, while as d increases they start to differ.
The difference among field distributions in the two networks is remarkable as, in the former, dilution roughly makes smoot and tending to zero the fields, while in the latter, dilution (above a critical value) broadens in a nottrivial way the fields and allows for parallel processing.
The difference among field distributions in the two networks is remarkable as, in the former, dilution roughly makes smoot and tending to zero the fields, while in the latter, dilution (above a critical value) broadens in a nottrivial way the fields and allows for parallel processing.
As a result of our dilution, different Mattis overlap become stricktly positive without collapsing the system in a spurious state (typical of the underlying glassiness the network is built with). These two plots show Mattis overlaps (different colours mimicking the retrieval of different patterns) at various level of dilution.
Again for d=0 the system behaves as a standard neural network (Hopfield model), while for d>1 the system becomes a paramagnet and in between it behaves as a multitasking associative networks, that is, the system is able to contemporarily handle multiple patterns, thus accomplishing parallel retrieval.
Again for d=0 the system behaves as a standard neural network (Hopfield model), while for d>1 the system becomes a paramagnet and in between it behaves as a multitasking associative networks, that is, the system is able to contemporarily handle multiple patterns, thus accomplishing parallel retrieval.
The picture on the left shows other examples of parallel retrieval but this time in the presence of fast noise: clouds represent noise affected trajectories resulting from simulation achieved with Glauber dynamics (and not deterministic steepest discent or similar minimization procedures).