Implementing a Hierarchical Bayesian Visual Cortex Model on Multi-core Processors

Implementing a Hierarchical Bayesian Visual Cortex Model on Multi-core Processors Pavan Yalamanchili Sumod Mohan Tarek Taha Electrical and Computer...
Author: Abner Simon
3 downloads 2 Views 131KB Size
Implementing a Hierarchical Bayesian Visual Cortex Model on Multi-core Processors Pavan Yalamanchili

Sumod Mohan

Tarek Taha

Electrical and Computer Engineering Clemson University Clemson, SC 29634

Electrical and Computer Engineering Clemson University Clemson, SC 29634

Electrical and Computer Engineering Clemson University Clemson, SC 29634

[email protected]

[email protected]

[email protected]

ABSTRACT Recent scientific studies of the brain have led to new models of information processing. Some of these models are based on Hierarchical Bayesian Networks and have several benefits over traditional neural networks. Large scale implementations of brain models have the potential for strong inference capabilities, and hierarchical Bayesian models lend themselves well to large scales. Multi-core processors are currently the standard architectural approach utilized for high performance computing platforms. In this paper we examine the parallelization and optimization of Dean's hierarchical Bayesian model onto two multi-core architectures: the nine-core IBM Cell and the quad-core Intel Xeon processors. This is the first study of the parallelization of this class of models onto multi-core processors. We evaluate two parallelization strategies and examine the performance of the model as it is scaled. Our results indicate that the Cell processor can provide speedups of up to 108 times over a serial implementation of the model for the network sizes examined. The quad-core Intel Xeon processor provided a speedup of 36 times for the same model configuration.

Categories and Subject Descriptors C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) – Multiple-instruction-stream, multiple-data-stream processors (MIMD) C.1.3 [Processor Architectures]: Other Architecture Styles – Heterogeneous (hybrid) systems I.5.5 [Pattern Recognition]: Implementation – Special architectures

General Terms Your general terms must be any of the following 16 designated terms: Algorithms, Management.

Keywords Performance, Measurement, Experimentation.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACMSE’09, March 12–21, 2009, Clemson, SC, USA. Copyright 2009 ACM 1-58113-000-0/00/0004…$5.00.

1. INTRODUCTION Recent scientific studies of the primate brain have led to new neuromorphic computational models of the information processing taking place in the cortex. These cortical models provide insights into the workings of the brain and concur well with experimental results. They differ significantly from traditional neural networks in that they are generally at a higher level of abstraction than neural networks and they consider several new biological details about the information processing in the cortex. Some of these newer cortical models [4][7] are based on hierarchical Bayesian networks and incorporate several of the recently suggested properties of the neocortex [5]. These include a hierarchical structure of uniform computational elements, invariant representation and retrieval of patterns, auto associative recall, and sequence prediction through both feed-forward and feedback inference between layers in the hierarchy. Neuro-anatomists have identified that a collection of about 80 to 100 neurons form into regular patterns of local cells running perpendicular to the cortical plane [9]. These collections of neurons are called mini-columns. Mountcastle [14] states that the basic unit of cortical operation is the mini-column and that a collection of mini-columns are grouped into a cortical column. He also states that the mini-columns within a cortical column are bound together by a common set of inputs and short-range horizontal connections. Hierarchical Bayesian network based cortical models have a significant computational advantage over traditional neural networks. Each node in the former models a cortical mini-column or a cortical column, while in the latter each node models only a single neuron. Thus to model a large collection of neurons, a hierarchical Bayesian network model would require far fewer nodes than a traditional neural network model. Additionally, the number of node-to-node connections is greatly reduced in hierarchical Bayesian network based cortical models. Anatomical evidence suggests that most of neural connections in the cortex are within a column as opposed to being between columns [9]. The brain utilizes a large collection of slow neurons operating in parallel to achieve very powerful cognitive capabilities. There has been a strong interest amongst researchers to develop large parallel implementations of cortical models in order to realize stronger inference capabilities than current computing algorithms [3]. A large domain of applications would benefit from stronger inference capabilities including speech recognition, computer

vision, textual and image content recognition, robotic control, and making sense of massive quantities of data. Several research groups are examining large scale implementations of neuron based models [1][13] and cortical column based models [10][16]. Such large scale implementations require high performance resources to run the models at reasonable speeds. IBM is utilizing a 32,768 processor Blue Gene system to simulate a spiking network based model [1], while EPFL and IBM are utilizing a 8,192 processor Blue Gene system to simulate a sub-neuron based cortical model [13]. In this paper, we examine optimizations and parallel implementations on multi-core processors of a recent hierarchical Bayesian network cortical model. The model examined is Dean’s Hierarchical Bayesian model. Lansner and Johannson [10] have shown that mouse sized cortical models developed on a cluster of commodity computers are computationally bound rather than communication bound. Therefore the acceleration of these models on multi-core architectures can provide significant performance gains for large scale implementations. With the limited scaling in processor clock frequencies, multi-core processors have become the standard industrial approach to improve processor performance. However, we are not aware of any studies examining the implementation or performance of hierarchical Bayesian cortical models on multi-core processors. In this study we examine the parallelization of Dean’s Hierarchical Bayesian model onto two multi-core processors: the IBM/Sony/Toshiba Cell broadband engine [8] and the quad-core 2.33 GHz Intel Xeon E5345 processor. The Cell processor has attracted significant attention recently because of the large number cores and can provide significant performance benefits. The fastest supercomputer at present, the IBM Roadrunner supercomputer [15] installed at Los Alamos National Lab, utilizes 12,240 Cell processors and 6,912 AMD Opteron processors. The PetaVision project announced recently at Los Alamos National Lab (June 2008) is utilizing the Roadrunner supercomputer to model “1 billion visual neurons and trillions of synapses” [2]. Details of the project are not publicly available yet. In this work we examine two approaches to parallelize Dean’s Hierarchical Bayesian model. The model and its related Bayesian Network libraries were implemented in C. These codes were then optimized to use multiple cores and the vector processing units on the cores. Several scaled versions of the model were developed to examine the impact of network scaling. Our results indicate that optimized parallel implementations of the model can provide significant speedups on multi-core architectures. The Cell processor provided a speedup of 108 times a serial implementation of the model for the largest network size examined. The quad-core Intel Xeon processor provided a speedup of 36 times for the same model configuration. Section 2 of this paper examines Dean’s model and describes the Cell processor. Section 3 discusses approaches to parallelize the model. Section 4 details the experimental setup, while section 5 provides the results of the work. Section 6 concludes the paper.

2. BACKGROUND 2.1 Dean model Cortical models based on hierarchical Bayesian networks include Hierarchical Temporal Memories (HTM) being developed by Numenta, Inc. [7] and a hierarchical Bayesian network model

being developed at Brown University by Thomas Dean [4]. These models have multiple layers of processing nodes, arranged hierarchically in an inverted tree structure (as shown in Figure 1). The computations at all the nodes are identical, thus preserving the uniform computation characteristic of the neocortex.

2.1.1 Overview Thomas Dean proposed a new hierarchical Bayesian model [4] of the visual cortex. This model consists of a layered collection of nodes as shown in Figure 1. Input data is presented to the bottom layer of nodes (generally after some preprocessing) and a final inference based on this input is produced by the top layer node. All the nodes in the network carry out the same set of computations and can be considered to be the functional equivalent of cortical columns. The model is trained in a supervised manner by presenting a set of training data to the bottom layer of nodes multiple times.

Layer 3

3 Subnet 2

Subnet 1 Layer 2

Layer 1

1,3

1

1

1,2,3

1,2

1,2

2,3

1,2

2

2

Figure 1: A simple example of Thomas Dean’s hierarchical Bayesian network model. This example can be divided into three subnets as shown. The nodes are numbered with the subnets they belong to.

In the example implementation of the model presented by Dean in [4], the model performs hand written character recognition on 28×28 pixel images. This example network consists of three layers of nodes connected in a pyramidal form, with the bottom layer consisting of 49 nodes (in a 7x7 layout), the middle layer of 9 nodes (in a 3x3 layout), and the top layer of 1 node. Each layer 2 node has nine layer 1 children (arranged in a 3x3 layout) forming a pyramidal collection. The field of view of each layer 2 node overlaps with its neighbors’ by an edge that is one node thick. Thus, each layer one node can have up to four layer 2 parents. The input image is preprocessed by a preprocessing layer before being fed to the layer 1 nodes. Each layer 1 node has a 4x4 patch of pixels corresponding to it. In the preprocessing layer, the 4x4 patch of pixels is transformed into a mixture of Gaussians and this mixture is matched against 16 predefined classes of mixtures of Gaussians. Thus each 4x4 pixel region is represented by a number between 1 and 16, with this number being fed to the corresponding layer 1 node by the preprocessing layer.

2.1.2 Processing The network can be divided into a set of modular component subnets (as shown in Figure 1). Each subnet has two layers of nodes. A subnet can be defined as a node, its parents, and all the children of those parents in the same level as the original node [4].

The function of each subnet is to produce an abstract set of features that are seen by the lower level subnets feeding into it. Neighboring subnets have overlaps in their receptive fields to enable the network to more robustly recognize invariant features. Hence a node could belong to multiple subnets (as shown in Figure 1). The subnets are identified during the training process and only the largest subnets (those that would not be a subset of another subnet) are utilized. For any given input image, the network is processed through multiple bottom-to-top-to-bottom passes. In each pass, all the subnets for a certain layer are processed before moving to the next layer.

1

3

4

2

5

6

7

(a) 1,3

1,4

2,6

1,2,5

2,7

(b)

Figure 2: A simple example of a junction tree derived from one of the lower level subnets shown in Figure 1. Part (a) shows the subnet with the nodes number 1 through 7. Part (b) shows the junction tree equivalent to this subnet. Each clique in part (b) is numbered with the corresponding nodes from the subnet that are used to build the clique.

In order to process a subnet, it is first converted to its equivalent junction-tree representation. The junction-tree consists of a set of nodes called cliques, where each clique is a collection of nodes in the original subnet. The connection between the cliques is called a separator and is labeled by the nodes in common between the two cliques. Figure 2 shows a simple subnet and its equivalent junction tree decomposition. Although this junction tree has only 5 cliques (with two that can be evaluated in parallel), the junction trees in the networks examined have up to 25 cliques (with up to 21 that can be evaluated in parallel). The subnet to junction tree mapping is carried out during training and does not have to be redone during inference (as the mapping is reused). The junction tree for a subnet is evaluated in a single bottom-to-top and then top-to-bottom pass. The Lauritzen and Spiegelhalter's junctiontree algorithm [11] is utilized for exact inference in the tree. The operations consist primarily of element by element multidimensional matrix adds, multiplies, and divides.

2.2 Cell processor The Cell Broadband Engine developed by IBM, Sony, and Toshiba [8] has attracted significant attention recently for its high performance capabilities. This is a multi-core processor that heavily exploits vector parallelism. The current generation of the

IBM Cell processor consists of nine processing cores: a PowerPC based Power Processor Unit (PPU) and 8 independent Synergistic Processing Units (SPU). The processor operates at 3.2 GHz. The PPU is primarily used for administrative functions while the SPUs provide high performance through vector operations. Each SPU is capable of processing up to eight instructions in parallel each cycle. One of the key design philosophies of the Cell processor is to transfer complexity from the hardware to the software. For example, unlike most high performance processors, the processing cores in the Cell utilize in-order execution with no branch prediction. This simplified hardware design enables low power consumption and small core areas while still maintaining high performance. As a result, it would be easy to scale the number of processing cores in future versions of the chip to achieve higher performance. This trade off in hardware complexity means that several software level optimizations are necessary to achieve high performance on the SPUs (these are generally not needed on traditional processors, such as the Intel Xeon). The optimizations include use of vectorization, reducing the frequency of branch instructions through loop unrolling and function in-lining, and explicit memory optimizations (note that the Intel Xeon also includes vector instructions). Instead of a processor controlled data cache, each SPU contains a programmer controlled local store to explicitly optimize memory operations. This enables several memory level optimizations not possible on most high performance processors. Since high compute-to-I/O ratios are needed to achieve the full potential of the Cell processor, the programmer controlled memory stores are especially important. There are several computing platforms available that utilize the Cell processor. The Sony Playstation 3 is an economical gaming platform that currently ships with one Cell processor (containing six available SPUs). It is possible to install Linux on these machines and develop a large computing cluster with them. The IBM BladeCenter® QS20 is a more powerful platform containing two Cell processors with eight available SPUs each per blade.

3. OPTIMIZATIONS 3.1 Network parallelization As shown in Section 2.1.1, the nodes in a network in Dean’s model can be grouped into subnets. The model has to be evaluated based on subnets, rather than individual nodes. One approach for parallelization is to evaluate each subnet on a separate processing core. In this approach, all the subnets within a layer can be evaluated in parallel, before moving to the subnets in the next layer. A second parallelization approach is to consider the multiple cliques that each subnet is composed of and evaluate each clique on a separate processing core. This latter approach will yield a higher level of parallelism as there are more cliques than subnets that can be evaluated in parallel. Dependencies between the cliques may limit the number of cliques that can be evaluated in parallel at any given level within a junction tree. In this study we evaluated both approaches and found that for the networks examined, the clique based approach had a better utilization of the available processing cores. In both approaches the order in which the subnets or cliques will be evaluated is predetermined and does not vary with the network inputs.

3.2 Vectorization There are at least two approaches to vectorization for this model: vectorizing the operations for a single image and vectorizing to evaluate multiple images simultaneously. In the former case, matrix operations would have to be vectorized as a large portion of the junction tree evaluations consist of multi-dimensional matrix operations. In the networks examined, these matrices had up to five dimensions with each dimension being up to 16 elements wide. The matrix operations included element-byelement matrix multiplies and divides. There were also matrix dimension reductions operations, which are essentially summations along a given dimension of a matrix. Not all of these operations can be vectorized efficiently, particularly as the matrix dimensions are of small widths (that are not always multiples of the vectorization factor). Since the model evaluates any input data in precisely the same way, multiple inputs can be evaluated in parallel through vectorization. In case of a vectorization of four, there will be four versions of each matrix (one for each image). The same set of operations will be carried out for all four versions of each matrix. In this case vectorization can be applied to almost 100% of all the operations. Therefore, this vectorization approach was utilized in this study.

4. EXPERIMENTAL SETUP Four networks with varying input image sizes were developed to examine the acceleration of Dean’s model on the multi-core platforms. As shown in Table 1, all the networks had three layers. The smallest network was identical to the example presented by Dean in [4]. Dean utilized 10,000 images from the MNIST database [12] by Yann LeCun for training and testing of his model. These consist of one thousand versions of 10 objects (handwritten numerals 0 to 9 from the MNIST database), resulting in 10,000 images. The images are 28×28 pixels in dimension (Figure 3 shows some samples from this database). The smallest network was trained with the 28×28 images in the database, while the larger networks were trained with zero padded versions of these images. Table 1: Configuration of the networks examined. Network input size Total Layer 3 Nodes Layer 2 Layer 1 Total Layer 3 Subnets Layer 2 Layer 1

28×28 59 1 9 49 6 1 1 4

36×36 98 1 16 81 11 1 1 9

40×40 110 1 9 100 6 1 1 4

52×52 186 1 16 169 11 1 1 9

Dean’s implementation of the model was in Matlab and utilized Kevin Murphy’s Bayesian Network Toolbox (also written in Matlab). We developed a C implementation of the model along with relevant parts of the Bayesian Network Toolbox. Although C++ Bayesian Network libraries are available, they would need significant modifications in order to be utilized in our study. These include parallelizing the code to run on multiple cores, vectorization using Cell SPU SIMD intrinsics, and being able to handle the DMA data transfers needed for the explicit memory

management of the SPU local stores. The model and the relevant Bayesian Network libraries were optimized separately for the Cell and the Intel architectures. In the Cell version, the PPU assigned a set of subnets or cliques to be processed to each SPU. The SPU codes were optimized through loop unrolling, function in-lining, double buffering, and vectorization. The Intel version was parallelized using POSIX threads and utilized SSE3 vectorization.

Figure 3: Sample images from the MNIST library used for training and testing.

Three hardware platforms were utilized in this study. One was Intel Xeon based, and the other two were IBM Cell based. The Intel platform utilized was a blade on the Palmetto Cluster at Clemson University. Each blade on the system contains two quadcore Intel Xeon processors running at 2.33 GHz (model E5345), has 12 GB of DRAM, and runs the CentOS 5 operating system. The Cell platforms utilized are a Sony Playstation 3 and an IBM QS20 cluster at Georgia Tech. The Playstation 3 has one Cell processor on which six of the eight SPUs are available for use. This platform was running Fedora Core 6 with IBM Cell SDK 2.1. The QS20 blade utilized has two Cell processors, each with all eight cores available, and also uses IBM Cell SDK 2.1. All the programs were compiled with O3 optimizations using gcc.

5. RESULTS For each platform, we examined the performance of the model using the maximum number of cores per processor and per blade. Thus the following parallel configurations were tested: 1. 2. 3.

Intel blade with 4 and 8 threads Playstation 3 with 6 SPU threads QS20 with 6, 8 and 16 SPU threads

A six SPU thread implementation on QS20 was examined to compare it against the Playstation 3 performance. A serial version of the program was developed and tested on the Sony Playstation 3’s Cell PPU. Figure 4 presents the speedup of each of the parallel implementations over the serial PPU implementation. Several observations can be made from this Figure: 1.

2. 3.

The parallel implementations provide a significant performance gain over the serial implementation. This is mainly due to the use of multiple cores and vectorization on each core. The larger versions of the models examined provide higher speedups. This is primarily due to the larger models having more nodes and thus greater parallelism. For each of the platforms, more cores provide higher speedups.

4. 5. 6.

The Cell processor on the QS20 with 8 SPU cores available outperforms the Intel Xeon processor (with 4 cores) by about 3 times. The Playstation 3 with 6 available SPU cores outperforms the Intel Xeon processor (with 4 cores) by about 2.4 times. Utilizing both Cell processors on the QS20 (16 threads) provides only a 22% performance improvement over one Cell processor (8 threads). We have not investigated the exact cause of this low performance gain. It is possible that the model sizes examined do not have enough parallelism to support 16 cores.

Xeon (4)

Xeon (8)

PS3 (6)

QS20 (6)

QS20 (8)

Figure 6 shows a breakdown of the overall runtime for the 186 node network on the Playstation 3. This shows that 75% of the time is spent on computations. Data transfers through DMA are overlapped with computations. However as shown in the figure, 25% of the runtime is spent on DMA transfers that could not be overlapped with computation. Preprocessing the input data required only 11% of the total runtime. Compute Pi 6% Compute Lambda 7%

Query 3% DMA 25%

QS20 (16)

160

Speed up over PPU

140 120

PreProcess 11%

100

Get Evidence 48%

80 60 40 20 0 59

98

110

186

Average

Nodes

Figure 4: Speedup of parallel implementations of the model over a serial implementation on the Cell PPU. The number in parenthesis next to each platform in the legend is the number of threads utilized.

Cliques

Subnets

Speed up over PPU

100

Figure 6: Runtime breakdown for the 186 node network on the Playstation 3.

Programming the Cell can be very time consuming. To get the optimized version of the code running on the Playstation 3, different code optimizations had to be incorporated. Table 2 lists a set of these optimizations and shows the model runtime after each. All the times mentioned in the table are for processing four images. Before vectorization, the four images had to be evaluated separately, while after vectorization, they are evaluated simultaneously.

80

Table 2: Runtimes after different optimizations.

60 40 20 0 59

98

Nodes

110

186

Figure 5: Comparison of speedup on the Playstation 3 for clique based parallelization vs. subnet based parallelization.

Figure 5 compares the two parallelization approaches examined: clique based and subnet based. All the subnets in a layer can be evaluated in parallel. The networks with 59 and 110 nodes had fewer subnets in level 1 than the 98 and 186 node networks. Thus the former set of networks provided lower speedups than the latter set when parallelized by subnets. For all the networks, there were more cliques that could be evaluated in parallel than subnets (since each subnet could be decomposed into multiple cliques). Thus the clique based parallelization approach provided higher speedups for all the network sizes evaluated.

Optimizations Serial code, PPU Basic parallel, SPUs In-lining functions Data type conversion Vectorization Loop Unrolling, Branch Reduction

Run Time (ms) 734 101.7 75.74 40.67 12.1 10.66

6. CONCLUSION There is a significant interest in the research community to develop large scale, high performance implementations of cortical models. These have the potential to provide significantly stronger information processing capabilities than current computing algorithms. Hierarchical Bayesian cortical models are a relatively new class of cortical models that make it easier to develop larger scale models than traditional neural networks. A collection of neurons forms a cortical column and are thought to be the basic units of computation in the brain. Since Hierarchical Bayesian

cortical models are based on cortical columns as opposed to individual neurons, they have a significant computational advantage over the latter. Fewer nodes need to be modeled along with fewer node connections. Given that large scale cortical models can offer strong information processing capabilities, hierarchical Bayesian models are an attractive candidate for scaling. At present, multi-core processors are the standard approach for achieving high performance. Thus, a study of the parallelization of hierarchical Bayesian models and their implementations on multi-core architectures is important.

[3] Dean, T., “A Computational Model of the Cerebral Cortex,” Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05), 2005.

In this paper we examined the parallelization and implementation of the Dean’s hierarchical Bayesian model onto two multi-core processors: the IBM Cell processor and the Intel Xeon processor. The former has nine cores, eight of which provide high performance, while the latter has four cores. Both the model and its relevant libraries were implemented in C, parallelized, and vectorized. This is the first study of the acceleration of this class of models on multi-core architectures.

[6] Dean, T., Carroll, G., and Washington, R., “On the prospects for building a working model of the visual cortex,” Proceedings of the Twenty-Second Conference on Artificial Intelligence (AAAI-07), 2007.

We show that the model can be parallelized based on the subnets that it contains or based on the cliques contained in the junctiontrees that the subnets can be broken into. Our results indicate that the latter approach provides slightly higher speedups as there is more parallelism exposed. We also examine the vectorization of the model and show that it is easier to vectorize by processing multiple inputs simultaneously.

[8] Gschwind, M., Hofstee, H. P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T., “Synergistic Processing in Cell’s Multicore Architecture,” IEEE Micro, 26(2), 10–24, Mar. 2006.

Our results indicate the model parallelizes well and that multi-core architectures can provide significant performance improvements for the model. In particular, the Cell processor provided a speedup of 108 times over a serial implementation of the model for the largest network size examined. The quad-core Intel Xeon processor provided a speedup of 36 times for the same model configuration. The results of this work can be applied to other multi-core processors. As future work, we plan to examine the performance of much larger models on clusters of multi-core processors. We also plan to examine the parallelization of other hierarchical Bayesian models, including Dean’s newer model [6] which incorporates temporal invariance in addition to spatial invariance.

[10] Johansson, C., Lansner, A., “Towards Cortex Sized Artificial Neural Systems,” Neural Networks, 20(1): 48-61, 2007.

7. ACKNOWLEGEMENTS This work is supported by an NSF CAREER Award and grants from the US Air Force.

8. REFERENCES [1] Ananthanarayan, R., Modha, D., “Anatomy of a cortical simulator,” Proceedings of the 2007 ACM/IEEE conference on Supercomputing. [2] Barker, K. J., Davis, K., Hoisie, A., Kerbyson, D. J., Lang, M., Patkin, S. and Sancho, J. C., “Entering the petaflop era: the architecture and performance of Roadrunner,” Proceedings of the 2008 ACM/IEEE conference on Supercomputing.

[4] Dean, T., “Learning invariant features using inertial priors,” Annals of Mathematics and Artificial Intelligence, 47(3-4), 223–250, Aug. 2006. [5] Dean, T., “Scalable inference in hierarchical generative models,” Ninth International Symposium on Artificial Intelligence and Mathematics, 2006.

[7] George, D., and Hawkins, J., “A Hierarchical Bayesian Model of Invariant Pattern Recognition in the Visual Cortex,” in International Joint Conference on Neural Networks, 2005.

[9] Hawkins, J. and Blakeslee, S., “On Intelligence,” Times Books, Henry Holt and Company, New York, NY 10011, Sept. 2004.

[11] Lauritzen, S. L. and Spiegelhalter, D. J., “Local computations with probabilities on graphical structures and their application to expert systems,” Journal of the Royal Statistical Society, 50(2):157–194, 1988. [12] LeCun, Y. and Cortes, C. “The MNIST Database of handwritten images,” http://yann.lecun.com/exdb/mnist/ [13] Markram, H., “The Blue Brain Project,” Nature Reviews Neuroscience, 7, 153–160, 2006. [14] Mountcastle, V. B., “Introduction to the special issue on computation in cortical columns,” Cerebral Cortex, 13(1):2– 4, 2003. [15] Rickman, J., “Roadrunner supercomputer puts research at a new scale,” http://www.lanl.gov/news/index.php/fuseaction/home.story/s tory_id/13602. [16] Wu, Q., Mukre, P., Linderman, R., Renz, T., Burns, D., Moore, M., and Qiu, Q., “Performance Optimization for Pattern Recognition using Associative Neural Memory,” Proceedings of the 2008 IEEE International Conference on Multimedia & Expo.

Suggest Documents