An Application Development Platform for Neuromorphic Computing

An Application Development Platform for Neuromorphic Computing Mark E. Dean Jason Chan Christopher Daffron Adam Disney John Reynolds Garrett Rose Jame...

Author: Hope Park

18 downloads 4 Views 881KB Size

Report

Download PDF

Recommend Documents

Application Development for Mobile and Ubiquitous Computing. 10. Cross-Platform Development

AstroCloud: A Distributed Cloud Computing and Application Platform for Astronomy

Cross-Domain Development Kit XDK110 Platform for Application Development

CMOS and Memristor Technologies for Neuromorphic Computing Applications

Android AN OPEN PLATFORM FOR MOBILE DEVELOPMENT

C IBM Cloud Platform Application Development V1

The PeakStream Platform for Many-Core Computing

PHP Cloud Computing Platform

Personalization Platform for Multimodal Ubiquitous Computing Applications

SOFTWARE APPLICATION FOR COMPUTING OPTIONS

SiSSA - An Infrastructure for NLP Application Development

TCPA Trusted Computing Platform Alliance

THE HUMAN BODY AS AN INTERACTIVE COMPUTING PLATFORM

BEST TARGET PLATFORM FOR APPLICATION MIGRATION

HPC APPLICATION SUPPORT FOR GPU COMPUTING

Platform Policy Brief Global Donor Platform for Rural Development

Multiple Application Platform (MAP-200) Platform Overview

COMPUTING. MaxCore HA Platform High Availability Compute & Media Platform

Adobe Flash Platform ActionScript Reference Rich Internet Application Development

YAAC: The Development of "Yet Another APRS Client", an Open- Source Cross-Platform Application

CENTRE FOR DEVELOPMENT OF ADVANCED COMPUTING

Development of an Electronic Diary Application for Windows Smartphones

Practical Computing. Website Development

Memristive Tunnel Junctions for Neuromorphic Circuits

An Application Development Platform for Neuromorphic Computing Mark E. Dean Jason Chan Christopher Daffron Adam Disney John Reynolds Garrett Rose James S. Plank J. Douglass Birdwell Catherine D. Schuman

University of Tennessee University of Tennessee University of Tennessee University of Tennessee University of Tennessee University of Tennessee University of Tennessee University of Tennessee Oak Ridge National Laboratory

[email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

Appearing in IJCNN: International Joint Conference on Neural Networks. Part of the IEEE World Congress on Computational Intelligence, Vancouver, Canada, July, 2016. This is the PDF of the accepted submission of the paper. The final published version will be hosted by the IEEE, and there will be a pointer to that here along with the DOI. © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. The online home for this paper is: http://neuromorphic.eecs.utk.edu/publications/2016-07-01-an-application-developmentplatform-for-neuromorphic-computing/ If that URL is too long, simply go to http://neuromorphic.eecs.utk.edu and follow the publications link. Citation Information (BibTex). @INPROCEEDINGS{dcd:16:aaf, author = "M. E. Dean and J. Chan and C. Daffron and A. Disney and J. Reynolds and G. S. Rose and J. S. Plank and J. D. Birdwell and C. D. Schuman", title = "An Application Development Platform for Neuromorphic Computing", booktitle = "International Joint Conference on Neural Networks", where = "http://www.wcci2016.org/", year = "2016", month = "July", address = "Vancouver" }

An Application Development Platform for Neuromorphic Computing Mark E. Dean, Jason Chan, Christopher Daffron, Adam Disney, John Reynolds, Garrett Rose, James S. Plank, J. Douglas Birdwell Electrical Engineering and Computer Science University of Tennessee Knoxville, TN

Abstract—Dynamic Adaptive Neural Network Arrays (DANNAs) are neuromorphic computing systems developed as a hardware based approach to the implementation of neural networks. They feature highly adaptive and programmable structural elements, which model artiﬁcial neural networks with spiking behavior. We design them to solve problems using evolutionary optimization. In this paper, we highlight the current hardware and software implementations of DANNA, including their features, functionalities and performance. We then describe the development of an Application Development Platform (ADP) to support efficient application implementation and testing of DANNA based solutions. We conclude with future directions. Keywords—Neuromorphic Computing; Bio-inspired Computer Architectures; SDK; Neural Networks

I.

INTRODUCTION

The ﬁeld of neuromorphic computing has grown and holds promise as a computing structure for applications in controls, classification and anomaly detection [1][2]. To date, many research groups have tried to implement neuromorphic computing systems using different forms of hardware and software abstractions. There are a handful of hardware platforms hosting neural networks, but these platforms have limited tools for application development and primarily support hand-tooled static networks. One key to the adoption of neuromorphic computing is an architecture that supports a viable application development environment. Neuroscience-Inspired Dynamic Architecture (NIDA) was developed in 2014 as a software abstraction for neuromorphic computing [2]. By utilizing simple neurons and synapses structures to create large 3D array configurations, NIDA demonstrated its potential as a neuromorphic computing architecture suitable for solving complex classification, controls and real-time data-intensive applications. Dynamic Adaptive Neural Network Arrays (DANNAs) are 2D representations of the NIDA neuromorphic computing architecture that are suitable for direct hardware implementation. A DANNA consists of an interconnected array of programmable neuromorphic elements that can be configured into a network of neurons or synapses. Previous work with DANNA networks has showcased their functionality on a FieldProgrammable Gate Array (FPGA) [3].

Catherine D. Schuman Computational Data Analytics Oak Ridge National Laboratory Oak Ridge, TN

An application development platform (ADP) is highly desirable to support application development, testing and deployment of DANNA neuromorphic systems. An effective standalone ADP should be low-cost, easy to use and scalable, and it should provide an efficient and effective environment for rapid application development and testing. At present, one of the major challenges to adoption of neuromorphic computing is the availability of ADPs for the target architectures. The DANNA implementation used for the platform described in this paper improves upon the original implementation while retaining the same architectural features. It implements an increased number of connections per element, an enhanced real-time monitoring capability, and a more efficient communications interface. Issues in the original design that limited implementation of large array structures have been resolved in the new platform. Testing the new design conﬁrms its behavior and has led to other enhancements. While the platform has been conﬁrmed to work, more enhancements are being developed to support even larger array structures, additional element functions, a more efficient host interface, and a higher performance simulation environment. The following sections of this paper detail the research work completed and targeted to support the development and testing of an ADP for a DANNA system. We address issues with the original DANNA system design and provide details of implementation enhancements to support an effective application development environment. Finally, we explore possible future enhancements to the ADP. II.

RELATED WORK

Currently, there have been many implementations of neuromorphic computing in software, through modeling and simulation, and in hardware, through analog and digital constructs [4]. Each research group has adopted its own approach to neuromorphic computing; some groups utilize traditional computing architectures while others have invented unique hardware devices and structures. Some notable hardware examples include Neurogrid by Stanford University [8], SpiNNaker by the University of Manchester [7], the TrueNorth Computer by IBM [10], and BrainScaleS by the Human Brain Project [9].

Using traditional von Neumann computing systems to simulate neural networks has been the most widely used approach to date for neuromorphic computing applications. However, large scale von Neumann neural network simulations, such as IBM’s cat sized cortical simulation [5], are limited by processor and memory bandwidth which is amplified by the large amount of memory required to support the simulations [6]. In an attempt to increase speeds, some research groups have organized von Neumann system components in a way to enhance performance and scale of the neural network application. SpiNNaker is a biologically inspired, massively parallel computing engine designed to model and simulate spiking neural networks. SpiNNaker uses an array of chip multiprocessors (CMP), each containing 18 ARM9 processor cores and a packet router. It leverages a customized memory structure and interconnect to further enhance its ability to simulate neural network structures. Neurogrid, by Stanford University, is a neuromorphic multichip system for simulating large-scale neural models in realtime. Its custom Neurocore chips use analog and digital circuits to model complex neuron structures and synapse connections. Neurogrid uses specialized software, a programmable logic device (CPLD), and an USB interface to map and transfer neural networks onto the hardware. TrueNorth, developed by IBM Research, is a brain-inspired neuro-synaptic processor chip. It is composed of 4096 neurosynaptic cores tiled in a 2-D array structure, which contains an aggregate of 1 million neurons and 256 million synapses. Each core is composed of tightly coupled neurons for computation, synapses for memory, and axons and dendrites for communication. The processor is driven by a parallel, eventdriven architecture that incorporates highly parameterized and programmable neuron structures and a connections matrix to implement neuron network systems. The Human Brain Project (HBP) by the European Commission conducted several projects in an eﬀort to develop its own neuromorphic computing platform. BrainScaleS, an HBP project, uses specialized wafer scale integrated VLSI chips, each chip containing 4×107 analog synapses and up to 180,000 neurons. To support conﬁgurability and monitoring of the system, BrainScaleS leverages a dedicated communication network through the use of application speciﬁc integrated circuits (ASICs) and FPGAs. Testing shows that the system is capable of accelerated emulation of spiking neural networks. These projects, and others, show the promise of implementing practical neuromorphic systems, but there are still challenges. Implementations using “hard-coded” neurosynaptic structures affix their neural network components at speciﬁc locations in hardware. This ﬁxed placement of neural components can limit the flexibility and dynamic learning of a neuromorphic computing system, restricting adaptability to changing operating parameters and/or inputs. Furthermore, many implementations referenced are limited by the use of components found in von Neumann based systems such as RAM, buses, processors, and sequential instruction processing. This can have undesirable consequences such as data flow bottlenecks due to bandwidth constraints leading to lower clock rates. Finally, one of the biggest challenges with most of the

neuromorphic computer systems under development is their lack of an application development environment that provides a low-cost platform, easy-to-use interface, a simulation environment, and/or a means to monitor, evaluate and dynamically adapt a network configuration. The challenges of current neuromorphic computing platforms have inspired our development of a dynamic, adaptable neuromorphic computing architecture, hardware implementation, and application development platform to support rapid implementation and testing of neuromorphic computing solutions. The primary goal of DANNA and the design of its ADP is to provide an effective but simple neuromorphic computing environment that will support the efficient implementation, testing and deployment of neuromorphic computing solutions. III.

A NEUROMORPHIC COMPUTING STRUCTURE

Because most neuromorphic computing initiatives have sought to emulate living neural processes in sufficient detail to support neuroscience research, they have implemented highly parameterized neuromorphic computing structures featuring very complex neuron models, and very large synaptic interconnect structures. In our view, this complexity limits performance and reduces efficiencies in training or adaptability of the network. Also, highly parameterized systems are difficult to program, simulate and monitor in real-time, making application development and test a more complex process. Finally, most existing systems allow connectivity of every neuron to every other neuron in the system, which can affect chip area utilization and/or performance. Either chip area devoted to the interconnect grows with the square of the number of neurons, dominating the area of a chip level implementation and limiting system level scalability, or crossbar switching is used, limiting the performance of the system. One goal of the neuromorphic computing architecture used in our research, the Neuroscience Inspired Dynamic Architecture (NIDA), has been to leverage a simpler model of neural-synaptic elements and support programmable neural network structures that enable scalable and efficient computation. A second goal is to enable not only dynamic, but also adaptive, neural network structures. NIDA’s simpler representation of neurons, synapses and their communications with each other has proven to be sufficient to support executable models of dynamic behavior and complex applications [2]. The structure and simplicity of this model enables an efficient digital component representation to be created and implemented in a single programmable neuromorphic element that can be configured to represent a neuron, a synapse or a fanout/fan-in function. This neuromorphic element can then be replicated across a chip to enable an adaptive and programmable array, or matrix, of elements. This array of programmable elements creates a Dynamic Adaptive Neural Network Array (DANNA), enabling the creation of any neural network sized to fit within the capacity provided by the array. This provides one of the first physical implementations of a programmable adaptive and dynamic array of neuromorphic elements not using traditional digital processing mechanisms (i.e. fetch-andexecute microprocessors, DSPs, FPUs, memory systems, caches, etc.).

Details of the basic structure of a DANNA are provided by Dean, Schuman and Birdwell [3]. Each neuromorphic element in the array can be programmed as: • A neuron with a programmable threshold and an “accumulate and fire” charge function. • A synapse with a programmable weight, distance, and refractory period, as well as a long-term potentiation (LTP) and long-term depression (LTD) function for its weight value dynamic adaptability. • A fan-in/fan-out element to expand or increase the connectivity of elements beyond their nearest neighbors. Each element is connected to its 16 nearest neighbor elements as depicted in Fig. 1; an enhancement of the previous design that only supported 8 connections. Each connection can be defined as an input or an output on the element. Note that each connection is implemented using uni-directional ports for inputs and outputs, allowing the connections to be used as feedback to support LTP/LTD functions when programmed as a synapse. With this connection matrix and the programmability of an element, we have demonstrated the ability to configure complex neural networks across a moderate size array of elements (75x75). We are presently working on adapting the evolutionary optimization algorithms of NIDA [1] to the DANNA 2D system structure, to support rapid development of neural networks for specific applications as well as the integration of real-time dynamic adaptability.

the fire-state of the neuron connected to its output. For example, if the firing of Synapse A causes its output neuron to fire, its weight value is potentiated. If the corresponding output neuron fires at the same time Synapse A is firing, Synapse A’s weight value is depressed. The input port sampling is randomized on each network clock cycle through the use of a Global Input Select (GIS) addressing function that is initialized at the beginning of each network clock cycle by a pseudo-random number generator (PRNG). The randomization prevents preferential selection of an input by the evolutionary optimization process and possible exclusion of the effects of other inputs. Without randomization of input sampling, inputs sampled early in a neural network cycle would dominate the effects on the neuron’s charge and thus its firing state. The PRNG is implemented using a linear feedback shift register (LFSR). After the initial input select address is generated, the inputs are sampled in sequential order. IV.

DANNA COMMAND AND STATUS INTERFACE

The DANNA command interface is specifically designed to conﬁgure and operate the array of elements. It parses commands received from the host computer and acts according to the operation speciﬁed by the command. It also receives data from the configured array and returns operational status, element state and output information back to the host computer. Commands supported by the programming interface include: run, halt, runfor a specified number of network cycles, reset, configure an element, send external ﬁres to input elements, capture element state, read element status and null/no-op. Command operations sent to the DANNA array are 36 byte packets with a one byte op-code and a 35 byte payload, structured as shown in Fig. 2.

Fig. 1. Structure of an element with its hierarchical connections. The dark gray element connects to all of the light gray elements. Every element in the array follows the red element’s connection scheme.

During each network cycle, an element’s input ports are sampled in sequence, acquiring the incoming ﬁre conditions. If the element is programmed as a neuron, the input fire information (weights) are sent to the accumulator for charge processing and the resulting charge state is compared with the neuron’s programmable threshold. If the resulting charge state is equal-to or greater-than the threshold, the neuron fires its output and resets its accumulator to a bias value. If the element is programmed as a synapse, its programmable “weight” value is initially stored in the accumulator. A single input port is enabled and that input’s fire-state is entered into its synapse distance register (implemented as a programmable output FIFO). Once the programmed number of network clock cycles have completed, the synapse distance register sends the stored fire state and its weight value to its output ports. The weight value stored in the accumulator will increase (potentiate) or decrease (depress) depending on the impact the fire had on

Fig. 2. DANNA Command Structure

The DANNA array returns response packets during response operations. These conditions include when the array is sending an external ﬁre event (via an output element), when the array has been halted through the use of a halt or run-for command, or when the array receives a shift command to shift element status information oﬀ of the array. The response packets have a ﬁxed length of 64-bytes, and are structured to indicate various information of the DANNA array. Fig. 3 shows the structure of a response packet. The Shift Output data are explained in the next section. The programming interface receives and sends command and response packets, respectively, through FIFOs (one for each direction) and operates at the same clock rate as the array

structure. The FIFO logic module controls the read and write operations between the programming interface and FIFOs. Commands sent to the DANNA array are stored in the “FIFO from Host” or Command FIFO. The FIFO logic module reads these commands and sends them to the programming interface. When the DANNA Array has a response packet to send to the host, it sends the packet via the programming interface to the “FIFO to Host” or Response FIFO. These FIFOs are 2048 bytes in size and provide an asynchronous means of communications between the host communications interface and the DANNA Array Structure. In our implementations to date, the DANNA Array Structure can receive a command or forward a response every global network clock cycle (0.5MHz). The host interface is 4 bytes wide and operates at 100MHz. Fig. 4 provides a highlevel block diagram of a DANNA chip.

The host interface utilizes two commands, capture and shift, to consolidate and retrieve the element status information. The capture command loads three pieces of data into each element’s capture-and-shift register: an element’s current accumulator value, the number of times an element has ﬁred since the last capture command (implemented using a 16-bit fire-event counter), and the number of ﬁres stored in its synapse distance register (each synapse distance register can hold 128 fireevents.) The accumulator value will either represent the current neuron charge or synapse weight of the element, depending on the element type configured. Note that reading the synapse current weight value from the accumulator allows tracking of the effects of long term potentiation and depression.

Fig. 3. DANNA Response Packet Structure

Fig. 5. DANNA Array Real-Time Monitoring Structure

Fig. 4. DANNA Chip Block Diagram

V.

DANNA REAL-TIME MONITORING

Real-time monitoring of a DANNA array allows cycleaccurate insight into the status of the array elements and their firing patterns. Real-time cycle-accurate monitoring of a neural network system operating at megahertz clock rates poses a significant design challenge given the number of components and parameters present in the system, but this level of system monitoring is crucial in application development and testing, and in real-time performance analysis and evolution of the neural network structure. We implement array monitoring using “parallel-capture and serial-shift-out registers” on each element to capture real-time status of key element parameters. These registers are serially connected, end-to-end up the columns of the array, allowing the information to be shifted out in a pseudoJTAG like fashion via response packets through the programming interface. A diagram of the monitoring structure is shown in Fig. 5.

Once the values have been loaded into the capture-and-shift register, the shift command shifts out the concatenated values, one bit per column per shift command, to the next element above in each column of the array. At the top of each column, the bits are shifted into the programming interface which sends the bits of data to the interface board in the Shift Output field of a response packet. Capturing and shifting data are global, so when the commands are sent, all elements capture and shift their data, even if they are not programmed. The primary benefits of this approach are: real-time cycle-accurate monitoring, no global bus required for data transfer, and N versus N2 scaling of data transfer size (as long as the number of bit-columns fit in the Shift Output field of the response packet, e.g. 128). While the monitoring implementation is a useful feature to get the sense of the array at a given network cycle, there are some limitations. One noticeable drawback is the amount of time it takes to get the data. Capturing and shifting data oﬀ-chip takes one network cycle for every bit-row of data. Given an 75×75 element array, it takes approximately 2400 network cycles (5 ms) to shift all of the data oﬀ-chip. In that time frame, the array can continue to operate and each element can continue to experience fire-events. While the number of fire-events per element will still be captured, the relative nature of the internal fire-event will be missed. This is mitigated by supporting capturing and shifting of data when the array is halted. Thus the array can be “stepped”, and element states can be capture after

each network clock cycle. The present programming interface design allocates 16-bytes to element status information, limiting the array size per chip to 128 columns. VI.

DANNA APPLICATION DEVELOPMENT PLATFORM (ADP)

The DANNA Application Development Platform (ADP) is a system platform that allows researchers and application developers to study and build DANNA-based neuromorphic computing systems. The ADP is a self-contained system, housed in a 10in x 10in x 4in enclosure, and requires no outside drivers or software. A block diagram of the ADP digital system is shown in Fig. 6. The DANNA ADP is composed of a host computer module (Host), a DANNA Array Module (FPGA), and a communications module (FX3). The host computer and the communications module communicate through a Universal Serial Bus (USB), while the communications module and the DANNA Array Module communicate via a 32-bit, 100MHz bidirectional synchronous interface implemented across a FPGA Mezzanine Card (FMC) interconnect. To transfer a configuration or command operation, the host computer sends the DANNA command packet through USB port to the communications module (FX3). The FX3 buﬀers the received commands and indicates to the DANNA Array Module (FPGA) that there is data to be read. A finite state machine (FSM) on the FPGA reads the available data and stores this in the Command FIFO to be read by the DANNA programming interface. Similarly, when a DANNA response packet is to be sent out, the FIFO logic module stores it in the Response FIFO, which is read by the FSM and written to the FX3 so that the host computer can read the data via its USB port.

Fig. 6. DANNA ADP Block Diagram

A. The Host Computer Module The host computer is a 4-core ARM based single-board computer and provides an operating system environment (Linux) and user interface to the ADP. The host computer supports the transfer of command and response packets to/from the DANNA array. Besides its primary function of sending configuration and command files, and receiving responses to/from the DANNA Array Module, it ultimately will be able to run the evolutionary optimization engine to support real-time optimization of DANNA based neural networks. Several singleboard computers were considered for use in the platform, with priority being given to available memory, processing power, the operating system, physical size, and cost. Some examples of

computers considered for the platform included the UDOO Quad, the Wandboard Quad, the CuBox-i4, and the ODROID XU3. We chose the Wandboard Quad due to its higher memory capacity of 2 GB, its processing power with its Freescale i.MX6 Quad-Core ARM-based processor, and its relatively low price. The development platform utilizes Ubuntu 14.04 as the operating system. Other key features of the Host Computer Module include multiple USB interfaces (keyboard, mouse, communications interface), a SATA hard disk interface, a HDMI display interface, and a Giga-bit Ethernet port. B. The Communications Module While we considered a custom design for the communications module, interfacing the Host Computer Module with the DANNA Array Module, we decided that the Cypress EZ-USB FX3 SuperSpeed Explorer Kit provided the best availability solution for cost, size and function. The Cypress FX3 utilizes the Cypress EZ-USB FX3 USB 3.0 peripheral controller allowing for USB 3.0 device functionality to any external processor [11]. The FX3 controller contains an ARM9 processor and General Programmable Interface (GPIF), I2C, I2S, and UART interfaces. Through use of DMA channels and the GPIF interface, the FX3 can provide communication between a USB host and an external processing unit, which, for the ADP, is the DANNA Array Module. C. The DANNA Array Module The DANNA Array Modules hold the neural network arrays and are presently built using off-the-shelf Xilinx FPGA-based boards. The FPGAs are required to contain enough logic cells to support neural networks large enough for practical application. A JTAG port on the DANNA Array Module is used to establish the default DANNA structure (e.g. matrix of un-configured DANNA elements.) The primary FPGAs used for our initial research include the Xilinx Virtex-7 XC7V690T and XC7V2000T. For the development platform, the module hosting the FPGA would be constrained by the physical dimensions of the enclosure and the FMC-based interface to the communications module. We chose the Virtex-7 FPGA-FMC Modules by Hitech Global for the platform. They contain a Xilinx Virtex-7 FPGA (690T or 2000T), three FMC connectors, and a DDR3 memory socket capable of supporting up to 8 GBs. Their compact size and basic functionality allow the use of a smaller enclosure without sacrificing performance or function. VII. TESTING AND OPTIMIZATIONS A. USB Overhead and Cycle Accurate Command Sequencing During testing we discovered notable latency between USB packet transfers, between 50-60usec. To mitigate this latency, we first enable hosts to batch input firing events, with inter-fire timing specifications, into single USB transfers. We also allow hosts to receive aggregated output packets. Second, we have programmed the FX3 firmware to optimize the use of its DMA resources. Specifically, we configured the DMA channel for host-to-FX3 transfers to have two consumer sockets, while the channel for FX3-to-host channels has one.

The paired sockets allow the FX3 to have additional buffering and accommodate a constant stream of input events from the host. To implement this functionality, we created a second set of DMA ﬂags and reconfigured the GPIF interface to reﬂect the new socket. Fig. 7 shows a block diagram of this structure. Our testing with the second socket shows signiﬁcant performance improvement during data transfers, with sub-microsecond delay for switching between buﬀers.

Finally, to avoid synapse fires that may be stored in the synapse distance registers from previous runs of the network, a “Run-For” command of 128 cycles must be included in the configuration command sequence of the DANNA array to clock out any residual fires in the synapse distance registers. This Run-For command must occur after a Reset Command and before the configuration commands to establish the target neural network.

It is important to utilize the available memory on the FX3 to establish the right balance between the number of DMA buffers per transfer type and their size. As a result of our testing, we use 170 512-byte buffers for host-to-array command transfers and 100 512-byte buffers for array-to-host response transfers.

VIII. AVAILABLE DANNA CONFIGURATIONS

Fig. 7. Communications Module to DANNA Array Module (FX3-to-FPGA) Socket Switching Implementation.

B. Reseting of DANNA Components To support repeatable results during application testing, it is important to guarantee the initial state of the DANNA array, the initial neural network configuration, and that the DANNA interface components are clear of any residual states or information left from a previous run of the system. This includes select registers, FIFOs, and Global Input Select addressing. To support initialization of key functions and registers to a known state, we leverage a USB command to create a “soft-reset” function. We leverage a function in the FX3 ﬁrmware that handles USB control transfers via USB vendor requests, allowing control of the available GPIO pins. Whenever the host application performs a USB control transfer indicating a vendor request with a speciﬁc request value, the ﬁrmware interprets the control transfer and drives the speciﬁed GPIO pin for two microseconds. The FPGA treats the signal as a reset signal and resets both Command and Response FIFOs on the FPGA. We also use this signal to reset the FSM on the FPGA to a default state and ensure that it is conﬁgured correctly during a new run. To avoid variations in fire-responses from run-to-run of a specific set of fire-commands on the same neural network configuration, the GIS address needs to be reset to the same initial seed address before each run. Given the GIS uses a LFSR to produce a random address starting point for input selection at each network clock cycle, we used flip-flops to build the LFSR shift registers (instead of the shift-register IP block provide by the Xilinx FPGA design tool, Vivado). We were than able to use the Reset Command of the DANNA array command set to reset the LFSR to an initial state.

A. The 690T Based DANNA Array Module During development, our results showed that the maximum number of elements that can be supported by the 690T FPGA for the current implementation is 2,209 elements, or an array size of 47×47. With these results, we have created DANNA configurations using the 690T based DANNA array module of 10x10, 25x25, 32x32 and 47x47. According to Vivado’s postimplementation reports, a 47x47 array utilizes 79% of the available look-up tables, 31% of the available ﬂip-ﬂops, and 44% of the available buﬀers on the FPGA. Also, with a global network clock rate of 0.5MHz the total on-chip power utilization is approximately 1.152 watts under normal operating conditions. B. The 2000T Based DANNA Array Module The structure of the 2000T is vastly diﬀerent than the 690T, with the 2000T containing four super-logic regions as compared to the 690T single super-logic region [12]. Because of the this structure, we employ a clocking tree to ensure that each DANNA element receives clocks with delays that track signals across the array. This involves partitioning the array into rowlike structures, where each row structure contains clock buﬀers for all the DANNA array clocks as well as for the “global input selects” and Reset signals. This unique structure allows for a balanced distribution of global signals across the array of elements. .The 2000T is able to support up to 5,625 elements in a 75×75 array. Configurations that we have tested for the 2000T based DANNA ADPs include 10x10, 25x25, 50x50 and 75x75. The 75×75 array utilizes 72% of the available look-up tables, 28% of the available ﬂip-ﬂops, and 20% of the available buﬀers on the FPGA. The 2000T implementations are limited by the available wiring channels versus the available FPGA logic elements. The total on-chip power utilization is approximately 2.522 watts with a global network clock of 0.5MHz. IX.

DANNA SIMULATOR

We have implemented a software simulator to complement the FPGA implementation. The simulator implements the same DANNA model as the hardware, including the same randomnumber generator that defines the sampling order of the element inputs. With the simulator, we have verified that the FPGA runs as anticipated, cycle for cycle. A suite of test cases were developed and run against the simulator and hardware, with system state monitoring, to verify cycle accuracy. The simulator also exports the same programming interface as the hardware. While this complicates using the simulator in some respects, it means that the same systems software that drives the FPGA also drives the simulator, helping to verify the correctness of the systems software and its communication with the hardware.

Additionally, the simulator may be employed as an alternative DANNA implementation that does not require an FPGA. In designing the simulator, the Command FIFO on the DANNA Array Module or FPGA was an issue. To maintain cycle accurate deployment of commands and/or input fireevents, the Command FIFO on the hardware must be kept full; if the FIFO drains while the array is running, cycle accurate command sequencing of input fire events is lost. To avoid issues with simulating the FIFO, the simulator simply waits for input at every cycle. In other words, the command packets drive the clock. Thus, with the simulator, one must generate null command packets for every clock cycle, whereas the hardware does not require these packets for sustained operation. Fortunately, generating and processing these packets has a negligible impact on the performance of the simulator. Currently the memory footprint of a single element in the simulator is 32 bytes. This allows for excellent scaling. Most high end CPUs currently have at least a 15MB L3 cache, which allows for a simulation of approximately 500K elements in cache alone. X.

DANNA TRAINING

We use an evolutionary optimization (EO) method for training DANNA networks to complete tasks, and we have successfully applied this method to train DANNA networks for a classification task [17]. For the EO method, we begin with an initial population of DANNA networks, where each network in the population has the same inputs and outputs, and internal neurons and synapses in the networks are randomly generated. We then evaluate each of the networks using a fitness function. This evaluation usually entails simulating the network’s activity on training inputs and evaluating how well the outputs of the network match the desired output. We make use of the DANNA simulator while training DANNA networks, although in the future we may also use the hardware implementation. In this work, we apply DANNA networks to the cart-pole task described in [18] [19] [20]. In this task, the network takes as input periodic measurements of the position and velocity of the pole and the cart. At each interval, the network outputs a direction for a constant force value to be applied to the cart. The goal is to keep the pole from falling, while the cart stays within a certain positional range, for five simulated minutes. We trained a 30x30 DANNA network on this problem from six starting conditions (x = -1.2, 0, or 1.2 m, and θ = -0.15 or 0.15 radians), and the EO converged on a solution in 70 generations. The final array is composed of 36 neurons and 180 synapses, resulting in utilization of 24% of the elements in the array. We evaluate the generalization ability of the resulting network by testing its ability to balance the pole for five minutes from 1440 different starting positions of the cart and pole (Fig. 8). On average, the network is able to balance the pole for 238.822 seconds or about 4 minutes. The network performs poorly when the cart is on left of the track and the pole is leaning far left (x < 0.8 m and θ < -0.16 radians), when the pole begins very near to the center, and when the pole is leaning far to the right (θ > 0.17 radians). The poor performance at the left and right is not unexpected since the cart needs to be accelerated in a direction that would run out of track to correct these situations.

The poor performance at the center is more problematic and is a subject for further investigation. Overall, the network did well in balancing the pole from starting conditions it had not seen over the course of training.

Fig. 8. Generalization Results for DANNA Network Trained to Balance a Pole from Six Starting Conditions for 5 Minutes (darker areas indicate poorer performance).

We are encouraged by these results, and by the results of DANNA on simple classification tasks [17]. As we continue to improve the EO method for DANNA and optimize the DANNA simulator, we plan to pursue more complex applications and tasks for DANNA. XI.

FUTURE ENHANCEMENTS

We have completed a preliminary design of a VLSI based DANNA array module in the IBM-8RF 130nm silicon process. This design exercise targets a 390 mm2 chip and a 54% chip utilization, resulting in a base array configuration of 75x75. We estimate an array configuration of 1000x1000 (e.g. 1 million) elements is possible using this VLSI design in a 14nm process and 90% chip utilization. Chip level simulation results also show a 10x improvement in network cycle performance (up to 5MHz) and more than a 50% reduction in power-consumption. Another emerging technology to be considered is memristors. The memristor has been explored by several groups for implementing nanoscale artificial synapses within a neuromorphic system [13][14]. Our long-term plans include a memristor and analog circuit based realization of DANNA where memristors are used to implement synaptic weights with LTP/LTD, neuron charge accumulator, synapse distance register, and storage elements for programmable thresholds. For example, memristor-based synaptic weights will lead to an analog representation of weighted inputs that can more easily be summed and integrated with analog circuitry, thus improving performance and reducing the area utilization of the circuit. We expect that a memristor-based DANNA can be implemented with 10X more computational elements than a CMOS-based VLSI system of a similar size. One major improvement to the communications interface will be to improve the transfer rate between the host computer and the DANNA Array Module. Despite the viability shown with the USB 2.0 implementation, there is still a significant latency associated with command and response transfers. We

have begun to test USB 3.0, which shows a significant improvement in data transfer efficiency, including a marked reduction in latency between packet transfers. Using USB 3.0 will require changing the host computer board to one that can support USB 3.0. However, there are very few single board computers with USB 3.0 and the form-factor, performance and functionality needed for our ADP. We will be testing the ODROID XU4 [15] in future ADP configurations. Additionally, the Cypress USB 3.0 peripheral controller can be modiﬁed to work with the enhanced protocol, needing little modiﬁcation to the FSM on the FPGA. We are also considering adding a central pattern generator (CPG) function to the DANNA element design. Central pattern generators are neuronal circuits that produce rhythmic motor patterns without inputs that carry timing information [16]. These patterns have been shown to be associated with behaviors such as walking, breathing, and crawling. By incorporating this structure in the DANNA, the implementation should be able to model involuntary actions found in neural systems with fewer components. As we start to develop larger neural network applications we will need to improve the performance of the neural network application development tools, including the simulator and the evolutionary optimization implementation. The simulation is presently clock-based because it was the easiest to initially match the hardware. Possible improvements for the simulator include the use of CPU vectorization, GPUs, plus a shift to a more event-based model, perhaps using optimistic simulation techniques to allow individual elements to be simulated independently [21]. Also, the use of multi-threading and parallel programming methods will significantly improve the evolutionary optimization process.

REFERENCES [1] [2]

[3]

[4] [5]

[6]

[7]

[8]

[9]

[10]

XII. CONCLUSION We have built and tested a new neuromorphic computing system, DANNA, and an application development platform (ADP), to simplify the design, implementation, and testing of neuromorphic computing solutions. The key characteristics of DANNA and its ADP are: 1) a simple, flexible and programmable neural-synaptic structure, 2) a means of “compiling” neural networks using evolutionary optimization, 3) a cycle accurate simulation environment, 4) an easy to use interface, 5) a low-cost development platform, and 6) a scalable power-efficient system structure. We believe DANNA represents a viable neuromorphic computing system for realtime data-centric applications.

[11] [12] [13]

[14]

[15] [16] [17]

ACKNOWLEDGMENT Prepared in part by Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831-6285; managed by UTBattelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR2225. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05- 00OR22725 for the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

[18] [19]

[20] [21]

Catherine D. Schuman. Neuroscience-Inspired Dynamic Architectures, PhD Thesis, University of Tennessee, November 2014. C. D. Schuman, J. D. Birdwell, and M. Dean, “Neuroscience-inspired dynamic architectures,” in Biomedical Science and Engineering Center Conference (BSEC), 2014 Annual Oak Ridge National Laboratory, May 2014, pp. 1–4. M. E. Dean, C. D. Schuman, and J. D. Birdwell, “Dynamic adaptive neural network array,” in Unconventional Computation and Natural Computation, ser. Lecture Notes in Computer Science, O. H. Ibarra, L. Kari, and S. Kopechi, Eds. Springer International Publishing, 2014, pp. 129–141. http://dx.doi.org/10.1007/978-3-31908123-6_11 C. P. Daffron, “DANNA, A Neuromorphic Computing VLSI Chip,” Masters Thesis, University of Tennessee, August 2015. R. Ananthanarayanan, S. Esser, H. Simon, and D. Modha, “The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses,” in High Performance Computing Networking, Storage and Analysis, Proceedings of the Conference on, Nov 2009, pp. 1–12. S. Kunkel, T. C. Potjans, J. M. Eppler, H. E. E. Plesser, A. Morrison, and M. Diesmann, “Meeting the memory challenges of brain-scale network simulation,” Frontiers in Neuroinformatics, vol. 5, no. 35, 2012. S. Furber, D. Lester, L. Plana, J. Garside, E. Painkras, S. Temple, and A. Brown, “Overview of the SpiNNaker system architecture,” Computers, IEEE Transactions on, vol. 62, no. 12, pp. 2454–2467, Dec 2013. B.V. Benjamin, Peiran Gao, E. McQuinn, S. Choudhary, A.R. Chandrasekaran, J.M. Bussat, R. Alvarez-Icaza, J.V. Arthur, P.A. Merolla, and K. Boahen, “Neurogrid: A Mixed Analog-Digital Multichip System for Large-Scale Neural Simulations,” Proceedings of the IEEE, 102(5):699-716, May 2014. J. Schemmel, D. Bruderle, A. Grubl, M. Hock, K. Meier, and S. Millner. “A Wafer-Scale Neuromorphic Hardware System for Large-Scale Neural Modeling,” Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium, pages 1947–1950, May 2010 P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha, “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014. http://www.sciencemag.org/content/345/6197/668.abstract “Cypress EZ-USB R FX3 SuperSpeed Explorer Kit,” Technical Report, 2015 “Xilinx Large FPGA Methodology Guide,” Technical Report, 2012. M. Hu, H. Li, Q. Wu, and G. Rose, “Hardware realization of bsb recall function using memristor crossbar arrays,” in Design Automation Conference, 2012 49th ACM/EDAC/IEEE, June 2012, pp. 498–503. M. Soltiz, D. Kudithipudi, C. Merkel, G. S. Rose, and R. E. Pino, “Memristor-based neural logic blocks for nonlinearly separable functions,” IEEE Trans. Comput., vol. 62, no. 8, pp. 1597–1606, Aug. 2013. http://dx.doi.org/10.1109/TC.2013.75 http://www.hardkernel.com/main/products/prdt_info.php Wikipedia, https://en.wikipedia.org/wiki/Central_pattern_generator Schuman, Catherine D., Adam Disney, and John Reynolds. "Dynamic adaptive neural network arrays: a neuromorphic architecture." Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 2015. Anderson, Charles W. "Learning to control an inverted pendulum using neural networks." Control Systems Magazine, IEEE 9.3 (1989): 31-37. Florian, Razvan V. "Correct equations for the dynamics of the cart-pole system." Center for Cognitive and Neural Studies (Coneural), Romania(2007). Schuman, Catherine D., and J. Douglas Birdwell. "Dynamic artificial neural networks with affective systems." PloS one 8.11 (2013): e80455. Fujimoto, R. Parallel Discrete Event Simulation, Communications of the ACM, 33 (10), 1990, pp. 30-53.