Bachelor Thesis: Develop an Automated System for EEG Artifacts Identification

Bachelor Thesis: Develop an Automated System for EEG Artifacts Identification Mälardalens Högskola Akademin för Innovation, Design och Teknik Niklas ...
Author: Blaise George
7 downloads 0 Views 928KB Size
Bachelor Thesis: Develop an Automated System for EEG Artifacts Identification

Mälardalens Högskola Akademin för Innovation, Design och Teknik Niklas Sjöqvist Kandidatexamen inom datavetenskap 2016-09-04 Examinator: Shahina Begum Handledare: Mobyen Uddin Ahmed

Abstract Electroencephalogram (EEG) is one of many important clinical diagnosing tools when a brain disorder or condition need to be confirmed or ruled out. When measuring EEG, interference will be noticeable. This could either originate from ocular movement, muscular tension or outside of the body. These interferences have a higher amplitude than the EEG data and therefore interpretation of the EEG becomes difficult. A manual process can be used to attempt to remove these artifacts from the EEG. This process requires experience and vast knowledge in the field, it is even in some cases referred to as an art rather than a science. To solve this, automatic methods such as FORCe and FASTER have been proposed but their performance have so far been insufficient. FASTER has a specificity of >60% and FORCe was able to remove 58,3%of all known artifacts during testing by its developers [14][15]. This thesis work attempts to find a solution to this problem by proposing a generalized method for detection of artifacts, while under the limitation of only having access to a few labeled samples of data. An evolutionary approach was chosen and implemented as a genetic algorithm. The algorithm uses a self maintained database of two second long independent components found through independent component analysis being applied to unlabeled data. The genetic algorithm is used to label the independent components as either an artifact or clean EEG. The labeled samples are used as training data for this purpose. The database is then used to evaluate unseen independent components by comparing the mean distance between the signals. The best match's label is then used to determine if the unseen independent component describes an artifact or clean EEG. The algorithm is validated using 248 previously unseen EEG epochs out of which 80 were clean EEG and 168 were artifacts. The suggested method achieved an average accuracy of 82,74%, average specificity of 91,25% and average true detection rate of 86,99%. The algorithm's result during these tests is superior to the results achieved by the developers of FASTER when comparing specificity, and true detection rate but not when comparing accuracy and run time. This new method shows potential but requires further optimization before it can be put to practical use in the field. The usages of a genetic algorithm is questionable but should be considered to possibly function as a compliment to other methods which normally has a low average specificity.

2

Table of Contents 1. Introduction.............................................................................................................................................. 6 1.1 Problem Definition...................................................................................................................... 6 2. Background................................................................................................................................................ 7 2.1 State of the Art............................................................................................................................ 10 3. Motivation............................................................................................................................................... 12 4. Method and Materials......................................................................................................................... 13 5. Implementation..................................................................................................................................... 14 5.1 Data................................................................................................................................................. 15 5.2 Pre-Processing............................................................................................................................ 15 5.3 Learning Algorithm.................................................................................................................. 16 6. Result......................................................................................................................................................... 17 6.1 Comparison with State-of-the-Art Methods...................................................................21 7. Discussion................................................................................................................................................ 22 7.1 Possible Improvement of the Pre-Process.......................................................................... 24 8. Conclusion............................................................................................................................................... 24 8.1 Limitations....................................................................................................................................... 25 8.2 Societal Consideration................................................................................................................ 25 9. References............................................................................................................................................... 26 9.1 Media References.......................................................................................................................... 29 10. Appendix............................................................................................................................................... 29 10.1 Settings for the Optimization Tool in Matlab.................................................................29 10.2 Fitness Function for the Optimization Tool in Matlab................................................32 10.3 Genetic Build-up of the Highest Scoring Population Member.................................36 10.4 Hardware and Operative System........................................................................................ 37 10.5 Manual Artifact Removal Process....................................................................................... 38 10.6 Results of True Accuracy Tests Run on the Individuals Produced by the GA...39

3

Table of Figures Fig. 1 - 10-20 system................................................................................................................................... 7 Fig. 2 - Brain waveform Classification................................................................................................. 8 Fig. 3 - Clean EEG.......................................................................................................................................... 9 Fig. 4 - EOG...................................................................................................................................................... 9 Fig. 5 - EMG.................................................................................................................................................. 10 Fig. 6 - Learning Algorithm.................................................................................................................... 16 Fig. 7 - Attempt 1....................................................................................................................................... 18 Fig. 8 - Attempt 3....................................................................................................................................... 19 Fig. 9 - Attempt 6....................................................................................................................................... 19 Fig. 10 - Attempt 7..................................................................................................................................... 20 Fig. 11 - Comparison with FASTER.................................................................................................... 21 Fig. 12 - Graph of observed run times............................................................................................... 22

4

Abbreviations Abbreviation: AI BCI BSS CT EA EEG E-INFOMAX EMG EOG FASTER FORCe Hz IC ICA INFOMAX LAMIC MRI ms NBT PC PCA RUNICA

s WT

Meaning: Artificial Intelligence Brain Computer Interface Blind Source Separation Computed Tomography Evolutionary Algorithm Electroencephalography Extended Information-Maximization Electromyography Electrooculography Fully Automated Statistical Tresholding for EEG artifact Rejection Fully Automated Online Artifact Removal Hertz Independent Component Independent Component Analysis Information-Maximization Auto-Mutual Information Clustering Magnetic Resonance Imaging Millisecond Neural Biosig Toolbox Principal Component Principal Component Analysis The logistic INFOMAX ICA algorithm of Bell & Sejnowski (1995) with the natural gradient feature of Amari, Cichocki & Yang. Second Wavelet Transform

5

1. Introduction An Electroencephalogram (EEG) is a type of measurement which measures the electrical activity of the human brain [1]. When measuring EEG two types of artifacts can often be observed, ocular and muscular artifacts [5]. Ocular artifacts occur when the subject performs any form of eye movement. Eye movements can be measured by performing a so called Electrooculography (EOG). An EOG measures the electrical potential over the eyes [2]. Muscular artifacts appears when the subject moves body parts such as an arm or leg, these movements can be measured by performing a so called Electromyography (EMG), which measures the electrical potential throughout the human body. These movements are larger compared to ocular movements and therefore often appear to have a higher amplitude in the EEG [3]. Ocular artifacts generally have a lower amplitude than muscular artifacts, as these are relatively smaller physical movements but still appear in an EEG, one of the reasons behind this is the eyes close location to the human brain. EOG and EMG artifacts are a problem when performing an EEG as they hold a higher electrical potential than the normal activity within the human brain. This causes periods of corrupted signal data within the EEG and this in turn makes interpretation of the result of an EEG a troublesome procedure, sufficient methods for removal of said artifacts is therefore desired. The way EEG is cleaned from these artifacts today can be broken down into a number of steps seen in appendix 10.5. The manual process is slow compared to the automatic methods which are several times faster, for example the FASTER method has an average run time of 0,3 seconds [15]. The manual method requires experience in the field to produce an accurate result [44]. The scientific field is still missing generalized laws describing an artifact and the frequency of their occurrence. That is one of the reasons why the process sometimes is considered more of an art form than a science [45]. This is why artificial intelligence (AI) is well suited for this type of problem where some of the fundamental laws behind the distribution and characteristics of the artifacts is unknown.

1.1 Problem Definition The aim of this bachelor thesis is to identify and analyze the best methods for detecting artifacts created during an EEG measurement and to propose a new improved method for doing so. This have previously been attempted multiple times but a generalized automatic method has never been achieved [7][8]. The primary research question is therefore: 6



Propose a new method for detecting artifacts in EEG, how does it compare to the known detection methods and which one yields the highest detection rate?

Three different methods will be considered for testing, FORCe, FASTER and the new proposed method. The variables used for comparison will be the once achieved by their inventors. A limitation to this project is that there only exists a few samples of labeled EEG, unlabeled samples does however exist in plenty. The mathematics behind the methods discussed in this report will not be covered, this report will discuss the methods from a strictly programmatic view.

2. Background Electroencephalogram (EEG) is one of many important clinical diagnosing tools. It is one of the more common ones, mostly because of its affordability and mobility in contrast to other techniques such as magnetic resonance imaging (MRI) or computerized axial tomography scan (CT) [32][33]. Both of these techniques requires stationary equipment but the techniques does however offer higher resolution compared to EEG. EEG is primarily used when a brain disorder or condition need to be confirmed or ruled out. Such might include: seizure disorders, head injuries, brain tumors, encephalopathy, sleep disorders, dementia, stroke and memory problems [34].

Fig. 1 - Electrode locations of International 10-20 system for EEG (electroencephalography) recording [24]. EEG measures the electrical activity of the human brain by recording the output of electrodes placed on the scalp [1]. An EEG session lasts 20 to 50 minutes with a sample rate of 128 to 1024 Hz [9] and 16 to 256 electrodes. An electrode pair, form what’s referred to as a channel. The standardized electrode placement pattern known as the 10-20 system can be seen in Figure 1.

7

Delta waves

Theta waves

Alpha waves

Beta waves

Gamma waves Mu waves Fig. 2 - Brainwave classification overview [23]. Valuable EEG data normally lies within the range of 4 to 30 Hz but frequencies span far outside of that range. EEG is divided into 6 sub-categories Delta, Theta, Alpha, Beta, Gamma and Mu waves. Delta waves have a frequency of up to 4 Hz. They tend to have a relatively high amplitude compared to other frequency ranges and are most commonly observed while the subject is in a state of deep sleep. Theta waves lies within the frequency range of 4 to 7 Hz and is observed (but not exclusively) in adults experiencing drowsiness or relaxation [10]. Alpha waves appears during relaxation or closing of the eyes and attenuates when opening the eyes or during mental exertion. They are observed within the range of 7 to 14 Hz [46]. Mu waves overlap with this frequency range but it originates from different types of stimuli. Mu waves reflects the synchronous firing of motor neurons in rest state [9]. Beta waves range from 15 to approximately 30 Hz [48]. Beta waves are closely linked to motor movement and physical activity. Beta waves with varying frequencies is associated with busy or anxious thinking and high levels of concentration. Gamma waves have a frequency of about 25 to 100 Hz [49]. Gamma waves are assumed to occur when different groups of neurons are bound together to perform cognitive or motor functions. This have so far not been proven. All the above mentioned brainwave categories can be viewed for further clarification in Figure 2. While measuring EEG two forms of interference are commonly observed, EOG and EMG but there also exists a third type of interference which is worth mentioning, the external interference. Outer body interference can be movement of the electrodes used for measuring or drippings from the conductive liquid used to ensure the electrodes contact with the scalp, but it can also be electrical interference such as power spikes within the system. An example of a clean EEG signal can be viewed in Figure 3. The EEG signal is determined to be clean when all of its components are classified as one of the six known brainwave types mentioned above. 8

Fig. 3 - EEG signal free from artifacts that can be distinguished through inspection alone [25].

Fig. 4 - EEG with eye blink (EOG artifact), marked with red [26]. EOG is a technique which measures the corneo-retinal standing potential between the front and the back of the human eye [2]. The resulting signal is called the electrooculogram and it is what can be seen as interference when measuring EEG, which can be seen in Figure 4. These are often but not exclusively generated by eye movements and blinking. EOG artifacts can easily be distinguished from clean EEG in some cases but roving eye movement artifacts with its shallow slopes is especially easy to over look during a visual inspection [6], that is why automated detection methods are desired for these type of artifacts.

9

Fig. 5 - EMG artifacts [27]. EMG is performed to produce a record called an electromyogram. It detects the electrical potential generated by skeletal muscles [3]. When these muscle cells are electrically or neurologically activated the resulting signals can be seen as interference on an EEG [4] as shown in Figure 5.

2.1 State of the Art To separate (handle) and detect these artifacts from the EEG, Independent component analysis (ICA) can be used. ICA is a computational method which is used to separate a multivariate signal into additive sub-components (ICs). This requires the assumption, that the sub-components are non-Gaussian signals [5]. They are also assumed to be statistically independent from each other. ICA is an excellent solution to the famous Cocktail Party Problem [35] since it can find an amount of ICs equal to the amount of reference points, in this case the EEG channels serve as reference points. “Independent Component Analysis (ICA) and Blind Source Separation (BSS) represent a wide class of statistical models and algorithms that have one goal in common: to retrieve unknown statistically independent signals from their mixtures.”[11]. Another commonly used tool for artifact detection is the statistical procedure known as, principal component analysis (PCA) [36]. PCA uses an orthogonal transformation to perform a dimensional reduction which converts data points into a set of linearly uncorrelated variables (called principal components) in a lower dimension [36]. The number of so called, principal components is either less than or equal to the number of original variables. The FASTER method uses PCA to reduce the amount of channels down into 30 principal components [15]. “The wavelet transform (WT)[37] is one of the leading techniques for processing nonstationary signals, where “non-stationary” should be understood here loosely as meaning “with time-varying frequency content” (and not according to the meaning that this qualifier has in statistical signal processing). The WT is thus well suited for EEG signals, since these are non-stationary. The major feature of the WT is its capacity of 10

decomposing a signal into components that are well localized in scale (which is essentially the inverse of frequency) and time.”[12][13]. The FORCe method for detection and removal of EEG artifacts implement WT together with soft thresholds[14] to detect and remove artifacts [14]. Soft thresholding is a method used by FORCe to detect artifacts, predetermined mathematical algorithms (such as curve skewness[38]) are used where a threshold have been specified by a learning AI. If enough of these detection algorithms returns values above the threshold the section of the EEG which is being tested, is declared an artifact [14]. A different method used for interpreting data is Evolutionary algorithms (EA)[39] also often named Genetic algorithm (GA)[40]. It’s based around principle of Darwin’s theory [40]. “EAs are computer programs that attempt to solve complex problems by mimicking the processes of Darwinian evolution.1 In an EA a number of artificial creatures search over the space of the problem. They compete continually with each other to discover optimal areas of the search space. It is hoped that over time the most successful of these creatures will evolve to discover the optimal solution” [17]. Elements of selection, mutation, and breeding are simulated to search the problem space. The problem is broken down into simple variables which in turn will become the individual's genes of the artificial specimen. For detecting artifacts both FORCe and FASTER use some form of ICA and a comparison between 16 previously known and 11 new ICA methods have been performed before [12]. In their report they conclude that the most powerful ICA methods for finding EOG artifacts is INFOMAX, E-INFOMAX and EFICA. Soft-thresholding combined with the wavelet transform also proved promising but this approach will not be considered since it’s the same as the FORCe method. The advantage with ICA is that it isn’t limited to finding artifacts, it also isolates them as an additive component of the EEG signal. This makes it simple to remove the artifacts through subtraction. This should have minimal effects on the data but some distortion should still be expected. PCA is used by FASTER [15] in combination with ICA to detect artifacts but it’s used to reduces the complexity of the input given to the AI. This approach was avoided since it modifies the ICs which in turn makes comparison hard to perform, especially when there exist samples with different amounts of channels in the training data. The proposed method share similarities with FASTER and FORCe, they all use ICA as a data processing method before they perform a data evaluation [14] [15]. Where FASTER and FORCe both use a fuzzy system [41](the developers of FORCe specifically uses soft tresholding [14]) with five respectively seven features to evaluate a given data sample, the proposed method only uses one feature and no form of fuzzy system. The proposed method instead searches among a set of preemptively evaluated 11

data samples for the closest match to the given EEG epoch, the result of the search is in turn used to evaluate the given EEG epoch. A fuzzy system is a control system which uses fuzzy logic. Fuzzy logic is a so-called many-valued logic. True can be any real numbers between 0 and 1 [41].

3. Motivation There currently exists automated and semi-automated methods for artifact removal but they currently share a common problem, they lack in specificity when detecting artifacts. Two of the most successful methods are FASTER and FORCe [14]. FORCe managed during tests to on average completely remove all traces of 58.3% of all known artifacts and FASTER removed all traces of 41.7% of all known artifacts [14]. These results were achieved by the creators of FORCe. The results are far from satisfying; this thesis will attempt to provide an improved method which outperforms the above mentioned methods. The current methods, FORCe and FASTER also has other limitations. FORCe has only been thoroughly tested with 16 channel EEG and the processing time is expected to scale linearly with the amount of channels [14]. FASTER is limited by its requirement of at least 30 channels to effectively perform its PCA [15]. This thesis will also attempt to generalize its solution in this aspect. However, the biggest limitation with the traditional methods lies with the methods fundamentals. Both FASTER and FORCe are limited to only describe ICs in predetermined features which in turn specify the rules for the soft tresholding/fuzzy system. A set of strict pre-programmed rules without any exceptions is unlikely to accurately describe the full range of possible signals measured during an EEG session. The chosen approach leaves little room for exceptions to the known rules of EEG and this might cause problems as the system is limited by it. The proposed approach tries to solve this limitation by implementing the EEG epoch evaluation method as a GA. This way the system itself can determine what's an artifacts and what isn't, without limiting it to only describe artifacts as a set of selected signal characteristics. By choosing the GA approach, it enables the system to find previously unknown artificial signal patters. These findings can in turn be used to learn more about EEG in general. The GA approach has the potential to contribute with new findings to the field of EEG, a soft tresholding/fuzzy -system is however limited by only operating within the known. A GA approach was therefore chosen due to its fundamental philosophical advantage over the traditional approaches The work of this thesis has practical applications in the research field of brain computer interfaces (BCI) [16]. 12

4. Method and Materials FASTER implements a High-Low pass filter (removes delta and gamma waves) also known as an Equiripple filter together with a Notch filter (removes the power supply frequency) to remove uninteresting frequencies before continuing onward to perform a PCA and then finally an ICA which detects and removes artifacts [15], FORCe also implements ICA [14]. With this in mind a new method is proposed which is based on the proven success of its predecessors. The new proposed method will use ICA to find ICs in a large set of unlabeled data. It will then store the unique ICs in a database. A population will then be generated using clean EEG and the different artifact types as labels in conjunction with the database as genes. A genetic algorithm will then attempt to correctly label the ICs while using previously unseen label data as testing data. This will be done by calculating the ICs of the labeled EEG to then compare these new found ICs with the ICs in the database. When a member of the population has reached a satisfactory accuracy, the genetic algorithm will finish and the individual's genes will be stored in the database. To then detect and remove artifacts the ICA method will be used and the resulting ICs will then be compared with the ICs in the database and matches with ICs labeled as artifacts will simply be subtracted from the EEG. This should result in a EEG signal where only the artifacts have been removed and the rest of the data have been left intact. As a final step, completely unseen labeled data will be used to evaluate the final performance of the proposed algorithm. The method will be implemented using MATLAB. It will be used for its many extension libraries which contain functionality needed in this project. Evolutionary algorithms have never been used to detect artifacts in EEG and it will therefore be interesting to see the results this type of algorithm can achieve. The unlabeled EEG data used in this thesis was retrieved from the developers of the BCI2000 instrumentation system, which they used in making these recordings. These recordings consist of “over 1500 one- and two-minute EEG recordings, obtained from 109 volunteers” [18][29]. The volunteers performed tasks such as closing and opening of a single eye, and opening and closing of both fists as well as imagining performing these actions. The following description of the data set can be retrieved from their publication on physionet.org: "This dataset was created and contributed to PhysioNet by the developers of the BCI2000 instrumentation system, which they used in making these recordings. The system is described in: Schalk, G., McFarland, D.J., Hinterberger, T., Birbaumer, N., Wolpaw, J.R. BCI2000: A General-Purpose Brain-Computer Interface 13

(BCI) System. IEEE Transactions on Biomedical Engineering 51(6):1034-1043, 2004. [In 2008, this paper received the Best Paper Award from IEEE TBME.]" [29]. It is primarily consists of 109 subjects, with 14 recorded sessions per subject, with a session length of one to two minutes. These were collected at 160Hz with 64 channels. The labeled data used in this thesis is recorded by Elain Åstrand at the faculty of Research in Innovation, Design and Engineering. This data was collected from one single subject. Five, three minute sessions were recorded. The subject was presented with a screen on which tasks to be performed would pop-up with a five second interval. These tasks were blink, clench jaw, raise arm, look left, look right and yawn, together they generate all types of artifacts which are desirable to detect except for external interference. External interference does however occur naturally but in this case it will not be labeled and therefore the AI will not be able to detect those type of artifacts. Blink, look left and look right was used to generate EOG artifacts. Clench jaw and yawn was used to generate smaller scale EMG artifacts and finally wave was used to generate large scale EMG artifacts as the subject would move their entire arm. The EEG sessions were recorded at 1000Hz with 64 channels using the biolab extension library for Mathlab. The labeling of the artifacts were done by logging timestamps of when an image was presented. To adjust for the delay between the computer and the EEG equipment, the offset was measured at the beginning and end of each session where the average offset was 2070ms. The measurement at the end of a session was used to adjust for possible drift of the delay, this turned out to have a minimal impact but it was however still noticeable with an average drift of 400ms.

5. Implementation The data used as unlabeled test data has a sample rate of 160Hz and a length of 60 seconds. In total there are 108 unlabeled specimen used in this project which all are represented with 14 individual measuring sessions recorded on 64 channels. The ICA method used to build the database of ICs was INFOMAX, specifically the RUNICA implementation found in the EEGlab toolbox [22]. Two databases was built using this method with a component length of two seconds and sixty seconds. Both managed to find components in all samples and the ICs look promising, the sixty second long ICs does however appear to be unusable for learning. During visual inspection almost all channels appeared to contain several different types of artifacts and matching long ICs with each other, does not appear to be an effective strategy. The longer ICs were therefore discarded.

14

5.1 Data The genetic algorithm requires two distinct groups of data. The first group contains randomly selected recordings with possible artifacts of unknown signal origin, this will be the genetic algorithms learning data. The second group needs to contain signals known to be either an artifact or known to be free from artifacts. These need to be manually checked and generated in a controlled environment. This data will be used to calculate fitness and is the evaluation data for the GA. A small part of it will be kept separate to evaluate the final detection accuracy. The learning data for the genetic algorithm was retrieved from the developers of the BCI2000 instrumentation system [29]. The volunteers performed tasks such as closing and opening of a single eye, and opening and closing of both fists as well as imagining performing these actions. These actions will generate some of the artifacts that the AI should be able to detect. It is however missing head and jaw movements, since these weren’t generated intentionally, the amount of reliable data for these specific artifacts might be severely reduced. Through visual inspect it became apparent that the recordings from subject number 106 was corrupt and the sample was therefore discarded. The final piece of data required for the genetic algorithm is data labeled as void of all known artifacts, which will be referred to as clean data. This type of data turned out to be impractical to generate with conscious subjects, as humans tend to blink every two to ten second [50]. The clean data was instead retrieved from the default library files of the Neurophysiological Biomarker Toolbox (NBT) [20]. These are known to be free of artifacts and this was confirmed through visual inspection of all data.

5.2 Pre-Processing

The data files were then serially opened and split into overlapping two second long segments (the shortest the ICA method could handle, floats and doubles were not allowed as start index of a segment). The labeled data containing artifacts was segmented according to its time stamps using a simple function to read the logged times from a text file. The clean labeled data was however segmented manually through visual inspection to ensure clean samples. The runica implementation of the INFOMAX ICA method [19] was run on each individual segment. A total of 27.095.042 unlabeled ICs were retrieved from 108 different subjects, as learning data. A total of 1.032 artifact free ICs were retrieved from eight different subjects and a total of 8.000 labeled ICs containing artifacts were retrieved from one subject, as evaluation data. The segments was then saved and named according to subject number, session number and segment start and end time. Segmentation and ICA method code without encapsulating for-loop: EEG EEG EEG EEG EEG

= = = = =

pop_fileio(filepath); pop_select( EEG,'time',[startTime endTime]); eeg_checkset( EEG ); pop_runica(EEG, 'extended',1,'interupt','on'); eeg_checkset( EEG );

15

EEG = pop_saveset( EEG, 'filename', filename, 'filepath', filepath); EEG = eeg_checkset( EEG );

Different segment lengths were experimented with to attempt to determine the length of an artifact. It could not be determined. The length of an artifact was assumed to be approximately 0.5 to 3 seconds, fast eye blinks being the shortest and jaw clenching being the longest. This was once again done through visual inspection. Two second long segments were favored in an attempt to isolate individual artifacts. For example, if a component contains a combination of artifacts, instead of a single one, the required labeled data to describe all possible combinations would increase immensely. This was an interesting find as it suggests there should exist an optimal component length for describing artifacts. The optimal length does unfortunately appear to be less than two seconds but more testing need to be performed for this to be anything beyond speculations.

5.3 Learning Algorithm

The genetic algorithm is implemented using the in-built Optimization Tool in Mathlab. The "ga - Genetic Algorithm" option was selected, and the lower and upper bound was set to zero and one [42]. The general outline of the evolutionary algorithm can be seen down below in Fig.6. The population of the genetic algorithm is made up of bit strings. The length of the bit strings are equal to the amount of unlabeled ICs. A zero represents false, the IC corresponding to the zeros index is an artifact. A one represents true, that the IC is clean and therefore contains no artifacts. The initial population was generated using the tool's default settings which results in 20 randomly generated population members.

Fig.6 - Program flow of the learning algorithm.

The breeding or so called crossing of population members, is done using a basic scattered function [21]. A random bit string is generated and then all true bits are inherited from the first parent and all the false bits are inherited from the second parent []. When choosing which population members to use for breeding, a selection function is used. A roulette function was used for this [21]. It assigns a number to each member of the population based on its fitness scale and then randomizes a number between one and the total amount of members to select an individual. This implementation uses a rank based scale [21]. Individuals are sorted by fitness and ranked accordingly, the algorithm is therefore not biased to select the higher ranked members of the population. Rank scaling removes the effect of fitness value spread, which otherwise would cause 16

the highest evaluated individual to be selected more often. Each population member receives the same chance to be selected for breeding. The fitness function [21] used to evaluate each individual is surprisingly simple, it iterates over all unlabeled ICs and compares them to the labeled ones. If the closest match and the labeled IC both have bit values equal to each other, the score of the individual is increased by one. The fitness is then calculated by subtracting the score from the total amount of labeled ICs. A fitness of zero is therefore a perfect score, which is equal to zero misinterpreted ICs. ICs are compared by calculating the mean distance between the two components. The lower the value the closer the match. Other methods could be tested such as the area formed by the two ICs or possibly a clustering function [51] to figure out which ICs have common characteristics, but this would require feature extraction [47]. The exact implementation of the fitness function can be found in: Appendix 10.2. To ensure that the GA doesn’t stagnate on a local maxima, a mutation function [21] is used. All bits have the same chance of being replaced by a new random bit every generation. This chance is set to one percent. The top ten percent of each generation is carried over to the next generation, eighty percent of the new generation is generated through breeding [21] and the final ten percent are made up of new individuals containing new randomly generated bits. The learning algorithm uses forward migration [21], every 20th generation the bottom 20% is discarded and the top 20% from the sub-population twenty generations ago is copied over to replace them. The GA is set to stop it’s learning process when one of three scenarios occur, either no improvement is made within fifty generations or one hundred generations pass or one member of the population has a fitness of zero. The complete list of the settings used with the in-built optimization tool in Matlab can be found in: Appendix 10.1.

6. Result The final algorithm implemented was not able to distinguish between different artifact types as there wasn’t enough time to implement this functionality. The run time necessary for the algorithm to be able to distinguish between clean EEG and artifacts was already a couple of hours long, when run in parallel. To increase the difficulty by having the AI learn to distinguishing artifact origin was not possible within a reasonable time frame, it was expected to take several days at a time to compute for smaller sets of data. Seven training sessions were preformed in total. Different number of ICs to label was tested. The number of ICs tested were 20, 60 and 100 ICs. A session using 20 ICs was 17

used during evaluation, the ones with larger amounts of ICs were too computation heavy. The first attempts of running the learning algorithm gave discouraging results as can be seen in Fig.7. The first attempt used an equal amount of clean and unclean ICs as evaluation data for the fitness function. The likelihood of mutation was however far too high (20%) and the method had a hard time finding and maintaining a solution even when the amount of learning data used was limited to 20 samples.

Fig. 7 - First attempt using a mix of artifact types and clean EEG The second attempt had the same issues as the first but as of the third attempt the chance of mutation was reduced to 1%. This caused the algorithm to find an almost optimal solution within six generations as seen in Fig.8. A dataset limited to EOG artifacts and clean EEG was used to speed up the learning process and the solution presented below was found after less than an hour. The low amount of generations required to find a solution, hints of possible problems in the future as the method is more likely to get stuck on a local maxima with such a quick convergence of the population.

18

Fig. 8 - Third attempt using only EOG artifacts and clean EEG The fourth attempt was a failure, clean EEG and artifacts both turned out to match the same unlabeled IC, because no other true match existed. This appears to be one of the methods greatest weaknesses. This was further tested in attempt five which followed in its predecessors footsteps. The preferable ratio of labeled to unlabeled ICs could however not be found. A possible explanation of this could be that the ratio is heavily dependent on the spread of the ICs characteristics. Attempt six however finally succeeded. 60 random unlabeled ICs was used and 20 labeled ICs was evenly split between clean EEG and randomly selected artifacts. The learning process took approximately two hours to complete and as can be seen in Fig.9 a perfect solution was found in 10 generations. This might however been a stroke of luck as the first generation began with a specimen able to correctly label 14 out of the 20 labeled ICs.

19

Fig. 9 - Sixth attempt using a mixture of artifacts and clean EEG The seventh and last attempt was a pressure test. This time the unlabeled ICs were increased to 100. It took over 8 hours to reach the fourteenth generation which finally reached a fitness value of zero.

Fig.10 - Seventh attempt using a larger amount of mixed ICs . Members of the resulting populations were then tested by running the Fitness function with previously unseen EEG epochs as labeled data. A total of four test were preformed with 20, 40, 80 and 248 ICs to be labeled. The population with a bit string length of 20 was used to reduce the computation time. The hardware and operative system used for all these tests can be seen in appendix 10.4. The average true accuracy achieved by population members with a fitness of zero was 76%. 20 ICs was used for the first test. The highest true accuracy achieved by a single member on this limited set of data was 100%, this was achieved by population member number five. It's genetic definition can be seen in appendix 10.3. The complete result for all individuals can be seen in appendix 10.6. The second test was preformed using 40 ICs. Populations members who achieved a true accuracy of 8090% when detecting artifacts in 128- and 64-channel EEG [15]. Accuracy is the ability to label artifacts as artifacts. Specificity is the ability to label clean EEG as clean EEG. FASTER has significantly lower specificity >60% as stated in [15]. Which is one of the motivations behind the development of the FORCe method. The FORCE method has been seen to outperform FASTER as stated by [14]: "The FORCe method is compared to state-of-the-art methods, LAMIC and FASTER, and is seen to produce significantly better performance as measured by all the metrics employed. As both LAMIC and FASTER have, in previous studies, been compared to other artifact removal methods (Blind source separation, Wavelet based methods etc.), and being demonstrated to exhibit superior performances, we may conclude that the method proposed and described in this work is also able to perform better than these alternative artifact removal methods." Therefore the most scientifically interesting comparison is with the FORCe method, as it is the currently leading method, at the time of this report writing. The developers of FORCe never included the result of tests regarding the methods accuracy or specificity, the amount of removed artifacts was instead used in their report. The accuracy of the suggested method will therefore be compared to the accuracy of FASTER since the FORCe method never provided any information regarding improvements of the detection rate. Aspect Accuracy Specificity True Detection Rate

FASTER >90% >60% >75%*

Suggested Method 82,74% 91,25% 86,99% 21

Fig.11 - True Detection Rate is the ability to correctly label clean EEG and artifacts. *10% missed artifacts and 40% of clean EEG assumed to be artifacts gives 50% inaccurate labels out of a maximum of 200%, a signal may only contain clean EEG or artifacts. There-by: (200-10-40)/2 = 75. Division by 2 is preformed to scale the true detection rate to have a maximum of 100%. The accuracy of FASTER and the suggested method is similar, however FASTER still holds a clear edge over the suggested method with a minimum lead of 7,5 percentage points. FASTER (and possibly FORCe) is clearly the strongest method when comparing accurancy. FASTER did however see an unexpected performance drop down to an accuracy of 5.88% when 32-channel EEG was used [15]. When it comes to specificity the suggested method preforms better than FASTER in the tests preformed, with a lead of approximately 31 percentage points. The suggested method clearly makes less mistakes when it comes to inaccurately labeling clean EEG as artifacts. In the test preformed, the suggested method did on average give a 12 percentage points more accurate labeling of the content in an EEG session compared to the results achieved by the creators of FASTER. When comparing run time, FASTER massively outperforms the suggested method. FASTER does on average complete a run in 0.3 seconds [15]. The suggested method is far from being able to compete with FASTER or FORCe. The suggested methods run time can be seen in the graph below. Its run time is almost linear but grows exponentially with the amount of ICs to label. This is most likely due to overhead when opening .set files. FASTER and FORCe is several hundred times faster than the suggested method.

22

Figure 12. - Graph of observed run times when testing the accuracy, specificity and true accuracy of the suggested method. By looking at the graph it can be seen that 64 ICs (equal to 2 seconds worth of 64channel EEG) would take approximately 900 seconds (13 minutes) to label. If we scale this up to account for 1 minute worth of 64-channel EEG. We get a run time of 754 minutes (assuming 58 overlapping epochs) this is approximately 12,5h of computation time while not considering the added overhead. The suggested method is therefore too slow to be put to practical use in the field. The suggested method does however have its own more unique strengths. The suggested method is able to detect artifacts with less known information. Subsections of channels can be run in parallel without any information of signal origin on the scalp of the subject. No information is required about data in other channels and no information about previous or past signal content is needed. The suggested method doesn't perform any form of feature extraction and uses no types of filters. When keeping this in mind, the performance of the suggested method is impressive for its ability to operate in an environment with so many uncertainties.

7. Discussion The proposed method has two major points of concern, the first one being that the genetic algorithm requires all types of artifacts that the user wishes to be detectable to be labeled at least once within the testing data. Secondly individual ICs might be too unique to find a match in the database and the result during the training of the GA also hints towards these type of problems. This type of problem can however most likely be avoided by implementing a comparison method, which looks for several good matches instead of only the closest match. Clustering [51] might be a well suited method for further development the method in this aspect. An advantage with this method however lies within its potentially low run time when optimized. If the labeled ICs can be sorted in a hash table, then the search time when comparing ICs would be massively reduced [43]. Unfortunately there currently exists no known way to implement this type of storage structure. One solution could be to adapt music recognition software to solve this problem [30]. If mobile apps such as Shazam! can be re-purposed for use with EEG signals, this problem would be solved. Propose a new method for detecting EOG and EMG artifacts in EEG, how does it compare to the known detection methods and which one yields the highest detection rate? In the performed tests the suggested method achieved results which outperforms FASTER in specificity and true detection rate but falls short when it comes to accuracy and especially run time. FASTER will find more artifacts than the suggested method but while doing so it will on average label 40% of the clean EEG as artifacts.

23

7.1 Possible improvement of the pre-processing The ICs are unfortunately not saved as a part of the struct containing the signal segment (eeglab does not support it) and therefore need to be calculated when needed. The ICA method is however not required to be re-run. The unmixing matrix, used to calculate the ICs from the raw EEG data, can however still be calculated with two matrixes stored in the same struct; the weight matrix and sphere matrix. Moving this calculation from run time to pre-processing would improve the GAs run time by reducing the stress on the fitness function [19]. The normalization of the data is also done during run time as it cannot be saved locally, as eeglab does not support this. If the ICs were scaled appropriately as part of the preprocessing the fitness function’s workload would be reduced, resulting in a faster run time. Finally the ICs vary in sample rate from 1000Hz to 160Hz therefore all samples were converted to an even sample rate of 200Hz during run time, which caused a loss of information in many of the ICs. This will have had a negative impact on the accuracy, but will on the other hand reduce calculation time of the mean distance. Available disk space was starting to become a problem, so this pre-processing step was avoided. It also saved development time. If the memory space and time can be spared, then the conversion should be done with the rest of the pre-processing. The sampling rate could also be lowered to speed up the algorithm. It is surprising how well a simple GA implementation preformed when comparing detection rates. Since the year of 2010 and onward new artifact detection algorithms have been limited to using filtering and feature extraction [14][15], an important question to ask here is why were the simple approaches discarded and not researched? Was is because of a disbelief in the simple methods expected performance? Or were previous developers unwilling to explore a solution which didn't follow the steps used for manual detection of artifacts? Was that because they weren't able to interpret the problem for an AI efficiently? The questions are many and the scope of this thesis is unfortunately not able to answer them. The history of previous solutions to artifact detection should be studied in-depth to answer these questions.

8. Conclusion The suggested method shows potential but is not yet ready for use in the field. The results suggest that the method should be reliable, but the exponential relationship between the amount of data to process and the time required to do so limits the method heavily. The method also appear to over train in many scenarios. This implies that the method requires vast amounts of data to work with or a larger population size to counteract this 24

behavior. This does however increase the time required for the learning phase further and the method is therefore in its current state unlikely to ever see practical use. The pre-processing took a lot more time than expected, over 30 days of pure data handling. This delay caused problems as certain elements of the implementation had to be cut to make up for this loss of time. Functionality for specification of artifact origin was planned to be implemented but it was therefore not implemented. Conclusion: proper time management is key. The choice of using GA does not appear to be optimal, it should however still be considered as a possible compliment to other methods who lack in specificity as this has proven to be the methods strongest aspect.

8.1 Limitations The project have experienced several limitations, time and hardware has had the biggest impact since it limits the complexity of the suggested method. It would have been interesting to use more labeled ICs for the training. The lack of labeled ICs also caused a problem as it became a time consuming process to label ICs manually. The acquirement of a Matlab license was also problematic as MDH was unable to provide one. Finally I personally had no prior knowledge of Matlab, EEG and ICA, so a lot of time was dedicated to installing, learning and testing the different methods and environment.

8.2 Societal Consideration EEG data is considered medical data in is therefore subject to European protective laws. “Due to their sensitive nature, health data require a high level of protection in the European Union. Against that background, the Directive defines as a principle that data which are capable by their nature of infringing fundamental freedoms or privacy should not be processed unless the data subject gives his explicit consent." [31]. The EEG recordings used in this thesis is therefore not included in this document.

25

9. References [1] E. Niedermeyer, F.L. da Silva, "Electroencephalography: Basic Principles, Clinical Applications, and Related Fields", 2004. [2] M. Brown, M. Marmor, Vaegan, E. Zrenner, M. Brigell, M. Bach, "ISCEV Standard for Clinical Electro-oculography (EOG)", Doc Ophthalmol, Vol. 113, pp. 205–212, 2006. [3] G. Kamen, "Electromyographic Kinesiology", Robertson, EDGE et al. Research Methods in Biomechanics, 2004. [4] Electromyography at the US National Library of Medicine Medical Subject Headings (MeSH), 1999. [5] J. Himberg, A. Hyvärinen, "Independent Component Analysis for Binary Data: An Experimental Study", 2001. [6] J. M. Stern, “Figure 4-36 Slow Roving Eye Artifact”, Atlas of EEG Patterns, 2013. [7] M. Fatourechi, A. Bashashati, R. K. Ward, G. E. Birch, "EMG and EOG artifacts in brain computer interface systems: A survey", Clinical Neurophysiology, Vol. 118, pp. 480-494. [8] J.C. Woestenburg, M.N. Verbaten, J.L. Slangen,"The removal of the eye-movement artifact from the EEG by regression analysis in the frequency domain", Clinical Neurophysiology, Vol. 16, pp. 127-147. [9] M. Teplan, "Fundamentals of EEG Measurement", Measurement Science Review, Vol. 2, pp. 1-11, 2002. [10] B. Rael Cahn, J. Polich, “Meditation States and Traits: EEG, ERP, and Neuroimaging Studies”, ,Vol. 132, pp.180-211, 2006. [11] P.Tichavský, Z. Koldovský, “Kybernetika”, Vol. 47, No. 3, pp. 426-438, 2011. [12] M. Kirkove, C. François, J. Verly, “Comparative evaluation of existing and new methods for correcting ocular artifacts in electroencephalographic recordings”, Signal Processing, Vol. 98, pp.102-120, 2014. [13] A. Kandaswamy, V. Krishnaveni, S. Jayaraman, N. Malmurugan, K. Ramadoss,”Removal of Ocular Artifacts from EEG—A Survey”, IETE Journal of Research, Vol. 51, pp. 121-130, 2005. [14] I. Daly, R. Scherer, M. Billinger, G. Müller-Putz, ”FORCe: Fully Online and automated artifact Removal for brain-Computer interfacing”, IEEE Transactions on Neural Systems and Rehabilitation Engineering, pp. 1-13, 2014. 26

[15] H. Nolan, R. Whelan, R.B. Reilly, ”FASTER: Fully Automated Statistical Thresholding for EEG artifact Rejection”, Journal of Neuroscience Methods, Vol. 192, pp. 152–162, 2010. [16] J. J. Vidal, “Toward Direct Brain-Computer Communication”, pp. 157-180, 1973. [17] G. Jones, “Genetic and Evolutionary Algorithms" [18] G. Schalk, McFarland, D.J., Hinterberger, T., Birbaumer, N., Wolpaw, J.R. ,"BCI2000: A General-Purpose Brain-Computer Interface (BCI) System". IEEE Transactions on Biomedical Engineering, 2004. [19] Swartz Center of Computational Neuroscience, “runica.m”,URL: https://sccn.ucsd.edu/svn/software/eeglab/functions/sigprocfunc/runica.m, [Retrieved: 2016-08-21]. [20] University of Amsterdam, "NBT - The Neurophysiological Biomarker Toolbox". URL: https://www.nbtwiki.net/. [Retrieved: 2016-01-13]. [21] Math Works, “Global Oprimization Toolbox, Genetic Algorithm”, Documentation, 2013. [22] Schwartz Center for Computational Neuroscience, "EEGLAB", URL: http://sccn.ucsd.edu/eeglab/, [Retrived: 2016-01-13]. [29] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. Ch. Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C. K. Peng, H. E. Stanley, “PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals.”, PhysioBank, Vol. 101, issue 23, pp. 215-220, 2000. [30] J. Jovanovic, "Shazam it! Music Recognition Algorithms, Fingerprinting and Processing", Unknown 2015. URL: https://commons.wikimedia.org/wiki/User:Hgamboa/gallery, [Retrived: 2016-05-18]. [31] European Comission, "Data protection in the EU", URL: http://ec.europa.eu/health/data_collection/data_protection/in_eu/index_en.htm#fragm ent3, [Retrived: 2016-05-18]. [32] T. Duggan-Jahns, "The Evolution of Magnetic Resonance Imaging: 3T MRI in Clinical Applications", European Magnetic Resonance Forum, 2008. [33] G. T. Herman, "Fundamentals of computerized tomography: Image reconstruction from projection", 2nd edition, 2009. [34] K. Blocka, "EEG(Electroencephalogram)", Healthline, URL: http://www.healthline.com/health/eeg#Uses2, 2015, [Retrieved: 2015-05-18]. 27

[35] Bronkhorst, W. Adelbert, "The Cocktail Party Phenomenon: A Review on Speech Intelligibility in Multiple-Talker Conditions", Acta Acustica united with Acustica, vol. 86, pp.117-128, 2000. [36] I. T. Jolliffe, "Principal Component Analysis", Springer Series in Statistics, 2nd edition, 2002. [37] N. A. Akansu, R. A. Haddad, "Multiresolution Signal Decomposition: Transforms, Subbands, Wavelets", San Diego: Academic Press, 1992. [38] J. F. Kenney, "Mathematics of Statistics", Pt. 1, 3rd ed., 1962. [39] D. Ashlock, "Evolutionary Computation for Modeling and Optimization", 2006.' [40] W. Banzhaf, P. Nordin, R. Keller,F. Francone, "Genetic Programming – An Introduction", 1998. [41] W. Pedrycz, "Fuzzy control and fuzzy systems", 2nd Ed., Research Studies Press Ltd., 1993. [42] Math Works, "Optimization Toolbox", URL: http://se.mathworks.com/help/optim/index.html, [Retrieved: 2015-05-18]. [43] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, "Introduction to Algorithms, 3rd Ed., pp. 253-280, 2009. [44] Unknown, “How to Become a Registered EEG Technician”, URL: http://study.com/articles/How_to_Become_a_Registered_EEG_Technician.html, [Retrieved: 2016-08-21]. [45] Edward Claro Madder, “How do I read an EEG?”, URL: https://www.quora.com/How-do-I-read-an-EEG, [Retrieved: 2016-08-21]. [46] W. Klimesch,"EEG alpha and theta oscillations reflect cognitive and memory performance: a review and analysis", Brain Research Reviews, Vol. 29, pp. 169–195, 1999. [47] S. Danielsson, “Invetstigation of feature Selection Optimization for EEG Signal Analysis for Monitoring a Driver”, pp. 9-10, 2015. [48] M. Rangaswamy, B. Porjesz, D. B. Chorlian, K. Wang, K. A. Jones, L. O. Bauer, J. Rohrbaugh , S. J. O'Connor, S. Kuperman, T. Reich, H. Begleiter, “Beta power in the EEG of alcoholics”, Biol Psychiatry, Vol. 52, pp. 831-842, 2002. [49] J. R. Hughes, “Gamma, fast, and ultrafast waves of the brain: their relationships with epilepsy and behavior.”, Epilepsy & Behavior, Vol. 13, pp 25-31, 2008. 28

[50] A. R. Bentivoglio, S. B. Bressman, E. Cassetta, D. Carretta, P. Tonali, A. Albanese, “Analysis of blink rate patterns in normal subjects.”, Movement Disorders, Vol. 12, pp 1028-1034, 1997. [51] K. Bailey, “Numerical Taxonomy and Cluster Analisys”, Typologies and Taxonomies, Vol. 102, pp. 34, 1994.

9.1 Media References [23] H. Gamboa, 1 December 2005. URL: https://commons.wikimedia.org/wiki/User:Hgamboa/gallery, [Retrived: 2016-01-13]. [24] Asanagi, 30 May 2010. UTL: https://en.wikipedia.org/wiki/1020_system_(EEG)#/media/File:21_electrodes_of_International_1020_system_for_EEG.svg, [Retrived: 2016-01-13]. [25] Unknown, 2012. URL: https://apotential.files.wordpress.com/2012/07/normaleeg-coma-alpha.jpg, [Retrived: 2016-01-13]. [26] NascarEd, "Wikipedia", 7 February 2013. URL: https://en.wikipedia.org/wiki/Electrooculography#/media/File:Sleep_Stage_REM.png, [Retrieved: 2016-01-13]. [27] Unknown, URL: http://www.jssm.org/vol9/n4/12/fig1.jpg, [Retrieved: 2016-0113]. [28] Neurophysiological Biomarker Toolbox (NBT), 2014, URL: https://www.youtube.com/watch?v=A3OjdSkQqwk. [retrieved: 2016-04-29].

29

10. Appendix 10.1 Settings for the Optimization Tool in Matlab Problem Setup and Results Setting Solver Fitness function: Number of Variables: Constraints: Linear inequalities: A: Linear inequalities: b: Linear equalities: Aeq: Linear equalities: beq: Bounds: Lower: Bounds: Upper: Nonlinear constraint function: Integer variable indices: Use random states from previous run: Population Population type: Population size: Creation function: Initial population: Initial scores: Initial range:

Value ga - Genetic Algorithm @Fitness 462000

0 1 Unchecked

Bit string Use default: 20 Uniform Use default: [] Use default: [] Use default: [0;1]

Fitness scaling Scaling function:

Rank

Selection Selection function:

Roulette

Reproduction Elite count: Crossover fraction:

Use default: 2 Use default: 0.8

Mutation Mutation function: Rate:

Uniform Use default: 0.01

Crossover Crossover function:

Scattered 30

Migration Direction: Fraction: Interval:

Forward Use default: 0.2 Use default: 20

Stopping criteria Generations Time limit: Fitness limit: Stall generations: Stall time limit: Function tolerance:

Use default: 100 Use default: Inf Specify: 0 Use default: 50 Use default: Inf Use default: 1e-6

Plot functions Plot interval: Best fitness: Best individual: Distance: Expectation: Genealogy: Range: Score diversity: Scores Selection: Stopping: Max constraint: Custom function:

1 Checked Checked Unchecked Unchecked Unchecked Unchecked Unchecked Checked Unchecked Checked Unchecked Unchecked

Output function Custom function:

Unchecked

Display to command window Level of display:

iterative

User function evaluation Evaluate fitness and constraint functions:

in parallel

31

10.2 Fitness function for the Optimization Tool in Matlab function scores = Fitness(x) fitness = 0; cleanEEGEpochs = 8; uncleanEEGEpochs = 115; numberOfICsPerCleanEpoch = 128; numberOfICsPerUncleanEpoch = 3; numberOfUnlabeledSubjects = 100; numberOfSessionsPerUnlabeledSubject = 14; epochLenght = 2; numberOfOverlappingEpochsPerUnabeledSession = 55; numberOfICsPerUnabeledEpoch = 60; %Use labeled components here to find each ones best match and check %if it should be true or false. Fitness = numberOfCorrectlyLabeledICs. for e = 1:cleanEEGEpochs EEG_Labled = pop_loadset('filename',strcat(strcat('Sample',int2str(e)),'.set'),'filepath ','F:\\Labeled Components 2s Clean\\'); EEG_Labled = pop_resample( EEG_Labled, 200); LabledComponents = EEG_Labled.icaweights*EEG_Labled.icasphere*EEG_Labled.data; bestMatch = 1; for t = 1:numberOfICsPerCleanEpoch LowestDifference = -1; for y = 1:numberOfUnlabeledSubjects for i = 1:numberOfSessionsPerUnlabeledSubject; z = epochLenght; k = 0:1; g = numberOfOverlappingEpochsPerUnabeledSession; for u = 0:(g-2)%go through 54 epochs per file k(1) = 1+u*z; k(2) = 1+z+u*z; EEG_Unlabled = pop_loadset('filename',strcat(strcat(strcat(strcat(strcat(strcat(strcat(str cat(strcat(strcat(sprintf('%03d',y),'-R'),sprintf('%02d',i))),'T'),int2str(z)),'P'),num2str(k(1))),'-'),num2str(k(2))),'.set'),'filepath', strcat(strcat('C:\\Users\\Niklas\\Documents\\EEG\\Unlabeled ICA components overlap\\sample', int2str(y)),'\\')); EEG_Unlabled = pop_resample( EEG_Unlabled, 200); UnlabledComponents = EEG_Unlabled.icaweights*EEG_Unlabled.icasphere*EEG_Unlabled.data; for f = 1:numberOfICsPerUnabeledEpoch %compare ICA components here Difference = 0; LMaxPnt = 1; LMinPnt = 1; UMaxPnt = 1; UMinPnt = 1; for d = 1:length(LabledComponents)-1%once for each data point %Normalize if(LMaxPnt < LabledComponents(t,d)) LMaxPnt = LabledComponents(t,d); end if(LMinPnt > LabledComponents(t,d)) LMinPnt = LabledComponents(t,d);

32

end %Normalize if(UMaxPnt < UnlabledComponents(f,d)) UMaxPnt = UnlabledComponents(f,d); end if(UMinPnt > UnlabledComponents(f,d)) UMinPnt = UnlabledComponents(f,d); end end %calc mid point of the component, and store %the inverse of it in diffToZero LAvrPnt=(LMaxPnt+LMinPnt)/2; LDiffToZero = 0-LAvrPnt; UAvrPnt=(UMaxPnt+UMinPnt)/2; UDiffToZero = 0-UAvrPnt; UAbsolute = UMaxPnt; LAbsolute = LMaxPnt; if(abs(UMinPnt)>abs(UMaxPnt)) UAbsolute = abs(UMinPnt); end if(abs(LMinPnt)>abs(LMaxPnt)) LAbsolute = abs(LMinPnt); end for d = 1:length(LabledComponents)-1%once for each data point

%Store index of the highest match UnlabledComponents(f,d)=(UnlabledCompon ents(f,d)+UDiffToZero)/(UAbsolute+UDiffToZero); LabledComponents(t,d)=(LabledComponents (t,d)+LDiffToZero)/(LAbsolute+LDiffToZero); %scale to -1 to 1 here Difference = Difference + abs(LabledComponents(t,d)-UnlabledComponents(f,d)); end end if(LowestDifference > Difference || LowestDifference == -1)%find the best match LowestDifference = Difference; bestMatch = (y1)*numberOfUnlabeledSubjects+(i-1)*1+(u)*2+f;%store the best match end end end end %disp(bestMatch); if(x(bestMatch:bestMatch) == 1) fitness = fitness + 1; end end end for e = 1:uncleanEEGEpochs EEG_Labled = pop_loadset('filename',strcat(strcat('sample', int2str(e)),'_1.set'),'filepath','F:\\Labled Components 2s Unclean\\'); EEG_Labled = pop_resample( EEG_Labled, 200); LabledComponents = EEG_Labled.icaweights*EEG_Labled.icasphere*EEG_Labled.data; bestMatch = 1; for t = 1:numberOfICsPerUncleanEpoch LowestDifference = -1;

33

for y = 1:numberOfUnlabeledSubjects for i = 1:numberOfSessionsPerUnlabeledSubject z = epochLenght; k = 0:1; g = numberOfOverlappingEpochsPerUnabeledSession; for u = 0:(g-2)%go through 20 components, 20 per file

k(1) = 1+u*z; k(2) = 1+z+u*z; EEG_Unlabled = pop_loadset('filename',strcat(strcat(strcat(strcat(strcat(strcat(strcat(str cat(strcat(strcat(sprintf('%03d',y),'-R'),sprintf('%02d',i))),'T'),int2str(z)),'P'),num2str(k(1))),'-'),num2str(k(2))),'.set'),'filepath', strcat(strcat('C:\\Users\\Niklas\\Documents\\EEG\\Unlabeled ICA components overlap\\sample', int2str(y)),'\\')); EEG_Unlabled = pop_resample( EEG_Unlabled, 200); UnlabledComponents = EEG_Unlabled.icaweights*EEG_Unlabled.icasphere*EEG_Unlabled.data; for f = 1:numberOfICsPerUnabeledEpoch %compare ICA components here Difference = 0; LMaxPnt = 1; LMinPnt = 1; UMaxPnt = 1; UMinPnt = 1; for d = 1:length(LabledComponents)-1%once for each data point %Normalize if(LMaxPnt < LabledComponents(t,d)) LMaxPnt = LabledComponents(t,d); end if(LMinPnt > LabledComponents(t,d)) LMinPnt = LabledComponents(t,d); end %Normalize if(UMaxPnt < UnlabledComponents(f,d)) UMaxPnt = UnlabledComponents(f,d); end if(UMinPnt > UnlabledComponents(f,d)) UMinPnt = UnlabledComponents(f,d); end end %calc mid point of the component, and sotre %the inverse of it in diffToZero LAvrPnt=(LMaxPnt+LMinPnt)/2; LDiffToZero = 0-LAvrPnt; UAvrPnt=(UMaxPnt+UMinPnt)/2; UDiffToZero = 0-UAvrPnt; UAbsolute = UMaxPnt; LAbsolute = LMaxPnt; if(abs(UMinPnt)>abs(UMaxPnt)) UAbsolute = abs(UMinPnt); end if(abs(LMinPnt)>abs(LMaxPnt)) LAbsolute = abs(LMinPnt); end for d = 1:length(LabledComponents)-1%once for each data point

34

%Store index of the highest match UnlabledComponents(f,d)=(UnlabledCompon ents(f,d)+UDiffToZero)/(UAbsolute+UDiffToZero); LabledComponents(t,d)=(LabledComponents (t,d)+LDiffToZero)/(LAbsolute+LDiffToZero); %scale to -1 to 1 here Difference = Difference + abs(LabledComponents(t,d)-UnlabledComponents(f,d)); end end if(LowestDifference > Difference || LowestDifference == -1)%find the best match LowestDifference = Difference; bestMatch = (y1)*numberOfUnlabeledSubjects+(i-1)*1+(u)*2+f;%store the best match end end end end %disp(bestMatch); if(x(bestMatch:bestMatch) == 0) fitness = fitness + 1; end end end scores = cleanEEGEpochs*numberOfICsPerCleanEpoch+uncleanEEGEpochs*numberOfICsPerUncl eanEpoch-fitness; end

35

10.3 Genetic build-up of the Highest Scoring Population Member Gene Type Bit String

Gene Content 10011001011100111010

36

10.4 Hardware and Operative System Processor: RAM: Disk: Operative System:

Intel(R) Core(TM) i5-2500K @ 3.30GHz 3.30GHz 16.0 GB, 1600MHz 5400RPM, 1TB, Harddrive Windows 8.1, 64-bit

37

10.5 Manual Artifact Removal Process 1. Copy the original raw data signal. 2. Filter the copy of the raw signal using a notch filter set to the range 45Hz to 55Hz. This removes possible distortions caused by the power supply. 3. Re-reference the copy of the signal to the zero potential. This scales the amplitude of channels so they can be compared. 4. Automatic selection of contaminated channels, this is however a flawed method which therefore requires visual inspection to confirm the selections made by the automatic method. 5. The operator then add bad and remove good channels from the group of selected channels. This is done through visual inspection and requires vast knowledge as well as experience. 6. The bad channels are then removed from the original signal but only if they contain larger or often occurring artifacts, the operator makes this decision. 7. Then an Independent Component Analysis (ICA) is preformed on the copy of the original signal. The resulting independent components are then cross compared with the copy of the original signal to determine which of the components describe the artifacts seen in the copy. 8. The operator selects the components he or she believes describe the artifacts and proceed to remove them from the original signal but to be on the safe side more of the signal than what visually seen to contain artifacts is removed. This is sometimes referred to as being considered an art and not a step by step scientific procedure. 9. This process of rejecting channels and using ICA to reject independent components is then repeat until the signal is considered clean [28].

38

10.6 Results of True Accuracy Tests Run on the Individuals Produced by the GA ID

True Detection Accuracy Evaluation data

True Detection Accurancy – 20 ICs

True Detection Accurancy – 40 ICs

True Detection Accurancy – 80 ICs

True Detection Accurancy – 248 ICs

1

100,00%

90,00%

75,00%

65,00%

-

4

100,00%

95,00%

87,50%

85,00%

-

5

100,00%

100,00%

100,00%

90,00%

86,99%

2

100,00%

85,00%

42,50%

-

-

3

100,00%

90,00%

65,00%

-

-

6

100,00%

85,00%

60,00%

-

-

9

100,00%

85,00%

45,00%

-

-

7

100,00%

45,00%

-

-

-

8

100,00%

55,00%

-

-

-

12

100,00%

30,00%

-

-

-

10

0,00%

-

-

-

-

11

42,50%

-

-

-

-

13

77,50%

-

-

-

-

14

80,00%

-

-

-

-

15

0,00%

-

-

-

-

16

57,25%

-

-

-

-

17

32,50%

-

-

-

-

18

0,00%

-

-

-

-

19

60,00%

-

-

-

-

20

87,50%

-

-

-

-

39