| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Blind source separation

Page history last edited by Manson Fong 12 years, 5 months ago

Dear all,

 

Below please find my email discussion with Prof. Jaggi. I'm starting this page so as not to mess up the "ideas" page.

 

I'm reading about an algorithm called infomax-ICA (independent component analysis), whose goal is "blind source separation". The paper will be uploaded shortly. As far as I can understand now, the idea is to obtain the source signals from some N signals, by finding a transformation such that the resulting transformed signals are maximally independent. Independence implies that the joint entropy is maximized, hence the name "infomax".

 

Manson

 

=====

 

For the poster presentation, I would like to ask if I could explore on the experimental side of information theory. For my own research, I have two general directions: (1) to study how the brain processes linguistic materials through brain waves (or multi-channel EEG recordings); (2) brain-computer inferfacing with EEG. Perhaps you might agree that both directions require a good deal of knowledge in signal processing & classification, and my gut feeling is that information theory would be very useful here to "decode" the brain.

I've just started some pilot experiments regarding direction (2), in which I want to classify EEG data according to whether the subject is doing one task or the other. To be specific, the subject will view a number of choices on the monitor, and if the choice he/she wants to select flashes, he/she is required to pronounce it silently. Therefore, every portion of data can be tagged according to whether the flashed choice is a target or a non-target.

Returning to the presentation, there are two potential directions that I'd like to do:

(a) understand the principles of independent component analysis (ICA), as well as a very popular ICA algorithm (INFOMAX) proposed by Bell & Sejnowski (1995) (I believe this is a classic paper in blind source separation?);

(b) to use, in some way, the mutual information between the recorded multi-channel data (X^n_i) & the task variable (target/non-target) for feature extraction/selection. The Date (2001) study below seems to be about this direction.

The detailed references of both studies (as attached) are as follows:

Anthony J. Bell, Terrence J. Sejnowski. An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation November 1995, Vol. 7, No. 6: 1129–1159.

A. Date, "An information theoretic analysis of 256-channel EEG recording: Mutual information and measurement selection problem," ICA conference, pp. 185-188, 2001.

Roughly, I would like to compare INFOMAX-ICA with other decomposition method like the singular value decomposition (SVD) for dimensionality reduction; or, regarding the feature selection step, the use of mutual information compared with other simpler measures like the R2 statistics. If you find these general directions "reasonably interesting", I will try to present my analysis results for the poster presentation, while highlighting the prinicples of these information theory-derived techniques. What do you think?

 

=====

 

thanks for the carefully considered email.


in general, i'm supportive of presentations where the person is enthusiastic about the idea, rather than forcing you to do something you're uncomfortable with.


however, a word of caution -- information theory is often used to justify claims that are beyond its ability to prove. for instance, in the papers you've just attached, i'd need to think more carefully, but i'd be wary of the implicit assumption that "mutual information equals communication rate". what we proved in class is that if someone designs an optimal code, THEN the rate of this code is (at most) the mutual information of the channel -- however, it's easy to design codes that do much worse (and hard to design optimal codes). also, another common trap that people fall into, when trying to use information theory, is to confuse single-letter expression with n-letter expressions. the former correspond to single channel uses. however, it's not hard to see that a single channel use usually gives very noisy information (think of a single BSC(p) -- the probability of being incorrect is p). yet, the expressions computed in the papers seem to be based on single-letter expressions.


again, these are cursory thoughts, and i'd need to think more, and more deeply, before giving definitive answers. don't let these discourage you from doing your project. however, i'd encourage you to think deeply about the applicability of information theory to the problems you're interested in, before jumping ahead and believing everything written in those papers.

 

=====

 

Thanks for your detailed reply and sharing your thoughts about the two papers. I would say it is very inspiring -- I haven't imagined before that classifying experimental data can be viewed as designing a code, for the topic I'm doing.

What I can catch from your discussion is that single-letter expression is not enough to predict the underlying state; in other words, to see how much the observed data carry information regarding the underlying state, it would require n-letter expression. Moreover, let say we can indeed quantify this mutual information, it would be difficult for a classification algorithm (the code) to achieve this quantity.

But in general, I'd say that in an experimental setting, there would not be enough training data to estimate the probability distribution over n timepoints. Would that be one of the reasons why people have to settle for using first-order or second-order statistics?

Also, since the "mutual information" of 3 random variables can be negative, for my case, would it be problematic to use I(X1(t);X2(t);state) for "feature selection"? Here, X1(t) & X2(t) are the potentials measured in two electrodes at the same time, while the "state" refers to what mental task the subject is doing.

Anyway, I will try to think more deeply, before any actual implementation. Perhaps I could consult you along the way?

 

=====

 

well, based on your discussion, it seems to me that you're using mutual information as a proxy for covariance (high mutual information means high probability of two events happening simultaneously, etc).


what would be nice to see is


1. what is your model of the sequence of events... (if i understand you correctly, it's something like p(state.x_1,x_2)=p(state)p(x_1,x_2|state) ?)
2. what is the goal of the experiment? (again, if i understand you correctly, it is to estimate state based on x_1,x_2 ?)
3. what's the model for how the state changes?


based on the answers to questions like this, perhaps the quantity you're looking for is a "best estimator" based on (perhaps) a parametric model
(see, for instance, http://en.wikipedia.org/wiki/Estimator )


mutual information is merely one quantity that measures correlation between random variables. it is by no means the appropriate quantity for all situations. you need to clearly specify what your problem setup is (what things can you control in your experiments, what are fixed) what the sources of randomness are, whether they fall within certain parametric classes (correlated/independent across time/measurements, markov, ergodic, stationary, ....), and also, what the goal of the experiment is.

i'd be happy to chat more along the way.


vis-a-vis your presentation topic, i'd suggest that you consider asking the questions we've just been discussing with regards to some of the papers you mentioned. i suspect you'll find, if you do it carefully, that many of the papers in the "applied" literature make very loose use of mutual information and entropy quantities, even when much simpler statistical quantities are (provably!) what they should be using. (just because something is more complicated doesn't necessary mean it's "more correct/useful" :)


a presentation that even points out the above would be very interesting, i think.

 

=====

 

Sorry for my late response, and thanks for raising these questions for me. Let me try to answer some of them -- although it is probably more effective discussing them face-to-face later, when you are back and when I think through the problems more carefully.

First, about the goals of the experiment, there are two: (a) to study the effect of a certain adjustment of an EEG-based BCI (Brain-Computer Interface) system; (b) to compare different methods of signal processing. In short, the BCI system works by RANDOMLY flashing a number of cells on-screen (something like this: http://www.youtube.com/watch?v=wKDimrzvwYA&feature=related, but I only use 4 cells), and the EEG responses elicited by the target cell will be stronger than when the same cell serves as non-targets. Typically, after the cells are flashed for like 5-10 times, the target cell can be identified using the EEG responses. My experiment is about "what if the order of flashing is FIXED?", and "how would the features useful for classification changes?" To do this, I have run the experiment both with randomized sequences and fixed sequences.

While the data analysis is still ongoing, I've already found out that the data seem to be noisier. Still, if the BCI system could work when the sequence is fixed, this would probably make the system slighty more user-friendly.

A possible abstraction for the experiment is as follows:
1. The experimental condition changes between two states, T ("target") and N ("non-target"). Each state lasts for about 350 ms, or about 90 sampling points. Within a SEQUENCE of 4 states, the must be one "T" and three "N"s. So, P(T) = 0.25 and P(N) = 0.75. For each target selection, the four cells are flashed for 10 sequences in succession.
2. The data consist of 32 time-series, X_n(t), n = 1, ..., 32, recorded by 32 electrodes.
3. The classification step can be viewed as estimating p(state|X_n(t)) from some training data, which give p(X_n(t)|state).

For some i and j, the correlations between X_i(t) and X_j(t) are quite high, e.g., up to 0.6. So the 32 electrode channels are not independent. I am still unsure what would be a good parametric model for the data, but this is clearly an important question. For now, I am trying SVD for dimensionality reduction, as suggested by my supervisor. I have also implemented an LDA classifier, but this approach is rather blind, so I'd indeed like to find a suitable model for the data.

Since I have only two states (or four, depending on the abstraction), covariance between X_n & the states might not be directly applicable (but let me know if this statement is wrong). To measure the extent to which the states can be predicted from single datapoints, I've been been using R2-value (as well as other studies), which roughly shows the ratio between between-class variance & within-class variance.

I'm still exploring the use of SVD or other types of transforms, but the classification accuracy reaches only at most 50% for now, when the chance-level is 25%. So, clearly, there is room for improvement, and it just occurs to me that an information-theoretic analysis would be cool (and useful) to do, if done appropriately :)

Anyway, as you suggest, I will try to read about the "applied" literature with more caution, and to understand them better before blinding implementing anything. Thank you.

Comments (0)

You don't have permission to comment on this page.