Two sample test for equality of distributions

August 13, 2019 19:19 PM

I have a large sample from a 2 dimensional continuous unknown distribution. From that sample I could compute any data structure I need to hold an approximation of the sample distribution. This will be constructed only once, so it does not matter if it takes a lot of time.

What I want is to compare with other samples and see if I find evidence that does not come from the same distribution. The trouble is that I have a lot such samples to compare with, and those samples are large and I cannot afford to hold them in memory. So I have to compute an online procedure because data from samples comes from a stream. One idea I have is to bin the two sample data and perform a chi squared test. But I am afraid that I will have many zeros if the grid is dense or loose too much power if the grid is sparse. This would also involve a lot of testing to fit the proper grid density.

Do you have any ideas. I am considering smart compact kde approximation with some bayesian update, but I cannot figure out.

