Thursday, June 6, 2013

a simple & stupid probatilistic corelation analysis method

I. Problem definition

assume you are in charge of a private tracker, recommendation could come in handy sometimes especially when visitor wanted to try out something new.

so i am going to work out a way to achieve this.


II. Theory

Consider resource A and B. N(A) represents number of users downloaded resource A.
In that case, presuming we have N(A ∪ B) users, P(A) = N(A) /  N(A ∪ B).
Similarly, P(B) = N(B) / N(A ∪ B).

Consider we wanted to present a list of recommendation based on resource A, we have two obvious way to do this:

a) For every other resource B, use P(B | A) as rank, with P(B | A) = P(AB) / P(A). The result is N(A ∩ B) / N(A).

b) For every other resource B, use P(AB) as rank. The result is N(A ∩ B) / N(A ∪ B).

First idea considers how much percentage of users downloaded B in users that downloaded A. 

Second idea considers how much percentage of users downloaded both A and B in users that downloaded A or B.

These are NOT THE SAME.

Consider N(A) = 50, N(B) = 40, N(A ∪ B) = 80, N(A ∩ B) = 10.
N(C) = 200, N(A ∪ C) = 210, N(A ∩ C) = 30.

Do the math, idea a) will get the following rank:
B: 0.20   C: 0.60

while idea b):
B: 0.13  C: 0.09

The difference is obvious.

In fact, resource C is a hot resource which almost every user downloaded it.
So C has a high rank in idea a). But in idea b), because it also considers how many users downloaded A in users that downloaded C, the ratio is reasonable.

After a series of tests, it turns out that some hot resource often got a high rank corelated to whatever resource in the first method, while the second method performs well because it considers a two-way linkage.


III. Coding phase

I wrote a tiny simple program to get this job done.

You can download it here: 


The idea is for every resource pair <A, B>, calculate (N(A ∩ B) / N(A ∪ B) * 100) as the corelation ratio of <A, B> as well as <B, A>.

When choosing recommendation list, sort the resource list by ratio, and pick the top N items.


IV. Next step

This is just a tiny experiment done all by myself because I am just curious about how well / fast these methods could perform, and what problem could i encounter next.

Now the time complexity of that algorithm is O( (R^2) * N ), R represents number of resources, N represents average number of users that downloaded a resource.

It takes a lot of time to finish the calculation.

Apparently there is much more better ways to improve this.

No comments:

Post a Comment