assume you are in charge of a private tracker, recommendation could come in handy sometimes especially when visitor wanted to try out something new.
so i am going to work out a way to achieve this.
II. Theory
Consider resource A and B. N(A) represents number of users downloaded resource A.
In that case, presuming we have N(A ∪ B) users, P(A) = N(A) / N(A ∪ B).
Similarly, P(B) = N(B) / N(A ∪ B).
Consider we wanted to present a list of recommendation based on resource A, we have two obvious way to do this:
a) For every other resource B, use P(B | A) as rank, with P(B | A) = P(AB) / P(A). The result is N(A ∩ B) / N(A).
b) For every other resource B, use P(AB) as rank. The result is N(A ∩ B) / N(A ∪ B).
First idea considers how much percentage of users downloaded B in users that downloaded A.
Second idea considers how much percentage of users downloaded both A and B in users that downloaded A or B.
These are NOT THE SAME.
Consider N(A) = 50, N(B) = 40, N(A ∪ B) = 80, N(A ∩ B) = 10.
N(C) = 200, N(A ∪ C) = 210, N(A ∩ C) = 30.
Do the math, idea a) will get the following rank:
B: 0.20 C: 0.60
while idea b):
B: 0.13 C: 0.09
The difference is obvious.
In fact, resource C is a hot resource which almost every user downloaded it.
So C has a high rank in idea a). But in idea b), because it also considers how many users downloaded A in users that downloaded C, the ratio is reasonable.
After a series of tests, it turns out that some hot resource often got a high rank corelated to whatever resource in the first method, while the second method performs well because it considers a two-way linkage.
III. Coding phase
I wrote a tiny simple program to get this job done.
You can download it here:
The idea is for every resource pair <A, B>, calculate (N(A ∩ B) / N(A ∪ B) * 100) as the corelation ratio of <A, B> as well as <B, A>.
When choosing recommendation list, sort the resource list by ratio, and pick the top N items.
IV. Next step
This is just a tiny experiment done all by myself because I am just curious about how well / fast these methods could perform, and what problem could i encounter next.
Now the time complexity of that algorithm is O( (R^2) * N ), R represents number of resources, N represents average number of users that downloaded a resource.
It takes a lot of time to finish the calculation.
Apparently there is much more better ways to improve this.
IV. Next step
This is just a tiny experiment done all by myself because I am just curious about how well / fast these methods could perform, and what problem could i encounter next.
Now the time complexity of that algorithm is O( (R^2) * N ), R represents number of resources, N represents average number of users that downloaded a resource.
It takes a lot of time to finish the calculation.
Apparently there is much more better ways to improve this.
No comments:
Post a Comment