Sunday, October 27, 2013

generating binary file for simplified chinese via word2vec

background
word2vec could come in handy in nlp.
i have discovered one way to generate vectors for simplified chinese.

training data
a) wiki dump: http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
(reference: http://licstar.net/archives/262)
use Wikipedia Extractor (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to extract text from dump, use the following command:

bzcat zhwiki-latest-pages-articles.xml.bz2 | python WikiExtractor.py -b1000M -o extracted >output.txt

b) socialysis: ftp://ftp.socialysis.org
you can find many raw text here.

segmentation
before training, we need to segment these raw text into terms.
in this case, i am using ansj for segmentation.
i wrote a demo class to turn ansj into a command line tool:
package org.ansj.demo;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;

import org.ansj.domain.Term;
import org.ansj.recognition.NatureRecognition;
import org.ansj.splitWord.analysis.ToAnalysis;

public class SimpleIODemo {

    public static void main(String[] args) throws IOException {
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
        String line = null;
        while ((line = br.readLine()) != null) {
            if (line.startsWith("<"))
                continue;
            List<Term> parse = ToAnalysis.parse(line);
            new NatureRecognition(parse).recognition();
            for (Term term: parse) {
                System.out.print(term.getName() + "/"
                        + term.getNatrue().natureStr + " ");
            }
            System.out.println();
        }
    }

}

and in this case, we append nature of term to avoid ambiguous terms.
use this command to generate segmented text:

mvn exec:java -Dexec.mainClass="org.ansj.demo.SimpleIODemo" < ~/work/extracted_text.txt > ~/work/segmented_text.txt

training
use this command:
./word2vec -train ~/work/segmented_text.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 16 -binary
after a period of time, a binary file named vectors.bin will be generated.

verifying
use this command to test trained vectors:
./distance vectors.bin
i generated vectors using pure wiki dump and get this:
Enter word or sentence (EXIT to break): 人工智能/n

Word: 人工智能/n  Position in vocabulary: 18882

                                              Word       Cosine distance
------------------------------------------------------------------------
                                       计算机/n 0.758043
                                    认知科学/n 0.659870
                                    机器人学/n 0.636466
                                       运筹学/n 0.628714
                                       控制论/n 0.626604
                                      自动化/vn 0.612964
                                       博弈论/n 0.608870
                                          科学/n 0.595060
                                    系统工程/l 0.593820
                                    微电子学/n 0.592527
                                            nlp/en 0.590136
                                          仿真/v 0.589741
                                          领域/n 0.588424
                                       知识库/n 0.588246
                                       分布式/b 0.586032
                                       信息论/n 0.584697
                                 计量经济学/n 0.582200
                                       计量学/n 0.580011
                                         分析/vn 0.579240
                                       生物学/n 0.578400
                                    机器翻译/l 0.578206
                                       自动化/v 0.577689
                                         应用/vn 0.573138
                                          技术/n 0.571564
                                          数学/n 0.571543
                                         模拟/vn 0.570714
                                          人机/n 0.570010
                                          编程/v 0.569065
                                    空间科学/n 0.566234
                                       系统论/n 0.566088
                                    基础理论/l 0.564778
                                           abap/en 0.563862
                                       本体论/n 0.563624
                                       跨学科/b 0.560602
                                            cae/en 0.560012
                                            gis/en 0.559896
                                 分子生物学/n 0.559691
                                         仿真/vn 0.558837
                                       信息学/n 0.558737
                                 社会心理学/n 0.555530

Saturday, October 26, 2013

Solution: Java Runtime.exec hangs abnormally

background
i was debugging a J2EE project which has a method that invokes an external program.
in most scenario it just works fine, but i spotted some hanging external process after a period of full-load time.

problem
this is the code when things went wrong:
private void execute(String command) throws IOException {
        Runtime runTime = Runtime.getRuntime();
        LOG.info("executing: " + command);
        String[] args = new String[] {
            "/bin/sh", "-c", command
        };
        Process proc = runTime.exec(args);
        try {
            if (proc.waitFor() != 0) {
                throw new IOException("subprocess exited with non-zero code");
            }
        } catch (InterruptedException e) {
            throw new IOException("interrupted");

        }
}

analyzing
it turns out that there is a very long output (through stdout) in every hanging process.
It is documented here: http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html
"Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock."
solution A
if output should be ignored, just append "> /dev/null" to command. and it should work fine.

solution B
if output is necessary, start reading stdin and stderr, instead of just waiting for subprocess to finish.

[For beginners] How to deploy J2EE website on your own server

background
you may want to deploy your own website using J2EE, so people could have access to it.
in this case, this article will show you how to do it.

essentials
a running server (or VPS, which you could deploy tomcat or jetty on it)
a JDBC compatible database (mysql, sql server, postgresql or oracle)
WAR package of your website
(optional) a top-level domain

how to
I. deploy tomcat/jetty on your server
setup jre:
[jre7] http://www.oracle.com/technetwork/java/javase/downloads/java-se-jre-7-download-432155.html

get tomcat/jetty package here:
[jetty] http://download.eclipse.org/jetty/stable-9/dist/
[tomcat] http://tomcat.apache.org/download-80.cgi

II. deploy webapp (WAR package)
put your war package in the webapps directory of jetty/tomcat.

III. get it running
tomcat:
sh tomcat/bin/catalina.sh

jetty:
sh jetty/bin/jetty.sh start

it works.

implementation of dependency click model

background

sometimes we are just not satisfied with the rank when implementing IR system, even it is based on sophisticated rank strategy.
therefore, user feedback is an important part of ranking system.
implementation of click model is one way to achieve this.
this article is based on the paper: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf, which is presented by Microsoft Research.
it is called Dependency Click Model because it is considers dependency on position.


hypothesis

considering a list of query result, we simply assume that user will examine them strictly by order.
and user click these results that seems relevant, and may continue this examining process, then stop somewhere.
DCM is based on what this basic hypothesis implied.
for details on how it works, please read: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf

engineering implementation

to implement such feature, we decompose the model into 4 processes.

a) model data storage
we propose this array-of-counter-pairs structure. each array-of-counter-pairs is an array for counter-pairs, which contain two counters, so it may look like this:
[(alpha, beta), (alpha, beta), (alpha, beta), (alpha, beta), ...]
element index indicates corresponding position.

we need to store global array-of-counter-pairs and array-of-counter-pairs for each keyword. 
this could be done with databases or simple binary file.
it may contain a hash table (key:keyword, value: array-of-counter-pairs). 

b) data recording and collection
we need to record exactly which result did user clicked for a query using specified keyword, and what result are in the list.
therefore, for a particular query, we may have a record like this (say session):
USERID, KEYWORD, SHOWN RESULTS, CLICKED DOCS

c) log analyzation
we need periodic analyze data record to update counter-pairs.
for global array-of-counter-pairs, each alpha indicates how many last click in session occur in corresponding position, while beta indicates how many click occur at corresponding position.
for keyword's array-of-counter-pairs, each alpha indicates how many click in session occur in corresponding position, while beta indicates how many impression occur at corresponding position (if it shows before last clicked position, then it is impressed based on hypothesis).

first update global counter for each recorded session, and then update that for specified keyword.
the relevance of individual document is alpha / beta.

d) rank adjustment
we can simply sort Top-N document by the computing score:
score = 0.3 * relevance + 0.7 * positionScore,
where positionScore is a constant corresponding to original position in list.


Friday, June 7, 2013

a small trick: automatically sign in SRUN3000 with OpenWRT

I. Background

Tianjin University has just implemented SRUN3000 authentication program, which restricts internet access without signed in.

It is very inconvenient to sign in every single time. So i decided to do something.

In this case, I have a router with OpenWRT running.

(btw: most tp-link wireless router could manage to update to openwrt firmware if you are willing to try this out)


II. Solution

OpenWRT provides hotplug2 feature to trigger scripts while interface state changed.

Save this script to /etc/hotplug.d/iface/30-srun:

#!/bin/sh
[ "$ACTION" = ifup ] || exit 0
[ "$INTERFACE" = wan ] || exit 0

# delay for 2 second
sleep 2

# nc to create a http post request
nc g.tju.edu.cn 80 < /tmp/post_login

And chmod it with x access.

Next step, save this datadump to /tmp/post_login:

GET /cgi-bin/do_login HTTP/1.1
Host: g.tju.edu.cn
Content-Type: application/x-www-form-urlencoded
Content-Length: 77

username=ID&password=PASSWORD&drop=0&type=1&n=100&force=true

replace ID with your account, PASSWORD with Substring(8, 16) of the md5 hash of your password. i.e. 123456 is 49ba59abbe56e057f. 

And 77 should be the total length of these parameters (username=ID&password=PASSWORD&drop=0&type=1&n=100&force=true).


--- EOF --

Thursday, June 6, 2013

a simple & stupid probatilistic corelation analysis method

I. Problem definition

assume you are in charge of a private tracker, recommendation could come in handy sometimes especially when visitor wanted to try out something new.

so i am going to work out a way to achieve this.


II. Theory

Consider resource A and B. N(A) represents number of users downloaded resource A.
In that case, presuming we have N(A ∪ B) users, P(A) = N(A) /  N(A ∪ B).
Similarly, P(B) = N(B) / N(A ∪ B).

Consider we wanted to present a list of recommendation based on resource A, we have two obvious way to do this:

a) For every other resource B, use P(B | A) as rank, with P(B | A) = P(AB) / P(A). The result is N(A ∩ B) / N(A).

b) For every other resource B, use P(AB) as rank. The result is N(A ∩ B) / N(A ∪ B).

First idea considers how much percentage of users downloaded B in users that downloaded A. 

Second idea considers how much percentage of users downloaded both A and B in users that downloaded A or B.

These are NOT THE SAME.

Consider N(A) = 50, N(B) = 40, N(A ∪ B) = 80, N(A ∩ B) = 10.
N(C) = 200, N(A ∪ C) = 210, N(A ∩ C) = 30.

Do the math, idea a) will get the following rank:
B: 0.20   C: 0.60

while idea b):
B: 0.13  C: 0.09

The difference is obvious.

In fact, resource C is a hot resource which almost every user downloaded it.
So C has a high rank in idea a). But in idea b), because it also considers how many users downloaded A in users that downloaded C, the ratio is reasonable.

After a series of tests, it turns out that some hot resource often got a high rank corelated to whatever resource in the first method, while the second method performs well because it considers a two-way linkage.


III. Coding phase

I wrote a tiny simple program to get this job done.

You can download it here: 


The idea is for every resource pair <A, B>, calculate (N(A ∩ B) / N(A ∪ B) * 100) as the corelation ratio of <A, B> as well as <B, A>.

When choosing recommendation list, sort the resource list by ratio, and pick the top N items.


IV. Next step

This is just a tiny experiment done all by myself because I am just curious about how well / fast these methods could perform, and what problem could i encounter next.

Now the time complexity of that algorithm is O( (R^2) * N ), R represents number of resources, N represents average number of users that downloaded a resource.

It takes a lot of time to finish the calculation.

Apparently there is much more better ways to improve this.

access the Internet through blocks in cernet

I. What is the problem

in this case, i am going to solve two problems at one time:

a) as far i as i know, overseas access via CERNET maintains a bandwidth at the average of 8kb/s, which drive you crazy

b) china GFW blocks almost every popular website around the world


II. The idea

it turns out that:

a) access overseas resource via a proxy which set up in nearest city is usually faster

b) GFW will not block secure connections

so suppose i am in Tianjin, i am gonna use two servers: A set up in Beijing and B set up in US, and a maintained website list to solve the problem.

when:

a) accessing mainland websites, just go directly

b) accessing overseas non-blocked websites, use A as proxy which makes it faster

c) accessing overseas blocked websites, first access B via proxy A, and access the target website via B, which get through the blocks with a impressive speed.


III. Tools

use proxifier (http://www.proxifier.com/) to redirect connections set up by localhost.

use plink (http://the.earth.li/~sgtatham/putty/latest/x86/plink.exe) to set up secure tunnel from localhost to server A.

use autossh (http://www.harding.motd.ca/autossh/) to maintain a secure tunnel from server A to server B, otherwise it will be unable for us to get through blocks.


IV. How to

a) configure ssh-key for A and B in order to give A free access to B without typing password

b) add the following command line to daemon on server A:
autossh -f -M 5678 -CfNg -D portAB serverB

c) set up a plink process on localhost:
plink.exe serverA -N -ssh -2 -D portA

d) create proxy servers in proxifier: 127.0.0.1:portA and 127.0.0.1:portAB

e) combine a proxy china using these two proxies named Tunnel-US

f) set up rules:
in this case i have to redirect all connections to US & CA to 127.0.0.1:portA
and all blocked connections to Chain Tunnel-US

g) enjoy twitter


V. Next step

so far i have to build these rules manually for there is no such list that could provide which website is overseas or which websites is blocked.

there is a page provides monitored domains that are blocked: https://en.greatfire.org/search/domains

that makes it possible to fill rules automatically.

--- EOF---