Sunday, October 27, 2013

generating binary file for simplified chinese via word2vec

background
word2vec could come in handy in nlp.
i have discovered one way to generate vectors for simplified chinese.

training data
a) wiki dump: http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
(reference: http://licstar.net/archives/262)
use Wikipedia Extractor (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to extract text from dump, use the following command:

bzcat zhwiki-latest-pages-articles.xml.bz2 | python WikiExtractor.py -b1000M -o extracted >output.txt

b) socialysis: ftp://ftp.socialysis.org
you can find many raw text here.

segmentation
before training, we need to segment these raw text into terms.
in this case, i am using ansj for segmentation.
i wrote a demo class to turn ansj into a command line tool:
package org.ansj.demo;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;

import org.ansj.domain.Term;
import org.ansj.recognition.NatureRecognition;
import org.ansj.splitWord.analysis.ToAnalysis;

public class SimpleIODemo {

    public static void main(String[] args) throws IOException {
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
        String line = null;
        while ((line = br.readLine()) != null) {
            if (line.startsWith("<"))
                continue;
            List<Term> parse = ToAnalysis.parse(line);
            new NatureRecognition(parse).recognition();
            for (Term term: parse) {
                System.out.print(term.getName() + "/"
                        + term.getNatrue().natureStr + " ");
            }
            System.out.println();
        }
    }

}

and in this case, we append nature of term to avoid ambiguous terms.
use this command to generate segmented text:

mvn exec:java -Dexec.mainClass="org.ansj.demo.SimpleIODemo" < ~/work/extracted_text.txt > ~/work/segmented_text.txt

training
use this command:
./word2vec -train ~/work/segmented_text.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 16 -binary
after a period of time, a binary file named vectors.bin will be generated.

verifying
use this command to test trained vectors:
./distance vectors.bin
i generated vectors using pure wiki dump and get this:
Enter word or sentence (EXIT to break): 人工智能/n

Word: 人工智能/n  Position in vocabulary: 18882

                                              Word       Cosine distance
------------------------------------------------------------------------
                                       计算机/n 0.758043
                                    认知科学/n 0.659870
                                    机器人学/n 0.636466
                                       运筹学/n 0.628714
                                       控制论/n 0.626604
                                      自动化/vn 0.612964
                                       博弈论/n 0.608870
                                          科学/n 0.595060
                                    系统工程/l 0.593820
                                    微电子学/n 0.592527
                                            nlp/en 0.590136
                                          仿真/v 0.589741
                                          领域/n 0.588424
                                       知识库/n 0.588246
                                       分布式/b 0.586032
                                       信息论/n 0.584697
                                 计量经济学/n 0.582200
                                       计量学/n 0.580011
                                         分析/vn 0.579240
                                       生物学/n 0.578400
                                    机器翻译/l 0.578206
                                       自动化/v 0.577689
                                         应用/vn 0.573138
                                          技术/n 0.571564
                                          数学/n 0.571543
                                         模拟/vn 0.570714
                                          人机/n 0.570010
                                          编程/v 0.569065
                                    空间科学/n 0.566234
                                       系统论/n 0.566088
                                    基础理论/l 0.564778
                                           abap/en 0.563862
                                       本体论/n 0.563624
                                       跨学科/b 0.560602
                                            cae/en 0.560012
                                            gis/en 0.559896
                                 分子生物学/n 0.559691
                                         仿真/vn 0.558837
                                       信息学/n 0.558737
                                 社会心理学/n 0.555530

Saturday, October 26, 2013

Solution: Java Runtime.exec hangs abnormally

background
i was debugging a J2EE project which has a method that invokes an external program.
in most scenario it just works fine, but i spotted some hanging external process after a period of full-load time.

problem
this is the code when things went wrong:
private void execute(String command) throws IOException {
        Runtime runTime = Runtime.getRuntime();
        LOG.info("executing: " + command);
        String[] args = new String[] {
            "/bin/sh", "-c", command
        };
        Process proc = runTime.exec(args);
        try {
            if (proc.waitFor() != 0) {
                throw new IOException("subprocess exited with non-zero code");
            }
        } catch (InterruptedException e) {
            throw new IOException("interrupted");

        }
}

analyzing
it turns out that there is a very long output (through stdout) in every hanging process.
It is documented here: http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html
"Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock."
solution A
if output should be ignored, just append "> /dev/null" to command. and it should work fine.

solution B
if output is necessary, start reading stdin and stderr, instead of just waiting for subprocess to finish.

[For beginners] How to deploy J2EE website on your own server

background
you may want to deploy your own website using J2EE, so people could have access to it.
in this case, this article will show you how to do it.

essentials
a running server (or VPS, which you could deploy tomcat or jetty on it)
a JDBC compatible database (mysql, sql server, postgresql or oracle)
WAR package of your website
(optional) a top-level domain

how to
I. deploy tomcat/jetty on your server
setup jre:
[jre7] http://www.oracle.com/technetwork/java/javase/downloads/java-se-jre-7-download-432155.html

get tomcat/jetty package here:
[jetty] http://download.eclipse.org/jetty/stable-9/dist/
[tomcat] http://tomcat.apache.org/download-80.cgi

II. deploy webapp (WAR package)
put your war package in the webapps directory of jetty/tomcat.

III. get it running
tomcat:
sh tomcat/bin/catalina.sh

jetty:
sh jetty/bin/jetty.sh start

it works.

implementation of dependency click model

background

sometimes we are just not satisfied with the rank when implementing IR system, even it is based on sophisticated rank strategy.
therefore, user feedback is an important part of ranking system.
implementation of click model is one way to achieve this.
this article is based on the paper: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf, which is presented by Microsoft Research.
it is called Dependency Click Model because it is considers dependency on position.


hypothesis

considering a list of query result, we simply assume that user will examine them strictly by order.
and user click these results that seems relevant, and may continue this examining process, then stop somewhere.
DCM is based on what this basic hypothesis implied.
for details on how it works, please read: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf

engineering implementation

to implement such feature, we decompose the model into 4 processes.

a) model data storage
we propose this array-of-counter-pairs structure. each array-of-counter-pairs is an array for counter-pairs, which contain two counters, so it may look like this:
[(alpha, beta), (alpha, beta), (alpha, beta), (alpha, beta), ...]
element index indicates corresponding position.

we need to store global array-of-counter-pairs and array-of-counter-pairs for each keyword. 
this could be done with databases or simple binary file.
it may contain a hash table (key:keyword, value: array-of-counter-pairs). 

b) data recording and collection
we need to record exactly which result did user clicked for a query using specified keyword, and what result are in the list.
therefore, for a particular query, we may have a record like this (say session):
USERID, KEYWORD, SHOWN RESULTS, CLICKED DOCS

c) log analyzation
we need periodic analyze data record to update counter-pairs.
for global array-of-counter-pairs, each alpha indicates how many last click in session occur in corresponding position, while beta indicates how many click occur at corresponding position.
for keyword's array-of-counter-pairs, each alpha indicates how many click in session occur in corresponding position, while beta indicates how many impression occur at corresponding position (if it shows before last clicked position, then it is impressed based on hypothesis).

first update global counter for each recorded session, and then update that for specified keyword.
the relevance of individual document is alpha / beta.

d) rank adjustment
we can simply sort Top-N document by the computing score:
score = 0.3 * relevance + 0.7 * positionScore,
where positionScore is a constant corresponding to original position in list.