tess3ract: October 2013

Sunday, October 27, 2013

generating binary file for simplified chinese via word2vec

background
word2vec could come in handy in nlp.
i have discovered one way to generate vectors for simplified chinese.

training data
a) wiki dump: http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
(reference: http://licstar.net/archives/262)
use Wikipedia Extractor (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to extract text from dump, use the following command:

bzcat zhwiki-latest-pages-articles.xml.bz2 | python WikiExtractor.py -b1000M -o extracted >output.txt

b) socialysis: ftp://ftp.socialysis.org
you can find many raw text here.

segmentation
before training, we need to segment these raw text into terms.
in this case, i am using ansj for segmentation.
i wrote a demo class to turn ansj into a command line tool:

package org.ansj.demo;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.util.List;

import org.ansj.domain.Term;

import org.ansj.recognition.NatureRecognition;

import org.ansj.splitWord.analysis.ToAnalysis;

public class SimpleIODemo {

public static void main(String[] args) throws IOException {

BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

String line = null;

while ((line = br.readLine()) != null) {

if (line.startsWith("<"))

continue;

List<Term> parse = ToAnalysis.parse(line);

new NatureRecognition(parse).recognition();

for (Term term: parse) {

System.out.print(term.getName() + "/"

+ term.getNatrue().natureStr + " ");

}

System.out.println();

}

}

and in this case, we append nature of term to avoid ambiguous terms.
use this command to generate segmented text:

mvn exec:java -Dexec.mainClass="org.ansj.demo.SimpleIODemo" < ~/work/extracted_text.txt > ~/work/segmented_text.txt

training

use this command:

./word2vec -train ~/work/segmented_text.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 16 -binary

after a period of time, a binary file named vectors.bin will be generated.

verifying

use this command to test trained vectors:

./distance vectors.bin

i generated vectors using pure wiki dump and get this:

Enter word or sentence (EXIT to break): 人工智能/n

Word: 人工智能/n Position in vocabulary: 18882

Word Cosine distance

------------------------------------------------------------------------

计算机/n 0.758043

认知科学/n 0.659870

机器人学/n 0.636466

运筹学/n 0.628714

控制论/n 0.626604

自动化/vn 0.612964

博弈论/n 0.608870

科学/n 0.595060

系统工程/l 0.593820

微电子学/n 0.592527

nlp/en 0.590136

仿真/v 0.589741

领域/n 0.588424

知识库/n 0.588246

分布式/b 0.586032

信息论/n 0.584697

计量经济学/n 0.582200

计量学/n 0.580011

分析/vn 0.579240

生物学/n 0.578400

机器翻译/l 0.578206

自动化/v 0.577689

应用/vn 0.573138

技术/n 0.571564

数学/n 0.571543

模拟/vn 0.570714

人机/n 0.570010

编程/v 0.569065

空间科学/n 0.566234

系统论/n 0.566088

基础理论/l 0.564778

abap/en 0.563862

本体论/n 0.563624

跨学科/b 0.560602

cae/en 0.560012

gis/en 0.559896

分子生物学/n 0.559691

仿真/vn 0.558837

信息学/n 0.558737

社会心理学/n 0.555530

Saturday, October 26, 2013

Solution: Java Runtime.exec hangs abnormally

background
i was debugging a J2EE project which has a method that invokes an external program.
in most scenario it just works fine, but i spotted some hanging external process after a period of full-load time.

problem
this is the code when things went wrong:
private void execute(String command) throws IOException {
Runtime runTime = Runtime.getRuntime();
LOG.info("executing: " + command);
String[] args = new String[] {
"/bin/sh", "-c", command
};
Process proc = runTime.exec(args);
try {
if (proc.waitFor() != 0) {
throw new IOException("subprocess exited with non-zero code");
}
} catch (InterruptedException e) {
throw new IOException("interrupted");

}

}

analyzing

it turns out that there is a very long output (through stdout) in every hanging process.

It is documented here: http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html

"Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock."

solution A

if output should be ignored, just append "> /dev/null" to command. and it should work fine.

solution B

if output is necessary, start reading stdin and stderr, instead of just waiting for subprocess to finish.

[For beginners] How to deploy J2EE website on your own server

background
you may want to deploy your own website using J2EE, so people could have access to it.
in this case, this article will show you how to do it.

essentials
a running server (or VPS, which you could deploy tomcat or jetty on it)
a JDBC compatible database (mysql, sql server, postgresql or oracle)
WAR package of your website
(optional) a top-level domain

how to
I. deploy tomcat/jetty on your server
setup jre:
[jre7] http://www.oracle.com/technetwork/java/javase/downloads/java-se-jre-7-download-432155.html

get tomcat/jetty package here:
[jetty] http://download.eclipse.org/jetty/stable-9/dist/
[tomcat] http://tomcat.apache.org/download-80.cgi

II. deploy webapp (WAR package)
put your war package in the webapps directory of jetty/tomcat.

III. get it running
tomcat:
sh tomcat/bin/catalina.sh

jetty:
sh jetty/bin/jetty.sh start

it works.

implementation of dependency click model

background

sometimes we are just not satisfied with the rank when implementing IR system, even it is based on sophisticated rank strategy.
therefore, user feedback is an important part of ranking system.
implementation of click model is one way to achieve this.
this article is based on the paper: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf, which is presented by Microsoft Research.
it is called Dependency Click Model because it is considers dependency on position.

hypothesis

considering a list of query result, we simply assume that user will examine them strictly by order.

and user click these results that seems relevant, and may continue this examining process, then stop somewhere.

DCM is based on what this basic hypothesis implied.

for details on how it works, please read: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf

engineering implementation

to implement such feature, we decompose the model into 4 processes.

a) model data storage

we propose this array-of-counter-pairs structure. each array-of-counter-pairs is an array for counter-pairs, which contain two counters, so it may look like this:

[(alpha, beta), (alpha, beta), (alpha, beta), (alpha, beta), ...]

element index indicates corresponding position.

we need to store global array-of-counter-pairs and array-of-counter-pairs for each keyword.

this could be done with databases or simple binary file.

it may contain a hash table (key:keyword, value: array-of-counter-pairs).

b) data recording and collection

we need to record exactly which result did user clicked for a query using specified keyword, and what result are in the list.

therefore, for a particular query, we may have a record like this (say session):

USERID, KEYWORD, SHOWN RESULTS, CLICKED DOCS

c) log analyzation

we need periodic analyze data record to update counter-pairs.

for global array-of-counter-pairs, each alpha indicates how many last click in session occur in corresponding position, while beta indicates how many click occur at corresponding position.

for keyword's array-of-counter-pairs, each alpha indicates how many click in session occur in corresponding position, while beta indicates how many impression occur at corresponding position (if it shows before last clicked position, then it is impressed based on hypothesis).

first update global counter for each recorded session, and then update that for specified keyword.

the relevance of individual document is alpha / beta.

d) rank adjustment

we can simply sort Top-N document by the computing score:

score = 0.3 * relevance + 0.7 * positionScore,

where positionScore is a constant corresponding to original position in list.