tess3ract

Sunday, October 27, 2013

generating binary file for simplified chinese via word2vec

background
word2vec could come in handy in nlp.
i have discovered one way to generate vectors for simplified chinese.

training data
a) wiki dump: http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
(reference: http://licstar.net/archives/262)
use Wikipedia Extractor (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to extract text from dump, use the following command:

bzcat zhwiki-latest-pages-articles.xml.bz2 | python WikiExtractor.py -b1000M -o extracted >output.txt

b) socialysis: ftp://ftp.socialysis.org
you can find many raw text here.

segmentation
before training, we need to segment these raw text into terms.
in this case, i am using ansj for segmentation.
i wrote a demo class to turn ansj into a command line tool:

package org.ansj.demo;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStreamReader;

import java.util.List;

import org.ansj.domain.Term;

import org.ansj.recognition.NatureRecognition;

import org.ansj.splitWord.analysis.ToAnalysis;

public class SimpleIODemo {

public static void main(String[] args) throws IOException {

BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

String line = null;

while ((line = br.readLine()) != null) {

if (line.startsWith("<"))

continue;

List<Term> parse = ToAnalysis.parse(line);

new NatureRecognition(parse).recognition();

for (Term term: parse) {

System.out.print(term.getName() + "/"

+ term.getNatrue().natureStr + " ");

}

System.out.println();

}

}

and in this case, we append nature of term to avoid ambiguous terms.
use this command to generate segmented text:

mvn exec:java -Dexec.mainClass="org.ansj.demo.SimpleIODemo" < ~/work/extracted_text.txt > ~/work/segmented_text.txt

training

use this command:

./word2vec -train ~/work/segmented_text.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 16 -binary

after a period of time, a binary file named vectors.bin will be generated.

verifying

use this command to test trained vectors:

./distance vectors.bin

i generated vectors using pure wiki dump and get this:

Enter word or sentence (EXIT to break): 人工智能/n

Word: 人工智能/n Position in vocabulary: 18882

Word Cosine distance

------------------------------------------------------------------------

计算机/n 0.758043

认知科学/n 0.659870

机器人学/n 0.636466

运筹学/n 0.628714

控制论/n 0.626604

自动化/vn 0.612964

博弈论/n 0.608870

科学/n 0.595060

系统工程/l 0.593820

微电子学/n 0.592527

nlp/en 0.590136

仿真/v 0.589741

领域/n 0.588424

知识库/n 0.588246

分布式/b 0.586032

信息论/n 0.584697

计量经济学/n 0.582200

计量学/n 0.580011

分析/vn 0.579240

生物学/n 0.578400

机器翻译/l 0.578206

自动化/v 0.577689

应用/vn 0.573138

技术/n 0.571564

数学/n 0.571543

模拟/vn 0.570714

人机/n 0.570010

编程/v 0.569065

空间科学/n 0.566234

系统论/n 0.566088

基础理论/l 0.564778

abap/en 0.563862

本体论/n 0.563624

跨学科/b 0.560602

cae/en 0.560012

gis/en 0.559896

分子生物学/n 0.559691

仿真/vn 0.558837

信息学/n 0.558737

社会心理学/n 0.555530

Saturday, October 26, 2013

Solution: Java Runtime.exec hangs abnormally

background
i was debugging a J2EE project which has a method that invokes an external program.
in most scenario it just works fine, but i spotted some hanging external process after a period of full-load time.

problem
this is the code when things went wrong:
private void execute(String command) throws IOException {
Runtime runTime = Runtime.getRuntime();
LOG.info("executing: " + command);
String[] args = new String[] {
"/bin/sh", "-c", command
};
Process proc = runTime.exec(args);
try {
if (proc.waitFor() != 0) {
throw new IOException("subprocess exited with non-zero code");
}
} catch (InterruptedException e) {
throw new IOException("interrupted");

}

}

analyzing

it turns out that there is a very long output (through stdout) in every hanging process.

It is documented here: http://docs.oracle.com/javase/6/docs/api/java/lang/Process.html

"Because some native platforms only provide limited buffer size for standard input and output streams, failure to promptly write the input stream or read the output stream of the subprocess may cause the subprocess to block, and even deadlock."

solution A

if output should be ignored, just append "> /dev/null" to command. and it should work fine.

solution B

if output is necessary, start reading stdin and stderr, instead of just waiting for subprocess to finish.

[For beginners] How to deploy J2EE website on your own server

background
you may want to deploy your own website using J2EE, so people could have access to it.
in this case, this article will show you how to do it.

essentials
a running server (or VPS, which you could deploy tomcat or jetty on it)
a JDBC compatible database (mysql, sql server, postgresql or oracle)
WAR package of your website
(optional) a top-level domain

how to
I. deploy tomcat/jetty on your server
setup jre:
[jre7] http://www.oracle.com/technetwork/java/javase/downloads/java-se-jre-7-download-432155.html

get tomcat/jetty package here:
[jetty] http://download.eclipse.org/jetty/stable-9/dist/
[tomcat] http://tomcat.apache.org/download-80.cgi

II. deploy webapp (WAR package)
put your war package in the webapps directory of jetty/tomcat.

III. get it running
tomcat:
sh tomcat/bin/catalina.sh

jetty:
sh jetty/bin/jetty.sh start

it works.

implementation of dependency click model

background

sometimes we are just not satisfied with the rank when implementing IR system, even it is based on sophisticated rank strategy.
therefore, user feedback is an important part of ranking system.
implementation of click model is one way to achieve this.
this article is based on the paper: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf, which is presented by Microsoft Research.
it is called Dependency Click Model because it is considers dependency on position.

hypothesis

considering a list of query result, we simply assume that user will examine them strictly by order.

and user click these results that seems relevant, and may continue this examining process, then stop somewhere.

DCM is based on what this basic hypothesis implied.

for details on how it works, please read: http://research.microsoft.com/pubs/73115/multiple-click-model-wsdm09.pdf

engineering implementation

to implement such feature, we decompose the model into 4 processes.

a) model data storage

we propose this array-of-counter-pairs structure. each array-of-counter-pairs is an array for counter-pairs, which contain two counters, so it may look like this:

[(alpha, beta), (alpha, beta), (alpha, beta), (alpha, beta), ...]

element index indicates corresponding position.

we need to store global array-of-counter-pairs and array-of-counter-pairs for each keyword.

this could be done with databases or simple binary file.

it may contain a hash table (key:keyword, value: array-of-counter-pairs).

b) data recording and collection

we need to record exactly which result did user clicked for a query using specified keyword, and what result are in the list.

therefore, for a particular query, we may have a record like this (say session):

USERID, KEYWORD, SHOWN RESULTS, CLICKED DOCS

c) log analyzation

we need periodic analyze data record to update counter-pairs.

for global array-of-counter-pairs, each alpha indicates how many last click in session occur in corresponding position, while beta indicates how many click occur at corresponding position.

for keyword's array-of-counter-pairs, each alpha indicates how many click in session occur in corresponding position, while beta indicates how many impression occur at corresponding position (if it shows before last clicked position, then it is impressed based on hypothesis).

first update global counter for each recorded session, and then update that for specified keyword.

the relevance of individual document is alpha / beta.

d) rank adjustment

we can simply sort Top-N document by the computing score:

score = 0.3 * relevance + 0.7 * positionScore,

where positionScore is a constant corresponding to original position in list.

Friday, June 7, 2013

a small trick: automatically sign in SRUN3000 with OpenWRT

I. Background

Tianjin University has just implemented SRUN3000 authentication program, which restricts internet access without signed in.

It is very inconvenient to sign in every single time. So i decided to do something.

In this case, I have a router with OpenWRT running.

(btw: most tp-link wireless router could manage to update to openwrt firmware if you are willing to try this out)

II. Solution

OpenWRT provides hotplug2 feature to trigger scripts while interface state changed.

Save this script to /etc/hotplug.d/iface/30-srun:

#!/bin/sh
[ "$ACTION" = ifup ] || exit 0
[ "$INTERFACE" = wan ] || exit 0

# delay for 2 second
sleep 2

# nc to create a http post request
nc g.tju.edu.cn 80 < /tmp/post_login

And chmod it with x access.

Next step, save this datadump to /tmp/post_login:

GET /cgi-bin/do_login HTTP/1.1

Host: g.tju.edu.cn

Content-Type: application/x-www-form-urlencoded

Content-Length: 77

username=ID&password=PASSWORD&drop=0&type=1&n=100&force=true

replace ID with your account, PASSWORD with Substring(8, 16) of the md5 hash of your password. i.e. 123456 is 49ba59abbe56e057f.

And 77 should be the total length of these parameters (username=ID&password=PASSWORD&drop=0&type=1&n=100&force=true).

--- EOF --

Thursday, June 6, 2013

a simple & stupid probatilistic corelation analysis method

I. Problem definition

assume you are in charge of a private tracker, recommendation could come in handy sometimes especially when visitor wanted to try out something new.

so i am going to work out a way to achieve this.

II. Theory

Consider resource A and B. N(A) represents number of users downloaded resource A.

In that case, presuming we have N(A ∪ B) users, P(A) = N(A) / N(A ∪ B).

Similarly, P(B) = N(B) / N(A ∪ B).

Consider we wanted to present a list of recommendation based on resource A, we have two obvious way to do this:

a) For every other resource B, use P(B | A) as rank, with P(B | A) = P(AB) / P(A). The result is N(A ∩ B) / N(A).

b) For every other resource B, use P(AB) as rank. The result is N(A ∩ B) / N(A ∪ B).

First idea considers how much percentage of users downloaded B in users that downloaded A.

Second idea considers how much percentage of users downloaded both A and B in users that downloaded A or B.

These are NOT THE SAME.

Consider N(A) = 50, N(B) = 40, N(A ∪ B) = 80, N(A ∩ B) = 10.

N(C) = 200, N(A ∪ C) = 210, N(A ∩ C) = 30.

Do the math, idea a) will get the following rank:

B: 0.20 C: 0.60

while idea b):

B: 0.13 C: 0.09

The difference is obvious.

In fact, resource C is a hot resource which almost every user downloaded it.

So C has a high rank in idea a). But in idea b), because it also considers how many users downloaded A in users that downloaded C, the ratio is reasonable.

After a series of tests, it turns out that some hot resource often got a high rank corelated to whatever resource in the first method, while the second method performs well because it considers a two-way linkage.

III. Coding phase

I wrote a tiny simple program to get this job done.

You can download it here:

https://www.dropbox.com/s/67iblrb5n2mtfm8/corelation.cpp

The idea is for every resource pair <A, B>, calculate (N(A ∩ B) / N(A ∪ B) * 100) as the corelation ratio of <A, B> as well as <B, A>.

When choosing recommendation list, sort the resource list by ratio, and pick the top N items.

IV. Next step

This is just a tiny experiment done all by myself because I am just curious about how well / fast these methods could perform, and what problem could i encounter next.

Now the time complexity of that algorithm is O( (R^2) * N ), R represents number of resources, N represents average number of users that downloaded a resource.

It takes a lot of time to finish the calculation.

Apparently there is much more better ways to improve this.

access the Internet through blocks in cernet

I. What is the problem

in this case, i am going to solve two problems at one time:

a) as far i as i know, overseas access via CERNET maintains a bandwidth at the average of 8kb/s, which drive you crazy

b) china GFW blocks almost every popular website around the world

II. The idea

it turns out that:

a) access overseas resource via a proxy which set up in nearest city is usually faster

b) GFW will not block secure connections

so suppose i am in Tianjin, i am gonna use two servers: A set up in Beijing and B set up in US, and a maintained website list to solve the problem.

when:

a) accessing mainland websites, just go directly

b) accessing overseas non-blocked websites, use A as proxy which makes it faster

c) accessing overseas blocked websites, first access B via proxy A, and access the target website via B, which get through the blocks with a impressive speed.

III. Tools

use proxifier (http://www.proxifier.com/) to redirect connections set up by localhost.

use plink (http://the.earth.li/~sgtatham/putty/latest/x86/plink.exe) to set up secure tunnel from localhost to server A.

use autossh (http://www.harding.motd.ca/autossh/) to maintain a secure tunnel from server A to server B, otherwise it will be unable for us to get through blocks.

IV. How to

a) configure ssh-key for A and B in order to give A free access to B without typing password

b) add the following command line to daemon on server A:
autossh -f -M 5678 -CfNg -D portAB serverB

c) set up a plink process on localhost:
plink.exe serverA -N -ssh -2 -D portA

d) create proxy servers in proxifier: 127.0.0.1:portA and 127.0.0.1:portAB

e) combine a proxy china using these two proxies named Tunnel-US

f) set up rules:
in this case i have to redirect all connections to US & CA to 127.0.0.1:portA
and all blocked connections to Chain Tunnel-US

g) enjoy twitter

V. Next step

so far i have to build these rules manually for there is no such list that could provide which website is overseas or which websites is blocked.

there is a page provides monitored domains that are blocked: https://en.greatfire.org/search/domains

that makes it possible to fill rules automatically.

--- EOF---