word2vec could come in handy in nlp.
i have discovered one way to generate vectors for simplified chinese.
training data
a) wiki dump: http://download.wikipedia.com/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
(reference: http://licstar.net/archives/262)
use Wikipedia Extractor (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to extract text from dump, use the following command:
bzcat zhwiki-latest-pages-articles.xml.bz2 | python WikiExtractor.py -b1000M -o extracted >output.txt
you can find many raw text here.
segmentation
before training, we need to segment these raw text into terms.
in this case, i am using ansj for segmentation.
i wrote a demo class to turn ansj into a command line tool:
package org.ansj.demo;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;
import org.ansj.domain.Term;
import org.ansj.recognition.NatureRecognition;
import org.ansj.splitWord.analysis.ToAnalysis;
public class SimpleIODemo {
public static void main(String[] args) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
String line = null;
while ((line = br.readLine()) != null) {
if (line.startsWith("<"))
continue;
List<Term> parse = ToAnalysis.parse(line);
new NatureRecognition(parse).recognition();
for (Term term: parse) {
System.out.print(term.getName() + "/"
+ term.getNatrue().natureStr + " ");
}
System.out.println();
}
}
and in this case, we append nature of term to avoid ambiguous terms.
use this command to generate segmented text:
mvn exec:java -Dexec.mainClass="org.ansj.demo.SimpleIODemo" < ~/work/extracted_text.txt > ~/work/segmented_text.txt
training
use this command:
./word2vec -train ~/work/segmented_text.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 16 -binary
after a period of time, a binary file named vectors.bin will be generated.
verifying
use this command to test trained vectors:
./distance vectors.bin
i generated vectors using pure wiki dump and get this:
Enter word or sentence (EXIT to break): 人工智能/n
Word: 人工智能/n Position in vocabulary: 18882
Word Cosine distance
------------------------------------------------------------------------
计算机/n 0.758043
认知科学/n 0.659870
机器人学/n 0.636466
运筹学/n 0.628714
控制论/n 0.626604
自动化/vn 0.612964
博弈论/n 0.608870
科学/n 0.595060
系统工程/l 0.593820
微电子学/n 0.592527
nlp/en 0.590136
仿真/v 0.589741
领域/n 0.588424
知识库/n 0.588246
分布式/b 0.586032
信息论/n 0.584697
计量经济学/n 0.582200
计量学/n 0.580011
分析/vn 0.579240
生物学/n 0.578400
机器翻译/l 0.578206
自动化/v 0.577689
应用/vn 0.573138
技术/n 0.571564
数学/n 0.571543
模拟/vn 0.570714
人机/n 0.570010
编程/v 0.569065
空间科学/n 0.566234
系统论/n 0.566088
基础理论/l 0.564778
abap/en 0.563862
本体论/n 0.563624
跨学科/b 0.560602
cae/en 0.560012
gis/en 0.559896
分子生物学/n 0.559691
仿真/vn 0.558837
信息学/n 0.558737
社会心理学/n 0.555530