2007-09-27 10:50 作者: 佚名 出处: 论坛整理 责任编辑:>幽灵
以下是引用片段:
using System;
using System.Collections.Generic;
using System.Text;
using Analyzer = Lucene.Net.Analysis.Analyzer;
using SimpleAnalyzer = Lucene.Net.Analysis.SimpleAnalyzer;
using StandardAnalyzer = Lucene.Net.Analysis.Standard.StandardAnalyzer;
using Token = Lucene.Net.Analysis.Token;
using TokenStream = Lucene.Net.Analysis.TokenStream;
namespace MyLuceneTest
{
class Program
{
[STAThread]
public static void Main(System.String[] args)
{
try
{
Test("中华人民共和国在1949年建立,从此开始了新中国的伟大篇章。长春市长春节致词", true);
}
catch (System.Exception e)
{
System.Console.Out.WriteLine(" caught a " + e.GetType() + ""n with message: " + e.Message + e.ToString());
}
}
internal static void Test(System.String text, bool verbose)
{
System.Console.Out.WriteLine(" Tokenizing string: " + text);
Test(new System.IO.StringReader(text), verbose, text.Length);
}
internal static void Test(System.IO.TextReader reader, bool verbose, long bytes)
{
//Analyzer analyzer = new StandardAnalyzer();
Analyzer analyzer = new Lucene.Fanswo.ChineseAnalyzer();
TokenStream stream = analyzer.TokenStream(null, reader);
System.DateTime start = System.DateTime.Now;
int count = 0;
for (Token t = stream.Next(); t != null; t = stream.Next())
{
if (verbose)
{
System.Console.Out.WriteLine("Token=" + t.ToString());
}
count++;
}
System.DateTime end = System.DateTime.Now;
long time = end.Ticks - start.Ticks;
System.Console.Out.WriteLine(time + " milliseconds to extract " + count + " tokens");
System.Console.Out.WriteLine((time * 1000.0) / count + " microseconds/token");
System.Console.Out.WriteLine((bytes * 1000.0 * 60.0 * 60.0) / (time * 1000000.0) + " megabytes/hour");
}
}
} |
测试结果:
完毕!
分词的郊率上还有待在算法上提高。还有中文的标点符号没有处理,我将进一步完善。
本人文采不好,写不出很多文字,只有以代码代替一下我的言语。兄弟姐妹们给点意见哦。谢
|
请发表评论