Java Tokenizer类代码示例

OStack程序员社区-中国程序员成长平台 › 门户 › 编程› Java›Java编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Java中com.aliasi.tokenizer.Tokenizer类的典型用法代码示例。如果您正苦于以下问题：Java Tokenizer类的具体用法？Java Tokenizer怎么用？Java Tokenizer使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

Tokenizer类属于com.aliasi.tokenizer包，在下文中一共展示了Tokenizer类的11个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: tokenizeSentencesOPENNLP

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public static void tokenizeSentencesOPENNLP(List<String> sentences)
{
	ArrayList<String> tokenizedSentences = new ArrayList<String>();
	int size = sentences.size();
	try {
		opennlp.tools.lang.english.Tokenizer tokenizer = new opennlp.tools.lang.english.Tokenizer("data/EnglishTok.bin.gz");
		for(int i = 0; i < size; i ++)
		{
			String[] tokens = tokenizer.tokenize(sentences.get(i).trim());
			String tokenized = "";
			for(int j = 0; j < tokens.length; j ++)
			{
				tokenized+=tokens[j]+" ";
			}
			tokenized=tokenized.trim();
			tokenizedSentences.add(tokenized);
			System.out.println(tokenized);
		}
	} catch (IOException e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}
	writeSentencesToTempFile("data/allSentencesTokenizedOPENNLP.txt",tokenizedSentences);
}

开发者ID:Noahs-ARK，项目名称:semafor-semantic-parser，代码行数:25，代码来源:ParsePreparation.java

示例2: tokenizeSentencesOPENNLP

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public static void tokenizeSentencesOPENNLP(ArrayList<String> sentences)
{
	ArrayList<String> tokenizedSentences = new ArrayList<String>();
	int size = sentences.size();
	try {
		opennlp.tools.lang.english.Tokenizer tokenizer = new opennlp.tools.lang.english.Tokenizer("data/EnglishTok.bin.gz");
		for(int i = 0; i < size; i ++)
		{
			String[] tokens = tokenizer.tokenize(sentences.get(i).trim());
			String tokenized = "";
			for(int j = 0; j < tokens.length; j ++)
			{
				tokenized+=tokens[j]+" ";
			}
			tokenized=tokenized.trim();
			tokenizedSentences.add(tokenized);
			System.out.println(tokenized);
		}
	} catch (IOException e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}
	writeSentencesToTempFile("data/allSentencesTokenizedOPENNLP.txt",tokenizedSentences);
}

开发者ID:Noahs-ARK，项目名称:semafor-semantic-parser，代码行数:25，代码来源:BasicIO.java

示例3: tokenize

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
@Override
public List<Token> tokenize(JCas jcas) {
  char[] cs = jcas.getDocumentText().toCharArray();
  Tokenizer tokenizer = tokenizerFactory.tokenizer(cs, 0, cs.length);
  return StreamSupport
          .stream(tokenizer.spliterator(), false).map(token -> TypeFactory.createToken(jcas,
                  tokenizer.lastTokenStartPosition(), tokenizer.lastTokenEndPosition()))
          .collect(toList());
}

开发者ID:oaqa，项目名称:bioasq，代码行数:10，代码来源:LingPipeParserProvider.java

示例4: getSentences

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public String[] getSentences(String text) {

        ArrayList<String> tokenList = new ArrayList<>();
        ArrayList<String> whiteList = new ArrayList<>();
        Tokenizer tokenizer = tokenizerFactory.tokenizer(text.toCharArray(), 0, text.length());
        tokenizer.tokenize(tokenList, whiteList);

        String[] tokens = new String[tokenList.size()];
        String[] whites = new String[whiteList.size()];
        tokenList.toArray(tokens);
        whiteList.toArray(whites);
        int[] sentenceBoundaries = sentenceModel.boundaryIndices(tokens, whites);

        if (sentenceBoundaries.length < 1) {
            return new String[0];
        }

        String[] result = new String[sentenceBoundaries.length];

        int sentStartTok = 0;
        int sentEndTok;
        for (int i = 0; i < sentenceBoundaries.length; ++i) {
            sentEndTok = sentenceBoundaries[i];
            StringBuilder sb = new StringBuilder();
            for (int j = sentStartTok; j <= sentEndTok; j++) {
                sb.append(tokens[j]).append(whites[j + 1]);
            }
            result[i] = sb.toString();
            sentStartTok = sentEndTok + 1;
        }
        return result;
    }

开发者ID:firm1，项目名称:zest-writer，代码行数:33，代码来源:SentenceExtractor.java

示例5: splitSentences

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public static ArrayList<String> splitSentences(String paragraph)
{
	ArrayList<String> tokenList = new ArrayList<String>();
	ArrayList<String> whiteList = new ArrayList<String>();
	paragraph=paragraph.trim();
	paragraph=paragraph.replace("\r\n", " ");
	paragraph=paragraph.replace("\n", " ");
	paragraph=paragraph.replace("\r", " ");
	
	
	Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(paragraph.toCharArray(),0, paragraph.length());
	tokenizer.tokenize(tokenList, whiteList);
	String[] tokens = new String[tokenList.size()];
	String[] whites = new String[whiteList.size()];
	tokenList.toArray(tokens);
	whiteList.toArray(whites);
	int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,whites);
	
	ArrayList<String> sentences = new ArrayList<String>();
	
	if (sentenceBoundaries.length < 1) {
	    System.out.println("No sentence boundaries found.");
	    sentences.add(paragraph);
	}
	int sentStartTok = 0;
	int sentEndTok = 0;
	for (int i = 0; i < sentenceBoundaries.length; ++i)
	{
	    sentEndTok = sentenceBoundaries[i];
	    String sentence="";
	    for (int j=sentStartTok; j<=sentEndTok; j++)
	    {
	    	sentence+=tokens[j]+whites[j+1];
	    }
	    sentences.add(sentence.trim());
	    sentStartTok = sentEndTok+1;
	}
	return sentences;
}

开发者ID:Noahs-ARK，项目名称:semafor-semantic-parser，代码行数:40，代码来源:ParsePreparation.java

示例6: tokenize

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
/**
 * Tokenizes a text.
 * 
 * @param text text to tokenize
 * @return array of tokens or <code>null</code>, if the tokenizer is not
 *         initialized
 */
public static String[] tokenize(String text) {
	if (tokenizerFactory == null) return null;
	
	ArrayList<String> tokenList = new ArrayList<String>();
	ArrayList<String> whiteList = new ArrayList<String>();
	Tokenizer tokenizer =
		tokenizerFactory.tokenizer(text.toCharArray(), 0, text.length());
	tokenizer.tokenize(tokenList, whiteList);
	
	return tokenList.toArray(new String[tokenList.size()]);
}

开发者ID:claritylab，项目名称:lucida，代码行数:19，代码来源:LingPipe.java

示例7: sentDetect

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
/**
 * Splits a text into sentences.
 * 
 * @param text sequence of sentences
 * @return array of sentences in the text or <code>null</code>, if the
 *         sentence detector is not initialized
 */
public static String[] sentDetect(String text) {
	if (sentenceModel == null) return null;
	
    // tokenize text
	ArrayList<String> tokenList = new ArrayList<String>();
	ArrayList<String> whiteList = new ArrayList<String>();
	Tokenizer tokenizer =
		tokenizerFactory.tokenizer(text.toCharArray(), 0, text.length());
	tokenizer.tokenize(tokenList, whiteList);
	
	String[] tokens = tokenList.toArray(new String[tokenList.size()]);
	String[] whites = whiteList.toArray(new String[whiteList.size()]);
	
	// detect sentences
	int[] sentenceBoundaries =
		sentenceModel.boundaryIndices(tokens, whites);
	
	int sentStartTok = 0;
	int sentEndTok = 0;
	
	String[] sentences = new String[sentenceBoundaries.length];
	for (int i = 0; i < sentenceBoundaries.length; i++) {
		sentEndTok = sentenceBoundaries[i];
		
		StringBuilder sb = new StringBuilder();
		for (int j = sentStartTok; j <= sentEndTok; j++) {
			sb.append(tokens[j]);
			if (whites[j + 1].length() > 0 && j < sentEndTok)
				sb.append(" ");
		}
		sentences[i] = sb.toString();
		
		sentStartTok = sentEndTok+1;
	}
	
	return sentences;
}

开发者ID:claritylab，项目名称:lucida，代码行数:45，代码来源:LingPipe.java

示例8: wordSpliter

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public static List<String>[] wordSpliter(String txt) {
    List<String> ls[] = new ArrayList[2];
    ls[0] = new ArrayList<String>();
    ls[1] = new ArrayList<String>();
    char cc[] = txt.toCharArray();
    Tokenizer tk = TOKENIZER.tokenizer(cc, 0, cc.length);
    tk.tokenize(ls[0], ls[1]);
    return ls;
}

开发者ID:dbmi-pitt，项目名称:pk-ddi-role-identifier，代码行数:10，代码来源:SentenceSplitter.java

示例9: tokenize

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public static String[] tokenize(String txt) {
    char cc[] = txt.toCharArray();
    Tokenizer tk = TOKENIZER.tokenizer(cc, 0, cc.length);
    return tk.tokenize();
    
}

开发者ID:dbmi-pitt，项目名称:pk-ddi-role-identifier，代码行数:7，代码来源:SentenceSplitter.java

示例10: execute

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
/**
 * execute method. Makes LingPipe API calls to tokenize the document.
 * It uses the document's string and passes it over to the LingPipe to
 * tokenize. It also generates space tokens as well.
 */
public void execute() throws ExecutionException {

  if(document == null) {
    throw new ExecutionException("There is no loaded document");
  }

  super.fireProgressChanged(0);

  long startOffset = 0, endOffset = 0;
  AnnotationSet as = null;
  if(outputASName == null || outputASName.trim().length() == 0)
    as = document.getAnnotations();
  else as = document.getAnnotations(outputASName);

  String docContent = document.getContent().toString();

  List<String> tokenList = new ArrayList<String>();
  List<String> whiteList = new ArrayList<String>();
  Tokenizer tokenizer = tf.tokenizer(docContent.toCharArray(), 0, docContent
          .length());
  tokenizer.tokenize(tokenList, whiteList);

  for(int i = 0; i < whiteList.size(); i++) {
    try {

      startOffset = endOffset;
      endOffset = startOffset + whiteList.get(i).length();
      if((endOffset - startOffset) != 0) {
        FeatureMap fmSpaces = Factory.newFeatureMap();
        fmSpaces.put("length", "" + (endOffset - startOffset));
        as.add(new Long(startOffset), new Long(endOffset), "SpaceToken",
                fmSpaces);
      }

      if(i < tokenList.size()) {
        startOffset = endOffset;
        endOffset = startOffset + tokenList.get(i).length();
        FeatureMap fmTokens = Factory.newFeatureMap();
        fmTokens.put("length", "" + (endOffset - startOffset));
        as.add(new Long(startOffset), new Long(endOffset), "Token", fmTokens);
      }
    }
    catch(InvalidOffsetException e) {
      throw new ExecutionException(e);
    }
  }
}

开发者ID:Network-of-BioThings，项目名称:GettinCRAFTy，代码行数:53，代码来源:TokenizerPR.java

示例11: tokenizer

import com.aliasi.tokenizer.Tokenizer; //导入依赖的package包/类
public Tokenizer tokenizer(char[] content, int start, int length) {
    String str = new String(content, start, length);
    return new TCCLingPipeTokenizer(str);
}

开发者ID:wittawatj，项目名称:ctwt，代码行数:5，代码来源:TCCTokenizerFactory.java

注：本文中的com.aliasi.tokenizer.Tokenizer类示例整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Java WXViewUtils类代码示例发布时间：2022-05-22

Java LRESULT类代码示例发布时间：2022-05-22

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：19285|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：10021|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8344|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8713|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8658|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9688|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8647|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：8014|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8683|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7550|2022-11-06

客服电话

电子邮件

Java Tokenizer类代码示例

示例1: tokenizeSentencesOPENNLP

示例2: tokenizeSentencesOPENNLP

示例3: tokenize

示例4: getSentences

示例5: splitSentences

示例6: tokenize

示例7: sentDetect

示例8: wordSpliter

示例9: tokenize

示例10: execute

示例11: tokenizer

请发表评论

全部评论

上一篇：

下一篇：

dustinvtran/ml-videos: A collection of v

小程序码生成随记

ravikumar001/maven

更的笔顺,体会更的笔画,理会更的部首

CVE-2022-35341

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053