Python reuters.fileids函数代码示例

OStack程序员社区-中国程序员成长平台 › 门户 › 编程› Python›Python编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Python中nltk.corpus.reuters.fileids函数的典型用法代码示例。如果您正苦于以下问题：Python fileids函数的具体用法？Python fileids怎么用？Python fileids使用的例子？那么恭喜您, 这里精选的函数代码示例或许可以为您提供帮助。

在下文中一共展示了fileids函数的20个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Python代码示例。

示例1: tfidf

def tfidf(word, wordCount):

	docCount = len(reuters.fileids())

	wordCountCorpus = 0
	count = 0
	for doc in reuters.fileids():
		count = count + 1
		present = 0
		for word2 in reuters.words(doc):
			if word.lower() == word2.lower():
				present = 1
				break

		if present == 1:
			wordCountCorpus == wordCountCorpus + 1

		if count == 200:
			break

	tf = wordCount
	idf = math.log(docCount/(1 + wordCountCorpus))

	tfidf = tf * idf

	return tfidf

开发者ID:davidjamesmcclure，项目名称:CodeSamples，代码行数:26，代码来源:parsingFunctions.py

示例2: import_reuters_files

def import_reuters_files(ds, silent=False, log=sys.stdout):
    """
    Import the brown corpus into `ds`. E.g.
    
    >>> from nathan.core import Dataspace
    >>> ds = Dataspace()
    >>> %time brown.import_brown(ds, silent=True)
    CPU times: user 12min 28s, sys: 536 ms, total: 12min 29s
    Wall time: 12min 29s
    """
    if not silent:
        total = len(reuters.fileids())
        counter = 0
    root_handle = ds.insert("#reuters")
    for fileid in reuters.fileids():
        tags = ["@%s" % category for category in reuters.categories(fileid)]
        file_handle = ds.insert(["#%s" % fileid] + tags)
        ds.link(root_handle, file_handle)
        for sent in reuters.sents(fileid):
            norm = [word.lower() for word in sent]
            sen_handle = ds.insert(norm)
            ds.link(file_handle, sen_handle)
        if not silent:
            counter += 1
            if (counter % 10 == 0):
                print("importing %s of %s files..." % (counter, total), 
                    file=log)

开发者ID:tdiggelm，项目名称:nltk-playground，代码行数:27，代码来源:train.py

示例3: collection_stats

def collection_stats():
	# List of documents
	documents = reuters.fileids()
	print(str(len(documents)) + " documents");
	
	train_docs = list(filter(lambda doc: doc.startswith("train"), documents));
	print(str(len(train_docs)) + " total train documents");
	
	test_docs = list(filter(lambda doc: doc.startswith("test"), documents));	
	print(str(len(test_docs)) + " total test documents");

	# List of categories 
	categories = reuters.categories();
	print(str(len(categories)) + " categories");

	# Documents in a category
	category_docs = reuters.fileids("acq");

	# Words for a document
	document_id = category_docs[0]
	document_words = reuters.words(category_docs[0]);
	print(document_words);	

	# Raw document
	print(reuters.raw(document_id));

开发者ID:BugliL，项目名称:SVNexercise，代码行数:25，代码来源:test2.py

示例4: init

    def __init__(self, categories=None, lower=True):
        if categories == None or len(categories) == 1:
            self.fileids = reuters.fileids()
        else:
            self.fileids = reuters.fileids(categories)
        self.categories = categories

        self.lower = lower

开发者ID:verasazonova，项目名称:textsim，代码行数:8，代码来源:reuters.py

示例5: print_reuters

def print_reuters():
    from nltk.corpus import reuters
    # print reuters.fileids()
    # print reuters.categories()
    print reuters.categories('training/9865')
    print reuters.categories(['training/9865','training/9880'])
    print reuters.fileids('barley')
    print reuters.fileids(['barely','corn'])

开发者ID:Paul-Lin，项目名称:misc，代码行数:8，代码来源:toturial.py

示例6: explore_categories

def explore_categories(max_len=5000, min_len=100, percentage=0.3):
    for cat in reuters.categories():
        for cat2 in reuters.categories():
            if cat2 > cat:
                if  len(set(reuters.fileids(cat)) & set(reuters.fileids(cat2))) == 0:
                    l1 = len(reuters.fileids(cat))
                    l2 = len(reuters.fileids(cat2))
                    if ( (l1 + l2) > min_len) and ( (l1 + l2) < max_len) and float((min(l1, l2))/float(l1+l2) > percentage):
                        print cat, cat2, l1 + l2, float(min(l1, l2))/float(l1+l2)

开发者ID:verasazonova，项目名称:textsim，代码行数:9，代码来源:reuters.py

示例7: generateTextList

def generateTextList( category, size, normalize = False ):
	i = 0
	text = []
	
	while i < size and i < len( reuters.fileids( category ) ):
		if not normalize:
			text.insert( i, reuters.words( reuters.fileids( category )[i] ) )
		else:
			text.insert( i, 
			getNormalizedText( reuters.words( reuters.fileids( category )[i] ) ) )
		i += 1
	
	return text

开发者ID:htaunay，项目名称:TextComparison，代码行数:13，代码来源:Exercise02_Stems.py

示例8: create_token_stream

def create_token_stream():
	"""
	A funtion that creates token stream based on the nltk reuters corpus
	A token stream is a list of tuples containing terms to docID
	"""
	token_stream = []
	docID = 1
	termID = 1
	print "Creating token stream..."
	for fileid in reuters.fileids('barley'):
	 	for term in reuters.words(fileid):
 			# Strip punctuatuion from the word and make lower case
 			term = remove_punct_from_word(term).lower()
 			# Check to make sure word is not "" and term is not a number
 			if len(term) > 0 and not is_number(term):
 				stemmed_term = stem().stem_word(term)
 				if stemmed_term not in terms:
 					terms[stemmed_term] = termID
 					termID += 1
 				new_token = (terms[stemmed_term], docID)
 				token_stream.append(new_token)
	 	# Add to docs dictionary mapping docID to file
	 	docs[docID] = fileid
	 	docID += 1
	return token_stream

开发者ID:btawfic，项目名称:COMP90024-Project1，代码行数:25，代码来源:btawfic_project1_compressed_index.py

示例9: load_data

def load_data(config={}):
    """
    Load the Reuters dataset.

    Returns
    -------
    data : dict
        with keys 'x_train', 'x_test', 'y_train', 'y_test', 'labels'
    """
    stop_words = stopwords.words("english")
    vectorizer = TfidfVectorizer(stop_words=stop_words)
    mlb = MultiLabelBinarizer()

    documents = reuters.fileids()
    test = [d for d in documents if d.startswith('test/')]
    train = [d for d in documents if d.startswith('training/')]

    docs = {}
    docs['train'] = [reuters.raw(doc_id) for doc_id in train]
    docs['test'] = [reuters.raw(doc_id) for doc_id in test]
    xs = {'train': [], 'test': []}
    xs['train'] = vectorizer.fit_transform(docs['train']).toarray()
    xs['test'] = vectorizer.transform(docs['test']).toarray()
    ys = {'train': [], 'test': []}
    ys['train'] = mlb.fit_transform([reuters.categories(doc_id)
                                     for doc_id in train])
    ys['test'] = mlb.transform([reuters.categories(doc_id)
                                for doc_id in test])
    data = {'x_train': xs['train'], 'y_train': ys['train'],
            'x_test': xs['test'], 'y_test': ys['test'],
            'labels': globals()["labels"]}
    return data

开发者ID:MartinThoma，项目名称:algorithms，代码行数:32，代码来源:reuters.py

示例10: run

def run():

    """Import the Reuters Corpus which contains 10,788 news articles"""

    from nltk.corpus import reuters
    raw_docs = [reuters.raw(fileid) for fileid in reuters.fileids()]

    # Select 100 documents randomly
    rand_idx = random.sample(range(len(raw_docs)), 100)
    raw_docs = [raw_docs[i] for i in rand_idx]

    # Preprocess Documents
    tokenized_docs = [ie_preprocess(doc) for doc in raw_docs]

    # Remove single occurance words
    docs = remove_infrequent_words(tokenized_docs)

    # Create dictionary and corpus
    dictionary = corpora.Dictionary(docs)
    corpus = [dictionary.doc2bow(doc) for doc in docs]

    # Build LDA model
    lda = models.LdaModel(corpus, id2word=dictionary, num_topics=10)
    for topic in lda.show_topics():
        print topic

开发者ID:alexandercrosson，项目名称:nlp，代码行数:25，代码来源:topic_modeling.py

示例11: get_testset_trainset_nltk_reuters

def get_testset_trainset_nltk_reuters():
    from nltk.corpus import reuters
    global categories_file_name_dict
    global cat_num_docs
    clean_files = [f for f in reuters.fileids() if len(reuters.categories(fileids=f))==1]    
    testset = [f for f in clean_files if f[:5]=='test/']
    trainset = [f for f in clean_files if f[:9]=='training/']
    for cat in reuters.categories():
        li=[f for f in reuters.fileids(categories=cat) if f in trainset]
        li_te = [f for f in reuters.fileids(categories=cat) if f in testset]
        if len(li)>20 and len(li_te)>20:
            cat_num_docs[cat]=len(li)
            li.extend(li_te)
            categories_file_name_dict[cat]=li
    return [[ f for f in trainset if f2c('reuters',f) in categories_file_name_dict],
            [ f for f in testset if f2c('reuters',f) in categories_file_name_dict]]

开发者ID:genf，项目名称:Naive-Bayes-Document-Classifier，代码行数:16，代码来源:Preprocessor.py

示例12: create_dictionary_index_reuters

def create_dictionary_index_reuters():
	""" 
	A function that creates a dictonary with terms as keys
	and positings lists as values for the nltk reuters corpus
	"""

	idx = {}
	docs = {}
	docID = 1
	for fileid in reuters.fileids():
	 	for word in reuters.words(fileid):
	 		if not is_number(word):
	 			# Strip punctuatuion from the word and make lower case
	 			word = remove_punct_from_word(word).lower()
	 			# Check to make sure word is not ""
	 			if len(word) > 0:
	 				# Check to see if word already is in index
	 				if word in idx:
	 					# Check to see if docID is not already present for word
	 					if docID not in idx[word]:
	 						idx[word].append(docID)
	 				# Otherwise add word and docID in array to index
	 				else:
	 					idx[word] = []
	 					idx[word].append(docID)
	 	
	 	# Add to docs dictionary mapping docID to file
	 	docs[docID] = fileid
	 	docID += 1
	size =  0
	for k in idx.iterkeys():
		size += os.sys.getsizeof(k)
		size += os.sys.getsizeof(idx[k])
	#print "size of original dictionary is:", size
	return idx

开发者ID:btawfic，项目名称:COMP90024-Project1，代码行数:35，代码来源:btawfic_project1_compressed_index.py

示例13: get_reuters_ids_cnt

def get_reuters_ids_cnt(num_doc=100, max_voca=10000, remove_top_n=5):
    """To get test data for training a model
    reuters, stopwords, english words corpora should be installed in nltk_data: nltk.download()

    Parameters
    ----------
    num_doc: int
        number of documents to be returned
    max_voca: int
        maximum number of vocabulary size for the returned corpus
    remove_top_n: int
        remove top n frequently used words

    Returns
    -------
    voca_list: ndarray
        list of vocabulary used to construct a corpus
    doc_ids: list
        list of list of word id for each document
    doc_cnt: list
        list of list of word count for each document
    """
    file_list = reuters.fileids()
    corpus = [reuters.words(file_list[i]) for i in xrange(num_doc)]

    return get_ids_cnt(corpus, max_voca, remove_top_n)

开发者ID:Assios，项目名称:python-topic-model，代码行数:26，代码来源:nltk_corpus.py

示例14: get_reuters_cnt_ids

def get_reuters_cnt_ids(num_doc=100, max_voca=10000):
    ''' to get test data for training a model
    reuters corpus should be installed in nltk_data: nltk.download()
    '''
    file_list = reuters.fileids()
    
    docs = list()
    freq = Counter()

    for i in range(num_doc):
        doc = reuters.words(file_list[i])
        freq.update(doc)
        docs.append(doc)

    voca = [key for key,val in freq.most_common(max_voca)]

    voca_dic = dict()
    voca_list = list()
    for word in voca:
        voca_dic[word] = len(voca_dic)
        voca_list.append(word)

    doc_ids = list()
    doc_cnt = list()

    for doc in docs:
        words = set(doc)
        ids = np.array([int(voca_dic[word]) for word in words if voca_dic.has_key(word)])
        cnt = np.array([int(doc.count(word)) for word in words if voca_dic.has_key(word)])

        doc_ids.append(ids)
        doc_cnt.append(cnt)

    return np.array(voca_list), doc_ids, doc_cnt

开发者ID:judaschrist，项目名称:python-topic-model，代码行数:34，代码来源:nltk_corpus.py

示例15: preProcess

def preProcess():
    print 'PreProcess Reuters Corpus'
    start_time = time.time()
    docs = 0
    bad = 0
    tokenizer = Tokenizer()

    if not os.path.isdir(Paths.base):
        os.makedirs(Paths.base)

    with open(Paths.text_index, 'w') as fileid_out:
      with codecs.open(Paths.texts_clean, 'w', 'utf-8-sig') as out:
          with codecs.open(Paths.reuter_test, 'w', 'utf-8-sig') as test:
              for f in reuters.fileids():
                  contents = reuters.open(f).read()
                  try:
                      tokens = tokenizer.tokenize(contents)
                      docs += 1
                      if docs % 1000 == 0:
                          print "Normalised %d documents" % (docs)

                      out.write(' '.join(tokens) + "\n")
                      # if f.startswith("train"):
                      #
                      # else:
                      #     test.write(' '.join(tokens) + "\n")
                      fileid_out.write(f + "\n")

                  except UnicodeDecodeError:
                      bad += 1
    print "Normalised %d documents" % (docs)
    print "Skipped %d bad documents" % (bad)
    print 'Finished building train file ' + Paths.texts_clean
    end_time = time.time()
    print '(Time to preprocess Reuters Corpus: %s)' % (end_time - start_time)

开发者ID:sirty，项目名称:TopicModelingDemo，代码行数:35，代码来源:main.py

示例16: train

 def train(self):
     training_set = []
     print self.activeCategories
     for category in self.activeCategories:
         files = reuters.fileids(category)
         for fi in files:
             training_set.append((self.getFileFeatures(fi), category))
     self.classifier = nltk.NaiveBayesClassifier.train(training_set)

开发者ID:maxim-popkov，项目名称:graph-term，代码行数:8，代码来源:Reader.py

示例17: init

 def __init__(self):
     # Generate training set from sample of Reuters corpus
     train_docs = [(self.bag_of_words(reuters.words(fileid)), category)
                   for category in reuters.categories()
                   for fileid in reuters.fileids(category) if
                   fileid.startswith("train")]
     # Create a classifier from the training data
     self.classifier = NaiveBayesClassifier.train(train_docs)

开发者ID:nathanjordan，项目名称:bernstein，代码行数:8，代码来源:classifier.py

示例18: init

 def __init__(self, dataset=''):
     """
         Docs in reuters corpus are identified by ids like "training|test/xxxx".
     :param dataset: filter for ids
     """
     self.dataset = dataset # filter docs
     self.categories = {c: n for n, c in enumerate(reuters.categories())} # map class with int
     self.docs = {d: n for n, d in enumerate(reuters.fileids())}  # map docs with int
     self.category_mask = [] # mask nth doc with its ith class

开发者ID:lum4chi，项目名称:IR，代码行数:9，代码来源:reuterscorpus.py

示例19: loadReutersCorpus

 def loadReutersCorpus(self):
     '''  
     Load the Reuters datasets.
     Note: need to install nltk data beforehand, using nltk.download() for the first run
     '''       
     self.fileIds = reuters.fileids()
     # Use only the training set corpus - documents located in the reuters/training folder.
     self.fileIds = [x for x in self.fileIds if x.startswith( 'training' )]
     return

开发者ID:nancyya，项目名称:searchEng，代码行数:9，代码来源:search_engine.py

示例20: get_list_fileids

def get_list_fileids(corpus):
    li=[]
    if corpus=='mr':
        from nltk.corpus import movie_reviews
        li = movie_reviews.fileids()
    else:
        from nltk.corpus import reuters
        li = reuters.fileids()
    return li

开发者ID:parthasm，项目名称:Search-Engine-TF-IDF，代码行数:9，代码来源:Tokenizer.py

注：本文中的nltk.corpus.reuters.fileids函数示例由纯净天空整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Python sentiwordnet.senti_synsets函数代码示例发布时间：2022-05-27

Python reuters.categories函数代码示例发布时间：2022-05-27

Python util.grid_equal函数代码示例

1 Python 入门教程

Python入门教程 Python 是一种解释型、面向对象、动态数据类型的高级程序设计语言。 P

阅读：13939|2022-01-22

2 Python wikiutil.getFrontPage函数代码示例

Python wikiutil.getFrontPage函数代码示例

阅读：10292|2022-05-24

3 Python 简介

Python 简介 Python 是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本

阅读：4174|2022-01-22

4 Python tests.group函数代码示例

Python tests.group函数代码示例

阅读：4064|2022-05-27

5 Python util.check_if_user_has_permission

Python util.check_if_user_has_permission函数代码示例

阅读：3889|2022-05-27

6 Python 操练实例98

Python 练习实例98 Python 100例题目：从键盘输入一个字符串，将小写字母全部转换成大

阅读：3539|2022-01-22

7 Python 环境搭建

Python 环境搭建本章节我们将向大家介绍如何在本地搭建 Python 开发环境。 Py

阅读：3069|2022-01-22

8 Python 基础语法

Python 基础语法 Python 语言与 Perl，C 和 Java 等语言有许多相似之处。但是，也

阅读：2727|2022-01-22

9 Python output.darkgreen函数代码示例

Python output.darkgreen函数代码示例

阅读：2682|2022-05-25

10 Python 中文编码

Python 中文编码前面章节中我们已经学会了如何用 Python 输出 Hello, World!，英文没

阅读：2347|2022-01-22

客服电话

电子邮件

Python reuters.fileids函数代码示例

示例1: tfidf

示例2: import_reuters_files

示例3: collection_stats

示例4: __init__

示例5: print_reuters

示例6: explore_categories

示例7: generateTextList

示例8: create_token_stream

示例9: load_data

示例10: run

示例11: get_testset_trainset_nltk_reuters

示例12: create_dictionary_index_reuters

示例13: get_reuters_ids_cnt

示例14: get_reuters_cnt_ids

示例15: preProcess

示例16: train

示例17: __init__

示例18: __init__

示例19: loadReutersCorpus

示例20: get_list_fileids

请发表评论

全部评论

上一篇：

下一篇：

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.get_stdout函数代码示例

关于我们

产品与服务

解决方案

139-2527-9053

示例4: init

示例17: init

示例18: init