Python textract.process函数代码示例

OStack程序员社区-中国程序员成长平台 › 门户 › 编程› Python›Python编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Python中textract.process函数的典型用法代码示例。如果您正苦于以下问题：Python process函数的具体用法？Python process怎么用？Python process使用的例子？那么恭喜您, 这里精选的函数代码示例或许可以为您提供帮助。

在下文中一共展示了process函数的20个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Python代码示例。

示例1: test_missing_filename_python

 def test_missing_filename_python(self):
     """Make sure missing files raise the correct error"""
     filename = self.get_temp_filename()
     os.remove(filename)
     import textract
     from textract.exceptions import MissingFileError
     with self.assertRaises(MissingFileError):
         textract.process(filename)

开发者ID:deanmalmgren，项目名称:textract，代码行数:8，代码来源:test_exceptions.py

示例2: test_unsupported_extension_python

 def test_unsupported_extension_python(self):
     """Make sure unsupported extension raises the correct error"""
     filename = self.get_temp_filename(extension="extension")
     import textract
     from textract.exceptions import ExtensionNotSupported
     with self.assertRaises(ExtensionNotSupported):
         textract.process(filename)
     os.remove(filename)

开发者ID:deanmalmgren，项目名称:textract，代码行数:8，代码来源:test_exceptions.py

示例3: annotate_doc

def annotate_doc(pdf_file_path, ontologies):
    if pdf_file_path.endswith('pdf') or pdf_file_path.endswith('PDF'):
        text = textract.process(pdf_file_path, method="pdfminer")
    elif pdf_file_path.endswith('html') or pdf_file_path.endswith('htm'):
        text = textract.process(pdf_file_path, method="beautifulsoup4")
    elif pdf_file_path.endswith('txt'):
            with open(pdf_file_path, 'r') as file:
                text = file.read()
    db = DBConnect()
    if text.isspace():
        log = {
            'file_name': pdf_file_path.encode('utf-8'),
            'error': 'Failed PDF to text transformation in annotation process',
            'exception': '',
            'data': ''
        }
        db.insert_log(log)
        return
    ontologies = ",".join(ontologies)
    annotations = []
    text = unidecode(text.decode('utf8'))
    text = ' '.join(text.split())
    # post_data = dict(apikey=settings.BIOPORTAL_API_KEY, text=text,
    #                  display_links='true', display_context='false', minimum_match_length='3',
    #                  exclude_numbers='true', longest_only='true', ontologies=ontologies, exclude_synonyms='true')
    post_data = dict(apikey=settings.BIOPORTAL_API_KEY, text=text,
                     display_links='true', display_context='false', minimum_match_length='3',
                     exclude_numbers='true', longest_only='true', ontologies=ontologies, exclude_synonyms='true')
    try:
        response = requests.post(settings.ANNOTATOR_URL, post_data)
        json_results = json.loads(response.text)
        for result in json_results:
            for annotation in result['annotations']:
                context_begin = annotation['from']  if annotation['from'] - 40 < 1 else annotation['from'] - 40
                context_end = annotation['to'] if annotation['to'] + 40 > len(text) else annotation['to'] + 40
                record = {
                    'file_name': pdf_file_path.encode('utf-8'),
                    'bio_class_id': result['annotatedClass']['@id'],
                    'bio_ontology_id': result['annotatedClass']['links']['ontology'],
                    'text': u'' + annotation['text'].encode('utf-8'),
                    'match_type': annotation['matchType'],
                    'context': u''+text[context_begin:context_end]
                }
                annotations.append(record)
        db.insert_annotations(annotations)
        return
    except (ValueError, IndexError, KeyError) as e:
        print e
        log = {
            'file_name': pdf_file_path.encode('utf-8'),
            'error': 'Bad response from Bioportal Annotator',
            'exception': str(e),
            'data': ''
        }
        db.insert_log(log)
        return

开发者ID:ficolo，项目名称:corpora-char-cli，代码行数:56，代码来源:bioportal_annotator.py

示例4: pdftotext_any

def pdftotext_any(myfile):
    # Todo: use tempfile instead
    path = '/tmp/infile.pdf'
    with open(path, 'wb') as f:
    #with tempfile.NamedTemporaryFile() as f:
    #    path = f.name
        f.write(myfile)
    text = textract.process(path, method='pdftotext')
    if len(text)<5: # No text found, it is probably an image scan, so we need to do an OCR
        text = textract.process(path, method='tesseract')
    return text

开发者ID:rsandstroem，项目名称:DocumentStore，代码行数:11，代码来源:docstore.py

示例5: process_text_file

def process_text_file(file_path):
    file_name, extension = os.path.splitext(file_path)
    print file_name, extension
    if (extension == ".txt"):
        return file_path
    elif (extension == '.epub'):
        print "Trying epub"
        try:
            text = textract.process(file_path)
            print "Processed epub: ", file_path
            output_path = file_name+'.txt'
            output_file = open(output_path, 'w')
            output_file.write(text)
            print "Converted epub: ", output_path
            return output_path
        except Exception as error:
            # TODO: textract raises own error so none isn't returned on try failure
            print error
            print 'Failed to convert epub: ', file_path
            return None
    elif (extension == ""):
        text_content = None
        try:
            with open(file_path) as input_file:
                text_content = input_file.read()
                if text_content:
                    print "Managed to read file: ", file_path
                    return file_path
        except IOError:
            print "Failed to read file: ", file_path
            return None
    else:
        print 'Unsupported file type: ', file_path
        return None

开发者ID:EilidhHendry，项目名称:author-similarity，代码行数:34，代码来源:util.py

示例6: get_text_from_files

def get_text_from_files(files_to_process):
    """Extracts text from each file given a list of file_names"""
    file_text_dict = {}
    for file_name in iter(files_to_process):
        extracted_text = textract.process(file_name)
        file_text_dict[file_name] = extracted_text
    return file_text_dict

开发者ID:aggerdom，项目名称:textprocessingfuncs，代码行数:7，代码来源:dumptext_dialog.py

示例7: get_text_from_file

	def get_text_from_file(self, file):
		filename = file['id'] + '.pdf'

		self._download_file(file, filename)
		text = textract.process(filename)
		os.remove(filename)
		return text

开发者ID:mlchow，项目名称:duethat，代码行数:7，代码来源:GoogleDoc.py

示例8: extract_text_from_lectureDocuments

    def extract_text_from_lectureDocuments(self):
        # pull files from database
        lectureDocumentsObjects = lectureDocuments.objects.filter(extracted=False)

        # loop through modules and pull all text
        for lectureDocumentsObject in lectureDocumentsObjects:
            if lectureDocumentsObject.document:
                print lectureDocumentsObject.document
                path_to_file = MEDIA_ROOT + '/' + str(lectureDocumentsObject.document)
                document_contents = textract.process(path_to_file, encoding='ascii')

                # create tags from noun_phrases
                # only add tags if none exist
                blobbed = TextBlob(document_contents)
                np = blobbed.noun_phrases
                np = list(set(np))
                np = [s for s in np if s]
                lectureDocumentsObject.tags.clear()
                for item in np:
                    s = ''.join(ch for ch in item if ch not in exclude)
                    print s
                    lectureDocumentsObject.tags.add(s)

                # save this string
                lectureDocumentsObject.document_contents = document_contents
                lectureDocumentsObject.extracted = True
                lectureDocumentsObject.save()

开发者ID:NiJeLorg，项目名称:GAHTC，代码行数:27，代码来源:import_text_docs.py

示例9: save

 def save(self, *args, **kwargs):
     super(Document, self).save(*args, **kwargs)
     text = textract.process(self.source_file.url)
     filtered_stems = self.get_filtered_stems(text)
     self.total_word_count = len(filtered_stems)
     self.count_target_words(filtered_stems)
     super(Document, self).save(*args, **kwargs)

开发者ID:pixelrust，项目名称:devjargon，代码行数:7，代码来源:models.py

示例10: extract

def extract(path):
    '''
    Extract full text fro pdf's

    :param path: [String] Path to a pdf file downloaded via {fetch}, or another way.

    :return: [str] a string of text

    Usage::

        from pyminer import miner

        # a pdf
        url = "http://www.banglajol.info/index.php/AJMBR/article/viewFile/25509/17126"
        out = miner.fetch(url)
        out.parse()

        # search first, then pass links to fetch
        res = miner.search(filter = {'has_full_text': True, 'license_url': "http://creativecommons.org/licenses/by/4.0"})
        # url = res.links_pdf()[0]
        url = 'http://www.nepjol.info/index.php/JSAN/article/viewFile/13527/10928'
        x = miner.fetch(url)
        miner.extract(x.path)
    '''
    text = textract.process(path)
    return text

开发者ID:sckott，项目名称:pyminer，代码行数:26，代码来源:extract.py

示例11: build_indexes

def build_indexes(files_list, index_file):
    toolbar_width = len(files_list)
    print(toolbar_width)
    sys.stdout.write("[%s]" % (" " * toolbar_width))
    sys.stdout.flush()
    sys.stdout.write("\b" * (toolbar_width+1)) # return to start of line, after '['
    hash_index = {}
    for item in files_list:
        text = textract.process(item)
        details = re.split("[, \t\n\\t:;]",  text)
        for i in details:
            if i == "" : continue
            if hash_index.has_key((i)) :
                if hash_index[(i)].has_key((item)):
                    hash_index[(i)][(item)] += 1
                else:
                    hash_index[(i)][(item)] = 1
            else:
                hash_index[(i)] = {}
                if hash_index[(i)].has_key(item):
                    hash_index[(i)][(item)] += 1
                else:
                    hash_index[(i)][(item)] = 1


        # update the bar
        sys.stdout.write("-")
        sys.stdout.flush()

    sys.stdout.write("\n")
    fp = open(index_file, "w")
    json.dump(hash_index, fp)
    fp.close()

开发者ID:saadasad，项目名称:highlight，代码行数:33，代码来源:main.py

示例12: get_path_details

 def get_path_details(cls, temp_path, image_path):
     """Return the byte sequence and the full text for a given path."""
     byte_sequence = ByteSequence.from_path(temp_path)
     extension = map_mime_to_ext(byte_sequence.mime_type, cls.mime_map)
     logging.debug("Assessing MIME: %s EXTENSION %s SHA1:%s", byte_sequence.mime_type,
                   extension, byte_sequence.sha1)
     full_text = ""
     if extension is not None:
         try:
             logging.debug("Textract for SHA1 %s, extension map val %s",
                           byte_sequence.sha1, extension)
             full_text = process(temp_path, extension=extension, encoding='ascii',
                                 preserveLineBreaks=True)
         except ExtensionNotSupported as _:
             logging.exception("Textract extension not supported for ext %s", extension)
             logging.debug("Image path for file is %s, temp file at %s", image_path, temp_path)
             full_text = "N/A"
         except LookupError as _:
             logging.exception("Lookup error for encoding.")
             logging.debug("Image path for file is %s, temp file at %s", image_path, temp_path)
             full_text = "N/A"
         except UnicodeDecodeError as _:
             logging.exception("UnicodeDecodeError, problem with file encoding")
             logging.debug("Image path for file is %s, temp file at %s", image_path, temp_path)
             full_text = "N/A"
         except:
             logging.exception("Textract UNEXPECTEDLY failed for temp_file.")
             logging.debug("Image path for file is %s, temp file at %s", image_path, temp_path)
             full_text = "N/A"
     return byte_sequence, full_text

开发者ID:BitCurator，项目名称:bca-webtools，代码行数:30，代码来源:text_indexer.py

示例13: indexing

def indexing():
    ana = analysis.StemmingAnalyzer()
    schema = Schema(title=TEXT(analyzer=ana, spelling=True), path=ID(stored=True), content=TEXT)
    ix = create_in("data/pdf_data", schema)
    writer = ix.writer()
    count = 0

    with open('Final_Links/doc_links.txt') as fp, open('data/pdf_data/mytemp/doc_content.txt', 'w+') as f:
        for line in fp:
            count += 1
            url = line
            doc_name = re.search('.*/(.*)', url).group(1)

            try:
                response = urllib2.urlopen(url, timeout=3)
                if int(response.headers['content-length']) > 2475248:
                    continue
                fil = open("data/pdf_data/mytemp/" + doc_name, 'w+')
                fil.write(response.read())
                fil.close()

                content_text = textract.process('data/pdf_data/mytemp/' + doc_name, encoding='ascii')
                f.write(content_text)
                writer.add_document(title=unicode(url, "utf-8"), path=unicode(url, "utf-8"),
                                    content=unicode(content_text))
                writer.add_document(title=unicode(url, "utf-8"), path=unicode(url, "utf-8"),
                                    content=unicode(url))
            except Exception as e:
                print "Caught exception e at " + '' + str(e)
                continue
            print str(count) + " in " + " URL:" + url

    writer.commit()
    print "Indexing Completed !"

开发者ID:theawless，项目名称:IITG-Search，代码行数:34，代码来源:doc_indexing_main_thread.py

示例14: parse_sentences

def parse_sentences(pdf):
	text = textract.process(pdf)

	reg = "[.?!]"

	sentences = re.split(reg, text)

	return [s for s in sentences if "\\x" not in s]

开发者ID:denalirao，项目名称:thesis-Visualizations，代码行数:8，代码来源:thesis_parser.py

示例15: compare_python_output

    def compare_python_output(self, filename, expected_filename=None, **kwargs):
        if expected_filename is None:
            expected_filename = self.get_expected_filename(filename, **kwargs)

        import textract
        result = textract.process(filename, **kwargs)
        with open(expected_filename) as stream:
            self.assertEqual(result, stream.read())

开发者ID:AlinaKay，项目名称:textract，代码行数:8，代码来源:base.py

示例16: test_standardized_text_python

 def test_standardized_text_python(self):
     """Make sure standardized text matches from python"""
     import textract
     result = textract.process(self.standardized_text_filename)
     self.assertEqual(
         ''.join(result.split()),
         self.get_standardized_text(),
     )

开发者ID:Elthas17，项目名称:textract，代码行数:8，代码来源:base.py

示例17: detectar

def detectar(f):
    texto = textract.process(f)
    texto = texto.decode('utf-8')
    _texto = textblob.TextBlob(texto)
    try:
        lang = _texto.detect_language()
        return lang
    except TranslatorError:
        return None

开发者ID:lmorillas，项目名称:detectar-idioma-materiales-arasaac，代码行数:9，代码来源:detectar.py

示例18: test_standardized_text_python

 def test_standardized_text_python(self):
     """Make sure standardized text matches from python"""
     import textract
     result = textract.process(self.standardized_text_filename)
     self.assertEqual(
         six.b('').join(result.split()),
         self.get_standardized_text(),
         "standardized text fails for %s" % self.extension,
     )

开发者ID:barrust，项目名称:textract，代码行数:9，代码来源:base.py

示例19: extract_all

 def extract_all(self, src, maxpages=0):
     if '.pdf' in src:
         try:
             start = time()
             text = self.extract(src, maxpages=maxpages)
             print "case 1 elapsed_time {}s".format(time() - start)
         except Exception, e:
             start = time()
             text = textract.process(src)
             print "case 2 elapsed_time {}s".format(time() - start)

开发者ID:webeng，项目名称:feature_engineering，代码行数:10，代码来源:file2text.py

示例20: get_recommendations_file

def get_recommendations_file(pdf_file_path):
        if pdf_file_path.endswith('pdf') or pdf_file_path.endswith('PDF'):
            text = textract.process(pdf_file_path, method="pdfminer")
        elif pdf_file_path.endswith('html') or pdf_file_path.endswith('htm'):
            text = textract.process(pdf_file_path, method="beautifulsoup4")
        elif pdf_file_path.endswith('txt'):
            with open(pdf_file_path, 'r') as file:
                text = file.read()
        if text.isspace():
            log = {
                'file_name': pdf_file_path.encode('utf-8'),
                'error': 'Failed PDF to text transformation in recommendation process',
                'exception': '',
                'data': ''
            }
            db = DBConnect()
            db.insert_log(log)
            return []
        abstract_index = text.find('abstract')
        abstract_index += text.find('ABSTRACT')
        abstract_index += text.find('Abstract')
        abstract_index = 0 if abstract_index < 0 else abstract_index
        text = unidecode(text.decode('utf8'))
        text = ' '.join(text.split())
        text = text[abstract_index:abstract_index+500] if len(text) > 500 else text
        post_data = dict(apikey=settings.BIOPORTAL_API_KEY, input=text, include='ontologies',
                         display_links='false', output_type='2', display_context='false',
                         wc='0.15', ws='1.0', wa='1.0', wd='0.5')
        try:
            response = requests.post(settings.RECOMMENDER_URL, post_data)
            json_results = json.loads(response.text)
            best_ontology_set = json_results[0]['ontologies'] if len(json_results) > 0 else []
            return [{'acronym': ontology['acronym'], 'id': ontology['@id']} for ontology in best_ontology_set]
        except (ValueError, IndexError, KeyError) as e:
            log = {
                'file_name': '',
                'error': 'Bad response from Bioportal Recommender:',
                'exception': str(e),
                'data': ''
            }
            db = DBConnect()
            db.insert_log(log)
            return []

开发者ID:ficolo，项目名称:corpora-char-cli，代码行数:43，代码来源:bioportal_annotator.py

注：本文中的textract.process函数示例由纯净天空整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Python textrect.render_textrect函数代码示例发布时间：2022-05-27

Python textile.textile函数代码示例发布时间：2022-05-27

Python util.grid_equal函数代码示例

1 Python 入门教程

Python入门教程 Python 是一种解释型、面向对象、动态数据类型的高级程序设计语言。 P

阅读：13931|2022-01-22

2 Python wikiutil.getFrontPage函数代码示例

Python wikiutil.getFrontPage函数代码示例

阅读：10292|2022-05-24

3 Python 简介

Python 简介 Python 是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本

阅读：4169|2022-01-22

4 Python tests.group函数代码示例

Python tests.group函数代码示例

阅读：4064|2022-05-27

5 Python util.check_if_user_has_permission

Python util.check_if_user_has_permission函数代码示例

阅读：3889|2022-05-27

6 Python 操练实例98

Python 练习实例98 Python 100例题目：从键盘输入一个字符串，将小写字母全部转换成大

阅读：3539|2022-01-22

7 Python 环境搭建

Python 环境搭建本章节我们将向大家介绍如何在本地搭建 Python 开发环境。 Py

阅读：3067|2022-01-22

8 Python 基础语法

Python 基础语法 Python 语言与 Perl，C 和 Java 等语言有许多相似之处。但是，也

阅读：2726|2022-01-22

9 Python output.darkgreen函数代码示例

Python output.darkgreen函数代码示例

阅读：2682|2022-05-25

10 Python 中文编码

Python 中文编码前面章节中我们已经学会了如何用 Python 输出 Hello, World!，英文没

阅读：2346|2022-01-22

客服电话

电子邮件

Python textract.process函数代码示例

示例1: test_missing_filename_python

示例2: test_unsupported_extension_python

示例3: annotate_doc

示例4: pdftotext_any

示例5: process_text_file

示例6: get_text_from_files

示例7: get_text_from_file

示例8: extract_text_from_lectureDocuments

示例9: save

示例10: extract

示例11: build_indexes

示例12: get_path_details

示例13: indexing

示例14: parse_sentences

示例15: compare_python_output

示例16: test_standardized_text_python

示例17: detectar

示例18: test_standardized_text_python

示例19: extract_all

示例20: get_recommendations_file

请发表评论

全部评论

上一篇：

下一篇：

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.get_stdout函数代码示例

关于我们

产品与服务

解决方案

139-2527-9053