COVID19开放式研究数据集|研究数据库_茶烟酒

请升级到MicrosoftEdge以使用最新的功能、安全更新和技术支持。

这一数据集可以动员研究人员应用自然语言处理方面的最新进展，得出新的见解，支持抗击这一传染性疾病。

注意

Microsoft按“原样”提供Azure开放数据集。Microsoft对数据集的使用不提供任何担保（明示或暗示）、保证或条件。在当地法律允许的范围内，Microsoft对使用数据集而导致的任何损害或损失不承担任何责任，包括直接、必然、特殊、间接、偶发或惩罚性损害或损失。

此数据集是根据Microsoft接收源数据的原始条款提供的。数据集可能包含来自Microsoft的数据。

在出版物或再分发的资料中包含CORD-19数据时，请按如下方式引用数据集：

在参考文献中：

在文本中：（新冠肺炎，2020年）

若有关于此数据集的任何疑问，请联系partnerships@allenai.org。

此笔记本有两个目标：

依赖项：此笔记本需要以下库：

CORD-19数据存储在covid19temp容器中。下面是容器中的文件结构以及示例文件。

CORD-19数据集附带一个metadata.csv，这个文件会记录有关CORD-19数据集中提供的所有论文的基本信息。建议从这里开始探索！

#containerhousingCORD-19datacontainer_name="covid19temp"#downloadmetadata.csvmetadata_filename='metadata.csv'blob_service.get_blob_to_path(container_name=container_name,blob_name=metadata_filename,file_path=metadata_filename)importpandasaspd#readmetadata.csvintoadataframemetadata_filename='metadata.csv'metadata=pd.read_csv(metadata_filename)metadata.head(3)粗略看一下会发现内容太多了，所以我们来稍微精简一下。

#choosearandomexamplewithpdfparseavailablemetadata_with_pdf_parse=metadata[metadata['has_pdf_parse']]example_entry=metadata_with_pdf_parse.iloc[42]#constructpathtoblobcontainingfulltextblob_name='{0}/pdf_json/{1}.json'.format(example_entry['full_text_file'],example_entry['sha'])#notetherepetitioninthepathprint("Fulltextblobforthisentry:")print(blob_name)现在，我们可以读取与此blob关联的JSON内容，如下所示。

importjsonblob_as_json_string=blob_service.get_blob_to_text(container_name=container_name,blob_name=blob_name)data=json.loads(blob_as_json_string.content)#inadditiontothebodytext,themetadataisalsostoredwithintheindividualjsonfilesprint("Keyswithindata:",','.join(data.keys()))在本例中，我们感兴趣的是body_text，它按如下方式存储文本数据：

fromnltk.tokenizeimportsent_tokenize#thetextitselflivesunder'body_text'text=data['body_text']#manyNLPtasksplaynicelywithalistofsentencessentences=[]forparagraphintext:sentences.extend(sent_tokenize(paragraph['text']))print("Anexamplesentence:",sentences[0])PDF与PMCXML分析在上面的示例中，我们看到了一个使用has_pdf_parse==True的示例。其中，blob文件路径采用了如下格式：

'/pdf_json/.json'或者，对于使用has_pmc_xml_parse==True的示例，使用了以下格式：

'/pmc_json/.xml.json'例如：

#getandsortlistofavailableblobsblobs=blob_service.list_blobs(container_name)sorted_blobs=sorted(list(blobs),key=lambdae:e.name,reverse=True)现在，我们可以直接循环访问blob。例如，让我们来计算可用的JSON文件数。

#wecannowiteratedirectlythoughtheblobscount=0forblobinsorted_blobs:ifblob.name[-5:]==".json":count+=1print("Thereare{}manyjsonfiles".format(count))Thereare59784manyjsonfiles附录数据质量问题这是一个大型数据集，由于明显的原因，它在仓促的情况下被放在一起！下面是我们观察到的一些数据质量问题。

我们观察到，在某些情况下，给定条目有多个sha。

metadata_multiple_shas=metadata[metadata['sha'].str.len()>40]print("Thereare{}manyentrieswithmultipleshas".format(len(metadata_multiple_shas)))metadata_multiple_shas.head(3)Thereare1999manyentrieswithmultipleshas容器的布局在这里，我们使用简单的正则表达式来浏览容器的文件结构，以防将来更新。

如果NLTK没有punkt包，则需要运行：

使用mount.start()和mount.stop()，或者也可以使用withmount():来管理上下文。

importosCOVID_DIR='/covid19temp'path=mount.mount_point+COVID_DIRwithmount:print(os.listdir(path))['antiviral_with_properties_compressed.sdf','biorxiv_medrxiv','biorxiv_medrxiv_compressed.tar.gz','comm_use_subset','comm_use_subset_compressed.tar.gz','custom_license','custom_license_compressed.tar.gz','metadata.csv','noncomm_use_subset','noncomm_use_subset_compressed.tar.gz']下面是CORD-19数据集中的文件结构以及示例文件。

#choosearandomexamplewithpdfparseavailablemetadata_with_pdf_parse=metadata[metadata['has_pdf_parse']]example_entry=metadata_with_pdf_parse.iloc[42]#constructpathtoblobcontainingfulltextfilepath='{0}/{1}/pdf_json/{2}.json'.format(path,example_entry['full_text_file'],example_entry['sha'])print("Fulltextfilepath:")print(filepath)现在，我们可以读取与此文件关联的JSON内容，如下所示。

importjsontry:withopen(filepath,'r')asf:data=json.load(f)exceptFileNotFoundErrorase:#incasethemountcontexthasbeenclosedmount.start()withopen(filepath,'r')asf:data=json.load(f)#inadditiontothebodytext,themetadataisalsostoredwithintheindividualjsonfilesprint("Keyswithindata:",','.join(data.keys()))Keyswithindata:paper_id,metadata,abstract,body_text,bib_entries,ref_entries,back_matter在本例中，我们感兴趣的是body_text，它按如下方式存储文本数据：

#choosearandomexamplewithpmcparseavailablemetadata_with_pmc_parse=metadata[metadata['has_pmc_xml_parse']]example_entry=metadata_with_pmc_parse.iloc[42]#constructpathtoblobcontainingfulltextfilename='{0}/pmc_json/{1}.xml.json'.format(example_entry['full_text_file'],example_entry['pmcid'])#notetherepetitioninthepathprint("Pathtofile:{}\n".format(filename))withopen(mount.mount_point+'/'+COVID_DIR+'/'+filename,'r')asf:data=json.load(f)#thetextitselflivesunder'body_text'text=data['body_text']#manyNLPtasksplaynicelywithalistofsentencessentences=[]forparagraphintext:sentences.extend(sent_tokenize(paragraph['text']))print("Anexamplesentence:",sentences[0])附录数据质量问题这是一个大型数据集，由于明显的原因，它在仓促的情况下被放在一起！下面是我们观察到的一些数据质量问题。

THE END

COVID19开放式研究数据集

ChineseResearchDataServicesPlatform

行行查行业研究数据库

实证研究指南七：标准数据库

关于正式开通泛研全球科研项目数据库（完整版）的通知

真实世界研究的基石——国内外真实世界数据库纵览

研究生都有什么数据库?Worktile社区

数据库介绍

《世界史研究外文数据库指南》

研究工具和数据库Elsevier

COVID19开放式研究数据集