Webbläsaren som du använder stöds inte av denna webbplats. Alla versioner av Internet Explorer stöds inte längre, av oss eller Microsoft (läs mer här: * https://www.microsoft.com/en-us/microsoft-365/windows/end-of-ie-support).

Var god och använd en modern webbläsare för att ta del av denna webbplats, som t.ex. nyaste versioner av Edge, Chrome, Firefox eller Safari osv.

KOSHIK: A large-scale distributed computing framework for NLP

Författare

Summary, in English

In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.

Publiceringsår

2014

Språk

Engelska

Sidor

464-470

Publikation/Tidskrift/Serie

3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM 2014)

Dokumenttyp

Konferensbidrag

Förlag

SciTePress

Ämne

  • Computer Science

Conference name

3rd International Conference on Pattern Recognition Applications an Methods (ICPRAM 2014)

Conference date

2014-03-06 - 2014-03-08

Conference place

Angers, France

Status

Published

ISBN/ISSN/Övrigt

  • ISBN: 978-989-758-018-5