Ingestum: A libre NLP document ingestion library
Many NLP projects that depend upon the analysis of documents are impaired by the difficulty of transforming source material into a computer-readable format. For example, PDF files are designed for human consumption but can look like a bag full of words to a computer. To address this problem engineers at Sorcero developed Ingestum, a library that is used to "devour" content sources, outputting a format that can be used for additional processing.
Original story by LibrePlanet 2021 and the Free Software Foundation. Published 2021-03-21, Originally published 2021-03-21.
This work is available under the Creative Commons Attribution-ShareAlike license.
Ingestum has four main concepts:
- Sources - common content sources that feed the ingestion process, e.g. PDF, HTML, PNG, WAV, Twitter, email, et al.
- Documents - the intermediary and final states of a source during the ingestion process.
- Transformers - a transformation function that can be applied a document, e.g. removing hyphens from a text document.
- Conditionals - a logic conditional operation that can be use to modify the behavior of a transformer.
"Walter Bender is founder of Sugar Labs and maintainer of Music Blocks, free software projects in support of education.
Over the past ten years, Martín Abente Lahaya has contributed to projects that involve free software in areas such as the GNU/Linux desktop and education.
Juan Pablo Ugarte is Glade developer/maintainer and GNOME contributor and foundation member since 2005."