Goal
Our goal is to create an open source framework for mining very large collections of scientific publications. Content Analysis System (CoAnSys) will handle tens of millions of bibliographic records on a modest Hadoop cluster.
Under the Hood
We employ state-of-the-art machine learning techniques for document deduplication, author name disambiguation, citation matching, keyword extraction, and document analysis (similarity, classification). All of this is implemented in Apache Hadoop (Java, Scoobi, Pig, Oozie). CoAnSys uses HBase and HDFS for data storage.
Software
The source code of CoAnSys is available at https://github.com/CeON/CoAnSys. Supporting projects are available from the CeON repository at GitHub. The software is released under the terms of GNU Affero General Public License.
Publications
- P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, P. Wendykier, and Ł. Bolikowski “Chrum - the Tool for Convenient Generation of Apache Oozie Workflows,” will be published in Intelligent Tools for Building a Scientific Information Platform, R. Bembenik, L. Skonieczny, H. Rybinski, M. Kryszkiewicz, and M. Niezgodka, Eds. Springer, 2014
- P. J. Dendek, A. Czeczko, M. Fedoryszak, A. Kawa, P. Wendykier, and Ł. Bolikowski “Content Analysis of Scientific Articles in Apache Hadoop Ecosystem,” will be published in Intelligent Tools for Building a Scientific Information Platform, R. Bembenik, L. Skonieczny, H. Rybinski, M. Kryszkiewicz, and M. Niezgodka, Eds. Springer, 2014
- M. Fedoryszak, D. Tkaczyk, Ł. Bolikowski, “Large Scale Citation Matching Using Apache Hadoop,” in Research and Advanced Technology for Digital Libraries, Springer Berlin Heidelberg, 2013, 8092, pp. 362-365
- P. Wendykier, “Deduplication of Metadata Harvested from Open Archives Initiative Repositories,” in Mining the Digital Information Networks, IOS Press, 2013, pp. 57–66
- P. J. Dendek, M. Wojewódzki, and Ł. Bolikowski, “Author disambiguation in the YADDA2 software platform,” in Intelligent Tools for Building a Scientific Information Platform, R. Bembenik, L. Skonieczny, H. Rybinski, M. Kryszkiewicz, and M. Niezgodka, Eds. Springer, 2013, pp. 131–143.
- M. Fedoryszak, Ł. Bolikowski, D. Tkaczyk, and K. Wojciechowski, “Methodology for evaluating citation parsing and matching,” in Intelligent Tools for Building a Scientific Information Platform, R. Bembenik, L. Skonieczny, H. Rybinski, M. Kryszkiewicz, and M. Niezgodka, Eds. Springer, 2013, pp. 145–154.
- A. Kawa, Ł. Bolikowski, A. Czeczko, P. J. Dendek, and D. Tkaczyk, “Data model for analysis of scholarly documents in the MapReduce paradigm,” in Intelligent Tools for Building a Scientific Information Platform, R. Bembenik, L. Skonieczny, H. Rybinski, M. Kryszkiewicz, and M. Niezgodka, Eds. Springer, 2013, pp. 155–169.
- M. Łukasik, T. Kuśmierczyk, Ł. Bolikowski, and H. S. Nguyen, “Hierarchical, multi-label classification of scholarly publications: modifications of ML-KNN algorithm,” in Intelligent Tools for Building a Scientific Information Platform, R. Bembenik, L. Skonieczny, H. Rybinski, M. Kryszkiewicz, and M. Niezgodka, Eds. Springer, 2013, pp. 343–363.
- P. J. Dendek, Ł. Bolikowski, and M. Łukasik, “Evaluation of Features for Author Name Disambiguation Using Linear Support Vector Machines,” in Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, 2012, pp. 440-444.
- D. Tkaczyk, Ł. Bolikowski, A. Czeczko, and K. Rusek, “A modular metadata extraction system for born-digital articles,” in Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, 2012, pp. 11-16.
- Ł. Bolikowski and P. J. Dendek, “Towards a Flexible Author Name Disambiguation Framework,” in Towards a Digital Mathematics Library, 2011, pp. 27-37.
- D. Tkaczyk and Ł. Bolikowski, “Workflow of metadata extraction from retro-born-digital documents,” in Towards a Digital Mathematics Library, 2011, pp. 39-44.