Content Analysis System

Goal

Our goal is to create an open source framework for mining very large collections of scientific publications. Content Analysis System (CoAnSys) will handle tens of millions of bibliographic records on a modest Hadoop cluster.

Under the Hood

We employ state-of-the-art machine learning techniques for document deduplication, author name disambiguation, citation matching, keyword extraction, and document analysis (similarity, classification). All of this is implemented in Apache Hadoop (Java, Scoobi, Pig, Oozie). CoAnSys uses HBase and HDFS for data storage.

Software

The source code of CoAnSys is available at https://github.com/CeON/CoAnSys. Supporting projects are available from the CeON repository at GitHub. The software is released under the terms of GNU Affero General Public License.

Publications