Open Source Projects for Extracting Data and Metadata from Files & the Web

I’ve been looking around for open-source libraries (preferably in Java, but not required) for extracting data and metadata from common file formats and Web formats. One project that looks very promising is Aperture. Do you know of any others that are ready or almost ready for prime-time use? Please let me know in the comments! Thanks.

Social tagging: > > > > > >

5 Responses to Open Source Projects for Extracting Data and Metadata from Files & the Web

  1. Nova,
    The Sponger [1] component of the Open Source edition of Virtuoso is an in-built middleware layer for extracting metadata and re-purposing and RDF instance data for appropriately matched ontologies and schemas. This is how we make almost and Web Information Resource a bona fide RDF Data Source (on the fly). The architecture of the sponger is such that you can slot in 3rd party extractors. Thus, in our case we are able to integrate the likes of Aperture.
    To see all of this in action simply look at:
    1. http://www.openlinksw.com/blog/~kidehen/?id=1172
    2. http://demo.openlinksw.com/DAV/JS/rdfbrowser/index.html (use this blog post as the Data Source URI for instance)
    3. http://demo.openlinksw.com/isparql – visual SPARQL Query Tools (just put this blog post as the Data Source URI as per item 1)
    The examples above are interacting with the Sponger’s REST interface.

  2. An excellent list of RDFizers is at:
    http://simile.mit.edu/wiki/RDFizers
    Another list (with some repeats):
    http://esw.w3.org/topic/ConverterToRdf
    Aperture has a number of these built in but it is easy to add additional types.
    I have also looked for other frameworks but have found none other than Aperture. It looks to be the only major one.

  3. bill says:

    Take a look at http://hul.harvard.edu/jhove/
    Mainly designed for file format validation, but also does a fair bit of metadata extraction.
    It’s LGPL

  4. Mike Pittaro says:

    napLogic (http://www.snaplogic.org) is an Open Source data integration framework implemented in Python.
    We combine data access and data transformation using a pipeline approach.
    Databases, files, and RSS read/write are currently available, other sources and targets can easily be added.

  5. Prateek says:

    For extracting data from Websites
    http://simile.mit.edu/wiki/Piggy_Bank