pdf-gold-digger/README.md

862 B

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

Install

npm install -g szczepano/pdf-gold-digger

Usage

pdfdig -i some_file.pdf

or test by clonning repository

git clone https://github.com/vane/pdf-gold-digger
then run
sh demo.sh
and see results in out directory

Work in progress

Supports:

  • extract text
    • separate each page
    • separate each line
    • separate font information
    • bounding box position (probably buggy now)
  • output to text -o text (default)
  • output to json -o json

TODO:

  • specify output directory
  • output to xml format
  • output to json format
  • extract images to files
  • extract font
  • extract tables
  • advanced font information
  • extract forms
  • extract drawings