pdf-gold-digger/README.md

958 B

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

Install

npm install -g szczepano/pdf-gold-digger

Usage

pdfdig -i some_file.pdf

Documentation url

pdf-gold-digger

or test by clonning repository

git clone https://github.com/vane/pdf-gold-digger
then run
sh demo.sh
and see results in out directory

Work in progress

Supports:

  • extract text
    • separate each page
    • separate each line
    • separate font information
    • bounding box position (probably buggy now)
  • extract images
  • output to text -f text (default)
  • output to json -f json

TODO:

  • specify output directory
  • output to xml format
  • output to json format
  • extract images to files
  • extract font
  • extract tables
  • advanced font information
  • extract forms
  • extract drawings