pdf-gold-digger/README.md

698 B

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

Work in progress

Usage

git clone https://github.com/vane/pdf-gold-digger
node gd.js -f some_file.pdf

Supports:

  • extract text
    • separate each page
    • separate each line
    • separate font information
    • bounding box position
  • output to text -o text (default)
  • output to json -o json

TODO:

  • specify output format and output directory
  • output to xml format
  • output to json format
  • extract images to files
  • extract font
  • extract tables
  • advanced font information
  • extract forms
  • extract drawings