Extract data from pdf
Go to file
2019-07-23 05:00:53 +02:00
bin Fix missing pdfdig shebang 2019-07-23 00:26:04 +02:00
src Change output directory structure / Closes #2 2019-07-23 05:00:53 +02:00
.esdoc.json Documentation generate scripts 2019-07-23 01:19:17 +02:00
.gitignore Add demo.sh and test.sh for automating stuff 2019-07-23 01:59:55 +02:00
demo.sh Extract images with VisitorImage #2 2019-07-23 04:53:17 +02:00
gd.js Change output directory structure / Closes #2 2019-07-23 05:00:53 +02:00
LICENSE Add LICENSE, README update package.json with valid repository url 2019-07-22 20:20:37 +02:00
package.json Documentation generate scripts 2019-07-23 01:19:17 +02:00
README.md Extract images with VisitorImage #2 2019-07-23 04:53:17 +02:00
test.sh Extract images with VisitorImage #2 2019-07-23 04:53:17 +02:00

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

Install

npm install -g szczepano/pdf-gold-digger

Usage

pdfdig -i some_file.pdf

or test by clonning repository

git clone https://github.com/vane/pdf-gold-digger
then run
sh demo.sh
and see results in out directory

Work in progress

Supports:

  • extract text
    • separate each page
    • separate each line
    • separate font information
    • bounding box position (probably buggy now)
  • output to text -o text (default)
  • output to json -o json

TODO:

  • specify output directory
  • output to xml format
  • output to json format
  • extract images to files
  • extract font
  • extract tables
  • advanced font information
  • extract forms
  • extract drawings