Extract data from pdf
Go to file
2019-07-28 19:25:37 +02:00
bin Fix missing pdfdig shebang 2019-07-23 00:26:04 +02:00
src Add FormatterHTML for html output, closes #15 2019-07-28 19:25:37 +02:00
.esdoc.json Documentation generate scripts 2019-07-23 01:19:17 +02:00
.eslintrc.json Add eslint standard with small modifications 2019-07-28 09:40:39 +02:00
.gitignore Add demo.sh and test.sh for automating stuff 2019-07-23 01:59:55 +02:00
demo.sh Update demo.sh with xml output and extract font 2019-07-28 17:34:16 +02:00
gd.js Add FormatterHTML for html output, closes #15 2019-07-28 19:25:37 +02:00
LICENSE Add LICENSE, README update package.json with valid repository url 2019-07-22 20:20:37 +02:00
package.json Update package.json keywords to match github project 2019-07-28 17:32:55 +02:00
README.md Add FormatterHTML for html output, closes #15 2019-07-28 19:25:37 +02:00
test.sh Extract images with VisitorImage #2 2019-07-23 04:53:17 +02:00

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js with various output formats.

GitHub npm GitHub commits since tagged version GitHub last commit doc

Install

npm install -g pdf-gold-digger

Usage

pdfdig -i some_file.pdf

Avaliable commands

pdfdig -h
ex. pdfdig -i input-file -o output_directory -f json
  
  --input  or  -i   pdf file location (required)
  --output or  -o   pdf file location (optional default "out")
  --debug  or  -d   show debug information (optional - default "false")
  --format or  -f   format (optional - default "text") - ("text,json,xml,html") 
  --font   or  -t   extract fonts as ttf files (optional)
  --help   or  -h   display this help message

Advanced usage

git clone https://github.com/vane/pdf-gold-digger
sh demo.sh

and see results in out directory

Documentation

pdf-gold-digger

Features:

  • extract text
    • separate each page
    • separate each line
    • separate font information
  • extract images
  • output formats
    • text -f text (default)
    • json -f json
    • xml -f xml
    • html -f html
  • specify output directory

TODO:

  • load pdf from remote location
    • from url
  • output to markdown format
  • output to zip
  • extract tables
  • extract forms
  • extract drawings