Extract data from pdf

extract-data-from-pdf nodejs pdf pdf-converter pdf-gold-digger pdf-text-extract pdf-to-html pdfjs

Go to file

Michal Szczepanski ab4e0ffdae Update README.md		2019-07-23 01:22:16 +02:00
bin	Fix missing pdfdig shebang	2019-07-23 00:26:04 +02:00
src	Documentation generate scripts	2019-07-23 01:19:17 +02:00
.esdoc.json	Documentation generate scripts	2019-07-23 01:19:17 +02:00
.gitignore	Documentation generate scripts	2019-07-23 01:19:17 +02:00
LICENSE	Add LICENSE, README update package.json with valid repository url	2019-07-22 20:20:37 +02:00
README.md	Update README.md	2019-07-23 01:22:16 +02:00
gd.js	Fix after move lib to src	2019-07-23 01:05:11 +02:00
package.json	Documentation generate scripts	2019-07-23 01:19:17 +02:00

README.md

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

Install

npm install -g szczepano/pdf-gold-digger

Usage

pdfdig some_file.pdf

or by clonning repository

git clone https://github.com/vane/pdf-gold-digger
node gd.js -f some_file.pdf

Work in progress

Supports:

extract text
- separate each page
- separate each line
- separate font information
- bounding box position (probably buggy now)
output to text -o text (default)
output to json -o json

TODO:

specify output directory
output to xml format
~~output to json format~~
extract images to files
extract font
extract tables
advanced font information
extract forms
extract drawings