Extract data from pdf

Go to file

Michal Szczepanski cc2002078b Extract images with VisitorImage #2 - output to file based on format data.{format} - more documentation		2019-07-23 04:53:17 +02:00
bin	Fix missing pdfdig shebang	2019-07-23 00:26:04 +02:00
src	Extract images with VisitorImage #2	2019-07-23 04:53:17 +02:00
.esdoc.json	Documentation generate scripts	2019-07-23 01:19:17 +02:00
.gitignore	Add demo.sh and test.sh for automating stuff	2019-07-23 01:59:55 +02:00
LICENSE	Add LICENSE, README update package.json with valid repository url	2019-07-22 20:20:37 +02:00
README.md	Extract images with VisitorImage #2	2019-07-23 04:53:17 +02:00
demo.sh	Extract images with VisitorImage #2	2019-07-23 04:53:17 +02:00
gd.js	Extract images with VisitorImage #2	2019-07-23 04:53:17 +02:00
package.json	Documentation generate scripts	2019-07-23 01:19:17 +02:00
test.sh	Extract images with VisitorImage #2	2019-07-23 04:53:17 +02:00

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js.

npm install -g szczepano/pdf-gold-digger

pdfdig -i some_file.pdf

git clone https://github.com/vane/pdf-gold-digger
then run
sh demo.sh
and see results in out directory

Work in progress

extract text
- separate each page
- separate each line
- separate font information
- bounding box position (probably buggy now)
output to text -o text (default)
output to json -o json