Extract data from pdf

extract-data-from-pdf nodejs pdf pdf-converter pdf-gold-digger pdf-text-extract pdf-to-html pdfjs

Go to file

Michal Szczepanski 2e17443c5f Version 0.1.1		2020-11-20 21:46:38 +01:00
bin	Fix missing pdfdig shebang	2019-07-23 00:26:04 +02:00
src	Add ability to provide password as command line argument closes #16	2020-11-20 21:39:09 +01:00
.esdoc.json	Documentation generate scripts	2019-07-23 01:19:17 +02:00
.eslintrc.json	Add eslint standard with small modifications	2019-07-28 09:40:39 +02:00
.gitignore	Add demo.sh and test.sh for automating stuff	2019-07-23 01:59:55 +02:00
demo.sh	Add html output to demo.sh	2019-07-28 22:08:22 +02:00
gd.js	Add ability to provide password as command line argument closes #16	2020-11-20 21:39:09 +01:00
LICENSE	Update LICENSE	2019-07-28 17:44:45 +02:00
package.json	Version 0.1.1	2020-11-20 21:46:38 +01:00
README.md	Version 0.1.1	2020-11-20 21:46:38 +01:00
test.sh	Extract images with VisitorImage #2	2019-07-23 04:53:17 +02:00

README.md

pdf-gold-digger

Pdf information extraction library based on pdf.js and node.js with various output formats.

Install

npm install -g pdf-gold-digger

Usage

pdfdig -i some_file.pdf

Avaliable commands

pdfdig -h
ex. pdfdig -i input-file -o output_directory -f json
  
  --input    or  -i   pdf file location (required)
  --output   or  -o   pdf file location (optional default "out")
  --debug    or  -d   show debug information (optional - default "false")
  --format   or  -f   format (optional - default "text") - ("text,json,xml,html") 
  --font     or  -t   extract fonts as ttf files (optional)
  --password or  -p   password
  --help     or  -h   display this help message
  --version  or  -v   display version information

Advanced usage

git clone https://github.com/vane/pdf-gold-digger
sh demo.sh

and see results in out directory

Documentation

pdf-gold-digger

Features:

extract text
- separate each page
- separate each line
- separate font information
extract images
output formats
- text -f text (default)
- json -f json
- xml -f xml
- html -f html
specify output directory

TODO:

load pdf from remote location
- from url
output to markdown format
pack output to zip
extract tables
extract forms
extract drawings
extract text from glyphs
- ability to provide input file for glyph path to letter
- detect when unicode is not provided or mangled
- get bounding box from text and draw it on canvas
- use tesseract.js as optional fallback