862 B
862 B
pdf-gold-digger
Pdf information extraction library based on pdf.js and node.js.
Install
npm install -g szczepano/pdf-gold-digger
Usage
pdfdig -i some_file.pdf
or test by clonning repository
git clone https://github.com/vane/pdf-gold-digger
then run
sh demo.sh
and see results in out
directory
Work in progress
Supports:
- extract text
- separate each page
- separate each line
- separate font information
- bounding box position (probably buggy now)
- output to text
-o text (default)
- output to json
-o json
TODO:
- specify output directory
- output to xml format
output to json format- extract images to files
- extract font
- extract tables
- advanced font information
- extract forms
- extract drawings