2019-07-22 20:20:37 +02:00
|
|
|
pdf-gold-digger
|
|
|
|
====
|
|
|
|
|
|
|
|
Pdf information extraction library based on [pdf.js](https://mozilla.github.io/pdf.js/)
|
|
|
|
and [node.js](https://nodejs.org).
|
|
|
|
|
|
|
|
## Work in progress
|
|
|
|
|
2019-07-22 20:58:24 +02:00
|
|
|
### Usage
|
|
|
|
``git clone https://github.com/vane/pdf-gold-digger``
|
|
|
|
``gd -f some.pdf``
|
|
|
|
|
2019-07-22 20:20:37 +02:00
|
|
|
### Supports:
|
|
|
|
- extract text
|
|
|
|
- separate each page
|
|
|
|
- separate each line
|
|
|
|
- separate font information
|
|
|
|
- bounding box position
|
|
|
|
|
|
|
|
### TODO:
|
|
|
|
- specify output format and output directory
|
|
|
|
- output to xml format
|
|
|
|
- output to json format
|
|
|
|
- extract images to files
|
|
|
|
- extract font
|
|
|
|
- extract tables
|
|
|
|
- advanced font information
|
|
|
|
- extract forms
|
|
|
|
- extract drawings
|