2019-07-22 20:20:37 +02:00
|
|
|
pdf-gold-digger
|
|
|
|
====
|
|
|
|
|
|
|
|
Pdf information extraction library based on [pdf.js](https://mozilla.github.io/pdf.js/)
|
|
|
|
and [node.js](https://nodejs.org).
|
|
|
|
|
2019-07-28 15:32:13 +02:00
|
|
|
![GitHub](https://img.shields.io/github/license/vane/pdf-gold-digger)
|
|
|
|
![npm](https://img.shields.io/npm/v/pdf-gold-digger)
|
2019-07-28 15:38:50 +02:00
|
|
|
![doc](https://vane.pl/pdf-gold-digger/badge.svg)
|
2019-07-28 15:32:13 +02:00
|
|
|
|
2019-07-24 21:21:37 +02:00
|
|
|
## Install
|
2019-07-23 08:55:39 +02:00
|
|
|
```npm install -g pdf-gold-digger```
|
2019-07-23 01:22:16 +02:00
|
|
|
|
2019-07-24 21:21:37 +02:00
|
|
|
## Usage
|
|
|
|
```bash
|
|
|
|
pdfdig -i some_file.pdf
|
|
|
|
```
|
|
|
|
|
|
|
|
## Avaliable commands
|
2019-07-22 20:20:37 +02:00
|
|
|
|
2019-07-23 08:55:39 +02:00
|
|
|
```bash
|
2019-07-24 21:21:37 +02:00
|
|
|
pdfdig -h
|
2019-07-23 08:55:39 +02:00
|
|
|
ex. pdfdig -i input-file -o output_directory -f json
|
|
|
|
|
|
|
|
--input or -i pdf file location (required)
|
|
|
|
--output or -o pdf file location (optional default "out")
|
|
|
|
--debug or -d show debug information (optional - default "false")
|
2019-07-28 15:20:15 +02:00
|
|
|
--format or -f format (optional - default "text") - ("text,json,xml"):
|
2019-07-23 08:55:39 +02:00
|
|
|
--help or -h display this help message
|
|
|
|
```
|
2019-07-23 01:22:16 +02:00
|
|
|
|
2019-07-24 21:21:37 +02:00
|
|
|
## Advanced usage
|
|
|
|
```bash
|
|
|
|
git clone https://github.com/vane/pdf-gold-digger
|
|
|
|
sh demo.sh
|
|
|
|
```
|
2019-07-23 08:55:39 +02:00
|
|
|
and see results in ```out``` directory
|
2019-07-23 08:59:59 +02:00
|
|
|
|
2019-07-24 21:21:37 +02:00
|
|
|
## Documentation
|
2019-07-23 08:59:59 +02:00
|
|
|
[pdf-gold-digger](https://vane.pl/pdf-gold-digger/)
|
|
|
|
|
2019-07-24 21:21:37 +02:00
|
|
|
## Features:
|
2019-07-22 20:20:37 +02:00
|
|
|
- extract text
|
|
|
|
- separate each page
|
|
|
|
- separate each line
|
|
|
|
- separate font information
|
2019-07-23 05:16:28 +02:00
|
|
|
- extract images
|
2019-07-24 21:21:37 +02:00
|
|
|
- output formats
|
|
|
|
- text ```-f text (default)```
|
|
|
|
- json ```-f json```
|
2019-07-28 15:20:15 +02:00
|
|
|
- xml ```-f xml```
|
2019-07-23 06:21:52 +02:00
|
|
|
- specify output directory
|
2019-07-22 20:20:37 +02:00
|
|
|
|
2019-07-24 21:21:37 +02:00
|
|
|
## TODO:
|
2019-07-23 05:47:58 +02:00
|
|
|
- load pdf from remote location
|
2019-07-28 15:20:15 +02:00
|
|
|
- from url
|
2019-07-23 05:47:58 +02:00
|
|
|
- output to html format
|
2019-07-24 21:21:37 +02:00
|
|
|
- output to markdown format
|
2019-07-23 06:21:52 +02:00
|
|
|
- output to zip
|
2019-07-22 20:20:37 +02:00
|
|
|
- extract font
|
|
|
|
- extract tables
|
|
|
|
- extract forms
|
2019-07-23 08:55:39 +02:00
|
|
|
- extract drawings
|