pdf-gold-digger/README.md

63 lines
1.4 KiB
Markdown
Raw Normal View History

pdf-gold-digger
====
Pdf information extraction library based on [pdf.js](https://mozilla.github.io/pdf.js/)
and [node.js](https://nodejs.org).
2019-07-24 21:21:37 +02:00
## Install
2019-07-23 08:55:39 +02:00
```npm install -g pdf-gold-digger```
2019-07-23 01:22:16 +02:00
2019-07-24 21:21:37 +02:00
## Usage
```bash
pdfdig -i some_file.pdf
```
## Avaliable commands
2019-07-23 08:55:39 +02:00
```bash
2019-07-24 21:21:37 +02:00
pdfdig -h
2019-07-23 08:55:39 +02:00
ex. pdfdig -i input-file -o output_directory -f json
--input or -i pdf file location (required)
--output or -o pdf file location (optional default "out")
--debug or -d show debug information (optional - default "false")
--format or -f format (optional - default "text") - ("text,json"):
--help or -h display this help message
```
2019-07-23 01:22:16 +02:00
2019-07-24 21:21:37 +02:00
## Advanced usage
```bash
git clone https://github.com/vane/pdf-gold-digger
sh demo.sh
```
2019-07-23 08:55:39 +02:00
and see results in ```out``` directory
2019-07-24 21:21:37 +02:00
## Documentation
[pdf-gold-digger](https://vane.pl/pdf-gold-digger/)
2019-07-24 21:21:37 +02:00
## Features:
- extract text
- separate each page
- separate each line
- separate font information
- extract images
2019-07-24 21:21:37 +02:00
- output formats
- text ```-f text (default)```
- json ```-f json```
2019-07-23 06:21:52 +02:00
- specify output directory
2019-07-24 21:21:37 +02:00
## TODO:
- extract text
- bounding box position
2019-07-23 05:47:58 +02:00
- load pdf from remote location
2019-07-23 06:21:52 +02:00
- from url
- output to xml format
2019-07-23 05:47:58 +02:00
- output to html format
2019-07-24 21:21:37 +02:00
- output to markdown format
2019-07-23 06:21:52 +02:00
- output to zip
- extract font
- extract tables
- advanced font information
- extract forms
2019-07-23 08:55:39 +02:00
- extract drawings