73 lines
2.3 KiB
Markdown
73 lines
2.3 KiB
Markdown
pdf-gold-digger
|
|
====
|
|
|
|
Pdf information extraction library based on [pdf.js](https://mozilla.github.io/pdf.js/)
|
|
and [node.js](https://nodejs.org) with various output formats.
|
|
|
|
[![GitHub](https://img.shields.io/github/license/vane/pdf-gold-digger)](https://github.com/vane/pdf-gold-digger/blob/master/LICENSE)
|
|
[![npm](https://img.shields.io/npm/v/pdf-gold-digger)](https://npmjs.com/package/pdf-gold-digger)
|
|
[![GitHub commits since tagged version](https://img.shields.io/github/commits-since/vane/pdf-gold-digger/0.1.1)](https://github.com/vane/pdf-gold-digger/compare/0.1.1...master)
|
|
[![GitHub last commit](https://img.shields.io/github/last-commit/vane/pdf-gold-digger)](https://github.com/vane/pdf-gold-digger)
|
|
[![doc](https://vane.pl/pdf-gold-digger/badge.svg)](https://vane.pl/pdf-gold-digger/)
|
|
|
|
## Install
|
|
```npm install -g pdf-gold-digger```
|
|
|
|
## Usage
|
|
```bash
|
|
pdfdig -i some_file.pdf
|
|
```
|
|
|
|
## Avaliable commands
|
|
|
|
```bash
|
|
pdfdig -h
|
|
ex. pdfdig -i input-file -o output_directory -f json
|
|
|
|
--input or -i pdf file location (required)
|
|
--output or -o pdf file location (optional default "out")
|
|
--debug or -d show debug information (optional - default "false")
|
|
--format or -f format (optional - default "text") - ("text,json,xml,html")
|
|
--font or -t extract fonts as ttf files (optional)
|
|
--password or -p password
|
|
--help or -h display this help message
|
|
--version or -v display version information
|
|
```
|
|
|
|
## Advanced usage
|
|
```bash
|
|
git clone https://github.com/vane/pdf-gold-digger
|
|
sh demo.sh
|
|
```
|
|
and see results in ```out``` directory
|
|
|
|
## Documentation
|
|
[pdf-gold-digger](https://vane.pl/pdf-gold-digger/)
|
|
|
|
## Features:
|
|
- extract text
|
|
- separate each page
|
|
- separate each line
|
|
- separate font information
|
|
- extract images
|
|
- output formats
|
|
- text ```-f text (default)```
|
|
- json ```-f json```
|
|
- xml ```-f xml```
|
|
- html ```-f html```
|
|
- specify output directory
|
|
|
|
## TODO:
|
|
- load pdf from remote location
|
|
- from url
|
|
- output to markdown format
|
|
- pack output to zip
|
|
- extract tables
|
|
- extract forms
|
|
- extract drawings
|
|
- extract text from glyphs
|
|
- ability to provide input file for glyph path to letter
|
|
- detect when unicode is not provided or mangled
|
|
- get bounding box from text and draw it on canvas
|
|
- use tesseract.js as optional fallback
|