pdf-gold-digger/README.md
2020-11-20 21:46:38 +01:00

73 lines
2.3 KiB
Markdown

pdf-gold-digger
====
Pdf information extraction library based on [pdf.js](https://mozilla.github.io/pdf.js/)
and [node.js](https://nodejs.org) with various output formats.
[![GitHub](https://img.shields.io/github/license/vane/pdf-gold-digger)](https://github.com/vane/pdf-gold-digger/blob/master/LICENSE)
[![npm](https://img.shields.io/npm/v/pdf-gold-digger)](https://npmjs.com/package/pdf-gold-digger)
[![GitHub commits since tagged version](https://img.shields.io/github/commits-since/vane/pdf-gold-digger/0.1.1)](https://github.com/vane/pdf-gold-digger/compare/0.1.1...master)
[![GitHub last commit](https://img.shields.io/github/last-commit/vane/pdf-gold-digger)](https://github.com/vane/pdf-gold-digger)
[![doc](https://vane.pl/pdf-gold-digger/badge.svg)](https://vane.pl/pdf-gold-digger/)
## Install
```npm install -g pdf-gold-digger```
## Usage
```bash
pdfdig -i some_file.pdf
```
## Avaliable commands
```bash
pdfdig -h
ex. pdfdig -i input-file -o output_directory -f json
--input or -i pdf file location (required)
--output or -o pdf file location (optional default "out")
--debug or -d show debug information (optional - default "false")
--format or -f format (optional - default "text") - ("text,json,xml,html")
--font or -t extract fonts as ttf files (optional)
--password or -p password
--help or -h display this help message
--version or -v display version information
```
## Advanced usage
```bash
git clone https://github.com/vane/pdf-gold-digger
sh demo.sh
```
and see results in ```out``` directory
## Documentation
[pdf-gold-digger](https://vane.pl/pdf-gold-digger/)
## Features:
- extract text
- separate each page
- separate each line
- separate font information
- extract images
- output formats
- text ```-f text (default)```
- json ```-f json```
- xml ```-f xml```
- html ```-f html```
- specify output directory
## TODO:
- load pdf from remote location
- from url
- output to markdown format
- pack output to zip
- extract tables
- extract forms
- extract drawings
- extract text from glyphs
- ability to provide input file for glyph path to letter
- detect when unicode is not provided or mangled
- get bounding box from text and draw it on canvas
- use tesseract.js as optional fallback