Skip to content

Spico197/CatalogExtraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌳 CED: Catalog Extraction from Documents

PWC arXiv

Rebuild document catalog tree structures from plain texts.

✈️ Abilities

  • Concatenate OCR text pieces
  • Compose paragraphs from sequences
  • Extract document catalogs from plain texts

⚙️ Installation

Make sure you have torch>=1.9.1 installed.

  • Python>=3.7
  • torch>=1.9.1
# better to create new environment in case of any potential package version mismatch
$ conda create -n doctree python=3.7
$ conda activate doctree
# install pytorch
$ echo 'install pytorch from https://pytorch.org/'

# install basics
$ pip install -e .
# if you want to test and make development, use this command
$ pip install -e .[dev]
# if you want to deploy demo on your local machine, use this
$ pip install -e .[demo]
# if you want to do both development and demo deployment, use this
$ pip install -e .[all]

🚀 Quick Start

All task examples are listed in examples/ .

Concatenate text segments

  • Train
# change setting file in `examples/text_concat/train/config.yaml`
$ bash examples/text_concat/train/run.sh
  • Inference
# check `examples/text_concat/inference/run.sh` and change the task directory
$ bash examples/text_concat/inference/run.sh

Concat & split paragraphs

We use the same task class to train paragraph split and text concatenation models

  • Train
# change setting file in `examples/text_concat/train/config.yaml`
$ bash examples/text_concat/train/run.sh
  • Inference
# check `examples/text_concat/inference/run.sh` and change the task directory
$ bash examples/text_concat/inference/run.sh

Extract catalog structures

We use the hierarchical config mechanism in REx. Here, settings in credit_eval.yaml will override those in base_config.yaml. You may want to add/change configurations in credit_eval.yaml to make it work.

  • Train
# change setting file in `examples/doc_tree_construction/train/credit_eval.yaml`
$ bash examples/doc_tree_construction/train/run.sh
  • Inference
# change `task_dir` in `examples/doc_tree_construction/inference/run_simple.py`
$ python examples/doc_tree_construction/inference/run_simple.py

📃 Documentations

Check docs/ for more detailed explanations.

📜 Citation

If you find this paper or repo useful, please cite our paper:

@article{zhu2023ced,
  title={CED: Catalog Extraction from Documents},
  author={Zhu, Tong and Zhang, Guoliang and Li, Zechang and Yu, Zijian and Ren, Junfei and Wu, Mengsong and Wang, Zhefeng and Huai, Baoxing and Chao, Pingfu and Chen, Wenliang},
  journal={arXiv preprint arXiv:2304.14662},
  year={2023}
}