Merge pull request #20 from pgolo/dev

Version 0.0.1
pgolo · Oct 9, 2020 · cc36faf · cc36faf
2 parents 1222765 + 83326c9
commit cc36faf
Show file tree

Hide file tree

Showing 42 changed files with 3,098 additions and 3 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,7 +2,8 @@
 **/__pycache__/*
 build/*
 cythonized/*
-dist/*
+bin/*
 *.spec
+pilsner.egg-info/*
 !**/.gitkeep
-!**/.gitignore
+!**/.gitignore
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,12 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [0.0.1] - 2020-10-08
+
+### Added
+
+- Alpha version of the utility
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,2 @@
+include README.md
+include LICENSE
diff --git a/README.md b/README.md
@@ -1,2 +1,262 @@
 # pilsner
-Utility for dictionary-based named entity recognition
+
+Python implemented library servicing named entity recognition
+
+[![pypi][pypi-img]][pypi-url]
+
+[pypi-img]: https://img.shields.io/pypi/v/pilsner?style=plastic
+[pypi-url]: https://pypi.org/project/pilsner/
+
+## 1. Purpose
+
+This library is Python implementation of toolkit for dictionary based named
+entity recognition. It is intended to store any thesaurus in a trie-like
+structure and identify any of stored synonyms in a string.
+
+## 2. Installation and dependencies
+
+```bash
+pip install pilsner
+```
+
+`pilsner` is tested in Python 3.6, 3.7, and 3.8.
+
+The only dependency is `sic` package. While it can be automatically installed
+at the time of `pilsner` installation, manual installation of `sic` beforehand
+might also be considered (see benchmark of cythonized vs pure Python
+implementation in `sic` docimentation,
+[https://pypi.org/project/sic/](https://pypi.org/project/sic/)).
+
+## 3. Diagram
+
+`pilsner` consists of two major components: `Model` and `Utility`. `Model`
+class provides storage for the dictionary and string normalization rules, as
+well as low-level methods for populating this storage. `Utility` class provides
+high-level methods for storing and retrieving data to/from `Model` instance.
+
+![Diagram](misc/pilsner-diagram.svg)
+
+## 4. Usage
+
+```python
+import pilsner
+```
+
+### 4.1. Initialize model
+
+- To initialize empty model:
+
+```python
+m = pilsner.Model()
+```
+
+- To specify path to temporary database for empty model:
+
+```python
+m = pilsner.Model(storage_location='path/to/database.file')
+```
+
+- To create empty model that uses database created in memory rather than on
+disk:
+
+```python
+m = pilsner.Model(storage_location=':memory:')
+```
+
+> If database is created in memory, the model cannot be later saved on disk
+(can only be used instantly).
+
+- To load model from disk:
+
+```python
+m = pilsner.Model(filename='path/to/model')
+```
+
+> More on how model is saved to and loaded from disk - see
+[4.6. Save model](#46-save-model) and [4.7. Load model](#47-load-model).
+
+### 4.2. Add string normalization units
+
+- Depending on the dictionary and nature of the text supposed to be parsed,
+string normalization might not be required at all, and nothing specific is to
+be done here in such case.
+- Without string normalization, synonyms from the dictionary will be stored as
+they are and looked up by recognizer case-sensitively.
+- To add a single normalization unit:
+
+```python
+# Assuming m is pilsner.Model instance:
+m.add_normalizer(
+    normalizer_name='normalizer_tag',
+    filename='path/to/normalizer_config.xml'
+)
+```
+
+> String normalization is technically done by `sic` component. See
+> documentation for `sic` at
+> [https://pypi.org/project/sic/](https://pypi.org/project/sic/) to learn how
+> to design normalizer config.
+
+- Model can embed more than one normalization unit.
+- Default normalization unit for the model is the one added first or the last
+one added with parameter `default` set to `True`.
+- Having multiple normalization units in one model makes perfect sense when the
+stored dictionary contains synonyms of different nature that should be
+normalized in different ways (for example, abbreviations probably should not
+get normalized at all, while other synonyms might include tokens or punctuation
+marks that should not affect entity recognition). For that purpose, Model class
+includes `normalizer_map` dict that is supposed to map names of added
+normalization units to values in specific field in a dictionary designating the
+way a synonym should be normalized (tokenizer field, or tokenizer column):
+
+```python
+# Assuming m is pilsner.Model instance:
+m.normalizer_map = {
+    'synonym_type_1': 'normalizer_1',
+    'synonym_type_2': 'normalizer_2'
+}
+```
+
+> The snippet above instructs `pilsner` to normalize synonyms that have
+> `synonym_type_1` value in `tokenizer` column with `normalizer_1`
+> normalization unit, and normalize synonyms that have `synonym_type_2` value
+> in `tokenizer` column with `normalizer_2` normalization unit. For more about
+> fields in a dictionary, see [4.4. Define dictionary](#44-define-dictionary).
+
+### 4.3. Initialize utility
+
+- To load dictionary into `Model` instance, as well as to parse text, the
+`Utility` instance is required:
+
+```python
+r = pilsner.Utility()
+```
+
+### 4.4. Define dictionary
+
+- Source dictionary for `pilsner` must be delimited text file.
+- Along with the source dictionary, specifications of the columns (fields) must
+be provided as list where each item corresponds to a column (from left to
+right). Each item in this list must be a dict object with string keys `name`,
+`include`, `delimiter`, `id_flag`, `normalizer_flag`, and `value_flag`, so
+that:
+  - `field['name']` is a string for column title;
+  - `field['include']` is a boolean that must be set to `True` for the column
+  to be included in the model, otherwise `False`;
+  - `field['delimiter']` is a string that is supposed to split single cell into
+  list of values if the column holds concatenated lists rather than individual
+  values;
+  - `field['id_flag]` is a boolean that must be set to `True` if the column is
+  supposed to be used for grouping synonyms (generally, entity ID is such
+  column), otherwise `False`;
+  - `field['normalizer_flag']` is a boolean that must be set to `True` if the
+  column holds indication on what normalization unit must be applied to this
+  particular synonym, otherwise `False`;
+  - `field['value_flag']` is a boolean that must be set to `True` if the column
+  holds synonyms that are supposed to be looked up when parsing a text,
+  otherwise `False`.
+
+> If dictionary has a column flagged with `normalizer_flag`, synonym in each
+> row will be normalized with string normalization unit which name is mapped on
+> value in this column using `pilsner.Model.normalizer_map` dict. If value is
+> not among `pilsner.Model.normalizer_map` keys, default normalization unit
+> will be used.
+
+### 4.5. Compile model
+
+- To store dictionary in `Model` instance, method `compile_model` of `Utility`
+instance must be called with the following required parameters:
+  - `model`: pointer to initilized `Model` instance;
+  - `filename`: string with path and filename of source dictionary;
+  - `fields`: dict object with definitions of columns (see
+  [4.4. Define dictionary](#44-define-dictionary));
+  - `word_separator`: string defining what is to be considered word separator
+  (generally, it should be whitespace);
+  - `column_separator`: string defining what is to be considered column
+  separator (e.g. `\t` for tab-delimited file);
+  - `column_enclosure`: string defining what is to be stripped away from cell
+  after row has been split into columns (typically, it should be `\n` for new
+  line character to be trimmed from the rightmost column).
+
+```python
+# Assuming m is pilsner.Model instance and r is pilsner.Utility instance:
+r.compile_model(
+    model=m,
+    filename='path/to/dictionary_in_a_text_file.txt',
+    fields=fields,
+    word_separator=' ',
+    column_separator='\t',
+    column_enclosure='\n'
+)
+```
+
+- To review optional parameters, see comments in the code.
+
+### 4.6. Save model
+
+- If `Model` instance has compiled dictionary, and if database location for the
+`Model` instance is not explicitly set to `':memory:'`, the data such instance
+is holding can be saved to disk:
+
+```python
+# Assuming m is pilsner.Model instance
+m.save('path/to/model_name')
+```
+
+- The snippet above will write the following files:
+  - `path/to/model_name.attributes`: database with attributes (fields from the
+  dictionary that are not synonyms);
+  - `path/to/model_name.keywords`: keywords used for disambiguation;
+  - `path/to/model_name.normalizers`: string normalization units;
+  - `path/to/model_name.0.dictionary`: trie with synonyms;
+  - `path/to/model_name.<N>.dictionary`: additional tries with synonyms (`<N>`
+  being integer number of a trie) in case more than one trie was created (see
+  comments in the code - `pilsner.Utility.compile_model` method, `item_limit`
+  parameter).
+
+### 4.7. Load model
+
+- To initialize new `Model` instance using previously saved data:
+
+```python
+m = pilsner.Model(filename='path/to/model_name')
+```
+
+- Alternatively, data can be loaded to previously initialized `Model` instance:
+
+```python
+m = pilsner.Model()
+m.load('path/to/model_name')
+```
+
+- In both cases, the program will look for the following files:
+  - `path/to/model_name.attributes`: database with attributes (fields from the dictionary that are not synonyms);
+  - `path/to/model_name.keywords`: keywords used for disambiguation;
+  - `path/to/model_name.normalizers`: string normalization units;
+  - `path/to/model_name.<N>.dictionary`: tries with synonyms (`<N>` being
+  integer).
+
+### 4.8. Parse string
+
+- To parse a string without filtering out any synonyms and output all
+attributes of spotted entities:
+
+```python
+# Assuming m is pilsner.Model instance, r is pilsner.Utility instance,
+# and text_to_parse is string to parse
+parsed = r.parse(
+    model=m,
+    source_string=text_to_parse
+)
+```
+
+- The output will be dict object where keys are tuples for location of spotted
+entity in a string (begin, end) and values are dicts for attributes that are
+associated with identified entity (`{'attribute_name': {attribute_values}}`).
+- For details about optional parameters, see comments in the code -
+`pilsner.Utility.parse` function).
+
+## 5. Example
+
+Everything written above is put together in example code,
+see **/misc/example/** directory in the project's repository.
diff --git a/dist/pilsner-0.0.1-cp36-cp36m-win_amd64.whl b/dist/pilsner-0.0.1-cp36-cp36m-win_amd64.whl
diff --git a/dist/pilsner-0.0.1-cp37-cp37m-win_amd64.whl b/dist/pilsner-0.0.1-cp37-cp37m-win_amd64.whl
diff --git a/dist/pilsner-0.0.1-cp38-cp38-win_amd64.whl b/dist/pilsner-0.0.1-cp38-cp38-win_amd64.whl
diff --git a/dist/pilsner-0.0.1.tar.gz b/dist/pilsner-0.0.1.tar.gz
diff --git a/misc/example/custom_normalizer.xml b/misc/example/custom_normalizer.xml
@@ -0,0 +1,5 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<tokenizer name="plants">
+  <setting name="cs" value="0" />
+  <token to="lilies" from="lily" />
+</tokenizer>
diff --git a/misc/example/default_normalizer.xml b/misc/example/default_normalizer.xml
@@ -0,0 +1,6 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<tokenizer name="animals">
+  <setting name="cs" value="0" />
+  <split where="r" value="rerio" />
+  <token to="" from="scary" />
+</tokenizer>
diff --git a/misc/example/example.py b/misc/example/example.py
@@ -0,0 +1,93 @@
+# Either install pilsner package to the environment first,
+# or run this from project's root
+
+import sys; sys.path.insert(0, '')
+
+# Import pilsner
+import pilsner
+
+# Initialize Model class
+m = pilsner.Model()
+
+# Add normalization units
+m.add_normalizer('default', 'misc/example/default_normalizer.xml')
+m.add_normalizer('custom', 'misc/example/custom_normalizer.xml')
+
+# Map names of normalization units to some string values
+m.normalizer_map = {
+    'animal': 'default',
+    'plant': 'custom'
+}
+
+# Initialize Utility class
+r = pilsner.Utility()
+
+# Provide table definition for misc/example/living_fileds.txt file
+fields = [
+    {
+        'name': 'type',             # attribute name is 'type'
+        'include': True,            # include this column
+        'delimiter': None,          # no delimiter (single value per row)
+        'id_flag': False,           # entity IDs are not in this column
+        'normalizer_flag': True,    # tags for normalization units are in this column
+        'value_flag': False         # string labels (synonyms) are not in this column
+    },
+    {
+        'name': 'id',               # attribute name is 'id'
+        'include': True,
+        'delimiter': None,
+        'id_flag': True,            # entity IDs are in this column
+        'normalizer_flag': False,
+        'value_flag': False
+    },
+    {
+        'name': 'label',            # attribute name is 'label'
+        'include': True,
+        'delimiter': None,
+        'id_flag': False,
+        'normalizer_flag': False,
+        'value_flag': True          # string labels (synonyms) are in this column
+    },
+    {
+        'name': 'habitat',          # attribute name is 'habitat'
+        'include': True,
+        'delimiter': ',',           # multiple values delimited with ',' can be stored in a single row
+        'id_flag': False,
+        'normalizer_flag': False,
+        'value_flag': False
+    }
+]
+
+# Populate Model instance with data from misc/example/living_things.txt file
+r.compile_model(
+    model=m,
+    filename='misc/example/living_things.txt',
+    fields=fields,
+    word_separator=' ',
+    column_separator='\t',
+    column_enclosure='\n',
+    include_keywords=True
+)
+
+# Save Model instance to disk
+m.save('misc/example/living_things')
+
+# Load Model instance from disk
+m = pilsner.Model('misc/example/living_things')
+
+# Parse string
+text_to_parse = '''
+Little mouse is not recognized and is not frightened by big scary eagle.
+Daniorerio also does not care much about water lilies, though both are recognized.
+'''
+parsed = r.parse(
+    model=m,
+    source_string=text_to_parse,
+    attrs_where={
+        '+': {'habitat': {'air', 'ocean'}} # only consider items with these values in 'habitat' column
+    },
+    attrs_out=['type'] # for each spotted entity, only output 'type' attribute
+)
+
+# Print out the result: recognized are 'big eagle', 'danio rerio', 'water lily'.
+print(parsed)