Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OcrdPage: propagate inherited attributes and TextStyle #698

Closed
bertsky opened this issue Jun 24, 2021 · 4 comments · May be fixed by #700
Closed

OcrdPage: propagate inherited attributes and TextStyle #698

bertsky opened this issue Jun 24, 2021 · 4 comments · May be fixed by #700

Comments

@bertsky
Copy link
Collaborator

bertsky commented Jun 24, 2021

PAGE-XML features an implicit inheritance relation between various elements of the hierarchy:

  • Page/TextStyle → TextRegion*/TextStyle → TextLine/TextStyle → Word/TextStyle → Glyph/TextStyle
  • TextRegion*/@production → TextLine/@production → Word/@production → Glyph/@production
  • Page/@primaryScript → TextRegion*/@primaryScript → TextLine/@primaryScript → Word/@primaryScript → Glyph/@script
  • Page/@secondaryScript → TextRegion*/@secondaryScript → TextLine/@secondaryScript → Word/@secondaryScript → Glyph/@script
  • Page/@primaryLanguage → TextRegion*/@primaryLanguage → TextLine/@primaryLanguage → Word/@language
  • Page/@secondaryLanguage → TextRegion*/@secondaryLanguage → TextLine/@secondaryLanguage → Word/@language
  • Page/@readingDirection → TextRegion*/@readingDirection → TextLine/@readingDirection → Word/@readingDirection
  • Page/@textLineOrder → TextRegion*/@textLineOrder

These relations are only documented and cannot be automatically implemented in a generated DOM. But their semantics are important, and it would make writing processors much easier if they would be implemented.

For example, if I want to know if the current segment belongs to a certain script, I'd currently have to:

  1. check the element type, what kind of attribute name applies (@script or @primaryScript / @secondaryScript)
  2. check if that is set locally
  3. otherwise check the parent element's @primaryScript etc

This is very hard to achieve with XPath (because disjunction/unions are only possible on nodesets, not on predicates). And with the DOM it requires a lot of code each time.

But we could facilitate this by simply propagating all inherited features during .build() – in a patched ocrd_page_generateds. We already have the user methods mechanism for patching, and we could simply use buildChildren to propagate all of the above attributes (as a bottom up post-hook), because attributes of parents are built before those of children.

But for TextStyle, it's more complicated: on all hierarchy levels except the Page level, TextStyle sorts after the logical children and thus is only built after they are built. Also, one would need to unify style attributes between levels (we usually have True, False and None; so true/false from parents replaces none in children).

@bertsky bertsky added discussion Diskussion/ Input aus der Gruppe erforderlich enhancement labels Jun 24, 2021
@bertsky
Copy link
Collaborator Author

bertsky commented Jun 30, 2021

Trying to find a way to implement this in a clean way, I have difficulty with our ocrd_page_user_methods mechanism: It's relatively easy to add new methods to classes, but in our case we want to patch the existing buildAttributes() (and buildChildren()) like so:

--- a/ocrd_models/ocrd_models/ocrd_page_generateds.py
+++ b/ocrd_models/ocrd_models/ocrd_page_generateds.py
@@ -14387,7 +14387,7 @@ class TextRegionType(RegionType):
             already_processed.add('secondaryLanguage')
             self.secondaryLanguage = value
             self.validate_LanguageSimpleType(self.secondaryLanguage)    # validate type LanguageSimpleType
-        value = find_attr_value_('primaryScript', node)
+        value = find_attr_value_('primaryScript', node) or self.parent_object_.primaryScript
         if value is not None and 'primaryScript' not in already_processed:
             already_processed.add('primaryScript')
             self.primaryScript = value

So instead of copying the existing build implementation into a file, applying our patches there, and then passing that file as user method (to replace the original code), perhaps in this case it is better to apply the patches after generateDS ran, i.e. directly on ocrd_page_generateds.py? We already use sed there for some workarounds. Why not include a patch line?

@kba
Copy link
Member

kba commented Jun 30, 2021

Trying to find a way to implement this in a clean way, I have difficulty with our ocrd_page_user_methods mechanism: It's relatively easy to add new methods to classes, but in our case we want to patch the existing buildAttributes() (and buildChildren()) like so:

--- a/ocrd_models/ocrd_models/ocrd_page_generateds.py
+++ b/ocrd_models/ocrd_models/ocrd_page_generateds.py
@@ -14387,7 +14387,7 @@ class TextRegionType(RegionType):
             already_processed.add('secondaryLanguage')
             self.secondaryLanguage = value
             self.validate_LanguageSimpleType(self.secondaryLanguage)    # validate type LanguageSimpleType
-        value = find_attr_value_('primaryScript', node)
+        value = find_attr_value_('primaryScript', node) or self.parent_object_.primaryScript
         if value is not None and 'primaryScript' not in already_processed:
             already_processed.add('primaryScript')
             self.primaryScript = value

So instead of copying the existing build implementation into a file, applying our patches there, and then passing that file as user method (to replace the original code), perhaps in this case it is better to apply the patches after generateDS ran, i.e. directly on ocrd_page_generateds.py? We already use sed there for some workarounds. Why not include a patch line?

If you can boil this down to a single sed call and perhaps add a comment as well, that's okay with me. But if it's more complex, a copied/patched build user method would be better.

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 30, 2021

It can be a single patch call, based on a git-controlled patch file. The patch file will be much more robust (and interpretable/readable) than a sed expression IMO.

The problem with copying the whole generated build code is that whenever something changes upstream in the generator, our code probably won't run anymore, and we won't notice immediately. A patch can also fail to apply, but we get to see that right away on make generate-page. Also, it's much harder to update/migrate the changes if you just add the copied and then edited result.

@kba
Copy link
Member

kba commented Jun 30, 2021

It can be a single patch call, based on a git-controlled patch file. The patch file will be much more robust (and interpretable/readable) than a sed expression IMO.

Gotcha, yes, a patch(1) is also okay.

and we won't notice immediately

I tend to check the git diff before committing a new ocrd_page_generateds.py, so I hope I would notice it right away :) But that's just fyi, patch solution is 👍

@OCR-D OCR-D locked and limited conversation to collaborators Dec 20, 2021
@lena-hinrichsen lena-hinrichsen converted this issue into discussion #775 Dec 20, 2021
@lena-hinrichsen lena-hinrichsen removed the discussion Diskussion/ Input aus der Gruppe erforderlich label Dec 21, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants