Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion ocrd/ocrd/workspace_bagger.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
unzip_file_to_dir,
DEFAULT_METS_BASENAME,
MIMETYPE_PAGE,
safe_filename,
VERSION,
)
from ocrd_validators.constants import BAGIT_TXT, TMP_BAGIT_PREFIX, OCRD_BAGIT_PROFILE_URL
Expand Down Expand Up @@ -80,7 +81,10 @@ def _bag_mets_files(
file_grp_dir.mkdir()

attr = 'local_filename' if f.local_filename else 'url'
basename = f.basename if f.basename else f"{f.ID}{MIME_TO_EXT.get(f.mimetype, '.xml')}"
if f.local_filename and f.basename:
basename = f.basename
else:
basename = safe_filename(f.contentids if f.contentids else f.ID) + MIME_TO_EXT.get(f.mimetype, '.xml')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise against direct use of @CONTENTIDS as file name. The URL prefix almost always is not what you want. How about stripping the host-name part (if in fact it is a URL), and then using makedirs for all remaining path prefixes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points. Agree that representing the URL path as directory is a neat way to do it, though it deviates from our general flat directory structure below the fileGrp dirs. Removing the host is also prettier.

But I'm wondering how that would work for @M3ssman's use case - how much info do you need to still be able to debug your workflows?

Copy link
Contributor

@M3ssman M3ssman Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not intended to be used as-it-is, simply because an URN/URI contains chars that are not valid as part in local filenames. Therefore - at least in ULB and semantics workflows - colon is exchanged with a plus sign.

Consider a page container like this

<mets:div ID="phys1278993" TYPE="page" CONTENTIDS="urn:nbn:de:gbv:3:1-113129-p0007-8" ORDER="1">
        <mets:fptr FILEID="IMG_MAX_1278993"/>
        <mets:fptr FILEID="OCR-D-GT-FULLTEXT-1"/>
</mets:div>

with GT linked

<mets:file ID="OCR-D-GT-FULLTEXT-1" MIMETYPE="application/vnd.prima.page+xml">
	<mets:FLocat xlink:href="GT-PAGE/urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
</mets:file>

but the actual imsge via URL like this

<mets:file MIMETYPE="image/jpeg" ID="IMG_MAX_1278993">
	<mets:FLocat xlink:href="https://opendata.uni-halle.de/retrieve/eeeee05d-c7cd-4e89-9607-5d1ac175afa1/00000007.jpg" LOCTYPE="URL"/>
</mets:file>

The goal is to match both files, GT and Image, by their names alone without any extensions.
If this can be achieved, both can be used out-of-the-box for further GT-works in Tools like Transkribus or Larex.
(I have to handle 1.600 GT-ODEM-files, nearly 100 newspaper-GT-files, 101 GT arabic and about 400 pages GT persian. And if our next digi-project will be granted, the GT will increase further.)

Proposal: Instead of using the CONTENTIDS attribute it'd be sufficient to rename the images locally like the corresponding GT-file, whatever it's name was. My main concern is therefore to have equal names, not to name something like some attribute.

Maybe this way one hopefully avoids additional processing?
This would also avoid additional problems which may occur since even for 2 units (SBB, ULB) there are yet 2 different interpretations for this attribute, and who knows what else lurks out there.

(c.f. https://github.com/M3ssman/gt-test/blob/a370f3a691506f4eab1b91226a76a7c1f461ba10/data/corpus-odem-ger-256/mets.xml#L8905)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not intended to be used as-it-is, simply because an URN/URI contains chars that are not valid as part in local filenames. Therefore - at least in ULB and semantics workflows - colon is exchanged with a plus sign.

It is not used as-is but passed through safe_filename which does

def safe_filename(url):                                              
    """                                                              
    Sanitize input to be safely used as the basename of a local file.
    """                                                              
    ret = re.sub(r'[^\w]+', '_', url)                                
    ret = re.sub(r'^\.*', '', ret)                                   
    ret = re.sub(r'\.\.*', '.', ret)                                 
    #  print('safe filename: %s -> %s' % (url, ret))                 
    return ret                                                       

Adapting this to make the replacement (_) configurable as + is not an issue. However

Proposal: Instead of using the CONTENTIDS attribute it'd be sufficient to rename the images locally like the corresponding GT-file, whatever it's name was. My main concern is therefore to have equal names, not to name something like some attribute.

I still don't understand how to achieve that. For your example

<mets:file ID="OCR-D-GT-FULLTEXT-1" MIMETYPE="application/vnd.prima.page+xml">
	<mets:FLocat xlink:href="GT-PAGE/urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/>
</mets:file>
...
<mets:file MIMETYPE="image/jpeg" ID="IMG_MAX_1278993">
	<mets:FLocat xlink:href="https://opendata.uni-halle.de/retrieve/eeeee05d-c7cd-4e89-9607-5d1ac175afa1/00000007.jpg" LOCTYPE="URL"/>
</mets:file>
...
<mets:div ID="phys1278993" TYPE="page" CONTENTIDS="urn:nbn:de:gbv:3:1-113129-p0007-8" ORDER="1">
        <mets:fptr FILEID="IMG_MAX_1278993"/>
        <mets:fptr FILEID="OCR-D-GT-FULLTEXT-1"/>
</mets:div>

What should the bagger write as the filenames of IMG_MAX_1278993 and OCR-D-GT-FULLTEXT-1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd wish to save urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml as filename for the GT, since this is the file which OCR-D-GT-FULLTEXT-1 points to, and for the image where container IMG_MAX_1278993 refers to, a corresponding name like urn+nbn+de+gbv+3+1-113129-p0007-8_ger.jpg , if it's possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point at least in my understanding - ALIMU- concerning ULB is, that the GT-files shall be published (and probably edited further afterwards) but will be tied to the GT-repository. There is nothing about to change with the images, they only need to be referenced and resolvable, for example, for later generation of training data using the GT-files.

_relpath = join(f.fileGrp, basename)
self.resolver.download_to_directory(file_grp_dir, getattr(f, attr), basename=basename)
changed_local_filenames[str(getattr(f, attr))] = _relpath
Expand Down
13 changes: 12 additions & 1 deletion ocrd_models/ocrd_models/ocrd_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def __init__(self, el, mimetype=None, pageId=None, local_filename=None, mets=Non
local_filename (Path): ``@xlink:href`` pointing to the locally cached version of the file in the workspace
ID (string): ``@ID`` of this ``mets:file``
loctype (string): DEPRECATED do not use
contentids (string): ``@CONTENTIDS`` of the ``mets:div`` in the ``mets:structMap[@TYPE="PHYSICAL]`` this file manifests
"""
if el is None:
raise ValueError("Must provide mets:file element this OcrdFile represents")
Expand Down Expand Up @@ -135,6 +136,15 @@ def pageId(self, pageId):
raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self)
self.mets.set_physical_page_for_file(pageId, self)

@property
def contentids(self):
"""
Get the ``@CONTENTIDS`` of the physical ``mets:structMap`` entry corresponding to this ``mets:file``.
"""
if self.mets is None:
raise Exception("OcrdFile %s has no member 'mets' pointing to parent OcrdMets" % self)
return self.mets.get_contentids_for_file(self)

@property
def loctypes(self):
"""
Expand Down Expand Up @@ -226,7 +236,7 @@ class ClientSideOcrdFile:
this represents the response of the :py:class:`ocrd.mets_server.OcrdMetsServer`.
"""

def __init__(self, el, mimetype=None, pageId=None, loctype='OTHER', local_filename=None, mets=None, url=None, ID=None, fileGrp=None):
def __init__(self, el, mimetype=None, pageId=None, loctype='OTHER', local_filename=None, mets=None, url=None, ID=None, fileGrp=None, contentids=None):
"""
Args:
el (): ignored
Expand All @@ -238,6 +248,7 @@ def __init__(self, el, mimetype=None, pageId=None, loctype='OTHER', local_filena
url (string): ignored XXX the remote/original file once we have proper mets:FLocat bookkeeping
local_filename (): ``@xlink:href`` of this ``mets:file`` - XXX the local file once we have proper mets:FLocat bookkeeping
ID (string): ``@ID`` of this ``mets:file``
fileGrp (string): ``@USE`` of the ``mets:fileGrp`` this file belongs to
"""
self.ID = ID
self.mimetype = mimetype
Expand Down
19 changes: 19 additions & 0 deletions ocrd_models/ocrd_models/ocrd_mets.py
Original file line number Diff line number Diff line change
Expand Up @@ -722,6 +722,25 @@ def get_physical_page_for_file(self, ocrd_file):
if len(ret):
return ret[0]

def get_contentids_for_file(self, ocrd_file):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #1063, I added a more general solution (sans @CONTENTIDS but plus @ORDER, @ORDERLABEL and @LABEL): add another kwarg return_divs=True to get_physical_pages to get the full divs (which can further be queried). There's also an extra physical_pages_labels (I don't remember why this is independent though). Both extensions should be thrown together IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a much better solution, I agree. As I said, this is just a proof-of-concept so we have a discussion basis for the naming. Implementation can be thrown away and replaced with a fine-tuned #1063.

It might be sensible to have an OcrdMetsPage(s) class similar to OcrdFile to provide a uniform interface. But that's best discussed in #1063.

"""
Get the ``@CONTENTIDS` attribute of the physical page (``@CONTENTIDS`` of the ``mets:structMap[@TYPE="PHYSICAL"]//mets:div[@TYPE="PAGE"]`` entry)
corresponding to the ``mets:file`` :py:attr:`ocrd_file`.
"""
ret = []
if self._cache_flag:
for pageId in self._fptr_cache.keys():
if ocrd_file.ID in self._fptr_cache[pageId].keys():
ret.append(self._page_cache[pageId].get('CONTENTIDS'))
else:
ret = self._tree.getroot().xpath(
'/mets:mets/mets:structMap[@TYPE="PHYSICAL"]/mets:div[@TYPE="physSequence"]/mets:div[@TYPE="page"][./mets:fptr[@FILEID="%s"]]/@CONTENTIDS' %
ocrd_file.ID, namespaces=NS)

# To get rid of the python's FutureWarning
if len(ret):
return ret[0]

def remove_physical_page(self, ID):
"""
Delete page (physical ``mets:structMap`` ``mets:div`` entry ``@ID``) :py:attr:`ID`.
Expand Down
5 changes: 5 additions & 0 deletions tests/model/test_ocrd_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from tests.base import (
main,
assets,
create_ocrd_file,
create_ocrd_file_with_defaults
)
Expand Down Expand Up @@ -121,6 +122,10 @@ def test_fptr_changed_for_change_id():
assert mets.get_physical_pages(for_fileIds=['FOO_1']) == [None]
assert mets.get_physical_pages(for_fileIds=['BAZ_1']) == ['p0001']

def test_get_contentids():
mets = OcrdMets(filename=assets.url_of('pembroke_werke_1766/data/mets.xml'))
ocrd_file = next(mets.find_files(pageId='PHYS_0009'))
assert ocrd_file.contentids == 'http://resolver.staatsbibliothek-berlin.de/SBB0001CA7900000009'

if __name__ == '__main__':
main(__file__)
3 changes: 3 additions & 0 deletions tests/model/test_ocrd_mets.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,9 @@ def test_update_physical_page_attributes(sbb_directory_ocrd_mets):
assert b'ORDER' in m.to_xml()
assert b'ORDERLABEL' in m.to_xml()

def test_get_contentids():
mets = OcrdMets(filename=assets.url_of('pembroke_werke_1766/data/mets.xml'))
assert mets.get_contentids_for_file(next(mets.find_files(pageId='PHYS_0009'))) == 'http://resolver.staatsbibliothek-berlin.de/SBB0001CA7900000009'

if __name__ == '__main__':
main(__file__)