Skip to content

Bookmarks

Louise-Seamster edited this page Mar 25, 2020 · 1 revision

Through experimentation, Haley Boles found that opening the PDFs in Adobe Acrobat Pro showed “bookmarks” dividing a majority of the emails within a larger PDF, probably due to a feature of compiling a PDF directly from an email client. The bookmarks are usually titled with the subject line of the email (although we have not used this feature yet). Haley used the bookmarks to divide the larger PDFs into smaller sections that roughly correspond with individual emails (or email chains, if they are threaded). When the emails were primarily being used on Google Drive, this facilitated analysis compared to wrangling one long PDF. On google drive, the naming convention, which includes bookmark number, is “deq #_Part####”.

When creating individual text files for each PDF page, we retained the bookmark level in the file naming convention, because it is helpful to know the PDF name, the page number, and the bookmark number. If and when we have to train machine learning to find distinct emails, bookmark information can comprise part of the algorithm.

Clone this wiki locally