-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamps need fixing #21
Comments
I've noticed a lot of OCR errors myself, including a lot of dates with double spaces, weird special characters, and odd spellings of months (ex. Jau instead of Jan). I've been going through and cleaning them on the tiny subsets I'm using for visualization experiments, but we should definitely devise a system for cleaning these. |
@akrinaldi Yes! I actually started doing that and about 75% of the initial OCR errors in the timestamps have been fixed, but I need some help figuring out how to fix the remaining ones. My code that I was working on is in this module--if you go down to the function fixTS, you can see what I have so far... |
@akrinaldi These are the 182 timestamps that still need to be fixed. |
What rules have you been going by for correcting some of these, especially for the ones where the original number is unclear? |
If you go into this module, and go down the functon fixTS (search "def fixTS(listofTS):" in the search bar), you can see what I have so far. It's been more of a bottom-up process, just looking for the common errors and putting together some sort-of reg ex function to fix it. For example, fixTS.spacer fixes timestamps without a space between the year and the time which was pretty common (e.g. "20159:30 AM --> 2015 9:30 AM). I did this by literally splitting the string into the year and the time and then returning year + " " + time into a new list. Probably a better way to do this, but I'm using my very elementary python skills lol The ones I am struggling to come up with rules to fix are these. Some of them are kind-of unfixable, or at least need a more qualitative lens, like the first item in the list... For a lot of them though I can probably just apply what I've been doing to the specific cases, I just have yet to do that. |
We may have another issue on our hands - when using the unix stamps you gave me, I noticed my final visualized calendar had two extra months, for the years 2614 and 2615. It looks like it went ahead and made us some unix timestamps for the future based on some incorrect OCR data. |
Oh yes! Good eye. I actually did add the code to fix that to fixTS, but it was only a few days ago so I haven't rerun everything with that applied yet. Let me know if there are any other issues! |
A lot of the timestamps on the list could be fixed by hand, so I did so. However, there are still a handful we have to come up with a way to interpret and a handful we're probably not going to be able to fix. I can make those into separate lists and post them here so we can deal with them as needed. |
@akrinaldi and @hzadeh17: should we generate a single place/procedure for listing remaining errors? what would work best for you? it might be different solutions/places, though, for flagging timestamp errors and spelling mistakes on names. I think Hannah was keeping a list of name appearances? |
@akrinaldi @Louise-Seamster I have to go back and check but I'm pretty sure I ended up fixing most of the errors. I will check in about this again here soon |
Currently there are about 700 timestamps (from all deqs) that are not being converted into unix times, which means they are either false positives or there is an OCR error (e.g. Fr!day or 4 :30 PM aren't read by the Unix converter). Looking at the list, there are quite a lot of OCR errors, so I'm going to work on something to run through the list and fix at least the common ones...
The text was updated successfully, but these errors were encountered: