Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamps need fixing #21

Open
hzadeh17 opened this issue May 26, 2020 · 10 comments
Open

Timestamps need fixing #21

hzadeh17 opened this issue May 26, 2020 · 10 comments
Assignees
Labels
timestamps this tag is for issues related to timestamps

Comments

@hzadeh17
Copy link
Contributor

Currently there are about 700 timestamps (from all deqs) that are not being converted into unix times, which means they are either false positives or there is an OCR error (e.g. Fr!day or 4 :30 PM aren't read by the Unix converter). Looking at the list, there are quite a lot of OCR errors, so I'm going to work on something to run through the list and fix at least the common ones...

@hzadeh17 hzadeh17 self-assigned this May 26, 2020
@hzadeh17 hzadeh17 added the timestamps this tag is for issues related to timestamps label May 26, 2020
@akrinaldi
Copy link
Contributor

I've noticed a lot of OCR errors myself, including a lot of dates with double spaces, weird special characters, and odd spellings of months (ex. Jau instead of Jan). I've been going through and cleaning them on the tiny subsets I'm using for visualization experiments, but we should definitely devise a system for cleaning these.

@akrinaldi akrinaldi self-assigned this May 29, 2020
@hzadeh17
Copy link
Contributor Author

@akrinaldi Yes! I actually started doing that and about 75% of the initial OCR errors in the timestamps have been fixed, but I need some help figuring out how to fix the remaining ones. My code that I was working on is in this module--if you go down to the function fixTS, you can see what I have so far...

@hzadeh17
Copy link
Contributor Author

@akrinaldi These are the 182 timestamps that still need to be fixed.

@akrinaldi
Copy link
Contributor

What rules have you been going by for correcting some of these, especially for the ones where the original number is unclear?

@hzadeh17
Copy link
Contributor Author

hzadeh17 commented May 29, 2020

If you go into this module, and go down the functon fixTS (search "def fixTS(listofTS):" in the search bar), you can see what I have so far. It's been more of a bottom-up process, just looking for the common errors and putting together some sort-of reg ex function to fix it.

For example, fixTS.spacer fixes timestamps without a space between the year and the time which was pretty common (e.g. "20159:30 AM --> 2015 9:30 AM). I did this by literally splitting the string into the year and the time and then returning year + " " + time into a new list. Probably a better way to do this, but I'm using my very elementary python skills lol

The ones I am struggling to come up with rules to fix are these. Some of them are kind-of unfixable, or at least need a more qualitative lens, like the first item in the list...

For a lot of them though I can probably just apply what I've been doing to the specific cases, I just have yet to do that.

@akrinaldi
Copy link
Contributor

We may have another issue on our hands - when using the unix stamps you gave me, I noticed my final visualized calendar had two extra months, for the years 2614 and 2615. It looks like it went ahead and made us some unix timestamps for the future based on some incorrect OCR data.

@hzadeh17
Copy link
Contributor Author

Oh yes! Good eye. I actually did add the code to fix that to fixTS, but it was only a few days ago so I haven't rerun everything with that applied yet. Let me know if there are any other issues!

@akrinaldi
Copy link
Contributor

A lot of the timestamps on the list could be fixed by hand, so I did so. However, there are still a handful we have to come up with a way to interpret and a handful we're probably not going to be able to fix. I can make those into separate lists and post them here so we can deal with them as needed.

@Louise-Seamster
Copy link
Contributor

Louise-Seamster commented Jul 29, 2020

@akrinaldi and @hzadeh17: should we generate a single place/procedure for listing remaining errors? what would work best for you? it might be different solutions/places, though, for flagging timestamp errors and spelling mistakes on names. I think Hannah was keeping a list of name appearances?

@hzadeh17
Copy link
Contributor Author

@akrinaldi @Louise-Seamster I have to go back and check but I'm pretty sure I ended up fixing most of the errors. I will check in about this again here soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
timestamps this tag is for issues related to timestamps
Projects
None yet
Development

No branches or pull requests

3 participants