Timestamps need fixing #21

hzadeh17 · 2020-05-26T12:20:16Z

Currently there are about 700 timestamps (from all deqs) that are not being converted into unix times, which means they are either false positives or there is an OCR error (e.g. Fr!day or 4 :30 PM aren't read by the Unix converter). Looking at the list, there are quite a lot of OCR errors, so I'm going to work on something to run through the list and fix at least the common ones...

akrinaldi · 2020-05-29T02:35:10Z

I've noticed a lot of OCR errors myself, including a lot of dates with double spaces, weird special characters, and odd spellings of months (ex. Jau instead of Jan). I've been going through and cleaning them on the tiny subsets I'm using for visualization experiments, but we should definitely devise a system for cleaning these.

hzadeh17 · 2020-05-29T02:48:04Z

@akrinaldi Yes! I actually started doing that and about 75% of the initial OCR errors in the timestamps have been fixed, but I need some help figuring out how to fix the remaining ones. My code that I was working on is in this module--if you go down to the function fixTS, you can see what I have so far...

hzadeh17 · 2020-05-29T02:59:51Z

@akrinaldi These are the 182 timestamps that still need to be fixed.

akrinaldi · 2020-05-29T15:03:10Z

What rules have you been going by for correcting some of these, especially for the ones where the original number is unclear?

hzadeh17 · 2020-05-29T15:18:27Z

If you go into this module, and go down the functon fixTS (search "def fixTS(listofTS):" in the search bar), you can see what I have so far. It's been more of a bottom-up process, just looking for the common errors and putting together some sort-of reg ex function to fix it.

For example, fixTS.spacer fixes timestamps without a space between the year and the time which was pretty common (e.g. "20159:30 AM --> 2015 9:30 AM). I did this by literally splitting the string into the year and the time and then returning year + " " + time into a new list. Probably a better way to do this, but I'm using my very elementary python skills lol

The ones I am struggling to come up with rules to fix are these. Some of them are kind-of unfixable, or at least need a more qualitative lens, like the first item in the list...

For a lot of them though I can probably just apply what I've been doing to the specific cases, I just have yet to do that.

akrinaldi · 2020-05-29T16:33:19Z

We may have another issue on our hands - when using the unix stamps you gave me, I noticed my final visualized calendar had two extra months, for the years 2614 and 2615. It looks like it went ahead and made us some unix timestamps for the future based on some incorrect OCR data.

hzadeh17 · 2020-05-29T17:52:27Z

Oh yes! Good eye. I actually did add the code to fix that to fixTS, but it was only a few days ago so I haven't rerun everything with that applied yet. Let me know if there are any other issues!

akrinaldi · 2020-06-01T17:45:13Z

A lot of the timestamps on the list could be fixed by hand, so I did so. However, there are still a handful we have to come up with a way to interpret and a handful we're probably not going to be able to fix. I can make those into separate lists and post them here so we can deal with them as needed.

Louise-Seamster · 2020-07-29T16:08:27Z

@akrinaldi and @hzadeh17: should we generate a single place/procedure for listing remaining errors? what would work best for you? it might be different solutions/places, though, for flagging timestamp errors and spelling mistakes on names. I think Hannah was keeping a list of name appearances?

hzadeh17 · 2020-07-29T16:30:07Z

@akrinaldi @Louise-Seamster I have to go back and check but I'm pretty sure I ended up fixing most of the errors. I will check in about this again here soon

hzadeh17 self-assigned this May 26, 2020

hzadeh17 added the timestamps this tag is for issues related to timestamps label May 26, 2020

akrinaldi self-assigned this May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamps need fixing #21

Timestamps need fixing #21

hzadeh17 commented May 26, 2020

akrinaldi commented May 29, 2020

hzadeh17 commented May 29, 2020

hzadeh17 commented May 29, 2020

akrinaldi commented May 29, 2020

hzadeh17 commented May 29, 2020 •

edited

Loading

akrinaldi commented May 29, 2020

hzadeh17 commented May 29, 2020

akrinaldi commented Jun 1, 2020

Louise-Seamster commented Jul 29, 2020 •

edited

Loading

hzadeh17 commented Jul 29, 2020

Timestamps need fixing #21

Timestamps need fixing #21

Comments

hzadeh17 commented May 26, 2020

akrinaldi commented May 29, 2020

hzadeh17 commented May 29, 2020

hzadeh17 commented May 29, 2020

akrinaldi commented May 29, 2020

hzadeh17 commented May 29, 2020 • edited Loading

akrinaldi commented May 29, 2020

hzadeh17 commented May 29, 2020

akrinaldi commented Jun 1, 2020

Louise-Seamster commented Jul 29, 2020 • edited Loading

hzadeh17 commented Jul 29, 2020

hzadeh17 commented May 29, 2020 •

edited

Loading

Louise-Seamster commented Jul 29, 2020 •

edited

Loading