Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing text with tds.Hash is very slow #26

Open
tastyminerals opened this issue Sep 11, 2016 · 0 comments
Open

Processing text with tds.Hash is very slow #26

tastyminerals opened this issue Sep 11, 2016 · 0 comments

Comments

@tastyminerals
Copy link

tastyminerals commented Sep 11, 2016

I have created a few scripts to preprocess text corpus ~6MB. In order to keep text formatting I need to iterate over each line and do some text manipulations with it. This in turn produces PANIC: unprotected error in call to Lua API (not enough memory). I decided to try tds.Hash to keep my corpus table.

Here is the code I am using:

  text_arr = tokenize(text)
  text_arr = tds.Hash(text_arr)
  -- replace rare tokens with <unk>
  -- text_arr is a {idx: {tokens arr}}
  for l=1,#text_arr do -- iterating lines {}
    for t=1,#text_arr[l] do -- iterating tokens {}
      -- rare is arr of rare words
      for r=1,#rare do
        if text_arr[l][t] == rare[r] then text_arr[l][t] = "<unk>" end
      end
    end
  end

text_arr is a table of size 2900 and this 3 loop operation becomes really slow when using tds.Hash.
I am by no means a lua expert but am I doing something wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant