

Yeah its ridiculous how much is in there. I’m pulling their current repo to see how they are building their DB so if they don’t get back to me, I can at least combine the two databases.
And if any one reading this wants a copy of what I’ve processed so far, I’m more than happy to share.
But it looks to me like they dropped a couple hundred on just processing those text files. It would be north of 2.5k additional to process the data I’m creating.
That being said, mine only goes as far as extracting the contents and creating a sha256 hash to keep track of the documents themselves/ document tampering. It doesn’t take the next step to extract names, locations, dates, etc…
I’m working that out now but it seems like the way to do this would be so it fits into their DB seamlessly.




















Its phenomenal. I have found a few places where it falls down, and its usually when the text is incredibly small. You can see its being down sampled before it gets handed off to the model. It falls down on like, one example I found, some bank disclosure documentation from bank of america:
It just came out as all I’s and o’s.
For the emails, book text, letters, etc… I genuinely haven’t found a place it didn’t work correctly as I’ve been spot checking the output.
If you have colab you can just try the script I put up. All you need to do to have it run is to book mark the house oversite committee google drive folder to your local google drive.