Markdown for RAG - Can't Get Tables for Core Filings Right #407
Unanswered
NamedIdentity
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been working with GPT 5 to write scripts. I can get the filings I want from years 2014-2025 for the companies I want downloaded, but when I try markdown, problems.
At first It was hard to get anything other than the default primary report. Eventually I got the exhibits I needed to markdown too. I even got forms 3, 4, 5 xmls through markdown, alongside the html based filings, all in the same script.
But, no matter what I try, I can't get the tables in core filings to output in a structured manner. They're always jumbled. The more I try to get GPT 5 to fix, the more I go around in circles and find new ways to break things (at one point data from tables displayed as "nan").
The kicker. If I open the html of a 10-K, print PDF, upload the PDF to Abacus.ai database, it'll extract into markdown with properly structured tables. So, it can be done. I just have no clue how.
I keep thinking, someone, somewhere, has already made the script I need to use, to pull specific forms, clean, parse, and markdown in preparation for RAG.
Which, leads to my request for guidance. What do I need to point GPT 5 to so it figures out how to salvage this script?
Anyone have a script they could share that I can give to GPT as an example of 'what to do'?
I'm 'almost there' for a proper workflow for this. My backup, is to just pull filings and convert to PDF and hope and pray it doesn't create other new problems.
Beta Was this translation helpful? Give feedback.
All reactions