Sitemap

Datamule

Scaling data processing, starting with the SEC corpus.

6 min readJan 15, 2026
Press enter or click to view image in full size
Thank you to nano banana for upscaling my primitive logo.

Goal

To understand how something works we need information. That information is often stored in a format that is hard to access. If we can make it easy to convert from hard to access formats to easy to access formats, we make it easy to understand how things work.

What that actually means

  • I’ve figured out how to clean, standardize and enrich information at scale.
  • I’m applying this technology to the SEC corpus. You can see existing products here or suggest new products here. Products are integrated into the datamule open source python package here.
  • I will later make a mature version of the underlying technology available via API, or to specific partners.

Motivation

From age 16 to 25, I was involved in economic research at Berkeley, UCLA, and MIT. The bottleneck was quality data. This took the majority of the time, and many other time expenditures stemmed from having to work around low quality data.

After LLMs went viral in 2022, it became clear that they would shortly be able to plot graphs, run regressions, and analyze the results.

However, they have not made much progress in data cleaning and standardization. This is because for a LLM based approach to make data nice, requires either:

  1. To ingest the data into its context window — expensive among other faults.
  2. To sample data, write code on how to clean the data, then iterate — which is a much harder problem.

LLMs have made data analysis much easier, and data cleaning and standardization slightly easier. This makes the latter even more valuable!

Some CS History

Computers are much better today than 20 years ago. Some of that is better hardware. But a lot of it, is that we humans have gotten very good at storing data efficiently.

It is easy to use a parquet (2013) with zstd compression (2016) to generate nice graphs out of the flattened SEC XBRL Corpus (~7gb of data). Doing that with modern hardware but 2005 technology is very hard. For one, the CSV representation is ~110gb. For another, CSVs do not encode data type well. Loading such a large dataset with many strings, will likely lead to a parsing error.

To illustrate why this is important in the age of LLMs, consider the following naive example:

  1. User instructs LLM to research company EBITDA over time.
  2. LLM downloads xbrl.csv.
  3. LLM tries to load xbrl.csv and sample ten columns.
  4. Parsing error.
  5. LLM iterates multiple times to address the parsing error with different solutions.
  6. LLM samples data and performs various regressions to understand EBITDA. LLM’s context window is now full.
  7. LLM saves progress in a text file, and calls new fresh instance of itself.
  8. New instance forgets about parsing error.
  9. New instance iterates addressing parsing error with different solutions.
  10. New instance beings research phase with a fairly full context window.
  11. etc.

Now consider the alternative:

  1. User instructs LLM to research company EBITDA over time.
  2. LLM downloads xbrl.parquet.
  3. LLM loads xbrl.parquet and samples ten columns.
  4. LLM samples data and performs various regressions to understand EBITDA. LLM’s context window is still fresh.
  5. LLM completes research phase and outputs results to user.

Again, this is a naive example. But the fundamental insight is that by making data nice, we reduce possibilities for error, reduce LLM cost, and increase quality of output.

Sense of Scale

In 2025, Hebbia and Reducto announced that they had each processed 1 billion pages to produce data that was ‘LLM ready’. This is a lot given their vision, and multimodal approach. Ballpark tens of millions in compute.

Using an algorithmic approach, I regularly process tens of million of pages to produce ‘LLM ready’ data on my 16gb ram personal laptop. An apples to oranges comparison, but it illustrates why algorithms are great. They scale well.

Why the SEC Corpus?

The SEC corpus is reasonably large (16tb), and contains a diversity of file types — pdf, html, text, sgml, jpg, and more. It is also extremely valuable.

More than that, it is just the right level of messy data. Due to it’s legal and regulatory nature, information even in unstructured form such as pdf or html has structural constraints that we can exploit.

Join Medium for free to get updates from this writer.

More than that, the formats have changed over time. Text files became html, html became xml, and so on. Some filings for domestic entities must be filed in html, while filings that must have the same information in them can be filed in pdf for foreign entities.

Once we figure out how to convert the different human readable files to machine readable form, we then have about 100 million data points on which to figure out the conversion from one file format to another, constrained by that it must contain the same information space and be human readable.

I’m not explaining this well, but basically, we might be able to capture an interesting latent space. If we can capture this latent space, we possibly crack human readable file conversion for everything, not just the SEC.

Initial work

  1. Made a package to work with sec.gov endpoints.
  2. Figured out how to host the SEC archive.
  3. Wrote generalized algorithmic html and pdf parsers that parse thousands of pages per second, running locally.
  4. Figured out how the SEC stores information: secsgml, secxbrl, fundamentals.
  5. Setup distributed computing. Don’t worry it got better than this.
  6. Figured out the fastest way to get SEC filings.

Problems that I will solve over the next few months

  1. Convert every SEC XML file into columnar form.
  2. Standardize tables within HTML files across all SEC filings.

Confident I can solve (1). Companies use filing software to convert csv files into xml to submit to EDGAR, so we know that SEC xml to csv is possible. Solving this will give me insight into generalized approaches to apply this to non SEC xml.

Fairly confident I can solve (2). By law, certain tables contain certain information. This must persist across all filings of that type. So, column names, data within, must be somewhat standardized. One approach would be do use LLMs on the tables, but that would cost millions. I think a naive approach using unsupervised machine learning, string similarity, and a dash of LLMs for certain difficult parts should work well. Pretty sure I can mostly solve this with ~$100 in compute.

Problems on the horizon

  1. Generalization of techniques to other corpuses
  2. Latent Space
  3. Enrichment — e.g. cheap entity detection

Datamule’s name

The name Datamule came from my love of Isaac Asimov, StarCraft, and because I spent my teenage years in Berkeley’s CS department.

For those who don’t know, Databricks grew out of a memory based open source project to scale distributed computing (Apache Spark) at Berkeley’s AMPLab. This largely replaced Hadoop, a disk based distributed computing framework.

Like Databricks, Datamule grew out of an open source project, so I felt a certain kinship. This is why when the domain ‘mule’ was taken, I added ‘data’ in front of it.

Note: a data mule is a courier that transports data storage to remote locations. One of my many crimes against the English language.

Why open source?

Most of my code is open source, except for my AWS and Cloudflare infrastructure. I made this choice because:

  1. I think a public version of this technology should exist.
  2. I feel as if open source allows me to iterate faster.

Monetization

I believe the value from advancing the underlying technology currently outweighs short term monetization.

Funding

I am currently bootstrapping, funded by revenue and contract work. I currently have ~$100k in compute. Notably from:

  • AWS Activate (25k/year, I have 18 months left)
  • Cloudflare for Startups (5k for one year, I have 6 months left)

I would be happy to take additional compute or grants. Grants would be useful, as it would allow me to spend less time contracting and more time on the core technology.

Future

I will either raise to fund a small research group or join a larger effort. The former appeals to me because I enjoy the challenge of responsibility, the latter appeals to me because I am young and inexperienced.

What’s important to me is creative autonomy, and that my open source code remains open source.

I have received and will continue to turn down offers to close source it.

Contact

I am happy to chat. My email is johnfriedman@datamule.xyz. I prefer email to video calls.

Acknowledgements

  • Thank you to Daniel Shaar (Modal), Arsen Vasilyan (MIT, Simons), and Yosef Mihretie (Porter) for excellent technical advice.
  • Thank you to Evan Tana (SPC), Jack McClelland (Afore), Ryan Sells (PearX), and Jakob Diepenbrock (Disciplus Ventures) for acting as a sounding board.
  • Thank you to AR (UCLA), Emerson and Morris Hsieh (Primodium), Whitney Zhang (MIT), Daniel Lovera, Rob Ferguson (Microsoft), JF, KZ, Kate Hu, CT, KP, and several anonymous mentors for guidance and support. Also, many open source contributors. I very much appreciate you.
John Friedman

Written by John Friedman

I recently left my PhD to work on information processing.