Show HN: Large Scale Article Extract of Newspapers 1730s-1960s

46 points by brettnbutter 1 day ago

Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities.

Problem: I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise.

Solution: I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough!

If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is:

Before searching for anything, go to the Sleuth page Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested in

If you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav bar

Some other people have also taken a crack at this, notably:

https://dell-research-harvard.github.io/resources/americanst... (very good attempt) https://labs.loc.gov/work/experiments/newspaper-navigator/ (focused on images)

benwills 1 day ago

As someone who has done a lot of downloading/parsing, this is so awesome and impressive to see.

One thing to think about, which I also struggle with when it comes to large and complicated datasets, is the UI. Even being in the search industry for a long time, it's difficult for me to concretely see how I would use this.

I'd suggest taking a small sample of the dataset that might be reflective of how people would use it, then make that segment public and immediately searchable without registering. eg: One year of articles related to the Olympics.

What I've found is that it's hard for a lot of people to imagine how they would use something without actually using it. So giving people the actual experience of searching the archive and interacting with the results would go a long way.

Again, congrats on the work. This is really impressive work.

brettnbutter 1 day ago

Thank you, I really appreciate it. I will see if I can figure out how to do that, or like "if you're authed, you can try the Sleuth or get x free searches a month"? The balance is trying to do that without (potentially) overwhelming the databases, more so than trying intentionally to gate people out from anything. I'll figure it out!
I don't know if you looked at the "Label Specific" search, but I think I could fairly easily isolate that to a particular label and sub-type for people to search within without much risk to the backend. Any thoughts on a good category?

longplay 2 hours ago

How well do you think your OCR solution would work on magazines? I found OCR very hit and miss with magazines, especially ones with text into background pictures etc.

brettnbutter 36 minutes ago

I'm not sure but in theory I would expect the layouts to be less variable with magazines due to the smaller page space. And vLLMs seem to be good at odd text inside images

brettnbutter 1 day ago

A few examples you can click on without having to authenticate or click the free trial (no cc if you do though and I won't bother you or chase you with spam etc...)

https://snewpapers.com/components/b2d40c08-db63-40e8-890f-09...

https://snewpapers.com/components/0fabc8e4-a60b-4f31-9ad1-b0...

https://snewpapers.com/components/cdde790f-4e97-4f2d-a2c2-95...

StilesCrisis 1 day ago

I see an obvious typo in the first one: "wickked deeds of witchecran" (should be craft)
I can see why the OCR is a challenge here, and spellcheck is a lost cause, but I'm surprised an LLM cleanup pass didn't detect this?
- brettnbutter 1 day ago
  
  In hindsight it was probably a terrible example to you use, because people will think the OCR is off, but if you click on the clipping (or download the PDF from the PDF link at the top) and zoom in, you'll see that it's verbatim quoting some ancient text, which uses a lot of old-timey spelling (wickked e.g. is actually spelled wickked in the article), so I'm pretty happy with the quality it managed to eek out on that!
  Check out the other examples for a more representative quality :-)
- brettnbutter 23 hours ago
  
  I see what you mean now with the witchcran. I do instruct it not to correct old-time spellings, and only to tidy up obvious mistakes. I guess it got confused on this one

zzleeper 1 day ago

Looks cool, congrats!

I've also worked with this data, but only for research purposes:

https://www.finhist.com/bank-runs/episodes/13895.html https://www.finhist.com/bank-runs/index.html

Surprisingly, I found out that layout was the trickiest thing, as newspaper articles often had multiple layers of headers, spanned multiple columns, etc.

Do you have a preferred solution on that?

brettnbutter 1 day ago

Nice collection you have there.
Just asked the Sleuth for some examples of that, and here's one to add to your Unional National one: https://www.finhist.com/bank-runs/episodes/19827.html
https://snewpapers.com/components/0b22f0ca-60d2-4d63-be99-74...
Yes I agree the layouts are the trickiest part. I tried a few and ended up using some of the Paddle Paddle models for document layout analysis and orientation and such, which give bounding boxes and predicted reading order, but the reading orders aren't great even with SOTA most recent models on complex layouts, or even simple layouts when you have mastheads or images or other artifacts to work around. It's still valuable information that can be combined with heuristics though to stitch together a more accurate reading order, as the starting point of a pipeline
- zzleeper 1 day ago
  
  Great! Was thinking about PP but because I only ran an order of magnitude fewer articles (under 1mm pages; by piggybacking on Dell's OCR) I relied on Arcanum ( https://www.arcanum.com/en/newspaper-segmentation/about/ ) which was cheap enough (but I think not cheap enough at your scale).
  Cheers!
  
  brettnbutter 1 day ago
  
  Hmm, I just tried to upload the jpgs of some of todays samples to Arcanum via https://www.arcanum.com/en/newspaper-segmentation/try-it/ and it didn't work. I'll try it again later, but it seems based on a cursory look that it wouldn't return info that I would need to correct it if I didn't like the output, and that I'd still have to stitch the individual pages back together myself?
  Probably much cheaper than my process though...

brettnbutter 22 hours ago

I'm opening up https://snewpapers.com/today-in-history to the world right now. Per @benwillis's advice below, I will figure out how to make a section of the data searchable for free as well, but this is the best I can do today!

Thank you everyone for taking the time to look at snewpapers today, enjoy!

whythismatters 20 hours ago

>for free
I took a quick look but could not find any pricing info on the website, could you clarify a bit?
- brettnbutter 20 hours ago
  
  When you're anonymous or google/email authd, you can now see today-in-history stuff. There should be a gentle nudge to sign up when you view a component or go to pages that require activating the free trial (no cc required) which opens up search / sleuth / custom collections. That takes you to /subscribe with all the options
  
  seanb 19 hours ago
  
  I'm not going to sign up for a free trial unless I can see the pricing information up front.
  
  brettnbutter 19 hours ago
  
  I think I see the confusing part now. The /subscribe page only is visible once you google auth or enter email / confirm your email exists. It's $9.99/month for the default selection though. In stripe it has the no CC option, so if you click the button it will just auto cancel after a week anyway
- brettnbutter 20 hours ago
  
  I'm going to try to figure out how to make a subset of the data searchable for free (for anyone anonymous or not without having to start free trial) as well in the next few days, so people can play with that too before committing to anything.

nastrofa 1 day ago

It would be really cool to create different analysis across the time

- Each month's / year's top news headline

- Left / Right swings of publishers

brettnbutter 1 day ago

Great idea. This should be fairly easy to do with the embedding vectors I have for the semantic search, using some clustering tools. Adding it to my backlog now!