Populus:DezInV/Notes
Aus Populus DE
Zur Navigation springenZur Suche springen
archive
- Discussion about compression with somebody working on Russian web archive
crawling
- httparchive hat einen Almanach mit Statistiken über Webseiten
index
scraping
search
from org-mode/thk
dezentrale Suchmaschine
- https://yacy.net
- https://github.com/nvasilakis/yippee
- Java
- last commit 2012
- https://www.techdirt.com/articles/20140701/03143327738/distributed-search-engines-why-we-need-them-post-snowden-world.shtml
- https://metaphacts.com/diesel
- non-free enterprise search engine
- https://fourweekmba.com/distributed-search-engines-vs-google/
- presearch.{io|org}
- non-free, blockchain,bla bla
- https://github.com/kearch/kearch
- letzter commit 2019
- python
- https://hackernoon.com/is-the-concept-of-a-distributed-search-engine-potent-enough-to-challenge-googles-dominance-l1s44t2
- https://blog.florence.chat/a-distributed-search-engine-for-the-distributed-web-39c377dc700e
- https://wiki.p2pfoundation.net/Distributed_Search_Engines
- https://en.wikipedia.org/wiki/Distributed_search_engine
- https://searx.github.io/searx/
- https://www.astridmager.net Wissenschaftlerin, alternative, ethic search
- Warum man auch ein lokales Archiv haben sollte: https://www.medialens.org/2023/is-that-orwellian-or-kafkaesque-enough-for-you-the-guardian-removes-bin-ladens-letter-to-america
Internet Archive
- https://netpreserve.org/about-us/
- https://github.com/iipc/awesome-web-archiving
- https://github.com/internetarchive
- https://github.com/iipc/openwayback
- http://timetravel.mementoweb.org/
- https://github.com/webrecorder
Crawling, Crawler
ArchiveBox
Python, aktiv
Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…
grab-site
Python, aktiv https://github.com/ArchiveTeam/grab-site
WASP
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
- Java, inaktiv?
- related Paper: Extending WASP: providing context to a personal web archive
- Paper: WASP: Web Archiving and Search Personalized via https://github.com/webis-de/wasp
StormCrawler
- https://en.wikipedia.org/wiki/StormCrawler
- Auf Basis von Apache Storm (distributed stream processing)
HTTrack
- https://www.httrack.com
- In C geschrieben, letztes Release 2017, aber aktives Forum und Github
- https://github.com/xroche/httrack/tags
Grub.org
dezentraler Crawler des Wikia Projektes In C#, Pyton
HCE – Hierarchical Cluster Engine
- http://hierarchical-cluster-engine.com
- https://en.wikipedia.org/wiki/Hierarchical_Cluster_Engine_Project
- Ukraine, aktiv bis ca. 2014?
Heritrix
- https://en.wikipedia.org/wiki/Heritrix
- Crawler von archive.org
Haskell
- https://shimweasel.com/2017/07/13/a-modest-scraping-proposal
- https://github.com/jordanspooner/haskell-web-crawler
- dead since 2017, seems like student assignment
- https://hackage.haskell.org/package/scalpel
- A high level web scraping library for Haskell.
- https://hackage.haskell.org/package/hScraper
- eine Version 0.1.0.0 von 2015
- https://hackage.haskell.org/package/hs-scrape
- eine VErsion von 2014, aber git commit von 2020
- https://hackage.haskell.org/package/http-conduit-downloader
- HTTP/HTTPS downloader built on top of http-client and used in https://bazqux.com crawler.
- https://github.com/hunt-framework/hunt
- A flexible, lightweight search platform
- Vorläufer https://github.com/fortytools/holumbus/
- lower level libs
Search
openwebsearch.eu
lemurproject.org
lucindri
https://lemurproject.org/lucindri.php
Lucindri is an open-source implementation of Indri search logic and structured query language using the Lucene Search Engine. Lucindri consists of two components: the indexer and the searcher.
Galago
toolkit for experimenting with text search
Wikia
- https://en.wikipedia.org/wiki/Wikia_Search
- von Jimmy Wales
- Implementierung
- Grub, dezentraler Crawler
- Apache Nutch
Recherche 2023-11-26
- https://en.wikipedia.org/wiki/Category:Web_scraping
- https://en.wikipedia.org/wiki/Category:Web_archiving
- https://en.wikipedia.org/wiki/Category:Free_search_engine_software
- https://en.wikipedia.org/wiki/Category:Internet_search_engines
- https://en.wikipedia.org/wiki/Category:Free_web_crawlers