Commit Graph

190 Commits

Author SHA1 Message Date
Mickaël Serneels
c8d594fb73 add url extraction
url get extracted from webpage when page contains proxies

this allows to "learn" as much links as possible from a working website
2019-05-01 22:58:23 +02:00
rofl0r
866f308322 proxywatchd: remove bogus blanket exception handler
this would catch *any* exception, including typos
2019-05-01 20:05:57 +01:00
rofl0r
01435671c1 add latest rocksock 2019-05-01 20:04:30 +01:00
Mickaël Serneels
0fb706eeae clean code 2019-05-01 17:43:29 +02:00
Mickaël Serneels
9a624819d3 check content type 2019-05-01 17:43:29 +02:00
Mickaël Serneels
0962019386 add own searx instance 2019-05-01 17:43:29 +02:00
Mickaël Serneels
70b6285394 scraper: more changes 2019-05-01 17:43:29 +02:00
Mickaël Serneels
482cf79676 scraper: make query configurable (Proxies, Websites, Search)
--scraper.query = 'pws'
2019-05-01 17:43:28 +02:00
Mickaël Serneels
15fc29abc4 externalize searx instances into new file "searx.instances" 2019-05-01 17:43:28 +02:00
Mickaël Serneels
c194d5cfc7 scraper: add debug option 2019-05-01 17:43:28 +02:00
Mickaël Serneels
0155c6f2ad ppf: check content-type (once) before trying to download/extract proxies
avoid trying to extract stuff from pdf and such (only accept text/*)

REQUIRES:
sqlite3 websites.sqlite "alter table uris add content_type text"

Don't test known uris:
sqlite3 websites.sqlite "update uris set content_type='text/manual' WHERE error=0"
2019-05-01 17:43:28 +02:00
Mickaël Serneels
e19c473514 update imports.txt 2019-05-01 17:43:28 +02:00
Mickaël Serneels
75318209ab oldies_multi: change default value from 100 to 10 2019-05-01 17:43:28 +02:00
Mickaël Serneels
d09244d04d proxywatchd: fix Exception error
Exception in thread Thread-9:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "proxywatchd.py", line 200, in workloop
    job.run()
  File "proxywatchd.py", line 123, in run
    sock, proto, duration, tor, srv, failinc = self.connect_socket()
ValueError: need more than 5 values to unpack
2019-05-01 17:43:28 +02:00
Mickaël Serneels
7aea9a3e53 irc: minimize possible response code 2019-05-01 17:43:28 +02:00
Mickaël Serneels
7b9f8b2e00 create socks4_resolve()
moves socks4 resolution out of socket_connect block
2019-05-01 17:43:28 +02:00
Mickaël Serneels
bad4d25bcf make watchd.tor_safeguard a configurable option (default: True) 2019-05-01 17:43:28 +02:00
Mickaël Serneels
59eea18bca update urignore 2019-05-01 17:43:28 +02:00
Mickaël Serneels
6427d4a645 remove that specific blogspot url 2019-05-01 17:43:28 +02:00
Mickaël Serneels
475f10560e search: more changes 2019-05-01 17:43:28 +02:00
Mickaël Serneels
8900153871 set default error value to 1 for new urls 2019-05-01 17:43:28 +02:00
Mickaël Serneels
fdd486f73c remove '-intitle:pdf' from default search 2019-05-01 17:43:28 +02:00
Mickaël Serneels
a2783bdfcf don't loop over every searx instances
randomly pick one per search, instead
2019-05-01 17:43:28 +02:00
Mickaël Serneels
67aec84320 fix Exception error
Exception in thread Thread-8:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "proxywatchd.py", line 191, in workloop
    job.run()
  File "proxywatchd.py", line 114, in run
    sock, proto, duration, tor, srv, failinc = self.connect_socket()
  File "proxywatchd.py", line 76, in connect_socket
    sock.send('%s\n' % random.choice(['NICK', 'USER', 'JOIN', 'MODE', 'PART', 'INVITE', 'KNOCK', 'WHOIS', 'WHO', 'NOTICE', 'PRIVMSG', 'PING', 'QUIT']))
  File "rocksock.py", line 279, in send
    return self.sock.sendall(buf)
  File "/usr/lib/python2.7/ssl.py", line 741, in sendall
    v = self.send(data[count:])
  File "/usr/lib/python2.7/ssl.py", line 707, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe
2019-05-01 17:43:28 +02:00
Mickaël Serneels
003a9074d2 make server file configurable 2019-05-01 17:43:28 +02:00
Mickaël Serneels
c729bf666e searx: use sample instances
don't loop over *all* instances
2019-05-01 17:43:28 +02:00
rofl0r
207574c815 import.txt: add chinese site 2019-05-01 17:43:28 +02:00
rofl0r
bf7ec03fbf fetch.py: factor out twice used var 2019-05-01 17:43:28 +02:00
rofl0r
096ee21286 urignore: add some rules suppressing SEO spam 2019-05-01 17:43:28 +02:00
mickael
310b01140a irc: implement use_ssl = 2
0: disabled, 1: enabled, 2: maybe
default is 0
2019-05-01 17:43:28 +02:00
mickael
0eebe4daff populate import.txt 2019-05-01 17:43:28 +02:00
mickael
61c3ae6130 fix: define retrievals on import 2019-05-01 17:43:28 +02:00
mickael
0d1316052c add servers.txt.sample 2019-03-05 22:29:16 +00:00
mickael
ceb840b00f remove noexistent server 2019-03-05 22:29:16 +00:00
mickael
1ad5ca53e5 take care of old proxies
test old proxies during free time
2019-03-05 22:29:16 +00:00
rofl0r
2bacf77c8c split ppf into two programs, ppf/scraper 2019-01-18 22:53:35 +00:00
rofl0r
8400eab7ee insert_proxies: remove 500-at-a-time logic
it's now done by mysqlite.py executemany.
2019-01-18 21:50:48 +00:00
rofl0r
8be5ab1567 ppf: move insert function into dbs.py 2019-01-18 21:43:17 +00:00
rofl0r
aba74c8eab mysqlite.py: improve
1) use a common try/except block for all ops
2) do not display query and args when DB is locked (could be several
   hundreds rows)
3) re-raise non locking-related exceptions (e.g. a wrong sql statement)
4) split executemany rows into chunks of 500 (so the caller doesn't have
   to do it)
2019-01-18 20:42:15 +00:00
rofl0r
5fd693a4a2 ppf: remove more unneeded stuff 2019-01-18 19:55:54 +00:00
rofl0r
d926e66092 ppf: remove unneeded stuff 2019-01-18 19:53:55 +00:00
rofl0r
b0f92fcdcd ppf.py: improve urignore code readability 2019-01-18 19:52:15 +00:00
rofl0r
b99f83a991 fetch.py: improve readability of extract_urls 2019-01-18 19:32:37 +00:00
rofl0r
4a41796b19 factor out http related code from ppf.py 2019-01-18 19:30:42 +00:00
rofl0r
0dad0176f3 ppf: add new field proxies_added to be able to rate sites
sqlite3 urls.sqlite "alter table uris add proxies_added INT"
sqlite3 urls.sqlite "update uris set proxies_added=0"
2019-01-18 15:44:09 +00:00
rofl0r
0734635e30 watchd main thread: be less nervous 2019-01-18 15:35:19 +00:00
rofl0r
ddee92d20f watchd: introduce configurable 'outage_threshold' 2019-01-18 15:34:49 +00:00
mickael
aaac14d34e worker: add threading lock
add lock to avoid same proxy to be scanned multiple time when
a small number a jobs is handed to worker
2019-01-13 16:50:54 +00:00
mickael
f489f0c4dd set retrievals to 0 for new uris 2019-01-13 16:50:54 +00:00
rofl0r
69d366f7eb ppf: add retrievals field so we know whether an url is new
use

sqlite3 urls.sqlite "alter table uris add retrievals INT"
sqlite3 urls.sqlite "update uris set retrievals=0"
2019-01-13 16:40:12 +00:00