I want to fork the repo to add domain ranking support. Many malicious domains have a very low common-crawl.org ranking. Common crawl has a huge list of ranked domains (500 million).
As a domain analytics nerd, and reading about the latest hack against security researchers.., (North Korea hackers use social media to target security researchers | Ars Technica)
..The goal is to show the user what domains are really low value / low popularity, new etc.. The Solarwinds hack, Ukraine power grid hack, and thousands of others could have been caught with this. Note this is detection only. Not blocking...
These are the main changes I see are needed.
add rank column to the queries create table statement
modify query-table.c "insert into queries" to support new column. Starts as null.
create a reverse index on the domains column.
create a cron job to run update statements
update queries set rank = "1" where domain like '%google.com' or domain like '%.google.com'
update queries set rank = (high max number) where domain like '%newevil.com' or domaing like "%.newevil.com"
add a php page similar to "top domains" that displays info order by rank desc
make a link to the new php page in the UI.
The only 100% secure computer is an unplugged computer. Blocklists for malware will never have 100% coverage. False positives, safe domains that common crawl missed because they are internet plumbing domains, not website domains will need to be added to the common crawl list.
Interested to know from php , C devs, or others what changes will I reget doing later? I need the index on the domain column in order to get the update statement to finish quickly. The cron job seems the least intrusive way to make the update. The size of the common-crawl data can be reduced by capping at 50 million and using a succint data structure. GitHub - QratorLabs/pysdsl: Python bindings to Succinct Data Structure Library 2.0