Sorry for the delay, I have been busy trying to move, finish schoolwork, and a lot of other things.
So, I think I found something.
Host files that have commented lines have the lines ignored.
Example:
0.0.0.0 blocked.host.net #Comment Here
I have a file with 15 + million lines in it like that and it counted hosts as "0"
I remove the comments and it works fine.
jfb
April 12, 2020, 4:28am
2
We are looking at potential changes to the gravity update routine.
Okay.
Just a thought, I use perl to parse multiple source files and dump them into a SQL database in order to create a single host file for export.
I use regex to identify the host names, I don't know if what I use may be of use for anything, but I'll share it here.
(([\w\d]+[-.])+\w{2,3})
The reason is that in v5.0 gravity.sh removes all lines with invalid (hence #
characters) no matter where in the line they are.
gravity_Blackbody=true
}
total_num=0
parseList() {
local adlistID="${1}" src="${2}" target="${3}" incorrect_lines
# This sed does the following things:
# 1. Remove all domains containing invalid characters. Valid are: a-z, A-Z, 0-9, dot (.), minus (-), underscore (_)
# 2. Append ,adlistID to every line
# 3. Ensures there is a newline on the last line
sed -e "/[^a-zA-Z0-9.\_-]/d;s/$/,${adlistID}/;/.$/a\\" "${src}" >> "${target}"
# Find (up to) five domains containing invalid characters (see above)
incorrect_lines="$(sed -e "/[^a-zA-Z0-9.\_-]/!d" "${src}" | head -n 5)"
local num_lines num_target_lines num_correct_lines num_invalid
# Get number of lines in source file
num_lines="$(grep -c "^" "${src}")"
# Get number of lines in destination file
num_target_lines="$(grep -c "^" "${target}")"
num_correct_lines="$(( num_target_lines-total_num ))"
total_num="$num_target_lines"
in v4 gravity.sh handled this differently
# Parse source files into domains format
gravity_ParseFileIntoDomains() {
local source="${1}" destination="${2}" firstLine abpFilter
# Determine if we are parsing a consolidated list
if [[ "${source}" == "${piholeDir}/${matterAndLight}" ]]; then
# Remove comments and print only the domain name
# Most of the lists downloaded are already in hosts file format but the spacing/formating is not contigious
# This helps with that and makes it easier to read
# It also helps with debugging so each stage of the script can be researched more in depth
# Awk -F splits on given IFS, we grab the right hand side (chops trailing #coments and /'s to grab the domain only.
# Last awk command takes non-commented lines and if they have 2 fields, take the right field (the domain) and leave
# the left (IP address), otherwise grab the single field.
< ${source} awk -F '#' '{print $1}' | \
awk -F '/' '{print $1}' | \
awk '($1 !~ /^#/) { if (NF>1) {print $2} else {print $1}}' | \
sed -nr -e 's/\.{2,}/./g' -e '/\./p' > ${destination}
return 0
fi
Best and easiest solution would be if you remove the trailed comments as you already parse your list anyway.
Yes, a simple solution for my list. However many other public and private lists I use are not currently part of it and many of the other lists are commented as well. This will work for my list, but not the others and It may be months before I get everything combined into my DB.
DL6ER
April 13, 2020, 8:03am
6
Whoops, I already wrote a fix for this two weeks ago, but, apparently, I have forgotten to open a PR for it. My bad. Thanks for reminding me (indirectly).
https://github.com/pi-hole/pi-hole/pull/3269