A Perl script that processes Squid proxy logs and categorizes web traffic using domain-based blacklists. Generates hourly/daily statistics for network monitoring and creates alerts for suspicious activity.
Core Functionality
Blacklist Loading
Scans /var/ipfire/urlfilter/blacklists/ directory at startup.
Each subdirectory = one category (e.g., adult, malware, phishing).
Loads domains file from each category into memory hash tables.
Current implementation: Category name = directory name (automatic detection).
Loads ~500k domains across 78 categories into RAM.
Uses caching to avoid repeated lookups (100% cache hit ratio achieved)
Squid Log Processing
Input: /var/log/squid/access.log (standard Squid format)
Process flow:
Hourly JSON files: Request counts, bandwidth, cache efficiency, top domains/IPs/categories.
Metrics index: Lightweight index for anomaly detection.
Alerts: Generated for sensitive categories, off-hours access, traffic spikes.
Current Issue
Problem: Domainxvideos.com appears as āuncategorizedā despite having adult content Root cause: Blacklist contains thousands of xvideos subdomains (xvideoss.com, xvideosporn.com) but NOT the main domain xvideos.com Why it matters: This suggests the blacklist may be incomplete for other major adult sitesā main domains
Questions for IPFire Community
URLFilter integration: How does IPFireās URLFilter consume these blacklists? Does it use Squidās url_rewrite_program or SquidGuard?.
Custom blacklists: Whatās the recommended way to add missing domains that persists across blacklist updates? Is /var/ipfire/urlfilter/blacklists/custom/ the correct location?.
Blacklist updates: Do IPFire blacklist updates preserve entries in custom/ subdirectories?.
Domain matching: Does URLFilter perform subdomain matching (e.g., blocking *.xvideos.com if xvideos.com is listed)?.
Recommended blacklist source: Are UT1 Toulouse blacklists the recommended source, or are there better-maintained alternatives compatible with IPFire?.
Performance: For a proxy handling 500k blacklist domains, are there recommended optimizations for Squid/SquidGuard?
Technical Environment
IPFire version: [current version]
Squid log format: Standard access.log
Blacklist structure: UT1 Toulouse format (one domain per line)
Processing: Incremental (tracks last position)
Optimization: Memoization, regex precompilation, object pooling
The engine works correctly - it categorizes based on whatās in the blacklists. The issue is the blacklist data quality, specifically missing primary domains for major sites.
The assumptions about the location of the lists are right.
Squid uses SquidGuard as rewriter.
SquidGuard processes the defined lists. The set of lists consists of the defined external lists and the local custom lists. The update process changes the copy of the external lists only.
Your example:
āxvideos.comā isnāt the main domain.
According to the FQDN syntax xvideoss.com and xvideosporn.com are other domain names. The hierarchy is build with the ā.ā character.
So, if xvideos.com is defined as domain to be blocked, all sub domains 'defined by the regex ā.*\.xvideo\.comā are blocked also.
that would require a big handful of volunteers to identify web sites and then categorize each site. And then the head(s) of that group would need to review and approve.
If you are interested, please speak up. A forward moving effort starts with just one person leading the effort!
Hi,
I understand the work and the people involved, but it could be a service that the IPFire community could offer by subscription to support itself.
But as I said, itās just a thought.
If they decide to do it, Iād be available.
I had thought of a pseudo-automatic system, not quite automatic. For pages that donāt appear as categorized (uncategorized) at the University of Toulouse, a button would be enabled so I could send a recategorization request (an anonymous email) to an email address. The recipient would categorize it and update a ācustomā repository. All IPFires that have the addon would update the ācustomā files in that repository, and thus, little by little, build a good database.
Iām slowly designing the interface. When I have something, Iāll add it. (without the pseudo-categorization module, since that needs careful thought.)
Copy file ācategorize.ipfire.tgzā to /opt/pakfire/tmp.
Unpak with:
tar xvf categorize.ipfire.tgz
To install:
./install.sh
To uninstall:
./uninstall.sh
Youāll see the menu inside āIPFireā.
Thereās a feature called āUpload Logs,ā which Iāve leveraged to upload and create reports generated by Squid. To do this, go to āLogs ā Proxy Logsā and regenerate whatever you want. Then, save it from your browser as a .dat file and you can create the report.
Any incident reports you may have will be appreciated.
Give me the budget and I will do it. We are looking at a full-time job for several people hereā¦
The things that are sometimes being asked for on this forum cost a lot of money. And indeed the features that we are bringing you also cost a lot of money. These things cannot be financed by just donating $5 once a yearā¦
I donāt know what the budget might be, but I think it could be a project sold as a service.
I know that other companies, even well-known ones, do it.
Yes, youāre absolutely right. But first, I need to figure out how to implement what youāre talking about in the addon Iāve made, and then add these manually cataloged lists to a parallel ācustomā list so that all the installed addons can feed back. I was thinking about implementing something, like an āuncagecorizeā button that would anonymously (obviously) send the request so that I could categorize it and upload it to a ācustomā repository, and then download the addon every so often. Iād have to think about it.
Iāve created an addon (and improved it) so that pages that appear as āUncategorizedā now have a āReportā button. This sends an anonymous email to āreports@northsecure.esā so I can update a repository and the installed addons will be updated from it. Iāve created a short document showing how it works.
that would require a big handful of volunteers to identify web sites and then categorize each site. And then the head(s) of that group would need to review and approve.
If you are interested, please speak up. A forward moving effort starts with just one person leading the effort!
Hi,
Even though I havenāt tried the component yet, but I will as soon as possible, I can say that from what youāve described, it seems really interesting.
As Iāve already said, Iām available to catalog the various sites.
Hi,
I tried the system on a test machine, and Iād say it works pretty well.
Allow me to offer some advice. Iād also provide a zip file with the remote server structure.
another note, the strings to change for your own customization are not documented