本文記錄如何從代理IP網站抓取代理IP,網站來源及部分代碼參考自KyxRecon/proxy-scraper.sh。協議類型可分爲HTTPSOCKS兩種,HTTP細分爲HTTPHTTPS兩種,根據匿名等級分爲transparentanonymoushigh-anonymous三種;SOCKS細分爲SOCKS4SOCKS5兩種。就隱藏真實IP而言,HTTPhigh-anonymousSOCKS5類型代理IP爲理想選擇。

Proxy Site Lists

代理IP網站列表如下

No Site CN socks4 socks5 transparent anonymous high-anonymous(elite)
1 SamAir 1 1 1 1 1 1
2 Nntime 1 0 0 1 1 1
3 PROXYS™ 1 0 0 1 0 1
4 Proxz 0 0 0 0 0 1
5 AliveProxy 0 0 1 0 1 1
6 ProxyNova 0 0 0 1 0 1
7 Daily Proxy 0 0 0 1 0 1
8 HideMyAss 0 0 0 1 1 1
9 freeproxylists 0 0 0 1 1 1

IP Extraction

通過代理IP網站的HTML代碼提取所需數據,使用到seqparallelawksed等命令

SamAir

SamAir同時提供HTTPSOCKS類型的IP。

HTTP Proxy

URL地址如下

https://premproxy.com/list/

https://premproxy.com/list/01.htm
...
https://premproxy.com/list/20.htm

輸出形式爲

IP:Port|AnonymityLevel|Country|City|ISP

代碼如下

page_no=$(curl -fsL https://premproxy.com/list/01.htm | sed -r -n '/ptabletitle/{s@(<[^>]*>|\(|\))@@g;s@.*of (.*)@\1@p}')

if [[ "${page_no}" < 10 ]]; then
    seq -f 0%g 1 "${page_no}" | parallel -k -j 0 -X curl -fsL https://premproxy.com/list/{}.htm 2> /dev/null | sed -r -n '/ptabletitle/,/pageinfo/{/tr class/{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | sed '/^[[:space:]]*$/d' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$6)}'
else
    seq -f 0%g 1 9 | parallel -k -j 0 -X curl -fsL https://premproxy.com/list/{}.htm 2> /dev/null | sed -r -n '/ptabletitle/,/pageinfo/{/tr class/{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | sed '/^[[:space:]]*$/d' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$6)}'

    seq 10 "${page_no}" | parallel -k -j 0 -X curl -fsL https://premproxy.com/list/{}.htm 2> /dev/null | sed -r -n '/ptabletitle/,/pageinfo/{/tr class/{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | sed '/^[[:space:]]*$/d' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$6)}'
fi

SOCKS

URL地址爲 https://premproxy.com/socks-list/

輸出形式爲

IP:Port|AnonymityLevel|Country|City|ISP

代碼如下

page_no=$(curl -fsL https://premproxy.com/socks-list/01.htm | sed -r -n '/next/{s@<[^>]*>@@gp}' | awk '{print $(NF-1)}')    # 5

seq -f 0%g 1 "${page_no}" | parallel -k -j 0 -X curl -fsL https://premproxy.com/socks-list/{}.htm 2> /dev/null | sed -r -n '/^<tr><td>/{{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,tolower($2),$4,$5,$6)}'

Nntime

Nntime提供HTTP類型代理IP

URL地址如下

http://nntime.com/proxy-list-01.htm
...
http://nntime.com/proxy-list-18.htm

端口號採用document.write(":"+z+v)形式,字母與數字的對應關係每一頁都不相同

輸出形式爲

IP:Port|AnonymityLevel|Country|City|ISP

代碼如下

page_no=$(curl -fsL http://nntime.com/proxy-list-01.htm | sed -r -n '/navigation/{{s@(<[^>]*>|\(|\)|next)@@g;p}}' | awk '{print $NF}')

if [[ "${page_no}" < 10 ]]; then
    seq -f 0%g 1 "${page_no}" | parallel -k -j 0 -X curl -fsL http://nntime.com/proxy-list-{}.htm 2> /dev/null | sed -r -n '/<\/thead>/,/<\/table>/{s@<\/?(thead|table|dfn|script)[[:space:]]*[^>]*>@@g;s@<(td|tr)[[:space:]]*[^>]*>@@g;s@<input.*value=\"(.*)\" onclick.*\/>@\1@g;s@(\"|\:|\+)@@g;s@document.write\((.*)\)@|\1@g;p}' | sed -r -n '/^[[:blank:]]*$/d;s@(<\/td>)@|@g;s@\)@@g;p' | awk '{if($0!~/^<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r 's@[[:space:]]*(\|)[[:space:]]*@\1@g' | awk -F\| '{str_start_pos=(length($1)-length($3)+1);port=substr($1,str_start_pos); sub(/[[:space:]]*proxy/,"",$4); printf("%s:%s|%s|%s|%s\n",$2,port,$4,$7,$6)}' | sed -r 's@[[:space:]]*\(@|@g' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$3)}'
else
    seq -f 0%g 1 9 | parallel -k -j 0 -X curl -fsL http://nntime.com/proxy-list-{}.htm 2> /dev/null | sed -r -n '/<\/thead>/,/<\/table>/{s@<\/?(thead|table|dfn|script)[[:space:]]*[^>]*>@@g;s@<(td|tr)[[:space:]]*[^>]*>@@g;s@<input.*value=\"(.*)\" onclick.*\/>@\1@g;s@(\"|\:|\+)@@g;s@document.write\((.*)\)@|\1@g;p}' | sed -r -n '/^[[:blank:]]*$/d;s@(<\/td>)@|@g;s@\)@@g;p' | awk '{if($0!~/^<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r 's@[[:space:]]*(\|)[[:space:]]*@\1@g' | awk -F\| '{str_start_pos=(length($1)-length($3)+1);port=substr($1,str_start_pos); sub(/[[:space:]]*proxy/,"",$4); printf("%s:%s|%s|%s|%s\n",$2,port,$4,$7,$6)}' | sed -r 's@[[:space:]]*\(@|@g' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$3)}'

    seq 10 "${page_no}" | parallel -k -j 0 -X curl -fsL http://nntime.com/proxy-list-{}.htm 2> /dev/null | sed -r -n '/<\/thead>/,/<\/table>/{s@<\/?(thead|table|dfn|script)[[:space:]]*[^>]*>@@g;s@<(td|tr)[[:space:]]*[^>]*>@@g;s@<input.*value=\"(.*)\" onclick.*\/>@\1@g;s@(\"|\:|\+)@@g;s@document.write\((.*)\)@|\1@g;p}' | sed -r -n '/^[[:blank:]]*$/d;s@(<\/td>)@|@g;s@\)@@g;p' | awk '{if($0!~/^<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r 's@[[:space:]]*(\|)[[:space:]]*@\1@g' | awk -F\| '{str_start_pos=(length($1)-length($3)+1);port=substr($1,str_start_pos); sub(/[[:space:]]*proxy/,"",$4); printf("%s:%s|%s|%s|%s\n",$2,port,$4,$7,$6)}' | sed -r 's@[[:space:]]*\(@|@g' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$3)}'
fi

PROXYS™

PROXYS™提供HTTP類型代理IP,分爲transparenteelite兩種。

URL地址爲 http://www.proxys.com.ar

輸出形式爲

IP:Port|AnonymityLevel|Country

代碼如下

curl -fsL http://www.proxys.com.ar/ | sed -r -n '/st-tables-page/{s@<\/?(ins|script|a|thead|tbody)[[:space:]]*[^>]*>?@@g;s@<\/tr>@\n@g;s@<(tr|td)>@@g;p}' | sed -r -n '/^[[:digit:]]+/{s@<\/td>@|@g;p}' | awk -F\| '{printf("%s:%s|%s|%s\n",$1,$2,tolower($4),$3)}'

Proxz

Proxz提供HTTP類型代理IP

URL地址如下

http://www.proxz.com/proxy_list_high_anonymous_0.html
http://www.proxz.com/proxy_list_high_anonymous_0_ext.html
...
http://www.proxz.com/proxy_list_high_anonymous_7_ext.html

注意:使用curlwget命令時,必須指定user-agent,否則無法獲取HTML頁面。頁面中IP地址使用javascript的unescape操作轉換爲長字符串,須對其進行反向操作復原爲IP。

輸出形式爲

IP:Port|AnonymityLevel|Country

代碼如下

user_agent=${user_agent:-"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6.4) AppleWebKit/537.29.20 (KHTML, like Gecko) Chrome/60.0.3030.92 Safari/537.29.20"}

page_no=$(curl -fsL --user-agent "\"${user_agent}\"" http://www.proxz.com/proxy_list_high_anonymous_0_ext.html | sed -r -n '/^<\/td><\/tr><\/table>/{s@(<[^>]*>|::..)@@g;p}' | awk -F: '{print $NF}')

urldecode() { : "${*}" ; echo -e "${_}" | sed 's/%\([0-9A-F][0-9A-F]\)/\\\\\x\1/g' | xargs echo -e | sed -r -n 's@.*\("(.*)"\).*@\1@g;s@%2e@.@g;p'; }

seq 0 "${page_no}" | parallel -k -j 0 -X curl -fsL --user-agent "\"${user_agent}\"" http://www.proxz.com/proxy_list_high_anonymous_{}_ext.html 2> /dev/null | sed -r -n "/eval\(unescape/{s@<\/td><\/tr>@@;s@<\/tr>@\n@g;s@<noscript>Please enable javascript<\/noscript>@@g;s@<\/?(tr|a|script)[[:space:]]*[^>]*>@@g;s@(<td>|\(|\)|;)@@g;s@evalunescape@@g;s@'@@g;s@<\/td>@|@g;s@<td[[:space:]]*[^>]*>@@g;p}" | sed '/^$/d' | while IFS="|" read -r a b c d e f;do ip=$(urldecode $a); echo "$ip:$b|${c,,}|$d"; done

AliveProxy

AliveProxy提供HTTP(anonymous, high-anonymous)和SOCKS5類型代理IP。

URL地址如下

<!-- Free Proxy List: High anonymity Proxies. -->
http://www.aliveproxy.com/high-anonymity-proxy-list/

<!-- Free Proxy List: Anonymous Proxies. -->
http://www.aliveproxy.com/anonymous-proxy-list/

<!-- Free Socks 5 Proxy Lists. -->
http://aliveproxy.com/socks5-list/

輸出形式爲

IP:Port

HTTP Proxy

代碼如下

# High Anonymous Proxies
curl -fsL http://www.aliveproxy.com/high-anonymity-proxy-list/ | sed -r -n '/^<TABLE class/{s@(.*)@\L\1@g;s@<\/tr>@\n@g;s@<\/?(tr|td|table|center|a|br)[[:space:]]*[^>]*>@@g;p}' | sed -r -n '/^[[:digit:].]+/{s@(.*)--.*@\1@gp}' | awk '{printf("%s|%s\n",$1,"high-anonymous")}'

# Anonymous Proxies
curl -fsL http://www.aliveproxy.com/anonymous-proxy-list/ | sed -r -n '/^<TABLE class/{s@(.*)@\L\1@g;s@<\/tr>@\n@g;s@<\/?(tr|td|table|center|a|br)[[:space:]]*[^>]*>@@g;p}' | sed -r -n '/^[[:digit:].]+/{s@(.*)--.*@\1@gp}' | awk '{printf("%s|%s\n",$1,"anonymous")}'

SOCKS5

代碼如下

# Socks 5 Proxies  IP基本不能用
curl -fsL http://aliveproxy.com/socks5-list/ | sed -r -n '/^<TABLE class/{s@(.*)@\L\1@g;s@<\/tr>@\n@g;s@<\/?(tr|td|table|center|a|br)[[:space:]]*[^>]*>@@g;p}' | sed -r -n '/^[[:digit:].]+/{s@(.*)--.*@\1@gp}'

ProxyNova

ProxyNova提供HTTP類型代理IP,分爲transparentelite兩種。

URL地址爲 https://www.proxynova.com/proxy-server-list/

注意:IP地址使用document.write('2331.16'.substr(2) + '0.4.90')形式進行混淆,合併單引號中的字符串,去除爲首的2個字符後即爲目標IP(此處爲31.160.4.90)。

輸出形式爲

IP:Port|AnonymityLevel|Country|City

代碼如下

curl -fsL https://www.proxynova.com/proxy-server-list/ | sed -r -n '/<center>/,/<\/center>/d;/<tbody>/,/<\/tbody>/{s@<\/?(tbody|images|script|a|time|img|div|ins)[[:space:]]*[^>]*>@@g;s@<(td|span)[[:space:]]*[^>]*>@@g;s@^[[:blank:]]*@@g;s@<tr>@@g;p}' | sed -r '/^$/d' | awk '{if($0!~/<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r -n "s@<\/span>@@g;s@(document.write|substr\(2\)|\(|\)|'|;|[[:space:]]*\+[[:space:]]*)@@g;s@(<\/td>)@|@g;s@\.{1,}@\.@g;s@^23@@g;p" | awk -F\| '{printf("%s:%s|%s|%s\n",$1,$2,tolower($7),$6)}' | sed -r -n 's@-@|@g;s@[[:space:]]+(|)[[:space:]]+@\1@g;s@: @:@g;p' | sed -r "/^[^[:digit:]]/d;s@(|)[[:space:]]*@\1@g"

Daily Proxy

Daily Proxy提供HTTP類型代理IP,分爲transparenthigh-anonymous兩種。

URL地址爲 http://www.dailyproxylists.com/index.php/proxy-lists

注意:網站使用document.write(unescape(...))形式將關鍵部分代碼進行加密,須先解密獲取HTML標籤後再提取數據。

輸出形式爲

IP:Port|Country|AnonymityLevel

代碼如下

curl -fsL http://www.dailyproxylists.com/index.php/proxy-lists | sed -r -n '/document.write/{s@<[^>]*>@@g;s@(document.write|unescape|\(|\)|\")@@g;s@^[[:space:]]*@@g;p}' | sed -r -n 's@^[[:blank:]]*@@g;s@[[:blank:]]$@@g;p' | sed 's@\\@\\\\@g;s@\(%\)\([0-9a-fA-F][0-9a-fA-F]\)@\\x\2@g' | printf $(cat -) | sed -r -n 's@<\/?tr>@\n@g;s@<(td)[[:space:]]*[^>]*>@@g;p' | sed -r -n '/^[^[:digit:]]+/d;/^$/d;s@<[^>]*>@|@g;p' |  awk -F\| '{printf("%s:%s|%s|%s\n",$1,$2,tolower($4),$3)}'

HideMyAss

HideMyAss的IP提取方式見本人Blog Extract Free IP:PORT Proxy Lists From HIDEMYASS Via SED & AWK

URL地址爲 http://proxylist.hidemyass.com/

freeproxylists.net

freeproxylists暫未實現通過curl抓取HTML代碼。

Shell Script

以上內容已通過Shell Script實現,使用parallel命令進行並行操作以縮短操作時間,時間縮短90%以上。

#!/usr/bin/env bash
set -u  #Detect undefined variable
set -o pipefail #Return return code in pipeline fails
# IFS=$'\n\t' #used in loop,  Internal Field Separator

#Target: Extract Proxy IP From Proxy Site On GNU/Linux

#########  0-1. Singal Setting  #########
# trap '' HUP	#overlook SIGHUP when internet interrupted or terminal shell closed
# trap '' INT   #overlook SIGINT when enter Ctrl+C, QUIT is triggered by Ctrl+\
trap funcTrapINTQUIT INT QUIT

funcTrapINTQUIT(){
    rm -rf /tmp/temp*.txt
    printf "Detect $(tput setaf 1)%s$(tput sgr0) or $(tput setaf 1)%s$(tput sgr0), begin to exit shell\n" "CTRL+C" "CTRL+\\"
    exit
}

#########  0-2. Variables Setting  #########
# term_cols=$(tput cols)   # term_lines=$(tput lines)
readonly c_bold="$(tput bold)"
readonly c_normal="$(tput sgr0)"     # c_normal='\e[0m'
# black 0, red 1, green 2, yellow 3, blue 4, magenta 5, cyan 6, gray 7
readonly c_red="${c_bold}$(tput setaf 1)"     # c_red='\e[31;1m'
readonly c_blue="$(tput setaf 4)"    # c_blue='\e[34m'

list_proxy_sites=${list_proxy_sites:-0}
proxy_site_specify=${proxy_site_specify:-}
include_country=${include_country:-}
exclude_country=${exclude_country:-}
protocol_type=${protocol_type:-}
anonymity_type=${anonymity_type:-}
proxy_server=${proxy_server:-}
use_proxy=${use_proxy:-0}
user_agent=${user_agent:-'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6.4) AppleWebKit/537.29.20 (KHTML, like Gecko) Chrome/60.0.3030.92 Safari/537.29.20'}
real_country=${real_country:-}


#########  1-1 Initialization Prepatation  #########
funcHelpInfo(){
cat <<EOF
${c_blue}Usage:
    script [options] ...
    script | sudo bash -s -- [options] ...
Extracting Proxy IP (HTTP/SOCKS) From Proxy Sites On GNU/Linux!

[available option]
    -h    --help, show help info
    -l    --list all supported proxy sites
    -s site    --specify proxy site No. listed in '-l'
    -t protocol  --protocol type (http|socks4|socks5), default is 'socks5'
    -a anonymity --anonymity level for http (low|medium|high), default is 'high'
    -p [protocol:]ip:port    --proxy host (http|https|socks4|socks5), default protocol is http
${c_normal}
EOF
# -i country   --just include specified country
# -e country   --exclude specified country
}

funcExitStatement(){
    local str="$*"
    [[ -n "$str" ]] && printf "%s\n" "$str" && exit
}

funcCommandExistCheck(){
    # $? -- 0 is find, 1 is not find
    local name="$1"
    if [[ -n "$name" ]]; then
        executing_path=$(which "$name" 2> /dev/null || command -v "$name" 2> /dev/null)
        [[ -n "${executing_path}" ]] && return 0 || return 1
    else
        return 1
    fi
}

funcInitializationCheck(){
    # 1 - Check root or sudo privilege
    # [[ "$UID" -ne 0 ]] && funcExitStatement "${c_red}Sorry${c_normal}, this script requires superuser privileges (eg. root, su)."
    # 2 - OS support check
    [[ -f /etc/os-release || -f /etc/SuSE-release || -f /etc/redhat-release || (-f /etc/debian_version && -f /etc/issue.net) ]] || funcExitStatement "${c_red}Sorry${c_normal}, this script doesn't support you system!"

    # 3 - bash version check  ${BASH_VERSINFO[@]} ${BASH_VERSION}
    # bash --version | sed -r -n '1s@[^[:digit:]]*([[:digit:].]*).*@\1@p'
    [[ "${BASH_VERSINFO[0]}" -lt 4 ]] && funcExitStatement "${c_red}Sorry${c_normal}, this script need BASH version 4+, your current version is ${c_blue}${BASH_VERSION%%-*}${c_normal}."

    if ! funcCommandExistCheck 'seq'; then
        funcExitStatement "${c_red}Error${c_normal}, No ${c_blue}seq${c_normal} command found!"
    fi

    if ! funcCommandExistCheck 'parallel'; then
        funcExitStatement "${c_red}Error${c_normal}, No ${c_blue}parallel${c_normal} command found!"
    fi
}

funcInternetConnectionCheck(){
    # CentOS: iproute Debian/OpenSUSE: iproute2
    if funcCommandExistCheck 'ip'; then
        gateway_ip=$(ip route | awk 'match($1,/^default/){print $3}')
    elif funcCommandExistCheck 'netstat'; then
        gateway_ip=$(netstat -rn | awk 'match($1,/^Destination/){getline;print $2;exit}')
    else
        funcExitStatement "${c_red}Error${c_normal}: No ${c_blue}ip${c_normal} or ${c_blue}netstat${c_normal} command found, please install it!"
    fi

    ! ping -q -w 1 -c 1 "$gateway_ip" &> /dev/null && funcExitStatement "${c_red}Error${c_normal}: No Internet connection, please check it!"   # Check Internet Connection
}

funcDownloadToolCheck(){
    local proxy_pattern="^((http|https|socks4|socks5):)?[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}:[0-9]{1,5}$"
    proxy_server=${proxy_server:-}
    if [[ -n "${proxy_server}" ]]; then
        if [[ "${proxy_server}" =~ $proxy_pattern ]]; then
            use_proxy=1
            local proxy_proto_pattern="^((http|https|socks4|socks5):)"
            if [[ "${proxy_server}" =~ $proxy_proto_pattern ]]; then
                local p_proto="${proxy_server%%:*}"
                local p_host="${proxy_server#*:}"
            else
                local p_proto='http'
                local p_host="${proxy_server}"
            fi
        else
            funcExitStatement "${c_red}Error${c_normal}: please specify right proxy host addr like ${c_blue}[protocol:]ip:port${c_normal}!"
        fi
    fi

    local retry_times=${retry_times:-5}
    local retry_delay_time=${retry_delay_time:-1}
    local connect_timeout_time=${connect_timeout_time:-2}
    local referrer_page=${referrer_page:-'https://duckduckgo.com/?q=github'}

    if funcCommandExistCheck 'curl'; then
        download_tool_origin="curl -fsL --retry ${retry_times} --retry-delay ${retry_delay_time} --connect-timeout ${connect_timeout_time} --no-keepalive"

        if [[ -n "${proxy_server}" ]]; then
            # curl version > 7.21.7
            case "${p_proto}" in
                # https ) export HTTPS_PROXY="${p_host}" ;;
                socks4 ) download_tool_proxy="${download_tool_origin} -x ${p_proto}a://${p_host}";;
                socks5 ) download_tool_proxy="${download_tool_origin} -x ${p_proto}h://${p_host}";;
                http|* ) download_tool_proxy="${download_tool_origin} -x ${p_host}";;
            esac
        fi
    else
        funcExitStatement "${c_red}Error${c_normal}: can't find command ${c_blue}curl${c_normal}s!"
    fi

    if [[ "${use_proxy}" -eq 1 ]]; then
        download_tool="${download_tool_proxy}"
    else
        download_tool="${download_tool_origin}"
    fi

}

#########  1-2 Initialization Operation  #########
# start_time=$(date +'%s')    # processing start time

while getopts "hls:i:e:t🅰️p:" option "$@"; do
    case "$option" in
        l ) list_proxy_sites=1 ;;
        s ) proxy_site_specify="$OPTARG" ;;
        i ) include_country="$OPTARG" ;;
        e ) exclude_country="$OPTARG" ;;
        t ) protocol_type="$OPTARG" ;;
        a ) anonymity_type="$OPTARG" ;;
        p ) proxy_server="$OPTARG" ;;
        h|\? ) funcHelpInfo && exit ;;
    esac
done


proxy_site_info=$(mktemp -t tempXXXXXX.txt)
cat > "${proxy_site_info}" <<EOF
No|Site|CN|socks4|socks5|transparent|anonymous|high-anonymous(elite)|Site
1|SamAir|1|1|1|1|1|1|https://premproxy.com
2|Nntime|1|0|0|1|1|1|http://nntime.com
3|PROXYS™|1|0|0|1|0|1|http://www.proxys.com.ar
4|Proxz|0|0|0|0|0|1|http://www.proxz.com
5|AliveProxy|0|0|1|0|1|1|http://www.aliveproxy.com
6|ProxyNova|0|0|0|1|0|1|https://www.proxynova.com
7|Daily Proxy|0|0|0|1|0|1|http://www.dailyproxylists.com
8|HideMyAss|0|0|0|1|1|1|http://proxylist.hidemyass.com
# 9|freeproxylists.net|0|0|0|1|1|1|http://freeproxylists.net/ 暫未實現通過curl抓取
EOF


#########  2-1. List Proxy Sites  #########
funcListProxySites(){
    awk -F\| 'BEGIN{printf("%-3s %-12s %10s\n","No","Site","URL")}match($1,/^[[:digit:]]/){printf("%-3s %-12s %-20s\n",$1,$2,$NF)}' "${proxy_site_info}"

    # awk -F\| 'BEGIN{printf("%-3s %-12s %4s %8s %8s %8s %8s %6s\n","No","Site","CN","socks4","socks5","transparent","anonymous","elite")}match($1,/^[[:digit:]]/){printf("%-3s %-12s %4s %6s %6s %8s %12s %8s\n",$1,$2,$3,$4,$5,$6,$7,$8)}' "${proxy_site_info}"
    exit
}


#########  3 Extract Proxy IP From HTML Page  #########
proxy_ip_extracted=$(mktemp -t tempXXXXXX.txt)

#########  3-1. SamAir (https://premproxy.com)  #########
# HTTP: transparent, anonymous, high-anonymous
funcProxySite_1(){
    # IP:Port|AnonymityLevel|Country|City|ISP
    local page_url='https://premproxy.com/list/'
    page_no=$($download_tool "${page_url}" | sed -r -n '/ptabletitle/{s@(<[^>]*>|\(|\))@@g;s@.*of (.*)@\1@p}')

    if [[ "${page_no}" -lt 10 ]]; then
        seq -f 0%g 1 "${page_no}" | parallel -k -j 0 -X $download_tool "${page_url}"{}.htm 2> /dev/null | sed -r -n '/ptabletitle/,/pageinfo/{/tr class/{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | sed '/^[[:space:]]*$/d' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$6)}' >> "${proxy_ip_extracted}"
    else
        seq -f 0%g 1 9 | parallel -k -j 0 -X $download_tool "${page_url}"{}.htm 2> /dev/null | sed -r -n '/ptabletitle/,/pageinfo/{/tr class/{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | sed '/^[[:space:]]*$/d' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$6)}' >> "${proxy_ip_extracted}"

        seq 10 "${page_no}" | parallel -k -j 0 -X $download_tool "${page_url}"{}.htm 2> /dev/null | sed -r -n '/ptabletitle/,/pageinfo/{/tr class/{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | sed '/^[[:space:]]*$/d' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$6)}' >> "${proxy_ip_extracted}"
    fi
}

# SOCKS: socks4, socks5
funcProxySite_1_socks(){
    # IP:Port|AnonymityLevel|Country|City|ISP
    local page_url='https://premproxy.com/socks-list/'
    page_no=$($download_tool "${page_url}" | sed -r -n '/next/{s@<[^>]*>@@gp}' | awk '{print $(NF-1)}')

    seq -f 0%g 1 "${page_no}" | parallel -k -j 0 -X $download_tool "${page_url}"{}.htm 2> /dev/null | sed -r -n '/^<tr><td>/{{s@<\/?(tr)[[:space:]]*[^>]*>@@g;s@<td>@@g;s@[[:space:]]*<\/td>@|@g;s@>.*@@g;s@(<dfn title="|")@@g;p}}' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,tolower($2),$4,$5,$6)}' >> "${proxy_ip_extracted}"
}

#########  3-2. Nntime (http://nntime.com)  #########
# HTTP: transparent, anonymous, high-anonymous
funcProxySite_2(){
    # IP:Port|AnonymityLevel|Country|City|ISP
    local page_url='http://nntime.com/'
    page_no=$($download_tool "${page_url}" | sed -r -n '/navigation/{{s@(<[^>]*>|\(|\)|next)@@g;p}}' | awk '{print $NF}')

    if [[ "${page_no}" -lt 10 ]]; then
        seq -f 0%g 1 "${page_no}" | parallel -k -j 0 -X $download_tool "${page_url}"proxy-list-{}.htm 2> /dev/null | sed -r -n '/<\/thead>/,/<\/table>/{s@<\/?(thead|table|dfn|script)[[:space:]]*[^>]*>@@g;s@<(td|tr)[[:space:]]*[^>]*>@@g;s@<input.*value=\"(.*)\" onclick.*\/>@\1@g;s@(\"|\:|\+)@@g;s@document.write\((.*)\)@|\1@g;p}' | sed -r -n '/^[[:blank:]]*$/d;s@(<\/td>)@|@g;s@\)@@g;p' | awk '{if($0!~/^<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r 's@[[:space:]]*(\|)[[:space:]]*@\1@g' | awk -F\| '{str_start_pos=(length($1)-length($3)+1);port=substr($1,str_start_pos); sub(/[[:space:]]*proxy/,"",$4); printf("%s:%s|%s|%s|%s\n",$2,port,$4,$7,$6)}' | sed -r 's@[[:space:]]*\(@|@g' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$3)}' >> "${proxy_ip_extracted}"
    else
        seq -f 0%g 1 9 | parallel -k -j 0 -X $download_tool "${page_url}"proxy-list-{}.htm 2> /dev/null | sed -r -n '/<\/thead>/,/<\/table>/{s@<\/?(thead|table|dfn|script)[[:space:]]*[^>]*>@@g;s@<(td|tr)[[:space:]]*[^>]*>@@g;s@<input.*value=\"(.*)\" onclick.*\/>@\1@g;s@(\"|\:|\+)@@g;s@document.write\((.*)\)@|\1@g;p}' | sed -r -n '/^[[:blank:]]*$/d;s@(<\/td>)@|@g;s@\)@@g;p' | awk '{if($0!~/^<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r 's@[[:space:]]*(\|)[[:space:]]*@\1@g' | awk -F\| '{str_start_pos=(length($1)-length($3)+1);port=substr($1,str_start_pos); sub(/[[:space:]]*proxy/,"",$4); printf("%s:%s|%s|%s|%s\n",$2,port,$4,$7,$6)}' | sed -r 's@[[:space:]]*\(@|@g' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$3)}' >> "${proxy_ip_extracted}"

        seq 10 "${page_no}" | parallel -k -j 0 -X $download_tool "${page_url}"proxy-list-{}.htm 2> /dev/null | sed -r -n '/<\/thead>/,/<\/table>/{s@<\/?(thead|table|dfn|script)[[:space:]]*[^>]*>@@g;s@<(td|tr)[[:space:]]*[^>]*>@@g;s@<input.*value=\"(.*)\" onclick.*\/>@\1@g;s@(\"|\:|\+)@@g;s@document.write\((.*)\)@|\1@g;p}' | sed -r -n '/^[[:blank:]]*$/d;s@(<\/td>)@|@g;s@\)@@g;p' | awk '{if($0!~/^<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r 's@[[:space:]]*(\|)[[:space:]]*@\1@g' | awk -F\| '{str_start_pos=(length($1)-length($3)+1);port=substr($1,str_start_pos); sub(/[[:space:]]*proxy/,"",$4); printf("%s:%s|%s|%s|%s\n",$2,port,$4,$7,$6)}' | sed -r 's@[[:space:]]*\(@|@g' | awk -F\| '{printf("%s|%s|%s|%s|%s\n",$1,$2,$4,$5,$3)}' >> "${proxy_ip_extracted}"
    fi
}

#########  3-3. PROXYS™ (http://www.proxys.com.ar)  #########
# HTTP: transparente, elite
funcProxySite_3(){
    # IP:Port|AnonymityLevel|Country
    local page_url='http://www.proxys.com.ar/'
    $download_tool "${page_url}" | sed -r -n '/st-tables-page/{s@<\/?(ins|script|a|thead|tbody)[[:space:]]*[^>]*>?@@g;s@<\/tr>@\n@g;s@<(tr|td)>@@g;p}' | sed -r -n '/^[[:digit:]]+/{s@<\/td>@|@g;p}' | awk -F\| '{printf("%s:%s|%s|%s\n",$1,$2,tolower($4),$3)}' >> "${proxy_ip_extracted}"
}


#########  3-4. Proxz (http://www.proxz.com)  #########
funcProxySite_4(){
    # IP:Port|AnonymityLevel|Country
    local page_url='http://www.proxz.com/'
    page_no=$($download_tool --user-agent "\"${user_agent}\"" "${page_url}" "${page_url}"proxy_list_high_anonymous_0_ext.html | sed -r -n '/^<\/td><\/tr><\/table>/{s@(<[^>]*>|::..)@@g;p}' | awk -F: '{print $NF}')

    urldecode() { : "${*}" ; echo -e "${_}" | sed 's/%\([0-9A-F][0-9A-F]\)/\\\\\x\1/g' | xargs echo -e | sed -r -n 's@.*\("(.*)"\).*@\1@g;s@%2e@.@g;p'; }

    seq 0 "${page_no}" | parallel -k -j 0 -X $download_tool --user-agent "\"${user_agent}\"" "${page_url}"proxy_list_high_anonymous_{}_ext.html 2> /dev/null | sed -r -n "/eval\(unescape/{s@<\/td><\/tr>@@;s@<\/tr>@\n@g;s@<noscript>Please enable javascript<\/noscript>@@g;s@<\/?(tr|a|script)[[:space:]]*[^>]*>@@g;s@(<td>|\(|\)|;)@@g;s@evalunescape@@g;s@'@@g;s@<\/td>@|@g;s@<td[[:space:]]*[^>]*>@@g;p}" | sed '/^$/d' | while IFS="|" read -r a b c d e f;do ip=$(urldecode "$a"); echo "$ip:$b|${c,,}|$d" >> "${proxy_ip_extracted}"; done
}

#########  3-5. AliveProxy (http://www.aliveproxy.com)  #########
# HTTP: anonymous, high-anonymous
funcProxySite_5(){
    local page_url='http://aliveproxy.com/'
    # High Anonymous Proxies
    $download_tool "${page_url}high-anonymity-proxy-list/" | sed -r -n '/^<TABLE class/{s@(.*)@\L\1@g;s@<\/tr>@\n@g;s@<\/?(tr|td|table|center|a|br)[[:space:]]*[^>]*>@@g;p}' | sed -r -n '/^[[:digit:].]+/{s@(.*)--.*@\1@gp}' | awk '{printf("%s|%s\n",$1,"high-anonymous")}' >> "${proxy_ip_extracted}"

    # Anonymous Proxies
    $download_tool "${page_url}anonymous-proxy-list/" | sed -r -n '/^<TABLE class/{s@(.*)@\L\1@g;s@<\/tr>@\n@g;s@<\/?(tr|td|table|center|a|br)[[:space:]]*[^>]*>@@g;p}' | sed -r -n '/^[[:digit:].]+/{s@(.*)--.*@\1@gp}' | awk '{printf("%s|%s\n",$1,"anonymous")}' >> "${proxy_ip_extracted}"
}

funcProxySite_5_socks(){
    # Socks 5 Proxies  基本不能用
    local page_url='http://aliveproxy.com/socks5-list/'
    $download_tool "${page_url}" | sed -r -n '/^<TABLE class/{s@(.*)@\L\1@g;s@<\/tr>@\n@g;s@<\/?(tr|td|table|center|a|br)[[:space:]]*[^>]*>@@g;p}' | sed -r -n '/^[[:digit:].]+/{s@(.*)--.*@\1@gp}' >> "${proxy_ip_extracted}"
}

#########  3-6. ProxyNova (https://www.proxynova.com)  #########
# HTTP: transparent, elite
funcProxySite_6(){
    # IP:Port|AnonymityLevel|Country|City
    local page_url='https://www.proxynova.com/proxy-server-list/'
    $download_tool "${page_url}"| sed -r -n '/<center>/,/<\/center>/d;/<tbody>/,/<\/tbody>/{s@<\/?(tbody|images|script|a|time|img|div|ins)[[:space:]]*[^>]*>@@g;s@<(td|span)[[:space:]]*[^>]*>@@g;s@^[[:blank:]]*@@g;s@<tr>@@g;p}' | sed -r '/^$/d' | awk '{if($0!~/<\/tr>/){ORS=" ";print $0}else{printf "\n"}}' | sed -r -n "s@<\/span>@@g;s@(document.write|substr\(2\)|\(|\)|'|;|[[:space:]]*\+[[:space:]]*)@@g;s@(<\/td>)@|@g;s@\.{1,}@\.@g;s@^23@@g;p" | awk -F\| '{printf("%s:%s|%s|%s\n",$1,$2,tolower($7),$6)}' | sed -r -n 's@-@|@g;s@[[:space:]]+(|)[[:space:]]+@\1@g;s@: @:@g;p' | sed -r "/^[^[:digit:]]/d;s@(|)[[:space:]]*@\1@g" >> "${proxy_ip_extracted}"
}

#########  3-7. Daily Proxy (http://www.dailyproxylists.com)  #########
# HTTP: transparent, high-anonymous
funcProxySite_7(){
    # IP:Port|AnonymityLevel|Country
    local page_url='http://www.dailyproxylists.com/index.php/proxy-lists'
    $download_tool "${page_url}" | sed -r -n '/document.write/{s@<[^>]*>@@g;s@(document.write|unescape|\(|\)|\")@@g;s@^[[:space:]]*@@g;p}' | sed -r -n 's@^[[:blank:]]*@@g;s@[[:blank:]]$@@g;p' | sed 's@\\@\\\\@g;s@\(%\)\([0-9a-fA-F][0-9a-fA-F]\)@\\x\2@g' | printf $(cat -) | sed -r -n 's@<\/?tr>@\n@g;s@<(td)[[:space:]]*[^>]*>@@g;p' | sed -r -n '/^[^[:digit:]]+/d;/^$/d;s@<[^>]*>@|@g;p' |  awk -F\| '{printf("%s:%s|%s|%s\n",$1,$2,tolower($4),$3)}' >> "${proxy_ip_extracted}"
}

#########  3-8. HideMyAss (http://proxylist.hidemyass.com)  #########
# HTTP: high-anonymous
funcProxySite_8(){
    local page_url='http://proxylist.hidemyass.com/search-1303043#listable'
    local start=1
    proxy_list_html=$(mktemp -t tempXXXXX.txt)
    tempfile_perip=$(mktemp -t tempXXXXX.txt)

    $download_tool "${page_url}" | sed -r -n '/table section/,/table section end/{/^$/d;/indicator/d;s@^[[:space:]]*@@;/^<[\/]?(td|div|span)>$/d;p}' | sed -r -n '/leftborder/,/<\/tr>/{p}' > "${proxy_list_html}"

    sed -n '/<\/tr>/=' "${proxy_list_html}" | while read -r line;do
        # echo "start $start, end $line";
        sed -r -n ''"${start},${line}"'p' "${proxy_list_html}" > "${tempfile_perip}"
        country=$(sed -r -n '/img src=/{n;s@<[^>]*>@@p}' "${tempfile_perip}" | sed -r -n 's@^[[:space:]]*@@g;s@[[:space:]]*$@@g;p')
        port=$(sed -r -n '/class=\"country\"/{x;s@<[^>]*>@@p};h' "${tempfile_perip}" | sed -r -n 's@^[[:space:]]*@@g;s@[[:space:]]*$@@g;p')
        class_none_list=$(sed -r -n '/^\..*none/s@.(.*)\{.*@\1@p' "${tempfile_perip}" | awk 'BEGIN{RS=EOF}{gsub(/\n/,"|");print}')
        ip=$(sed -r -n '/^<\/style/{s@<\/[^>]*>@\n@g;p}' "${tempfile_perip}" | sed -r 's@\.@@g' | sed -r -n 's@^([[:digit:]]+)(<.*)$@\1\n\2@;p' | sed -r -n '/^$/d;/(none|\.)/!p' | sed -r -n '/('"${class_none_list}"')/d;s@<[^>]*>@@;/^$/d;p' | awk 'BEGIN{RS=EOF}{gsub(/\n/," ");print}' | awk '{printf("%s.%s.%s.%s",$1,$2,$3,$4)}')

        echo "$ip:$port|high-anonymous|$country" >> "${proxy_ip_extracted}"

        start=$((line+1));
    done

    [[ -f "${proxy_list_html:-}" ]] && rm -f "${proxy_list_html}"
    [[ -f "${tempfile_perip:-}" ]] && rm -f "${tempfile_perip}"
}


#########  3. Executing Process  #########
funcSpecificProxyIPTesting(){
    line="$1"
    ip_addr=$(echo "${line}" | awk -F\| '{print $1}')
    anonymity=$(echo "${line}" | awk -F\| '{print $2}')
    country=$(echo "${line}" | awk -F\| '{print $3}')
    city=$(echo "${line}" | awk -F\| '{print $4}')
    isp=$(echo "${line}" | awk -F\| '{print $5}')

    local curl_speed_time=${curl_speed_time:-1}     #time second -y, --speed-time <time>
    local curl_speed_limit=${curl_speed_limit:-3}    # speed byte -Y, --speed-limit <speed>
    local curl_max_time=${curl_max_time:-1.5}         #time second -m, --max-time <seconds>

    case "${anonymity,,}" in
        socks5 ) protocol_str="socks5h://" ;;
        socks4 ) protocol_str="socks4a://" ;;
        * ) protocol_str="" ;;
    esac

    if [[ -n $(curl -fsL --speed-time "${curl_speed_time}" --speed-limit "${curl_speed_limit}" --max-time "${curl_max_time}" -x "${protocol_str}${ip_addr}" ipinfo.io/country 2> /dev/null) ]]; then
        echo "$ip_addr|$country|$city|$isp"
    fi
}

funcProxyIPExtraction(){
    echo "IP testing process will cost some time, just be patient!"
    case "${protocol_type,,}" in
        h|http|https ) protocol_type='http' ;;
        socks4 ) protocol_type='socks4' ;;
        socks5 ) protocol_type='socks5' ;;
        * ) protocol_type='socks5' ;;
    esac
    real_country=$($download_tool_origin ipinfo.io/country)

    if [[ "${real_country}" == 'CN' ]]; then
        if [[ "${protocol_type}" =~ ^socks ]]; then
            funcProxySite_1_socks
        else
            if [[ "${proxy_site_specify}" -gt 0 && "${proxy_site_specify}" -le 3 ]]; then
                funcProxySite_"${proxy_site_specify}"
            else
                funcProxySite_1
                funcProxySite_2
                funcProxySite_3
            fi

        fi
    else
        if [[ "${protocol_type}" =~ ^socks ]]; then
            funcProxySite_1_socks
            funcProxySite_5_socks
        else
            if [[ "${proxy_site_specify}" -gt 0 && "${proxy_site_specify}" -le 9 ]]; then
                funcProxySite_"${proxy_site_specify}"
            else
                funcProxySite_1
            fi
        fi
    fi

    if [[ -f "${proxy_ip_extracted}" ]]; then
        if [[ "${protocol_type}" =~ ^s ]]; then
            filter_str="${protocol_type}"
        else
            case "${anonymity_type,,}" in
                l|low ) filter_str='transparent|transparente' ;;
                m|medium ) filter_str='anonymous' ;;
                h|high|* ) filter_str='high|high-anonymous|elite' ;;
            esac
        fi

        printf "Protocol type is ${c_red}%s${c_normal}.\n\n" "${protocol_type^^}"

        export -f funcSpecificProxyIPTesting
        awk -F\| 'match($2,/^('"${filter_str}"')/){print $0}' "${proxy_ip_extracted}" | parallel -k -j 0 funcSpecificProxyIPTesting 2> /dev/null
    fi
}

#########  4. Executing Process  #########
funcInitializationCheck
funcInternetConnectionCheck
funcDownloadToolCheck
[[ "${list_proxy_sites}" -eq 1 ]] && funcListProxySites
funcProxyIPExtraction


#########  5. EXIT Singal Processing  #########
# trap "commands" EXIT # execute command when exit from shell
funcTrapEXIT(){
    [[ -f "${proxy_site_info:-}" ]] && rm -f "${proxy_site_info}"
    [[ -f "${proxy_ip_extracted:-}" ]] && rm -f "${proxy_ip_extracted}"
}

trap funcTrapEXIT EXIT

# Script End

Script Execution

演示過程如下

# time bash /tmp/proxy.sh -t h
IP testing process will cost some time, just be patient!
Protocol type is HTTP.

5.249.148.50:3128|Italy|Arezzo|Aruba S.p.A.
181.65.239.105:3128|Peru||Telefonica del Peru
36.67.85.242:53281|Indonesia||PT Telkom Indonesia
207.154.225.175:8080|Germany|Frankfurt|Digital Ocean
103.248.233.156:80|India|Ahmedabad|Ishan Infotech Limited
186.42.253.246:8080|Ecuador|Quito|Corporacion Nacional De Telecomunicaciones Cnt S.A
185.82.212.95:8080|Czech Republic||Whois protection s.r.o.

real	0m10.559s
user	0m2.352s
sys	0m0.872s

Problem Occuring

在實際操作中,發現有些代理IP網站採用了一些手段以提高爬蟲抓取數據的難度。具體如下

document.write(”:“+z+v)

Nntime端口號採用document.write(":"+z+v)形式,字母與數字的對應關係每一頁都不相同。HTML代碼如下

<!-- 51.254.214.236:24631 -->
<tr class="odd"><td><input type="checkbox" name="c15" id="row15"  value="52576131.254.214.236754230924631" onclick="choice()" /></td><td>51.254.214.236<script type="text/javascript">document.write(":"+i+x+l+y+j)</script></td>

端口號的位數由document.write中字符個數決定,如此例中ixlyj有5位。從input的value中從右往左截取5個字符,爲24631,此即爲端口號

javascript unescape

Proxz須指定user-agent才能獲取HTML頁面。頁面中IP地址使用javascript的unescape操作轉換爲長字符串,須對其進行反向操作復原爲IP。

轉換爲IP的方法如下

# 111.8.22.204
urldecode() { : "${*}" ; echo -e "${_}" | sed 's/%\([0-9A-F][0-9A-F]\)/\\\\\x\1/g' | xargs echo -e | sed -r -n 's@.*\("(.*)"\).*@\1@g;s@%2e@.@g;p'; }
text="%73%65%6c%66%2e%64%6f%63%75%6d%65%6e%74%2e%77%72%69%74%65%6c%6e%28%22%31%31%31%2e%38%2e%32%32%2e%32%30%34%22%29%3b"
echo $(urldecode "$text")

document.write substr

ProxyNova使用document.write('2331.16'.substr(2) + '0.4.90')形式進行混淆IP地址,要獲取目標IP,須先合併單引號中的字符串,再去除爲首的2個字符後(此處爲31.160.4.90)。

document.writ unescape

Daily Proxy使用document.write(unescape(...))形式將關鍵部分代碼進行加密,須先解密獲取HTML標籤後再提取數據。

Change Logs

  • 2017.06.14 21:43 Wed Asia/Shanghai
    • 初稿完成
  • 2017.06.23 11:28 Fri Asia/Shanghai
    • 添加Shell Script