PhishTank is operated by OpenDNS, a free service that makes your Internet safer, faster, and smarter. Get started today!

'XML' Posts

Data about phishers at the right cost (free)

posted by John Roberts on November 14th, 2006 in API, Community, Data, PhishTank, PhishTank in the news, XML

I read the SecurityProNews article “Sites Want To Hook And Gut Phishers” with interest this morning. The article’s summary:

A trio of websites offer people the opportunity to report the phish emails they receive in order to thwart the various scams and their perpetrators.

Three different sites are included in the round-up: PhishTank, CastleCops, and Symantec’s Phish Report Network.

At OpenDNS (operators of PhishTank), we’re fans of CastleCops. Their work is excellent, and their efforts in the broader anti-abuse community are notable. We shared our gratitude in July.

However, I don’t think the Phish Report Network site belongs in the same category, for two key reasons: the lack of information about submissions and the hefty price of their data.

Submitting to a black hole

Submitting phish to the Phish Report Network is dumping your submissions into a black hole. (And they didn’t even accept submissions from individuals until October 2006… wonder if PhishTank’s launch had something to do with that?)

I just took a live phish site from PhishTank and submitted it, after agreeing to a license and filling out a Captcha. Those hoops are not necessarily a bad idea to weed out spurious submissions, but here’s all I was told after the submission was received.

CONFIRMATION

Your submission has been sent Tue Nov 14 09:46:06 PST 2006. To make another submission, click here.

Sincerely,

Symantec Security Response

Couldn’t the page at least say thanks?

Outside of the lack of human touch, there is no insight into what the final judgment might be, when such judgment will be rendered, and by whom. There is literally no way to follow up.

PhishTank shows you your activity, and gives you email updates (if you want them) and an RSS feed to track your submissions. Go to your My Account page to learn how your contributions are being judged.

The price of data

The data gathered and verified by Symantec’s site is only available if you pay for it. How much? US$50,000 per year.

On behalf of OpenDNS, I inquired about a license to the data on July 12, 2006. On August 8, 2006, I got an apologetic response for the delay. On August 9, 2006, I got a copy of the contract, with its US$50,000 price tag for the year. I declined to go any further.

I have nothing against businesses charging for a service, and perhaps Symantec is finding customers who find this a valuable source of data. It’s hard to know, since they give out little information about who’s using the data and how much data there is. PhishTank statistics are wide open.

PhishTank was set up to help the Internet at large and solve a business problem for OpenDNS (the common need for better data about phishing sites). The reason PhishTank works is because the data is freely available to all, from the free, open API to the XML data file or the lightweight method.

My suggestion to Symantec? Add data from PhishTank to your Phish Report Network. It’s free. And if you’d like to share your submissions with PhishTank, we’re happy to help make it work.

Mozilla found the data worth testing with, at least.

XML data file of online, valid phishes from PhishTank

posted by John Roberts on October 17th, 2006 in API, Data, PhishTank, XML

The best judgment of the PhishTank community is represented by the ever-changing list of suspected phishes that are both online and valid, meaning “verified as a phish” by the members of PhishTank. You can page through this list on the site, in reverse chronological order (by submission time). The PhishTank API is more powerful and more granular; it does not offer a way to get a bulk list.

However, the PhishTank data also is effective when distributed and available for use in local applications, whether local is your personal router, the gateway of your ISP, a corporate firewall or elsewhere. Now the data is available as a regularly updated XML data file.

Basic file details

  • Format: XML
  • Update frequency: Hourly.
    I encourage you to fetch it no more often than once an hour.
  • File size: Varies. Edited: April 11, 2007: Can be as large as 10MB, so it may not open easily in a browser. Right now, with 1125 verified, online phishes, the file is a bit over 600Kb.

File location

http://data.phishtank.com/data/online-valid/

Considerations

Phish sites go up and down at various times. Usually, a single phish URL doesn’t stay online for very long, so it’s important to consider not only the timestamp of the data file, but the time elapsed since both the submission of the phish to PhishTank and its verification by the PhishTank community. There are exceptions, but if a phish URL is more than a week or two old, then the host where it’s living is not paying attention. Over time, PhishTank will start to provide more data about the hosts, so you can see which hosts tend to allow this kind of activity to continue.

The file has an ETag header and a Last-Modified header. Please respect these when fetching the file. We may support gzip in the future, to further reduce bandwidth for all parties.

Field definitions

To help you use the data in this file, I’ve described each of the fields below.

meta is the wrapper for information about the file itself.

generated_at is the time the file was last generated as an ISO 8601 date string. The ISO standard incorporates the timezone; PhishTank uses UTC.
Sample value: 2006-10-17T00:17:02+00:00

total_entries is the count of how many valid phish URLs are in the file at that time. This will always be a positive integer.
Sample value: 1125

entries is the overall container for all the individual phish records as a collection.

entry is the container for data about each individual phish.

url is the phish URL. The value (a URL) is presented as CDATA because phishers are not polite folks, and occasionally use non-valid characters in their URLs. Some browsers are more forgiving about the standards, and interpret (or ignore) the non-valid characters, so the URL is a phish, even though it might fail in other browsers.
Sample value: <![CDATA[http://www.firstgenericbank.account-updateinfo.com]]>
Note: This URL is an example only. The domain is owned by OpenDNS, operators of PhishTank, for demonstration purposes.

phish_id is the PhishTank ID for the phish URL. All data in PhishTank is tied to this ID. You may or may not need this piece of information, but it’s useful for us. This will always be a positive integer.
Sample value: 19845

phish_detail_url is the PhishTank detail page for the phish URL, where you can view data about the phish, including a screenshot and the community votes. More data will be added to this page over time.
Sample value: <![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]>

submission is a container for submission_time currently, and may contain additional fields in the future.

submission_time is the time the phish was submitted to PhishTank, in UTC. Same timestamp format as generated_at.
Sample value: 2006-10-17T19:21:30+00:00

verification is a container for information about a phish URL’s verification, including verified and verification_time currently. This container may have additional fields in the future.

verified indicates whether or not a suspected phish has been judged by the PhishTank community. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes

verification_time is the time the phish was judged by the PhishTank community, in UTC. In this file, it’s the time the phish was verified as a phish. Same timestamp format as generated_at. It may be interesting to compare verification_time and submission_time.
Sample value: 2006-10-17T23:06:28+00:00

status is the container for online, currently. This container may have additional fields in the future.

online notes whether a phish URL is live and responding. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes

Attribution and usage

This data is free. It may be used in commercial products or non-commercial products, by organizations or individuals.

If you use the data, we would appreciate public attribution for the data to PhishTank, preferably with a link to the PhishTank home page. We will soon publish a page with some guidelines about how to use the PhishTank logo (if you want to) and otherwise attribute the data to PhishTank. For now, contact us if you have anything special… kind words and a link are the general goals! ;-)

We’re curious to learn how this data gets used, so please let us know, either in the comments or via the contact form.

Example XML

<?xml version="1.0" encoding="utf-8"?>
<output>
<meta>
<generated_at>2006-10-17T18:17:01+00:00</generated_at>
<total_entries>1</total_entries>
</meta>
<entries>
<entry>
<url><![CDATA[http://www.firstgenericbank.account-updateinfo.com]]></url>
<phish_id>19845</phish_id>
<phish_detail_url><![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]></phish_detail_url>
<submission>
<submission_time>2006-10-17T03:00:18+00:00</submission_time>
</submission>
<verification>
<verified>yes</verified>
<verification_time>2006-10-17T13:13:37+00:00</verification_time>
</verification>
<status>
<online>yes</online>
</status>
</entry>
</entries>
</output>
Server: pt5.phishtank.com