The best judgment of the PhishTank community is represented by the ever-changing list of suspected phishes that are both online and valid, meaning “verified as a phish” by the members of PhishTank. You can page through this list on the site, in reverse chronological order (by submission time). The PhishTank API is more powerful and more granular; it does not offer a way to get a bulk list.
However, the PhishTank data also is effective when distributed and available for use in local applications, whether local is your personal router, the gateway of your ISP, a corporate firewall or elsewhere. Now the data is available as a regularly updated XML data file.
Basic file details
- Format: XML
- Update frequency: Hourly.
I encourage you to fetch it no more often than once an hour.
- File size: Varies. Edited: April 11, 2007: Can be as large as 10MB, so it may not open easily in a browser.
Right now, with 1125 verified, online phishes, the file is a bit over 600Kb.
File location
http://data.phishtank.com/data/online-valid/
Edited: January 25, 2007: If you need a filename for your script, the filename is index.php in that directory. Please do not use the filename; it interferes with our mirroring of the data file in multiple locations.
Considerations
Phish sites go up and down at various times. Usually, a single phish URL doesn’t stay online for very long, so it’s important to consider not only the timestamp of the data file, but the time elapsed since both the submission of the phish to PhishTank and its verification by the PhishTank community. There are exceptions, but if a phish URL is more than a week or two old, then the host where it’s living is not paying attention. Over time, PhishTank will start to provide more data about the hosts, so you can see which hosts tend to allow this kind of activity to continue.
The file has an ETag header and a Last-Modified header. Please respect these when fetching the file. We may support gzip in the future, to further reduce bandwidth for all parties.
Field definitions
To help you use the data in this file, I’ve described each of the fields below.
meta is the wrapper for information about the file itself.
generated_at is the time the file was last generated as an ISO 8601 date string. The ISO standard incorporates the timezone; PhishTank uses UTC.
Sample value: 2006-10-17T00:17:02+00:00
total_entries is the count of how many valid phish URLs are in the file at that time. This will always be a positive integer.
Sample value: 1125
entries is the overall container for all the individual phish records as a collection.
entry is the container for data about each individual phish.
url is the phish URL. The value (a URL) is presented as CDATA because phishers are not polite folks, and occasionally use non-valid characters in their URLs. Some browsers are more forgiving about the standards, and interpret (or ignore) the non-valid characters, so the URL is a phish, even though it might fail in other browsers.
Sample value: <![CDATA[http://www.firstgenericbank.account-updateinfo.com]]>
Note: This URL is an example only. The domain is owned by OpenDNS, operators of PhishTank, for demonstration purposes.
phish_id is the PhishTank ID for the phish URL. All data in PhishTank is tied to this ID. You may or may not need this piece of information, but it’s useful for us. This will always be a positive integer.
Sample value: 19845
phish_detail_url is the PhishTank detail page for the phish URL, where you can view data about the phish, including a screenshot and the community votes. More data will be added to this page over time.
Sample value: <![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]>
submission is a container for submission_time currently, and may contain additional fields in the future.
submission_time is the time the phish was submitted to PhishTank, in UTC. Same timestamp format as generated_at.
Sample value: 2006-10-17T19:21:30+00:00
verification is a container for information about a phish URL’s verification, including verified and verification_time currently. This container may have additional fields in the future.
verified indicates whether or not a suspected phish has been judged by the PhishTank community. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes
verification_time is the time the phish was judged by the PhishTank community, in UTC. In this file, it’s the time the phish was verified as a phish. Same timestamp format as generated_at. It may be interesting to compare verification_time and submission_time.
Sample value: 2006-10-17T23:06:28+00:00
status is the container for online, currently. This container may have additional fields in the future.
online notes whether a phish URL is live and responding. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes
Attribution and usage
This data is free. It may be used in commercial products or non-commercial products, by organizations or individuals.
If you use the data, we would appreciate public attribution for the data to PhishTank, preferably with a link to the PhishTank home page. We will soon publish a page with some guidelines about how to use the PhishTank logo (if you want to) and otherwise attribute the data to PhishTank. For now, contact us if you have anything special… kind words and a link are the general goals!
We’re curious to learn how this data gets used, so please let us know, either in the comments or via the contact form.
Example XML
<?xml version="1.0" encoding="utf-8"?>
<output>
<meta>
<generated_at>2006-10-17T18:17:01+00:00</generated_at>
<total_entries>1</total_entries>
</meta>
<entries>
<entry>
<url><![CDATA[http://www.firstgenericbank.account-updateinfo.com]]></url>
<phish_id>19845</phish_id>
<phish_detail_url><![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]></phish_detail_url>
<submission>
<submission_time>2006-10-17T03:00:18+00:00</submission_time>
</submission>
<verification>
<verified>yes</verified>
<verification_time>2006-10-17T13:13:37+00:00</verification_time>
</verification>
<status>
<online>yes</online>
</status>
</entry>
</entries>
</output>