PhishTank is operated by OpenDNS, a free service that makes your Internet safer, faster, and smarter. Get started today!

XML data file of online, valid phishes from PhishTank

posted by John Roberts on October 17th, 2006 in PhishTank, API, Data, XML

The best judgment of the PhishTank community is represented by the ever-changing list of suspected phishes that are both online and valid, meaning “verified as a phish” by the members of PhishTank. You can page through this list on the site, in reverse chronological order (by submission time). The PhishTank API is more powerful and more granular; it does not offer a way to get a bulk list.

However, the PhishTank data also is effective when distributed and available for use in local applications, whether local is your personal router, the gateway of your ISP, a corporate firewall or elsewhere. Now the data is available as a regularly updated XML data file.

Basic file details

  • Format: XML
  • Update frequency: Hourly.
    I encourage you to fetch it no more often than once an hour.
  • File size: Varies. Edited: April 11, 2007: Can be as large as 10MB, so it may not open easily in a browser. Right now, with 1125 verified, online phishes, the file is a bit over 600Kb.

File location

http://data.phishtank.com/data/online-valid/

Edited: January 25, 2007: If you need a filename for your script, the filename is index.php in that directory. Please do not use the filename; it interferes with our mirroring of the data file in multiple locations.

Considerations

Phish sites go up and down at various times. Usually, a single phish URL doesn’t stay online for very long, so it’s important to consider not only the timestamp of the data file, but the time elapsed since both the submission of the phish to PhishTank and its verification by the PhishTank community. There are exceptions, but if a phish URL is more than a week or two old, then the host where it’s living is not paying attention. Over time, PhishTank will start to provide more data about the hosts, so you can see which hosts tend to allow this kind of activity to continue.

The file has an ETag header and a Last-Modified header. Please respect these when fetching the file. We may support gzip in the future, to further reduce bandwidth for all parties.

Field definitions

To help you use the data in this file, I’ve described each of the fields below.

meta is the wrapper for information about the file itself.

generated_at is the time the file was last generated as an ISO 8601 date string. The ISO standard incorporates the timezone; PhishTank uses UTC.
Sample value: 2006-10-17T00:17:02+00:00

total_entries is the count of how many valid phish URLs are in the file at that time. This will always be a positive integer.
Sample value: 1125

entries is the overall container for all the individual phish records as a collection.

entry is the container for data about each individual phish.

url is the phish URL. The value (a URL) is presented as CDATA because phishers are not polite folks, and occasionally use non-valid characters in their URLs. Some browsers are more forgiving about the standards, and interpret (or ignore) the non-valid characters, so the URL is a phish, even though it might fail in other browsers.
Sample value: <![CDATA[http://www.firstgenericbank.account-updateinfo.com]]>
Note: This URL is an example only. The domain is owned by OpenDNS, operators of PhishTank, for demonstration purposes.

phish_id is the PhishTank ID for the phish URL. All data in PhishTank is tied to this ID. You may or may not need this piece of information, but it’s useful for us. This will always be a positive integer.
Sample value: 19845

phish_detail_url is the PhishTank detail page for the phish URL, where you can view data about the phish, including a screenshot and the community votes. More data will be added to this page over time.
Sample value: <![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]>

submission is a container for submission_time currently, and may contain additional fields in the future.

submission_time is the time the phish was submitted to PhishTank, in UTC. Same timestamp format as generated_at.
Sample value: 2006-10-17T19:21:30+00:00

verification is a container for information about a phish URL’s verification, including verified and verification_time currently. This container may have additional fields in the future.

verified indicates whether or not a suspected phish has been judged by the PhishTank community. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes

verification_time is the time the phish was judged by the PhishTank community, in UTC. In this file, it’s the time the phish was verified as a phish. Same timestamp format as generated_at. It may be interesting to compare verification_time and submission_time.
Sample value: 2006-10-17T23:06:28+00:00

status is the container for online, currently. This container may have additional fields in the future.

online notes whether a phish URL is live and responding. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes

Attribution and usage

This data is free. It may be used in commercial products or non-commercial products, by organizations or individuals.

If you use the data, we would appreciate public attribution for the data to PhishTank, preferably with a link to the PhishTank home page. We will soon publish a page with some guidelines about how to use the PhishTank logo (if you want to) and otherwise attribute the data to PhishTank. For now, contact us if you have anything special… kind words and a link are the general goals! ;-)

We’re curious to learn how this data gets used, so please let us know, either in the comments or via the contact form.

Example XML

<?xml version="1.0" encoding="utf-8"?>
<output>
<meta>
<generated_at>2006-10-17T18:17:01+00:00</generated_at>
<total_entries>1</total_entries>
</meta>
<entries>
<entry>
<url><![CDATA[http://www.firstgenericbank.account-updateinfo.com]]></url>
<phish_id>19845</phish_id>
<phish_detail_url><![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]></phish_detail_url>
<submission>
<submission_time>2006-10-17T03:00:18+00:00</submission_time>
</submission>
<verification>
<verified>yes</verified>
<verification_time>2006-10-17T13:13:37+00:00</verification_time>
</verification>
<status>
<online>yes</online>
</status>
</entry>
</entries>
</output>

12 Responses

  1. funchords

    Thank you!

    One of the reasons that I considered stopping volunteering time in the PhishTank was because I felt like the information was not being shared well enough. Perhaps my time was better spent elsewhere. You have just changed my mind for the better!

    OpenDNS and others who use the API use the data to warn off victims — this is very cool! But, the volunteers on the PIRT can use this data to notify domain/IP owners to remove the offending sites.

    I think this is great! Thanks PhishTank!

    –Robb

  2. Jeff Chan

    Thanks for making this available, guys! I’ve added the data to our phishing SURBL, with some appropriate munging:

    http://www.surbl.org/lists.html#ph

    Cheers,

    Jeff C.

  3. MASA

    I am using this data for a firefox extension that protects the user from phishpages based on data from phishtank.

    The extension is in the translation stage right now.

  4. MASA

    It’s out,

    http://phishtank.com/sitechecker (redirects to the extension’s homepage)

  5. Ian

    Perhaps you ought to make the server serve the RSS feed as static data, and kick out a 304 header if it is not modified? That should help bandwidth-wise.

  6. phishthis

    Thank You OpenDNS for having this data. Hopefully, as one who finds phishing and pharming as totally repulsive, this will help the yet-unwashed of the internet to learn what phishing is, and perhaps our efforts will keep them safer and spare them the time in repairing their good name(s).

    Mark
    Founder, President, and “Bottle Warsher” of the
    London Antiphishing Society,
    near Arkansas Nuclear One.

  7. Andres Riancho

    Great work, the database is really amazing! Only one simple question, why isnt the xml file compressed ?

    dz0@sock3t:/tmp$ du -sh index.xml
    6.4M index.xml
    dz0@sock3t:/tmp$ bzip2 index.xml
    dz0@sock3t:/tmp$ du -sh index.xml.bz2
    292K index.xml.bz2
    dz0@sock3t:/tmp$

    The size reduction is amazing, you would save a lot of bandwidth !
    Maybe the compression algorithm should be “zip” instead of bzip2 to make it easy to decompress in all programming languages and operating systems.


    Andres Riancho

  8. John Roberts

    Andres, the file is transparently gzipped across the wire as long as the client requesting the file support gzip.

  9. Steve Garvey

    The file is NOT sent compressed to my gzip capable browser (Firefox)

    Me: Accept-Encoding: gzip, deflate

    Your server: Content-Length: 6584660

    What if someone writes a script to retrieve it without an “Accept-Encoding” header? As stated previously, you should be serving a gzip’d file.

  10. Tom

    http://data.phishtank.com/data/online-valid/index.xml has been returning 0 entries since yesterday some time.

    Please advise

  11. John Nagle

    Something has gone very wrong with the XML file of PhishTank data at “http://data.phishtank.com/data/online-valid”. Today, it reads:


    2007-10-12T04:30:01+00:00
    0

    That’s the entire file. Valid XML, no entries. Something is very broken.

  12. John Nagle

    The XML file started containing useful data again on Friday, October 12th. Thanks.

    Incidentally, it would help if the file was updated as an atomic operation. Occasionally, we see a partially written file, if we happen to read it while it’s being rewritten. We have to read the file twice at 30 second intervals and compare, rereading until we get the same contents twice in a row. It would be better to write a new file on each update, then move or link it to the name of the distributed file.

Leave a Reply

Server: pt1