PhishTank is operated by OpenDNS, a free service that makes your Internet safer, faster, and smarter. Get started today!

XML data file of online, valid phishes from PhishTank

posted by John Roberts on October 17th, 2006 in API, Data, PhishTank, XML

The best judgment of the PhishTank community is represented by the ever-changing list of suspected phishes that are both online and valid, meaning “verified as a phish” by the members of PhishTank. You can page through this list on the site, in reverse chronological order (by submission time). The PhishTank API is more powerful and more granular; it does not offer a way to get a bulk list.

However, the PhishTank data also is effective when distributed and available for use in local applications, whether local is your personal router, the gateway of your ISP, a corporate firewall or elsewhere. Now the data is available as a regularly updated XML data file.

Basic file details

  • Format: XML
  • Update frequency: Hourly.
    I encourage you to fetch it no more often than once an hour.
  • File size: Varies. Edited: April 11, 2007: Can be as large as 10MB, so it may not open easily in a browser. Right now, with 1125 verified, online phishes, the file is a bit over 600Kb.

File location

http://data.phishtank.com/data/online-valid/

Considerations

Phish sites go up and down at various times. Usually, a single phish URL doesn’t stay online for very long, so it’s important to consider not only the timestamp of the data file, but the time elapsed since both the submission of the phish to PhishTank and its verification by the PhishTank community. There are exceptions, but if a phish URL is more than a week or two old, then the host where it’s living is not paying attention. Over time, PhishTank will start to provide more data about the hosts, so you can see which hosts tend to allow this kind of activity to continue.

The file has an ETag header and a Last-Modified header. Please respect these when fetching the file. We may support gzip in the future, to further reduce bandwidth for all parties.

Field definitions

To help you use the data in this file, I’ve described each of the fields below.

meta is the wrapper for information about the file itself.

generated_at is the time the file was last generated as an ISO 8601 date string. The ISO standard incorporates the timezone; PhishTank uses UTC.
Sample value: 2006-10-17T00:17:02+00:00

total_entries is the count of how many valid phish URLs are in the file at that time. This will always be a positive integer.
Sample value: 1125

entries is the overall container for all the individual phish records as a collection.

entry is the container for data about each individual phish.

url is the phish URL. The value (a URL) is presented as CDATA because phishers are not polite folks, and occasionally use non-valid characters in their URLs. Some browsers are more forgiving about the standards, and interpret (or ignore) the non-valid characters, so the URL is a phish, even though it might fail in other browsers.
Sample value: <![CDATA[http://www.firstgenericbank.account-updateinfo.com]]>
Note: This URL is an example only. The domain is owned by OpenDNS, operators of PhishTank, for demonstration purposes.

phish_id is the PhishTank ID for the phish URL. All data in PhishTank is tied to this ID. You may or may not need this piece of information, but it’s useful for us. This will always be a positive integer.
Sample value: 19845

phish_detail_url is the PhishTank detail page for the phish URL, where you can view data about the phish, including a screenshot and the community votes. More data will be added to this page over time.
Sample value: <![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]>

submission is a container for submission_time currently, and may contain additional fields in the future.

submission_time is the time the phish was submitted to PhishTank, in UTC. Same timestamp format as generated_at.
Sample value: 2006-10-17T19:21:30+00:00

verification is a container for information about a phish URL’s verification, including verified and verification_time currently. This container may have additional fields in the future.

verified indicates whether or not a suspected phish has been judged by the PhishTank community. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes

verification_time is the time the phish was judged by the PhishTank community, in UTC. In this file, it’s the time the phish was verified as a phish. Same timestamp format as generated_at. It may be interesting to compare verification_time and submission_time.
Sample value: 2006-10-17T23:06:28+00:00

status is the container for online, currently. This container may have additional fields in the future.

online notes whether a phish URL is live and responding. In this data file, of all online, valid phishes, the value will always be yes.
Sample value: yes

Attribution and usage

This data is free. It may be used in commercial products or non-commercial products, by organizations or individuals.

If you use the data, we would appreciate public attribution for the data to PhishTank, preferably with a link to the PhishTank home page. We will soon publish a page with some guidelines about how to use the PhishTank logo (if you want to) and otherwise attribute the data to PhishTank. For now, contact us if you have anything special… kind words and a link are the general goals! ;-)

We’re curious to learn how this data gets used, so please let us know, either in the comments or via the contact form.

Example XML

<?xml version="1.0" encoding="utf-8"?>
<output>
<meta>
<generated_at>2006-10-17T18:17:01+00:00</generated_at>
<total_entries>1</total_entries>
</meta>
<entries>
<entry>
<url><![CDATA[http://www.firstgenericbank.account-updateinfo.com]]></url>
<phish_id>19845</phish_id>
<phish_detail_url><![CDATA[http://www.phishtank.com/phish_detail.php?phish_id=19845]]></phish_detail_url>
<submission>
<submission_time>2006-10-17T03:00:18+00:00</submission_time>
</submission>
<verification>
<verified>yes</verified>
<verification_time>2006-10-17T13:13:37+00:00</verification_time>
</verification>
<status>
<online>yes</online>
</status>
</entry>
</entries>
</output>

20 Responses to “XML data file of online, valid phishes from PhishTank”

  1. funchords says:

    Thank you!

    One of the reasons that I considered stopping volunteering time in the PhishTank was because I felt like the information was not being shared well enough. Perhaps my time was better spent elsewhere. You have just changed my mind for the better!

    OpenDNS and others who use the API use the data to warn off victims — this is very cool! But, the volunteers on the PIRT can use this data to notify domain/IP owners to remove the offending sites.

    I think this is great! Thanks PhishTank!

    –Robb

  2. Jeff Chan says:

    Thanks for making this available, guys! I’ve added the data to our phishing SURBL, with some appropriate munging:

    http://www.surbl.org/lists.html#ph

    Cheers,

    Jeff C.

  3. MASA says:

    I am using this data for a firefox extension that protects the user from phishpages based on data from phishtank.

    The extension is in the translation stage right now.

  4. MASA says:

    It’s out,

    http://phishtank.com/sitechecker (redirects to the extension’s homepage)

  5. Ian says:

    Perhaps you ought to make the server serve the RSS feed as static data, and kick out a 304 header if it is not modified? That should help bandwidth-wise.

  6. phishthis says:

    Thank You OpenDNS for having this data. Hopefully, as one who finds phishing and pharming as totally repulsive, this will help the yet-unwashed of the internet to learn what phishing is, and perhaps our efforts will keep them safer and spare them the time in repairing their good name(s).

    Mark
    Founder, President, and “Bottle Warsher” of the
    London Antiphishing Society,
    near Arkansas Nuclear One.

  7. Great work, the database is really amazing! Only one simple question, why isnt the xml file compressed ?

    dz0@sock3t:/tmp$ du -sh index.xml
    6.4M index.xml
    dz0@sock3t:/tmp$ bzip2 index.xml
    dz0@sock3t:/tmp$ du -sh index.xml.bz2
    292K index.xml.bz2
    dz0@sock3t:/tmp$

    The size reduction is amazing, you would save a lot of bandwidth !
    Maybe the compression algorithm should be “zip” instead of bzip2 to make it easy to decompress in all programming languages and operating systems.


    Andres Riancho

  8. John Roberts says:

    Andres, the file is transparently gzipped across the wire as long as the client requesting the file support gzip.

  9. Steve Garvey says:

    The file is NOT sent compressed to my gzip capable browser (Firefox)

    Me: Accept-Encoding: gzip, deflate

    Your server: Content-Length: 6584660

    What if someone writes a script to retrieve it without an “Accept-Encoding” header? As stated previously, you should be serving a gzip’d file.

  10. Tom says:

    http://data.phishtank.com/data/online-valid/index.xml has been returning 0 entries since yesterday some time.

    Please advise

  11. John Nagle says:

    Something has gone very wrong with the XML file of PhishTank data at “http://data.phishtank.com/data/online-valid”. Today, it reads:


    2007-10-12T04:30:01+00:00
    0

    That’s the entire file. Valid XML, no entries. Something is very broken.

  12. John Nagle says:

    The XML file started containing useful data again on Friday, October 12th. Thanks.

    Incidentally, it would help if the file was updated as an atomic operation. Occasionally, we see a partially written file, if we happen to read it while it’s being rewritten. We have to read the file twice at 30 second intervals and compare, rereading until we get the same contents twice in a row. It would be better to write a new file on each update, then move or link it to the name of the distributed file.

  13. iluxan says:

    I really like your feed. One thing I do have a problem with, though, is the lack of some kind of “delta” or “versioned” retrieval mechanism.

    I have no safe way to know which entries were removed from the list. I toyed with the option of assuming that whatever is missing from the current file but was sent in a previous file is deleted, but since sometimes I may get a truncated or incomplete file (network issues, etc), I run the risk of accidentally deleting all the items that were sent previously.

    Is there any way to make this more explicit – some kind of versioning mechanism like the Google Safe Browsing API uses?

    Thanks.

  14. Sorin says:

    Hi,

    There are some invalid characters in the URLs.
    Have a look at phish IDs: 405396 and 360933

  15. What is the purpose of the phish_detail_url, when you only need to take “http://www.phishtank.com/phish_detail.php?phish_id=” and put on the end?
    The phish_detail_url entry adds about 100 unneccesary bytes to each entry. In the current file with about 5752 entires, that makes the file about 562 kb bigger = 0,5 MB bigger.

    I guess its a safeguard if Phishtank decides to change the format of the Phish detail url, but then it could be made that the current phish detail url format is specified in the metadata, only once per XML file, like:
    and then the application developer only needs to replace %1 with the phish ID that the application developer wants to link/send the user to.

    I think the “verified” and the “online” tag can be removed too, since they are constant.

  16. I did a nice perl parser for Phishtank XML data. It parses out all url data, keeping IPs intact, and listing all second level domains that are in phishtank database. The whole thing are then put into the “domains” category of dansguardian’s filter category “phishing”.

    It also reads in the “domains” at start, so if some phish site gets temporary offline and gets removed from online-valid.xml, it will still be listed in the dansguardian file. (This to prevent a phish site from temporarly dropping all its connections just to get removed from phishtank, and then reopening the phish site)

    With second level domains, I mean that a phising url like:
    http://adsl-75-11-237-21.dsl.rcsntx.sbcglobal.net/.irs/stimulus.refund/0,,id=181665,00.html

    is converted to:

    sbcglobal.net

    The reason of why im doing that, is that a phisher can set up like millions of sites aaaaaaa.host.com to zzzzzzz.host.com. He only needs to point *.host.com to his IP in his DNS, and then setting up a *.host.com virtualhost, and then sending a random one to each recipient. This makes the listing at phishtank ineffective, if I don’t block out the whole second level block. The only part the phisher dosen’t have control over, is the second level domain (that he has to purchase), and the TLD.

    And here comes the script: http://pastebin.com/f2f4f0f27
    You can then use wget to pull down online-valid.xml from phishtank, and then run the script that I have posted. Then you need to restart DG (/etc/rc.d/dansguardian restart) to reload the blacklist. Three lines of code (wget fetching, running perl script and then restarting DG) can be done from cron.hourly or cron.daily

  17. [...] PhishTank data file we announced two days ago is already seeing [...]

  18. [...] developer, MASA has built several extensions. With SiteChecker, MASA used the PhishTank data file (details) to bring PhishTank’s judgments right into the [...]

  19. [...] WOT uses data from lots of sources, including its users. PhishTank is now part of the mix, via the downloadable data file. [...]

  20. [...] PhishTank works is because the data is freely available to all, from the free, open API to the XML data file or the lightweight [...]

Server: pt2