いももちのきもち

新米プログラマによる技術的メモ

WormBaseのftpサイトについて

以前WormBaseの情報を取得するのにRESTful API経由で取得する方法を紹介しました。
WormBase RESTful APIを使って好きな情報を取得する - toricor’s memo

WormBaseのリリースごとのデータ全部とか、C. elegans以外の遺伝子一覧とか、まとまった情報をファイルを取得するならこちらのサイトを利用します。
ftp://ftp.wormbase.org/pub/wormbase/
f:id:toricor:20160119182221p:plain

tar.gzファイルなどがありますので適宜解凍してparseして利用します。
僕も全貌はあまり理解できてないので、各ディレクトリのREADMEファイルをよく読んで使ってください^^;

説明書:
ftp://ftp.wormbase.org/pub/wormbase/README

Site Contents
--------------------------------------

species/
   Core files and annotations for all species available
   at WormBase (or of possible interest to WormBase users)
   organized by species. Files previously available in genomes/
   can be found here.  File names, paths, and contents are 
   standardized and computable. Please see species/README for
   details.

      Look here for the most current and archival versions of:
        - genomic fasta sequence
        - genomic annotations in GFF2 or GFF3
        - assembly versions
        - commonly requests data sets by species
     
releases/
   Core files for each WormBase release organized by WS release ID.

      Check here if you are interested in downloading all the files
      that comprise the current WormBase release, or any other
      older releases.

datasets-published/
   Published datasets submitted to WormBase for distribution.

datasets-wormbase/
   WormBase-generated datasets and data dumps. Includes non-species
   specific, cross-species, and general WormBase information. See
   /pub/wormbase/species/*/annotations" for species-specific datasets.

software
   The software that drives WormBase, related libraries, and installation
   documents.


Computable Filenames/Paths/Contents
--------------------------------------

Doing large scale analyses across a large number of species? 
Filenames and their locations are easily computable, and you
won't be left scratching your head trying to figure out what
all the "genome.seq" files are in your Downloads/ folder.

Each filename has 
     - the g_species of the source (if appropriate; eg c_elegans)
     - the WS version (eg WS225)
     - a brief content description (eg genomic_masked)
     - the filetype as a suffix (eg .fa or .gff3)

This structure makes processing en masse all the species hosted at
WormBase easy.

Fetching the most current version
--------------------------------------
We use extensive symlinking to make it easy to fetch the most 
current version of a file. For example:

The most current production release is always available at:

    ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release

The most current development release is always available at:

    ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release

In any directory, a symlink of the form:

 G_SPECIES.canonical_bioproject.current.FILETYPE.FILETYPE.COMPRESSION 
   eg c_elegans.canonical_bioproject.current.annotations.gff2.gz

will lead to the most current version of the file.

--

Need help? Contact:
Todd Harris (todd@wormbase.org)
25 May 2011