Not logged inOcean Color Forum
Forum Ocean Color Home Help Search Login
Up Topic Frequently Asked Questions / Data Access FAQ / Bulk data downloads via HTTP (locked)
- By sean Date 2009-09-15 10:50 Edited 2013-12-10 15:35
Can I download data in bulk via HTTP?

Yes.  It is possible to mimic FTP bulk data downloads using the HTTP-based data distribution server

CAVEATS
1) The following examples are provided for informational purposes only.
2) No product endorsement is implied. 
3) There is no guarantee that these options will work for all situations.
4) The examples below are not an exhaustive description of the possibilities.

Using command-line utilities:

wget:

1) "mget *SST4* from /MODISA/L2/2006/005

wget -q -O - http://oceandata.sci.gsfc.nasa.gov/MODISA/L2/2006/005/ |grep SST4|wget -N --wait=0.5 --random-wait --force-html -i -

2) Use the file search utility to find and download OCTS daily L3 binned  data from November 1, 1996 through  December 31, 1999

wget -q --post-data="sensor=octs&sdate=1996-11-01&edate=1997-01-01&dtype=L3b&addurl=1&results_as_file=1&search=*DAY*" -O - http://oceandata.sci.gsfc.nasa.gov/search/file_search.cgi |wget -i -

file_search.cgi options:
    sensor : mission name.  valid options include: aquarius, seawifs, aqua, terra, meris, octs, czcs, hico, viirs
    sdate : start date for a search
    edate : end date for a search
    dtype : data type (i.e. level). valid options: L0, L1, L2, L3b (for binned data), L3m (for mapped data), MET (for ancillary data), misc (for sundry products)
    addurl : include full url in search result (boolean, 1=yes, 0=no)
    results_as_file : return results as a test file listing (boolean, 1=yes, 0=no, thus returns and HTML page)
    search : text string search
    subID: non-extracted subscription ID to search
    std_only : restrict results to standard products (i.e. ignore extracts, regional processings, etc.; boolean)
    cksum: return a checksum file for search results (boolean; sha1sums except for Aquarius soil moisture products which are md5sums; forces results_as_file; ignores addurl)

3) grab MERIS L1B data (which needs username and password)

wget --user=username --password=passwd http://oceandata.sci.gsfc.nasa.gov/echo/getfile/MER_RR__1PRLRA20120330_112205_000026183113_00138_52738_8486.N1.bz2

4) retrieve recent files for a non-extracted subscription and check against the sha1sums:

wget --post-data="subID=###&cksum=1" -q -O - http://oceandata.sci.gsfc.nasa.gov/search/file_search.cgi > search.cksums && awk '{print $2}' search.cksums
| wget -N --base="http://oceandata.sci.gsfc.nasa.gov/cgi/getfile/"  -i - && sha1sum -c search.cksums


Useful wget options:
    --timeout=10 : sets timeout to 10 seconds (by default wget will retry after timeout)
    --wait=0.5 : tells wget to pause for 0.5 seconds between attempts
    --random-wait :  causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where
           wait was specified using the --wait option
    -N, --timestamping : prevents wget from downloading files already retrieved if a local copy exists and the remote copy is not newer

cURL:

Unlike wget, cURL has no method for downloading a list of URLs (although it can download multiple URLs on the command line). 
However, a shell or script (perl, python, etc) loop can easily be written (examples below use a BASH for loop):

1) grab MODIS L2 files for 2006 day 005 (Jan 5th, 2006)

for file in $(curl http://oceandata.sci.gsfc.nasa.gov/MODISA/L2/2006/005/ | grep getfile | cut -d '"' -f 2);
do
  curl -L -O $file;
done;


2) Use the file search utility to find and download OCTS daily L3 binned  data from November 1, 1996 through  December 31, 1999

for file in $(curl -d "sensor=octs&sdate=1996-11-01&edate=1997-01-01&dtype=L3b&addurl=1&results_as_file=1&search=*DAY*" http://oceandata.sci.gsfc.nasa.gov/search/file_search.cgi |grep getfile);
do 
  curl -L -O $file;
done;


3) grab MERIS L1B data (which needs username and password)

curl -u username:passwd -L -O  http://oceandata.sci.gsfc.nasa.gov/echo/getfile/MER_RR__1PRLRA20120330_112205_000026183113_00138_52738_8486.N1.bz2

Useful curl options:
    --retry 10 - sets the number of retries to 10, by default curl does not retry
    --max-time 10 - sets timeout to 10 seconds

Web Browser options:

Firefox add-on 'DownThemAll'

If you prefer a GUI based option, there is an add-on for the Firefox web browser called 'DownThemAll'.  It is easy to configure to download only
what you want from the page (even has a default for archived products -gz, tar, bz2, etc.).  It allows putting a limit concurrent downloads, which is
important for downloading from our servers as we limit connections to one concurrent per file and 3 files per IP - so don't try the "accelerate"
features as you're IP may get blocked. 
Recommended preference settings:
    1) Set the concurrent downloads to 1.
    2) There is an option under the 'Advanced' tab called 'Multipart download'.  Set the 'Max. number of segments per download' to 1.
    3) Since this download manager does not efficiently close connection states, you may find that file downloads will time out.  You may want to
set the Auto Retry setting to retry each (1) minute with Max. Retries set to 10.

Another alternative - that works for more than just Firefox (but isn't free) is "Internet Download Manager"

Like 'DownThemAll' it has features to grab all the links on a page, as well as limit the number of concurrent downloads.  It also advertises download
acceleration - Do NOT use this feature with our servers, as you're IP may get blocked.
Up Topic Frequently Asked Questions / Data Access FAQ / Bulk data downloads via HTTP (locked)



Responsible NASA Official: Gene C. Feldman
Curator: OceanColor Webmaster
Authorized by: Gene C. Feldman
Updated: 03 July 2013
Privacy Policy and Important Notices NASA logo