Download a Google Drive file from Matlab
Here is a story for today. We have some public files stored on a Google Drive, which we would like to download automatically. Files that I was interested in are relatively big multi-frame Tiff files (231 Mb each).
I assume that you have file IDs. If you do not know what it is, please refer to numerous online pages. For example, lifehacker.
Having an ID one can easily construct a download link. The problem comes with big files. If you try to download a big file, Google will redirect you to a special page informing that it is not possible to scan the file for viruses. There will be a download link on that page. It won't be Google if you could just parse the response, get the new link and use it directly. No! Everything is very dynamic, new link will contain a confirmation code. But in order to work, it also needs a proper cookie. If you are on Linux machine, here is a script for you from StackOverflow. We are on whatever OS using Matlab.
Alright, Matlab has several ways to get web content. Webread is a very convenient top-level function, but you won't be able to use it because you need to preserve cookies. Matlab provides an example function to send arbitrary HTTP requests with cookie support. Here is the corrected version of that function for the reference:
function [response, retInfos, history] = sendRequest(uri, request)
% uri: matlab.net.URI
% request: matlab.net.http.RequestMessage
% response: matlab.net.http.ResponseMessage
% matlab.net.http.HTTPOptions persists across requests to reuse previous
% Credentials in it for subsequent authentications
persistent options
% infos is a containers.Map object where:
% key is uri.Host;
% value is "info" struct containing:
% cookies: vector of matlab.net.http.Cookie or empty
% uri: target matlab.net.URI if redirect, or empty
persistent infos
if isempty(options)
options = matlab.net.http.HTTPOptions('ConnectTimeout',20);
end
if isempty(infos)
infos = containers.Map;
end
host = string(uri.Host); % get Host from URI
try
% get info struct for host in map
info = infos(char(host));
if ~isempty(info.uri)
% If it has a uri field, it means a redirect previously
% took place, so replace requested URI with redirect URI.
uri = info.uri;
end
if ~isempty(info.cookies)
% If it has cookies, it means we previously received cookies from this host.
% Add Cookie header field containing all of them.
request = request.addFields(matlab.net.http.field.CookieField(info.cookies));
end
catch
% no previous redirect or cookies for this host
info = [];
end
% Send request and get response and history of transaction.
[response, ~, history] = request.send(uri, options);
if response.StatusCode ~= matlab.net.http.StatusCode.OK
return
end
% Get the Set-Cookie header fields from response message in
% each history record and save them in the map.
arrayfun(@addCookies, history)
% If the last URI in the history is different from the URI sent in the original
% request, then this was a redirect. Save the new target URI in the host info struct.
targetURI = history(end).URI;
if ~isequal(targetURI, uri)
if isempty(info)
% no previous info for this host in map, create new one
infos(char(host)) = struct('cookies',[],'uri',targetURI);
else
% change URI in info for this host and put it back in map
info.uri = targetURI;
infos(char(host)) = info;
end
end
retInfos = infos;
function addCookies(record)
% Add cookies in Response message in history record
% to the map entry for the host to which the request was directed.
%
ahost = record.URI.Host; % the host the request was sent to
cookieFields = record.Response.getFields('Set-Cookie');
if isempty(cookieFields)
return
end
cookieData = cookieFields.convert(); % get array of Set-Cookie structs
cookies = [cookieData.Cookie]; % get array of Cookies from all structs
try
% If info for this host was already in the map, add its cookies to it.
ainfo = infos(ahost);
ainfo.cookies = [ainfo.cookies cookies];
infos(char(ahost)) = ainfo;
catch
% Not yet in map, so add new info struct.
infos(char(ahost)) = struct('cookies',cookies,'uri',[]);
end
end
end
Note that I also return some additional variable from the function:
retInfos
is used to get confirmation code from a cookie.history
is used to obtain a direct link to a file.
Why do we need an additional direct link after we got a confirmation code? Because we want to download multi-frame tiff files. Matlab tries to be smart and downloads only a single frame by default.
Let's check out code to download the file and save it on disc:
fileName = 'file_00002_00002.tif';
fileId = '0B649boZqpYG1OEZnV21ncDVNcVk';
fileUrl = sprintf('https://drive.google.com/uc?export=download&id=%s', fileId);
request = matlab.net.http.RequestMessage();
% First request will be redirected to information page about virus scanning
% We can get a confirmation code from an associated cookie file
[~, infos] = sendRequest(matlab.net.URI(fileUrl), request);
confirmCode = '';
for j = 1:length(infos('drive.google.com').cookies)
if ~isempty(strfind(infos('drive.google.com').cookies(j).Name, 'download'))
confirmCode = infos('drive.google.com').cookies(j).Value;
break;
end
end
newUrl = strcat(fileUrl, sprintf('&confirm=%s', confirmCode));
% We now need to send another request to get the file.
% However, Matlab doesn't download the whole Tiff file, but only one frame.
[~, ~, history] = sendRequest(matlab.net.URI(newUrl), request);
% Thus we must use log information to find out a
% direct link and downalod it as a raw file
ind = arrayfun(@(x) ~isempty(strfind(x.URI.Host, 'googleusercontent')), history);
ind = find(ind, 1);
% we need the raw type in order to download the whole file and not just a single frame
options = weboptions('ContentType', 'raw');
imgData = webread(history(ind).URI.EncodedURI, options);
fid = fopen(fileName, 'wb');
fwrite(fid, imgData);
fclose(fid);
Finally, we got the whole file saved in the location pointed by fileName
. Note there are no error checks in the code!
Here is a Gist with the same code.