#789 File upload support for web pod

tactics Fri 16 Oct 2009

I'm designing a web site and I want to use Fan to implement it. Users need to be able to upload files from their hard drive through good ol' multipart/form-data. Currently, the only way to do this is to grab the file data straight from web::WebReq.in. Is it possible to add some support for this. It'd be nice to have that data cleanly wrapped up somewhere so I could access the uploads through a File[] field, sort of like how $_FILES works in PHP.

brian Fri 16 Oct 2009

Promoted to ticket #789 and assigned to brian

tactics Mon 19 Oct 2009

I want to get this functionality as soon as possible for my project. I don't mind throwing it together, but I need some input on how it might work.

Handling the weblet's input stream

When the content type is multipart/form-data, the input stream must be parsed for the user's data. What guarantees are asserted by this input stream? It looks like the existing code for handling application/x-www-form-urlencoded clobbers the stream. Does that mean that I can just consume the entire stream into a Buf in order to break it up into its various parts? If I do that, should I leave it open? Or close it?

MultipartFormData class

I will create a MultipartFormData class to represent each "part" of the mutlipart data. (Any ideas for a better name?)

const class MultipartFormData
{
  const Str:Str headers
  const Buf content
  const MultipartFormData[]? parts
}
  • The headers field are the headers specific to that part.
  • The data content for each part is stored in content.
  • If the content type is multipart/mixed, the parts field is a list of all "subparts". (This occurs when multiple files are uploaded in a single entry form).

Changes to the WebReq class

I will be adding a new field to the WebReq class called parts, typed as const MultipartFormData[]?. When the content type of the web request is multipart/form-data, it is generated from the input stream.

All non-file parts are loaded into the form field appropriately. If no non-file parts are sent, it still sets form to an empty map (so overall, if the content type is either application/x-www-form-urlencoded or multipart/form-data, you can be guaranteed form is non-null).

All file parts are stored on the file system into a temporary directory. For each of these, a File object is created. I'm thinking we need a special subclass of File to encapsulate the metadata (content-type, local filename) provided by the client. These files are deleted on exit.

The WebReq class will also receive a few new methods:

  • Str[] listFiles(). Returns the list of form names associated with files.
  • File[] getFiles(Str name). Returns the list of Files uploaded with the given name.
  • File getFile(Str name). An alias for getFiles(name).first.

Upload limit per IP

Each server implementation should provide a way to limit the amount of data that can be uploaded. For example, WispService might take optional arguments to set the maximum size of any file upload and a total limit for each IP.

Any thoughts? Comments?

brian Tue 20 Oct 2009

I haven't given this a lot of thought yet, but a couple of things I would note:

  • Check-out the email APIs, it has existing classes for MultiPart and encoding them (I haven't gotten around to decoding)
  • As a general rule, every API in Fan always uses stream IO at the lowest level so that you can efficiently handle huge data streams without requiring memory (or file buffers)

I don't know exactly how the PHP routine you mentioned worked, but I seem to recall that it actually stored files on the local disk temporarily (which is something I sort of consider an anathema)

But in terms of this problem, I think the trick will be how to elegantly design an API which can pipe data from the multi-part files to their destinations without creating temporary buffers or files.

tactics Mon 23 Nov 2009

I'm finally to the point in my project where I need file upload support to continue.

Like I said, I don't mind grinding through the HTTP spec and throwing together a solution. It would be best if I could write something that both serves my own needs and that can be easily assimilated into the core Fantom libraries. That said, I need some input on design.

Looking back at the design I came up with last month, I can see a few points where it does not fit the feel of the rest of the web API. Saving the files to disk, hardened file upload limits by IP, even storing the uploaded data in a special Multipart class are too high level for the web pod, it seems.

There is no way to randomly access parts of a multipart/form-data transfer without creating temporary buffers. As I see it, that leaves two options with regards to reading the uploaded data:

  • Read each part of the data one at a time.
  • Read in all parts upon request.

For the first option, you would have a nextPart() method on the WebReq class that would parse the next "part" of the input stream, parsing out the headers and rolling up the data in a buffer.

In most cases, if you're going to have people uploading files, you want all the parts anyway. The second option, then, would be to have a parseParts() method on WebReq. It would be essentially iterate over all the parts using nextPart() and return the result as a List or a Map (keyed by the HTML name attribute) for easy consumption.

There is still a lot of interpretation to how the API would look. Do we want these methods built into WebReq itself? Do we want to delegate responsibility of parsing the input stream to a helper class? Maybe we would have a WebMultipart which does the parsing and stores the resulting data instead. Do we still populate the form attribute on WebReq as nextPart() is called?

Speaking of WebReq#form, I have a tangentially related question. What are the official semantics of form data when multiple controls share the same name? Is a Map sufficient to capture this? It feels like it should be a multimap or a Str:[Str] instead of a Str:Str. Or better yet, that it should be handled by a special container class that is aware of its "usually-single-valued, but sometimes-multi-valued, usually-string-valued, but sometimes-file-buffer-valued-with-header-data" nature.

tcolar Mon 23 Nov 2009

Several years back I made my own Multipart request parser (Java).

Basically implementing RFC 2833. That RFC covers a lot and I only bothered with the File upload part.

What I had done is added a request.parseMultiPartContent() to request.

By default it's not done(parsing multipart) unless the user ask for it, because that's better to do it only when needed for performance reasons (also that allowed to disable it for performance/security reasons).

The parse method would parse all parts, iInever had any use to parse only specific parts.

I would store the content in temp file system (with a max upload size configured)

I don't know that the code is all that great, but if that helps, here is a page and link to the code I had done:

http://www.javaontracks.net/file_uploading_multipart_form_api

http://www.colar.net/jotdoc/javaontracks/index.html?page=net/jot/web/multipart/package-summary.html

tcolar Mon 23 Nov 2009

Also wanted to add that I'm not sure you can "can pipe data from the multi-part files to their destinations without creating temporary buffers".

Part of the issue is that all the parts are encoded "together" in the browser stream and oftentimes you need some of the other parameters and the uploaded file name etc.. just to make a decision on when to store the file.

You don't know which order the fields are coming in, so you have to read the stream to find a specific one and, you can't "backtrack" the browser stream ... so I don't think that you have a choice but buffering all the parts as they come.

BTW: One annoyance is that you don't always know how much data you will have to get either, the content-length seem to be often innacurate/wrong when using multipart.

I was a little leary about using temp files (security reasons) .... but at the same time for a file upload can be large so I feel that's better than trying to do it all in memory.

brian Mon 23 Nov 2009

I definitely want the lowest level of the API to be stream based. That will give developers control for reading into memory, to a file, or skipping. Andy and I have had a few conversations about it, and will post some ideas.

Parsing multi-part forms will also be applicable for email - so not sure where something like belongs (web, email, or some new mime pod).

tactics Mon 23 Nov 2009

I definitely want the lowest level of the API to be stream based.

I have a few ideas of what you mean by "stream based", but could you elaborate? It sounds like you want it so that for each file part, you can specify an OutStream to pipe the file to. That seems to make more sense than putting it into a Buf.

Parsing multi-part forms will also be applicable for email - so not sure where something like belongs (web, email, or some new mime pod).

Good point. I'll check around and see how other languages package this.

brian Mon 23 Nov 2009

I have a few ideas of what you mean by "stream based", but could you elaborate?

The goal would be to provide an API which provided access to the file content as an InStream as it was being parsed. Similar to how web::WebUtil provides InStream wrappers on top of chunked HTTP streams.

katox Thu 7 Jan 2010

I've done some basic search on this topic. It seems that

  • There are several reasonable stream based APIs, for instance Jersey API - see an example on decoding/encoding.
  • All implementations I've found (regardless of the programming language) use temporary or assignable files (though mostly hidden unlike PHP explicit temporary files).

Some reasoning can be found in Warning section in Perl implementation...

I'd suggest to use Jersey-like API - which could be backed by it and rewritten in Fantom later...

kaushik Fri 15 Jan 2010

tactics, Do you have an example code for pulling files directly out of webreq.in? I require this for one of my projects, I don't mind putting in a hack until the actual fix arrives.

tactics Fri 15 Jan 2010

I had a partial solution. But it was having issues with binary files, and then my computer had a hardware failure, and I haven't had the time to work on it.

I might be able to dig it up if you want it.

brian Fri 22 Jan 2010

Ticket resolved in 1.0.49

I've add a new method to WebUtil which implements this feature:

**
** Parse a multipart/form-data input stream.  For each part in the
** stream call the given callback function with the part's headers
** and an input stream used to read the part's body.  Each callback
** must completely drain the input stream to prepare for the next
** part.
**
static Void parseMultiPart(InStream in, Str boundary, |Str:Str headers, InStream in| cb)

This method makes it easy to work with the multi-part data efficiently off the socket input stream.

I've also added a upload.fan script to the web demo which illustrates how to work with these API.

kaushik Sat 23 Jan 2010

Great! thanks

Login or Signup to reply.