Friday, October 19, 2012

Where the internet lives

Google releases photos of its data centers. Data centers homepage

Friday, October 5, 2012

"using" keyword in C#, and large file processing

1. Using

The "using" keywork in C# is used to either import a library, or to cause a local variable to be disposed immediately after use.

To design a class that can be used in the using(...) clause, the class needs to implemented the IDisposable interface. This mostly means to implement the Dispose() and Dispose(boolean) methods, and deallocate local resources in the Dispose(boolean) method. See http://msdn.microsoft.com/en-us/library/system.idisposable.aspx.

2. Processing large data file

Processing of large data file may run out of memory if everything is done inside memory, for example, XmlSerializer may do this. The solution is to do the processing chunk by chunk (e.g., line by line, or block by block if no line separator).

For example, processing a file of 13GB will exhaust almost 16GB memory, causes the machine to hang for 30 minutes and fail. Using line by line processing, it takes 15 minutes and works successfully. Of course, for line by line processing, output can use buffering to avoid too many IO which also can be slow.

Another example is when reading a large file, in C/C++, read by line is much faster than read by char. But, for a binary file, you will not be able to read by line.

So processing large file requires careful handling of memory and IO.

Also, when a file is large, for example the 13GB file which does not contain any new line character (so read by line does not work), it can't be open by any common editor on windows including notepad, wordpad or VS.NET studio; it also can't be open on linux by vi. Well, when use vi to open it, it waits and seems there is never an end to the waiting. Search google shows that vi will have difficulty opening file with more than 9070000 character or file of size 2GB. Also for openning large file under 2GB, it will be faster for vi by disabling swap file, syntax parsing or undo history, see How to open big size file using vi editor or Faster loading of large files.

Use Perl to open the file also waits for ever. Actually using Perl it should also work if read as byte stream. Using C or Java to read as byte stream it works immediately.

Below is Java code to read as a byte stream:

import java.lang.*;
import java.io.*;

public class readByteFile {
  public static void main(String[] args) {
    InputStream is = null;
    ByteArrayOutputStream os = null;

    try  {
      File f = new File("filename");
      byte[] b = new byte[100];      // byte buffer.
      is = new FileInputStream(f);
      os = new ByteArrayOutputStream();

      int read = 0;
      int len = 0;
      while ( (read = is.read(b)) != -1 ) {
        os.write(b, 0, read);
        System.out.print(new String(b));
        len += read;
        if (len > 1000) break;
      }

      System.out.println(new String(b));
    } catch (Exception e) {

    } finally {
      try { if (os != null) os.close(); } catch (IOException e) {}
      try { if (is != null) is.close(); } catch (IOException e) {}
    }
  }
}
Or read in C using fgetc():
#include 

int main() {
  FILE * f = fopen("filename", "r");
  char ch;
  long ct = 0;      // char count.
  long line_ct = 0; // line count.

  if (f != NULL) {
    while (1) {
      ch = fgetc(f);
      ct ++;
      if (ch == '\r' || ch == '\n') line_ct ++;

      if (ch == EOF) break;

      putchar(ch);

      if (ct > 1000) break;
      if (line_ct > 5) break;
    }
  }
  printf("\n");
  fclose(f);
  return 0;
}

Blog Archive

Followers