Homework Three
2-Way Merge Sort

The purpose of this assignment is to learn how to sort large data files by implementing a 2-way merge sort. As described in class, the 2-way merge sort is similiar to the N-Way Merge sort, except we know in advance that half the data file will fit into memory.

Submit your source code via email to dannellys@winthrop.edu with the subject line CSCI325 - HW 03 by 5:00pm on February 11.


The Problem

Write a program that will sort a large data file. Your program should sort any size data file. You should assume that only half the file will fit into memory.

Your program should read and write data files with the following format:

   N
   record 1
   record 2
   ...
   record N-1
   record N
In other words, the first bytes of the data file are an integer that indicates the number of records in the file. Note, this header value makes it a lot easier process the file.

The format of each record is:

   struct faculty_type {
      char  fname[15],lname[15];
      int   phone, office;
      char  dept[5];
   };

A sample data file of 47 faculty records is available at "/home/ACC.dannellys2/faculty_unsorted.dat". Since that is a data file, not a text file, Dannelly cannot put this sample on the web. You need to copy that file from his linux home directory into your account.

Your program should sort a file of faculty records by last name.

Instead of hardcoding the names of the input and output files like in the first two assignments, your program should read these from the command line (see "code hints" section below). A couple of example runs are below. In the first example the user did not indicate the file to read or the file to write, so the program just output an error message and quit. The second run read the file "faculty_unsort.dat" and created the sorted file "faculty_sorted.dat".

   > merge_sort
       Usage Error: merge_sort inputfile outputfile

   > merge_sort faculty_unsorted.dat faculty_sorted.dat


Algorithm

As discussed in class, the following is the general algorithm for a 2-way merge sort:
Create 2 sorted files
Read 1st half of input file into an array in main memory
sort the array
write array to file temp1

Read 2nd half of input file into memory
sort the array
write to file  temp2

Merge the 2 files
Read record x from temp1
Read record y from temp2
While both temp1 and temp2 contain records
   if x < y
      write x to outfile
      read x from temp1
   else
      write y to outfile
      read y from temp2

Copy the file that did not empty
If temp1 is empty
   write remainder of temp2 to outfile
else
   write remainder of temp1 to outfile


Data Viewer

To complete this assignment you really need at least two different programs: one that reads an unsorted file and writes a sorted file; and a second program to print the contents of a data file to the screen. Note, a third useful program would be one that can create new data files of this format.

I will share with you my C program to view the contents of files that have this format. The program expects the name of the file to read as a command line argument. In other words, run this program like this:
      > viewer faculty_unsorted.dat

/***************************************
Program to display to stdout the
contents of a file of faculty records.

Dannelly - January 2011
***************************************/

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// a structure to read and write
struct faculty_type {
   char  fname[15],lname[15];
   int   phone, office;
   char  dept[5];
};

/**************************************/
void main (int argc, char *argv[])
{
   FILE *infile;
   struct faculty_type input;
   int i, count;

   // error check command line
   if (argc != 2)
     {
      fprintf(stderr, "Error: Usage: %s recordsfile\n",argv[0]);
      exit (1);
     }

   // open data file for reading
   infile = fopen (argv[1],"r");
   if (infile == NULL)
     {
      fprintf(stderr, "\nError opening %s for reading\n\n", argv[1]);
      exit (1);
     }

   // read record count from top of the file
   fread (&count, sizeof(int), 1, infile);

   // loop throught the file and print all records
   for (i=0; i<count; i++)
     {
      fread (&input, sizeof(struct faculty_type), 1, infile);
      printf ("%-15s", input.fname);
      printf ("%-15s", input.lname);
      printf (" %d ", input.phone);
      printf (" %d ", input.office);
      printf (" %s ", input.dept);
      printf ("\n");
     }
}


Code Hints

There are lots of places to get confused on this assignment. Here are some likely sticking points.

Don't forget that the first thing in the data file is an integer, not a record. Also, be sure your output file starts with a count, not a record.
Also, be sure you use read or fread (not >> or fscanf) to read that integer. Your input is a byte file, not a text file.
To make your debugging easier, I highly recommend that your two temporary files use the same format as the input and output files. That way you can use the viewer program that I provided above.
A major problem is where to put the file data inside main memory. I suggest an array because it is easiest to sort arrays. The problem is how big does it need to be? Array size of 500 will be way too big for the sample data I provide, yet too small for the data I will use to test your program. Try this C code:
   int run_size;                         // how many records to process
   struct faculty_type *records = NULL;  // a variable-size array of records
   . . .
   free(records);
   records = (struct faculty_type *) malloc (sizeof (struct faculty_type) * run_size);
   fread (records, sizeof(struct faculty_type), run_size, infile);
The free function returns the array back to the operating system for memory clean-up. If free is sent a NULL, then nothing happens. The malloc function does a memory allocation of run_size number of faculty records. The fread then reads run_size number of records into the array named "records". You can now do things like "records[20].phone = 1234;".
When sorting an array of records, don't forget to swap each element of the structure. This will not work:
   temp = records[j];
   records[j] = records[j+1];
   records[j+1] = temp;
Most importantly, consider that is a long and sometimes tricky program. My version was only 178 lines of code, but it took me a few hours. Homework 1 and 2 took me about 20 minutes each.