FINAL REPORT ON MACHINE
METHODS FOR INFORMATION SEARCHING
Williamina A. Himwich
Helen O. Field
John M. Whittock
Sanford V. Larkey*
*The preparation and writing of this report was the responsibility
of Mr. Whittock
and Dr. Larkey, based on the work of all of the authors listed above
WELCH MEDICAL LIBRARY INDEXING PROJECT
Sponsored by the Armed Forces Medical Library
The Johns Hopkins University
Baltimore, Maryland 1955
For a number of years there has been a good deal of talk of the possibilities of using machine methods and other techniques to aid in searches through the scientific literature. While some of the ideas were quite speculative and glossed over the many difficult problems involved, there were also many practical studies and applications to certain rather restricted areas. The general picture of this work has been summed up in the Terminal Report ... 1 November 1948 to 31 January 1951 and in the project working paper appended to it, Punched Cards for InformationFiles,Review of the Literature, dated 18 April 1950.
One of the objectives of the Welch Library Project was to study the application of machine methods to the specific problems of medical bibliography. In any such study we must first face the basic questions why do we think of using machine methods for such purposes; are they really necessary; and what do we expect to get from them? These questions are not easy to answer on theoretical grounds or even when we know something of the operating practicabilities of machines. Probably the first consideration in the desire for machine methods is the tremendous extent of periodical literature in the medical sciences and the ensuing difficulty in retrieval of information contained in these periodicals. The extent and nature of these periodicals has been discussed in another Project report, Survey of World Medical Serials and Coverage by Indexing and Abstracting Services.
Another common assumption is that our present indexing and abstracting services are not adequate for the demands of modern scientific research, either as to their coverage of the thousands of periodicals or as to the quality and depth of indexing. It is true that, as was shown in the above Survey, there is a great deal of over-lapping of coverage of some journals but at the same time there is very limited or no coverage of many others. We have also found from our interviews (as contained in the previous Terminal Report) that many scientists are not satisfied with or have difficulties in using our present services. We have pointed out some of the problems of subject indexing, particularly for printed indexes, in the Final Report on Subject Headings and Subject Indexing. But certainly the better of our present indexes and abstract journals would seem to meet the needs of the greater part of medical scientists and before we embark on any large-scale use of machine methods we must find out just what more they could provide and for whom? We need to know a great deal more about what is required in various areas of science, particularly in medical science. It is obvious that the needs of a physician in practice are quite different from those of the man in highly specialized research. There are probably different demands as to thorough literature searches for the more general areas of clinical medicine than for technical and commercial research and development, as in pharmaceutical chemistry. We have a range, then, from the desire for a rather cursory picture of modern developments to the exacting demands of patent searching.
What might we expect from machine methods over and beyond what we can get from current indexing practices, as employed, for instance, by the Current List, The Quarterly Cumulative Index Medicus or Chemical Abstracts? First we would want more complete and more detailed indexing and greater correlation of the items of information. Second we would hope for more rapid and more complete retrieval of information. As can be seen, the first is largely in the intellectual realm and depends very little on machines, except insofar as they can accomplish certain aspects of correlation if proper coding has been done. It involves the extension and modification of present indexing practices, adapted, of course, to machine operations, the aim being to furnish the machine with more data than is required for printed indexes and in different ways. The following is an example of what items of information might be coded for machine searching from a given article:
Chemical changes in the brain produced by injury and by anoxia Wm. E. Stone, Clyde Marshall and Leslie F. Nims. Am. J. Physiol. 132: 770-775, 1941".
This paper can be located under the following headings in :
|QCIM||CHEMICAL ABSTRACTS||BIOLOGICAL ABSTRACTS|
|Brain, wounds and injury||Injury, brain changes from||Injury, effect on brain|
|Brain, chemistry||Brain, in anoxia and injury||Brain, changes, chemical injury|
|Oxygen, deficiency||Oxygen, deficiency brain changes from||Oxygen, lack, effect on brain|
For machine coding we would want to include the following additional headings:
Electroencephalography; Lactic acid; Hexose diphosphate; Brain, anatomy, cerebral cortex; Phosphorus; Hydrogen ion concentration; Phosphocreatine; Adenosine triphosphate; Cats; Age - not specified.", from 1951 Terminal Report.
The second desideratum depends much more on the machines themselves although the intellectual element is again an important one in the proper programming of the machine operations. All in all it may be said that machine methods in bibliography would require much greater rather than less intellectual effort than is needed for present indexing methods. The spectacular role of machines appears only in the later stages of the operation.
In our discussions we are speaking generally of searching through the periodical literature, since this would seem to be the area for the greatest use of machine methods. There might very well be a possible use for machine methods in certain large research or bibliographical operations for organizing and selecting unpublished data. We have spoken of retrieval of information but in what form will we get this information, whether itis from a periodical article or from some other source? We would, of course, like to have immediately the original document or at least a detailed abstract of it, either preferably translated into our own language. This is not as easy to achieve with machine methods as has sometimes been prophesized. With many present machines the best we would get would be a card or tape with the serial number, or a tabulated list of such serial numbers, leading us to a bibliographical reference or possibly to a file of the original papers. With some machines it may be possible to supply the bibliographical reference directly. The Rapid Selector furnishes a microfilm of an abstract of the original article. Microfilm can be inserted into IBM cards and there are possibilities of using microcard or microprint techniques on IBM type cards. Any of these, though, would encroach on the needed punching areas of the card. In respect then to the final physical product, we may not be able to get from the machine much more than we would from an index or abstract journal. This is an aspect that will need much more thought after the basic details of machine searching methods are worked out.
2. Approaches to the Problem
With the above assumptions in mind, the first step was to lay out the broad lines of approach on an ideal basis, without particular consideration of the practical possibilities of existing and available machines. It was felt that too great familiarity with the capabilities of the machines might result in shaping the problem to the machine rather than adapting the machine to the problem. What we wanted to do was to work out a system for indexing in great detail and depth and then by coding to put this material in form for machine operations. Therefore a good part of the studies on coding were carried on before we were very familiar with the details of the machines and actually before some of them were even on hand. Of course, when we came to the practical application of specific IBM machines to the various aspects of the problem, certain members of the staff became quite expert, by intensive study of the operating manuals and other literature and by attendance at IBM schools. We were greatly assisted by members of the IBM office staff from Baltimore arid New York. It should perhaps be pointed out that machine methods were used in many other phases of the Projects work.
All of the practical work on searching was done on IBM equipment although other machines were studied. The first work was carried out on an ordinary Sorter. It was obvious that such a machine would be useful only for very simple searches, but it served to give us an idea of the basic principles of such machines and was valuable in testing certain features of coding, particularly those involving multiple punches in a single column of the card.
The Sorter with multiple column selector and the Collator were the next machines used. Here category codes were employed with fixed fields on the cards. The details of these features will be explained later. We had the opportunity of a trial run of material, on the IBM photoelectric scanner, developed by Dr. Luhn of IBM, and then available only in one experimental model. For this trial a classified code was prepared for Brain Physiology and Biochemistry. One hundred and forty articles were indexed and coded. The necessary additional codes were set up for Brain Anatomy and Histology, Proteins, Lipids, Amino Acids, Chemical Elements, Enzymes, Carbohydrates, and Hormones on the basis of existing classifications. Codes for drugs, diseases, pathological processes, were developed as the need for them arose, to cover only the material indexed. The results of the trial run were highly satisfactory. This machine seemed to have great possibilities and we would have liked to have had more experience with it. Since there were no models available, IBM suggested in the Fall of 1951 that we work on a relatively new machine, the IBM 101 Electronic Statistical Machine, which had been used by the Patent Office in somewhat similar problems. The 101 was delivered early in 1952. All of the later work was done with this machine. It offered a great range for experimentation and enabled us to get away from the rigid requirements of fixed fields on the cards.
3. Coding for Machine Operations.
In any machine operation for information searching it is necessary to have some way of translating the indexable items in the material being analysed into some sort of code or system of symbols that can be "read" by the machine. The exact way in which a code would be expressed would vary with the type of machine being used, but, in general, the basic principles are the same. The coding system has first to be developed with regard to the material to be analysed and then modified and adjusted to the specific machine operation.
Before describing our work on developing coding systems, it might be well to say something of the methods and problems of feeding information to machines, particularly those used in our studies and described above. For most IBM machines, the items of information are represented by punches on the card, standing either for numbers or alphabetical characters. So coding here can be alphabetical, numerical or a combination of both. Such codes are simple to punch since they can be read directly by the punch operator and punched directly by using a keyboard very similar to the ordinary typewriter keyboard. The cards can also be verified directly. This was the method used in all of our studies.
For those machines using photoelectric cells as the searching device, such as the IBM photoelectric scanner or the Shaw Rapid Selector, the codes are represented by patterns; in the former by patterns of punches on IBM cards and in the latter by patterns of clear and opaque areas on microfilm. Here there is an added step in transforming the basic code into patterns. But since there are greater possible combinations than with ordinary methods of punching, the code can be more complicated and thus, for IBM cards, requires fewer columns per subject item. Information is fed to most electronic giant calculators by means of tape, with the code represented by magnetized spots on the tape. This requires a binary system with again the step of transforming the basic code into binary symbols. In spite of the very different technical details of these methods the basic principles of searching are very much the same for all of them and there are many common problems.
In the first place, the really important phase, that of analyzing or indexing the material and coding it, is the same, no matter what searching method is used. In the second place, there are problems of storage. It must be realized that the first purpose of putting information, in code, on to the various media for machine operations is to preserve this information permanently or for a long time and in such a form that it can be sent through the machines many, many times. From the point of view of space requirements alone, cards are at great disadvantage and microfilm or tape would seem to be the most efficient forms for storage. Cards have one advantage though in that they are separate units and so can be pre-arranged or handled individually or in small groups. The relation of this feature to certain aspects of programming will be discussed later.
When we turn to questions of retrieval we find great differences depending on the media of input, particularly if speed of searching is one of our stated requirements. Machines actuated by punched cards are much slower than those using microfilm or magnetized tape. This slowness may be offset to some extent by the apparently greater correlative features and by programming. In any event, if we are thinking of machine methods as a means of searching all of the literature for the whole field of science, or even of medical science, over a considerable period of time, we must realize that, regardless of the media, a huge mass of coded data must go through a machine for the answer to a single question or at best a limited number of questions. As far as we can see, there is no realization yet of that often expressed hope of almost immediate complete answers to complicated questions.
The great electronic calculators, such as Univac, may be the answer to the demand for speed, but it should be emphasized that they have been designed and built for very different purposes and so have certain relative drawbacks for our purposes. Hence any discussion of their possible use must be largely theoretical. The speed of input of information by means of magnetized tape, while extremely rapid by ordinary standards, is relatively slow compared to the speed of the internal operations of these machines. This is not a serious problem for the purposes for which the machines were designed, but might possibly be for our purposes, where the mass of information to be fed in is so great. internally our questions would require fairly simple matching operations seemingly well within the capabilities of the "memory" devices. The output, or the answers to our questions, should be extremely small compared to the input. In many ways, our requirements are almost the opposite of what is normally asked of the machine. We suggest that input might be speeded up by feeding the information file of coded data in binary code, directly into the machine by a television scanning device, scanning synchronized moving picture film, on which the binary code for the necessary items of information Was represented by spots. We have no ideas as to the practicality of this suggestion or whether the method would actually be faster than tape.
While we are speaking of the relative merits of the various machines and the media for information, we might mention another disadvantage of punched card actuated machines. By their very nature, searching by such machines involves sorting, which means picking out those cards which meet the requirements of our question and putting them together. This is no problem when we ask only one question at a time but may be a limiting factor when we want to ask a number of different questions at one run. It is possible that one card may answer more than one question. We will see later the implications of this difficulty in the operations of the IBM 101 machine.
In the above discussions, we have considered the problems of storage and retrieval of information from files of various media set up in a certain way, that is with the original document as the unit for the file. Thus for IBM cards, for every document (periodical article, report, etc.) there would be a card or a series of cards, on which would be punched codes for all the ,desired indexable items contained in that article and some code for the identification of the article bibliographically. The cumulative deck of such cards, covering documents from many sources and over periods of time or the equivalent rolls of tape or microfilm constitute the information file, all or part of which must be sent through the machine for any search. Perhaps it should be pointed out that there is another way of storing information which is the opposite of this. Here the unit for the basic deck of cards is the individual subject concept. On these cards is registered, in various ways, the serial numbers for those articles which contain material pertaining to these subject concepts. Searching for articles which contain information answering a given question involves matching, by machine methods or otherwise, those subject cards which add up to the question. The next step, then, is to pick out the common, identical serial numbers for only those articles which answer the combined requirements of the question. This is the method that has been used, for instance, by Batten, Cordonnier and Taube*. In our own work we have confined our considerations to the first method, although certain general principles might well apply to the other.
[*The methods of Batten and Cordonnier are described in the previously cited paper "Punched Cards for Information Files" and that of Taube in "Studies in Coordinate Indexing" by Mortimer Taube and Associates, 1953]
Since any code for machine operations represents actually only a shortened symbolic form of the various terms used to describe the important features of the contents of a written document or parts of it, the background for working out such a code is similar to that of any method of organizing large areas of knowledge. We have to set up an all-inclusive system that will contain any possible subject concept that may occur in the literature being analyzed. The usual question arises as to whether the best way to do this is by some system of classification or by an alphabetical dictionary arrangement of subject entries. We have worked with both methods and have found that each has advantages and disadvantages when applied to coding for machine operations.
In either case, a system for machine operations must be made more rigid, specific and detailed than those used in traditional bibliographical approaches. Such a system must forego all cross references, since selection by machine is on an all-or-none basis. But there must be something to take their place. The design sign of the code itself can meet some of the problems raised by cross references but in some instances they can only be obviated by more thorough indexing. The usual see reference, usually from one synonym or closely related term to another, can be taken care of easily by using the same code number for all such related terms, as they appear in the index to the code. The need for something like see under references and for many of the see also references, those from the general to the specific, will be met by some degree of classification in the code, or as it might be called generic coding.
Since, as has been pointed out, a well thought out and detailed indexing system must be the background for machine coding, most of the findings described in the Final Report on Subject Headings and Subject Indexing are applicable to the machine problem, modified, of course, to fit the special requirements of machine operations. This is true not only of the studies on subject headings, and particularly the category method, but also of the studies on cross references, subdivisions and on the principles of subject indexing. The basis for both indexing systems is the establishment of a list of standard subject headings. But just as a subject index needs additional features such as cross references, subdivisions and "modifications", so does a machine indexing code need some way of meeting the same requirements. We have already spoken of the function of generic coding, and of what a code can do to take the place of some cross references. This is an important element not only in coding information but also in formulating questions for retrieval of information. Since we hope that one way in which a machine searching system would be more effective than a printed index is in the greater possibilities of correlation and depth of indexing, we do not want to complicate the problem by excessive, unnecessary coding that would add to our difficulties in encoding information and in asking questions.
The need for generic coding might be exemplified by the problems involved in searching for information about a class of drugs, such as antibiotics, and the individual specific drugs of that class. In a printed subject index there would probably be a main heading antibiotics under which would be indexed articles dealing with this class of drugs in general while articles on any aspect of the specific drugs would be under their separate headings, with a see also reference from antibiotics. Thus if a reader was interested in a certain aspect relating to any antibiotic he would have to look under all the headings for all the specific drugs. We would hope to obviate this in machine searching by a code or a feature of the code that would indicate this generic relationship.
Another feature, of probably greater importance, is the need to show correlations, that is the relations between certain subject entries in a given document or section of a document. These would include the action of one organism on another, or on a system of the human body and the possible disease condition involved; the effect of a drug on a system of the body or on a disease; or the reaction of one chemical substance on another; and the mode, direction and degree of such interreactions. In a printed subject index these relations are brought out sometimes by subheadings but mainly in the "modification" or descriptive title of the article. In a machine searching system these relations would have to be included as part of the overall code. It might be possible to have single code numbers or symbols for these combined concepts but this would result in a very complicated coding system and we hope that the problem can be solved by simpler means. In effect, we have one aim working against another here. The greater the depth of indexing, that is the more subject entries used, for a single article, the greater the need, usually, for indication of correlation between these subject entries.
Many of these general principles were discussed at length with detailed examples in the paper "Categorization as a Basis for Machine Coding", August 28, 1951, presented by Dr. Himwich before the Division of Chemical Literature, American Chemical Society, in September of that year. Since this paper has been previously distributed the conclusions are given here only in summary with some discussion of the application of the coding system to fixed field cards.
This paper set up certain criteria for a code, as follows, with slight modification from the original:1. It should be simple to encode and to decode.
2. Each specific or individual concept should be represented by a single code number.
3. It should provide that a concept can be located as a specific concept or as part of a generic concept or concepts,
4. It should provide that a specific concept may be approached from as many axes as appear desirable; e.g., the concept of rabbit from the point of view of a food, of a pest, of an experimental animal, of a carrier of disease, of a rodent, and of a fur-bearing animal, etc.
5. It should allow grouping of any specific concept in any possible relation to any other concept or concepts. For example, the use of any drug in any given disease in any given organism could be searched; the fact that the drug had never been used for that disease would not prevent the search. Such a search would give a negative answer.
6. It should permit the establishment and recognition of basic relationships between the items coded; e. g., there must be no doubt as to which is the affected and which the affector, whether it be a chemical reaction, a disease condition, or a social state that is being coded.
From certain of the requirements as set forth here and from what has already been said about generic coding, it will be seen that some degree of classification was assumed to be needed in a code for machines The paper proposed two possible solutions:
"to have a very simple code with little or no classification accompanied by extensive indexes both alphabetical and classified -- or to have a complex code allowing for multiple classification and a relatively simple index. In evaluating these two types of codes and the many possible compromises between the two, not only the problems of actual coding and finding of material but also those of cost, of time, of personnel, must all be considered."
We had made a study of a number of the universal classification systems currently used, but for purposes, of course, different from ours. We had hoped that features of them might be adapted to our uses. Among these was the Universal Decimal Classification. At first sight a decimal system seems ideally suited for machine coding. It is the simplest way to express step-by-step relationships, therefore particularly useful in generic coding and the straight numerical codes should be easy to punch on cards. The great disadvantage is that in some areas the need for specificity results in very long codes, since one has to keep advancing to additional decimal places to allow for the needed entries. At the same time there are many other areas that do not require as great a number of entries for the same ox equivalent degree of breakdown. The result is an overuse of punching areas for certain fields and a great waste in others. Since there are only 80 columns for punching on an IBM card, one tries to keep codes as short as possible and to utilize to the maximum the available combinations. For these and other reasons we came to the conclusion that no overall classification system, and particularly a decimal one on a mnemonic basis, would meet our needs. As will be seen, though, we have tried to keep the decimal feature wherever it is practicable.
In our early study of subject headings we had found that the great majority of headings fell into certain rather obvious or natural groups, which often ten were at variance with the grouping of the same headings according to any universal classification system. ire termed these groups "categories". The further development and elaboration of these categories has been described in great detail in the Final Report on Subject Headings and Subject Indexing. The category method of breakdown had proved of great value in the work on subject headings and we felt it might be applied to similar breakdowns for machine coding.
At this time we had set up 16 major categories which are outlined in the accompanying table.
For category coding on fixed field cards most of the codes were the same as or simple expansions of the category codes used for the subject heading study with added correlative features. However a very elaborate code was set up by Dr. Himwich for the field of Chemistry. This is described in detail in the paper on categorization previously cited.
One of the values of categories as a basis of coding is that it has many of the advantages of classification without some of the limiting disadvantages. Another point is. that it facilitates the description of the article as a whole or of the broad concepts within a given article. A high proportion of medical articles are about combinations of aspects from different categories, for instance, the action of a drug on a certain disease in various categories of individuals. Of course, an article may be concerned with interrelations between items of the same category. In either case, the category method would be of assistance in coding, particularly in indicating correlations, and also in formulating questions for machine searching.
As has been said, no single overall system seems to meet the requirements for all areas of knowledge, as applied to machine operations. This observation is true of the category method. In coding for fixed field cards we used both classified category coding and unclassified coding, and in various combinations to bring out generic and other relationships. It is probable that any code for machine operations would have to be so designed on the basis of expediency and the need for provision of correlating indications. No matter what kind of code, or combination of codes, is used, there must be a detailed alphabetical index, which will usually be the basic tool for the coder.
Categories for Subject Headings
Organism - e.g. dogs, escherichia coli
4. Trials IBM Sorter with Multiple Column Selector and IBM Collator, using Fixed Field Cards
The first major trial of machine searching operations was on two standard IBM machines that were then available to us. These were a Sorter with Multiple Column Selector and the Collator. Whereas the ordinary Sorter can select cards on only one column at a time, sorting them into any one of twelve pockets, depending on the punch or punches in that column, a Sorter with an attached Multiple Column Selector can select over a range of as many as ten columns at once provided the ten columns are adjacent. This appliance then would make possible searches for two items with.5 digit code numbers or for more with 3 or 4digit codes. It is, though, restricted to a ten column area of the 80 column card.
The Collator is a more complicated machine, in that two decks of cards are fed into it at the same time and the selecting or sorting procedure is based on matching pinches in certain columns of the cards in one deck with similar punches in the cards in the other deck. There are two types of Collators. The numerical Collator matches only the single punches in a column representing numerals, while the alphabetical Collator matches the two punches per column for alphabetical characters. The Collator can be wired to select or match on any of 15 columns anywhere on the card. It thus has a greater range as to the number of columns to be searched at any one time and as to their position on the card. It can be seen, though that with either machine, searches would be limited to at least one-fifth of the punching capacity of the standard IBM card. This means that one could search only a certain area of the cards at any one run through the machines and that a search of all the punches in a card or of those areas used for coding would require a number of runs.
It seemed, for our purposes, that the best way to make the greatest use of the possibilities of these machines was to use a fixed field card. This is a card where certain items of coded information are punched only in certain designated columns of the card. This is a common principle in IBM operations and is satisfactory for most of the uses to which the machines are put. An example of this is the card designed for the Projectís study on periodicals, where such items of information about them as country, language, contents and subject coverage were assigned to definite column positions on the card. In the present problem the 16 categories were the basis of assignment.
With the varied potentialities of the two machines in mind, a card was designed that provided for varied types of coding for different categories. For most categories, column areas were assigned that allowed for both coding of specific subject entries usually on a category breakdown basis and fqr correlative or as it was called "function" coding, to take care of generic relationships and in some instances directions of action and states.
The first seven columns of the card were reserved for a serial number to identify the source bibliographically and the eighth column was used to show how many cards had been used for coding that given article. There remained then 72 columns for actual coding. In relation to the searching capacity of these two machines, it can be seen that a complete search of a card would require up to from five to eight passes through the machine. Those categories in which there are the greatest number of subject headings, namely 1- Organisms, 2-Anatomical Terms, 3- Chemical Terms and 9- Pathologic Conditions, were each allotted the largest possible number of columns, thus allowing for longer codes, and were assigned to the left of the card, beginning with column 9, in the order listed.
None of the areas for coding, that is, the number of columns, whether for specific subject codes or for "function" codes, exceeded four columns. This meant that the Collator could search for specific subject codes in four different fields at one time or up to a total of 16 columns. The Collator could be used only for these kinds of searches, since all of the "function" codes were based on multiple punches in columns. For this type of punching only the Sorter with Multiple Column Selector could be used. It is true that this machine could also search for specific subject codes in two different fields if the columns for these respective codes were within the range of ten adjacent columns. This was taken care of for some categories by placing together or within the ten column range the specific subject coding areas for adjacent fields. These features emphasize some of the limitations of these machines and the need for careful card design and programming for any effective use of them.
An example of searching in the field of Chemistry might help in explaining some of these points. For Category 3- Chemistry, columns 24 through 32 were assigned; columns 24-27 for function coding, columns 28-31 for specific subject coding and column 32 for direction of action. Thus in columns 28-31 specific chemical compounds were coded using a combined alphabetical and numerical category gory code. These columns could be searched either by the Collator or the Multiple Column Selector. The Collator could combine a search for any one chemical substance coded in these columns with other aspects in up to three other separate categories. For searches on generic features, such as the action of the drug as described in the given article, as coded as a "function" in columns 24-27 or for direction of action, column 32, only the Sorter with Multiple Column Selector is useful. This latter machine could also search for the effect of any single drug in any single specific disease, since, it so happens, that specific diseases are coded in columns 33-36, or within the ten column range, It will be noted that the fixed field card allows for coding only one item per category per card. Since we might very well expect a paper to mention more than one drug, organism or disease, extra cards would be needed for coding then.
The details of coding with examples and the relation to card design and machine operations are given in full in the 1951 Terminal Report and the paper Categorization as a Basis for Machine Coding. It is believed that the summary as here given illustrates the possibilities, such as they are, of fixed field cards and of these specific machines and above all the serious faults and limitations.
While the techniques described above represented certain advances, particularly as to application of coding principles to machine operations, and above all for correlation, the total picture was discouraging, especially for any large scale searching operations. The fixed field requirement presented many drawbacks, particularly the limitation on the number of entries for a specific subject in a field and the difficulties imposed on searching. With these two machines the procedures for searching are slow and cumbersome, requiring many sorts and combination of sorts with two different machines for answers to any detailed questions. It was realized that we would have to find means to get away from the fixed field requirement and to simplify and speed up the searching procedures.
5. Trials of IBM 101 Electronic Statistical Machine.
As we began to work with the IBM 101 Electronic Statistical Machine, our first aim was to get beyond the requirements of fixed field searching. That is, we wanted to be able to search for a given code anywhere on the card, More realistically what we wanted was to search for a certain combination of punches making up a code of a certain length, which might be represented by alphabetical or numerical punches, and to search for this code in a series of areas on the card equivalent in length, in terms of columns, to the length of the desired code.
The IBM 101 machine is much more complicated and therefore more flexible in use than any of the other machines we had used and apparently offered great possibilities for our purposes. Due to the fact that the sensing mechanism, the brushes acting through the punches in the cards to an electric circuit, is connected up by a very elaborate wiring board to an equally complicated system of electronic tubes and relays, it is possible to ask for cards to be selected even on the basis of complicated patterns of punches anywhere on the card. The details of the wiring board are shown in the exhibits of the various wiring diagrams developed by the Project. These intricate matching operations are performed while the card is passing through the first reading station, at the rate of 1450 cards per minute. If the matching requirements are met, the next step of the machine operation takes place. The machine is capable either of sorting or counting. It can sort a card on the basis of any given combination of punches to a designated pocket or it can count the cards passing through or any given items of information punched on the card. It can perform these operations simultaneously. It is the sorting feature that was made use of primarily in our searching operations.
In a search with this machine, the question (actually the desired punches in given columns) is put into the wiring board. In a sense, then, the board might be considered as a small prototype of, or at least as acting in a similar way to, the "memory" devices of the larger electronic machines. Our problem was to work out wiring systems that would best answer our particular needs.
As a starting point it was decided to set up certain fixed arbitrary requirements and then to see how well the machine would meet them under varying conditions. As a beginning we set up an arbitrary numerical code of five digits, that would be represented by single numerical punches in five columns. Then we would require the machine to search for this code or codes in the maximum mum possible number of five-column areas on the card. It should be pointed out that the numerical codes mentioned in this part of the study are purely for the purpose of testing the capabilities of this machine and so are not directly related to any of the general problems of coding as set forth in preceding sections.
Our first step was to wire the board for searches for a single 5-digit code in a series of 5-column areas on the card. Our problem was to ask for any card which contained this code number anywhere in a given number of areas on the card Since the 101 has only 60 selectors we were automatically limited to twelve such 5-column areas or a total of 6o columns out of the 80 on the usual card.
The wiring diagram for this Board "A" is given in Exhibit No. 1 Columns 1 through 60 on the cards were used for these trial runs.
Following is an illustration how this wiring system works and what are its results. Assuming that the code for the item desired is 59673, we wire the board so that matching requirements are set up for a 5 punch in column 1, + a 9 punch in column 2+ a 6 punch in column 3+ a 7 punch in column 4+ a 3 punch in column 5. This wiring procedure is then continued for each of the other eleven 5-columns areas, for instance, 5-9-6-7-3 successively in columns 6 to 10, and then in columns 11-15 and so on. Thus any card which contained the given code number of 59673 in this order and in the proper columns of any of the twelve areas would be selected.
It can be seen that this method represents a considerable advance over what we had obtained with the other machines, in that it was now possible to search for a given code number over a much greater area at one run and also that cards punched on this basis could have as many subject entries in the same field as desired, within the limitation of twelve 5-column areas. On the other hand one can search for only one aspect of a question at a time and sq a fairly complicated question would require a number of runs for a complete answer, thus increasing the total searching time. It is true that the number of cards involved becomes successively less and that the changes in wiring to ask for different code numbers can be done very quickly. It would probably be the method of choice for answering simple questions of one or two facets but it would seem that machine chine methods are hardly necessary for such questions. It will be noted that the system has possibilities for generic searching in that one could ask for only part of a code in the given areas, say for only the punches in the first three or four columns of each area.
The basic principle of wiring is sound and could be extended if more selectors were available for the 101 or in some other machine. As for principle, it is also useful as a research tool since it represents in a small and relatively simple way what would go on, in a much more complicated way, within the "memory" devices of the larger electronic machines. We will see later that the wiring system of Board "A" also can serve as a most important supplement in searches using other wiring systems. It demonstrated what could be done with the 101 and led to the next more complicated systems.
We still wanted more from this machine and we felt that there might be ways of getting it without an improvement in the machine itself. We wanted to be able to ask all the elements or facets of a question at one time and if possible to ask more than one question at a time. Using the same arbitrary requirements of five-digit codes and for searches over twelve 5-column areas, other wiring possibilities of the machine were explored. Actually, for some runs the area requirement was cut to 10 areas, so that the cards involved could be studied more easily.
A system of superimposed wiring was developed by Mr. Garfield of the Project staff, that met many of the points we desired. It represented quite a revolutionary point of view as to wiring for such machines and because of this is much more difficult to explain and diagram. The Wiring Diagram for this system, Board "B" is shown in Exhibit No. 2.
In effect what we are asking for here is not for an exact matching requirement for a specific code number as with Board "A" but for patterns of punches that would meet the requirements for this code number but might also meet other requirements not desired. In this way the capabilities of the 101 machine are greatly extended but at the same time undesired items are selected, false sorts,, and certain other features of specific selection are modified or lost.
Before going on to describe this system in the fullest details with all of its ramifications and resultant possibilities and with the set-up of and findings from trial runs, it might be well to try to illustrate certain of the basic principles in the simplest form possible. The following illustration is based on actual experimental trials but has been somewhat condensed and simplified in order to bring out the essential elements of the system. First let us assume that we are asking only for the answer to a single question, which has three separate subject elements in it. Actually the trial run, of which this was a part, involved nine separate questions, five with five facets or subject elements and four with four facets. It can be realized, though, that in setting up even a question for three facets at once we are asking much more than we could do with Board "A".
Our question asks for the three following 5-digit codes 49576, 45081 80832. We want any card that has all three of these code numbers, no matter where they might appear in the designated five-column areas of the card. For purposes of convenience in reading the cards, the trial deck used for this run had random code numbers in only ten 5-column fields, but they could have beers in more. The board was wired to search 60 columns. In actual indexing, of course, the number of subject entries would vary per article. So we are searching for a total combination of these three code numbers in any one of ten designated areas on the card.
In this system we ask first that means appropriate wiring of the board to be described in detail later or as set forth in the diagram for all those numbers we want in the first columns of all the ten 5-column fields. That is, we say we want any card, that, in the first instance, has a 4 and an 8 in columns 1, 6,11,16, etc. up to 46. In this example we ask for a 4 to take care of the initial digit of two codes, since the first and second codes both start with 4. Such duplication happens, of course, frequently and helps to cut down the demands on wiring positions on the board. Next, then, we ask for cards having a 9 and a 5 and a 0 in columns 2,7,12,17 etc. and then for cards having a 5 and a 0 and an 8 in columns 3,8,13,18, etc., and then for cards having a 7 and an 8 and a 3 in columns 4, 9, 14, 19, etc., and then for cards having a 6 and a 1 and a 2 in columns 5,10,15 and 20, etc. It can be seen that we have asked for more than one digit in each of these series of columns, which is why this is called superimposed wiring, for this is what it involves on the board. But we have demanded that all of these digits appear in one or another (out of ten) of the columns where the digit might be coded. For the proper answer to our question the combination of the digits in the three facets of our question --49576, 45081, and 80832 should each be in this order in three separate columns. For instance they might be placed like this on a card with ten 5-column areas used for coding:
In this event the card chosen and any like this, no matter where the given code numbers appear, would answer our requirments and would be the exact number to our question. It has all the codes we have asked for an there are no combinations of digits in any of the groups of columns that would add up to an artificial match or a false sort. In other words, if one of the required code numbers should not be present, the card would not be selected.
On the other hand, we might have a card that answered the requirments as to certain digits in one or other of the series of five columns, but which actually did not have all or any of the desired code numbers in these positions. The following is an example of such a card that actually was selected from a deck of 1000 random numbered cards:
It will be noted that not one of our demanded code numbers appears above. But the card was selected and is a striking example of a complete false sort. As a matter of fact, in the actual trial run, it answered the requirement for four desired code numbers, none of which appeared on the card. It also was the only false sort in the entire run of 2000 cards. Let us see how this happened. The first number we want is 49576. We see at once that the first number on the unwanted card gives us a 4 in column 1 and a 9 in column 2. We here will designate all columns 1, 6, 11, etc. as column 1 and so on. The fourth number 18(5)89 gives us a 5 in column 3, etc. The ninth number 517(7)14 gives us a 7 in column 4 and the fifth number 2503(6) gives us the 6 in column 5. So we have built up the total requirements for the first code number 49576, although though most of the digits are in different 5-column fields. The second demanded code number 45081 can be accounted for, in short, as follows:
|Area Position on Card||1||10||5||3 and 4||10|
|Area Position on Card||3||2||3||5 and 8||3|
Thus we have a card selected which answers all the requirements for matching that have been set up but does not have a single item we really want. Such an occurrence might seem to limit severely such a system of wiring. But as we will see the statistical chances of such an occurrence would seem to be low if the question contains four or more facets and there are means of selecting out the false sorts. There are, though, other adverse features and all of these points will be discussed in detail after we have described the complete trial runs and their results.
Having established that this wiring system would work, it was then put to fairly large-scale tests in order to determine its maximum potentialities and, at the same time, the nature and degree of adverse results. At this stage the testing was of the machine operation itself and not of the combination of the operation and any coding system. The tests now to be described were carried out on the basis of the arbitrary standards outlined earlier and did not involve searching for actual indexed material.
A trial deck of 2000 cards was produced with 5-digit codes of random numbers in ten 5-column areas, beginning with column 1 and extending through column 50. As will be seen sane other cards extended through 12 5-column areas or 60-columns. It was felt that testing against a random-numbered deck would set up minimum basic conditions for false sorts. It was realized that similar groups of code numbers based on any indexing coding system and particularly a classified one would modify these conditions. We felt that the random-numbered study would serve as a base line for evaluation of any such future problems.
A number of trial runs were conducted on this deck, some to get statistical findings on false sorts and some to test the potentialities for searching for a number of complicated questions. Following is the description of one of the latter tests. For this test a series of nine arbitrary questions was set up. They were arbitrary in the sense that the desired code numbers were simply numbers and did not have any subject connotation. But they could have had each number might have been a code for a specific subject item and the total combination represent a correlated question, such as a drug or two drugs in the treatment of a specific disease in persons of a certain age group. Five of the questions contained five facets or five 5-digit codes and four contained four facets. The questions and the code numbers in them are given below.
Question No. 1 42101- 86891- 12126- 56322- 01501
No. 2 86070- 34768- 23556 - 11131- 84737
No. 3 73543- 43483 -36642 -42041- 22074
No. 4 88973- 46286 -51550 -59605 - 47955
No. 5 65476 -42617 -52811 - 87064- 97574
No. 6 49576 - 45081- 80832 - 89399
No. 7 39485 - 42402 - 83599 - 55215
No. 8 261473 - 56929 - 30976 - 61166
To test the searching for. these nine
questionsanother deck of cards was produced. For each question ten cards
were punched which contained in various areas the code numbers demanded
in the question. These code numbers were in different 5-column areas and
in different order for each card. The remaining five column areas were
then filled in with 5-digitrandom numbers, generally through 50 columns,
but in a few instances through 60 columns and for some cards one of
the desired code numbers was placed within the 51-55 and 56-60 column areas. A distinguishing punch was put in column 80, so that there could be a double check after the run to be sure that all desired cards had been selected. The layout of one of the test cards for question No. 1 was as below with the position of the code numbers of the question underlined.
It will be noted that the question code numbers are in a different order than that originally given. On all the other cards they appear in different places. All the random numbers are different. The 90 cards of this deck were then interspersed indiscriminately in a deck of 1000 random numbered cards. We should expect then that all of these 90 cards would be selected, ten for each of the nine questions.
The board was wired for a search through 60 columns. The principle is the same as that described above for three facets of one question but of course there is much more wiring to be done. For the 5-facet question one must wire in up to 5 digits for each column of the twelve 5-column areas. For example, in the first question given above, we ask for 4, 8, 1, 5 and 0 in columns 1, 6, 11 etc. As has been pointed out there are times when we do not have to ask for the maximum number of digits for any one column of the 5-column area. It will be noted that, in this question, three of the question codes end with 1- 42101, 86891, and 01501. This means that wiring in one "1" in columns 5, 10, 15, etc. serves for the final digit of all three codes and then only a 6 for 12126 and a 2 for 56322 are required for this series of columns, or the use of only three wiring positions instead of five as was required for the first series of columns. All of the other questions are wired for in a similar way. If a card meets the stated requirements, it is sorted out, those answering question No.1 to pocket 1 and so on. It should be pointed out that the machine is searching the cards for the requirements to the answers to all nine questions at one operation.
As can be realized and also as can be seen in the series of wiring diagrams, grams, the wiring of the board for this procedure is very complicated and takes a long time and a lot of thought and planning. The board has to be carefully tested by trial cards before starting a long run.
In this first trial run all of the 90 desired cards were selected and sorted to the proper pockets. In addition there was one unwanted card, a false sort. This card, and the reason for it, has been described earlier. There we showed how three of the desired code numbers were accounted for. The fourth code number is 8 9 3 9 9. If one refers back to the numbers on the false sort card it will be seen that the third 5-digit number (8)68(8)2 supplies the 8 for the column 1 position, the first number 4(9)6(9)4 the 9 for the column 2 and the column 4 position, the eighth number 33(3)34 the 3 for the column 3 position and the fourth number 1 8 5 8(9) the 9 for the column 5 position.
The 90 cards were interspersed again with another deck of 1000 random numbered cards and a second run made, using the same board. Again all of the desired cards were selected and this time there were no false sorts from among the 1000 random numbered cards.
There is another sorting difficulty that may be encountered but is not very likely. This occurs in the event that the same card would contain code numbers that would answer two questions. This is known as a "sort compare". Naturally the machine cannot send the same card to two pockets and it literally signals its dilemma. It stops, a red light goes on and by a special device a red mark is printed on the questioned card. We did not have any "sort compares" in our trial runs but purposely made up a card to test the machine for this point. The card contained all the desired code numbers for question No.1 and also all those for question No.9. When this card was "read", the "sort compare" signals were given. This occurrence emphasizes again the point that the machine searches the card for all questions.
Obviously the false sorts were the most serious drawback but they were expected, as an implicit consequence of the method of wiring. The next problem was to find out when and how often they would appear. From a series of trial runs it was found that with two facet questions 1.5% false sorts resulted, with three facet questions 0.1%, and with four-facet questions .025%. We did not have any false sorts, in our runs, with five-facet questions, but, of course, there is a possibility for them, though very slight. These probabilities were almost better than we had expected but they still presented problems for the general use of this method. These experimental findings were subtantiated to some extent by probability analyses of this specific situation.
In evaluating this method in comparison to the others that had been tried it is obvious that it has many very valuable features. In the first place, with any of the techniques using the IBM 101 machine, we can search for a given code over a large area of the card, so that we are rid of the fixed field requirement. This is an essential step in the right direction. In the second place, we can ask a number of questions at once, and questions that have multiple aspects. This means that the speed of searching, as compared to the actual speed of operation of these machines; as expressed in terms of number of cards per minute, is really greatly increased. The margin between the effective speeds of these machines and of those using microfilm or tape is considerably cut down. The card speed of 450-600 per minute which is the standard for most IBM machines seems very slow compared to the reported speeds of other machines. But if we realize that we can ask up to 9 questions at once and questions that contain up to over 40 code numbers, it is seen that we are increasing the effective speed of this machine by from 9 to 40 tines in comparison with the other machines. One still, though, has to look at this question of speed in relation to actual searching situations. This proviso applies, as well, to other machines. For instance, if we think of the body of periodical literature indexed in one year by the Current List, some 100,000 articles, we can see that we would need at least 100,000 cards for the machine indexing of this number of articles. Assuming that the entire deck has to be sent through the machine for the answers to nine questions, it can be seen that the actual machine operation would take almost four hours This is a minimum since additional time would be needed for testing the board, for card handling ling and for elimination of false sorts. Only experience in actual practice over a period of time could tell whether this is an excessive length of time required to answer nine complicated questions. It is possible that careful programming based on pre-decking could cut the searching time markedly.
The method of superimposed wiring for searching for code numbers in itself imposes serious limitations on what we have considered desirable or essential features of a machine searching system, namely, generic coding and searching and correlation between specific subject items. Generic searching would be difficult not only because of the problems of wiring for such codes but also because of the probability of a great increase in the number of false sorts that might result If a classified code on some sort of a decimal or mnemonic basis was used.
With the type of codes required for this wiring system, any indications of correlation would have to be within the code itself. Another possibility would be to use additional separate "function" columns for these indications somewhat as on the fixed field card described earlier. The first run using superimposed wiring would bring out all the subject items desired, but regard less of correlation, The groups of cards so selected could then be searched on the "function" columns for the desired correlations. Since only 60 columns have been used for code punches, there are ample additional columns for punching for "function" codes and for a serial number for bibliographical identification.
6. Programming for Machine Searching
Careful and detailed programming is essential for the most effective utilization of any machine method, and particularly for a method like the one using superimposed wiring, which in itself presents special problems. Programming involves planning for the entire operation from the formulation of the code to the final searching operation. Since, in the methods we have described, the basic unit is an IBM card, the design of the card is all-important. It must be designed to fit not only the requirements of the code but also of the machines to be employed. The production and organization of the card file must be planned carefully for the most efficient use of the file. Then there is the choice of the machine to be used for different situations, the method of using the machine and how it is to be used, in relation to the features and organization of the card file.
We will discuss now certain features of programming in relation to the machine methods here described, and particularly those using the IBM 101 machine. In the first place we face the problem of false sorts. If this matter is a serious one, as is likely, some way must be found to eliminate the cards so selected, without leaving it to the person asking a question to find that he has a number of unwanted references. A solution would be to use Board "A" to weed out these unwanted cards. After the total deck has been searched by Board "B", the cards in each pocket, containing the wanted and unwanted answers to a single question, could then be sent through the machine, with Board "A" wired for a direct search for one facet of that question and, then, if necessary, for other runs for the other facets. Since, as has been pointed out, the necessary wiring changes, for this board, can be done very quickly and the number of cards involved would be relatively small, the additional searching time would be minimal.
The fact that there are no false sorts with Board "A" suggests the possibility of using it for searches of 2-facet and even 3-facet questions, whore we get a higher proportion of false sorts with Board "B". In such searches one would wire in for the first run the code for that subject item with the lowest expectancy of occurrence as an entry on the total number of cards in the deck. Thus if we were searching for the use of any antibiotic or other group of drugs, in the treatment of a certain rather rare disease, we would search first for the code number for that disease, since we could well expect that fewer articles would be written on the disease than on all antibiotics and so we would have the smallest possible number of cards for the next searches for the second or third facets of the question.
As a corollary, in a sense, we might organize our deck of cards on the basis of the number of subject entries punched in them. This pre-decking might be done at the time the cards are produced or later by sorting on a column punched to indicate number of entries or by using the 101 to sort on the basis of the 5-column punching areas. If we had such a pre-sorted deck, it might then be possible to use a modification of Board "A" to search completely, at one run, those cards with only a few subject entries, that is having codes punched in a limited number of the 5-column areas. This method would probably only be applicable to questions with two or three facets and even then the rest of the deck would have to be searched by one or other of the methods described above. It might, though, save considerable time.
The principle of pre-decking has many other applications. It illustrates one of the advantages of cards over film and tape. Many of these procedures would be almost impossible with film or tape as would be any changes or corrections. One way a file of cards could and probably should be pre-decked is by date. If one should want only the most recent developments in a phase of research it would be wasteful to send through a deck of 500,000 cards when one of 100,000 cards would give the answers.
We have spoken, heretofore, only of subject coding, but we also have the problem of searching for articles by author. It is likely that a deck of cards, coded for authors, would be kept separate and searched separately.
Another way of pre-decking that might be very useful is that on the basis of major categories. Considering the nature of medical literature it could well be that a high proportion of the subject entries might be confined to one or two categories. If we could separate out those cards that had entries only for one or two of the major categories, say drugs and disease, it would facilitate the searches for questions involving only the other categories. Conversely, we could facilitate searches for questions involving the major categories if we had duplicate decks of all cards with entries for subjects in those categories. Reproduction of cards that have already been punched is relatively simple, but there would be additional problems of storage.
All of these features of the organization of the card file would have to be carefully coordinated in the final programming for the machine operation. The total result of all these phases of programming should be more efficient and faster searching of the literature, as represented in the file of IBM cards coded for the desired indexed information.
7. Practical Applications with Indexed Material.
It had always been the plan to make practical tests of the machine methods developed, by having pilot runs using material specifically indexed arid coded for machine searching. The original idea had been to re-index in greater detail and depth a large number of the articles previously indexed by the Project for the Cumulative Index to the Bulletin of the Johns Hopkins Hospital and to code for machine operations. The imminent termination of the Project precluded any extensive work along these particular lines. We were fortunate, though, in having at hard, as a by-product of two other aspects of the Projectís program, a fairly complete coding system and a large body of material already indexed and coded for machine operations. In the work on the subject heading authority the Currert List, we had assigned consecutive numerical serial numbers for specific subject headings (and cross references) on the basis of a straight alphabetical list and also had assigned category codes for the subject headings, which were used for grouping the headings in rather detailed subcategories. These codes were then adapted for the machine codes used in the studies on the preparation of the subject index to the Current List by machine methods. There was also a 2-digit code for the standard subheadings used by the Current List. The details of both of these studies have been described in full in separate reports and we will be concerned here only with the use and application of this material, so coded, for machine searching studies.
Since in the study of the preparation of the indexes to the Current List, we wished to duplicate by using machine methods what the Current List did by other methods, the operations were based on material as actually indexed by the Current List staff. The Editor of the Current List, Mr. Seymour I. Taine, very kindly made available to us the indexing slips prepared for the April 1952 number of the Current List. On these slips the indexer puts down the authors name, the title of the article in the original language with a translation into English if necessary, the indicated subject entries, subheadings and the "modification" or descriptive phrase for That specific article, in relation to the chosen subject headings.
Below is an example of such a slip, prepared for an article appearing in Medicine et Hygiene (Geneva) for 15 November 1951 and having the register number 29501 in the April 1952 Current List. It has been slightly edited to conform to the subject entries as they actually appeared in the subject index of that number.
Indications respectives des antihistaminiques de synthese et de certaines vitamins dans le traitement des dermatoses allergiques.
Synthetic antihistamins and certain vitamins in allergic dermatosis therapy.
Allergy, antigustamin and vitamin ther.
Antihistamin and vitamin, in allergic
VITAMINS, therapeutic use,
ANTIHISTAMINICS, therapeutic use
It might be pointed out that the above is an example of rather detailed indexing for the Current List.
Code numbers were assigned for each subject heading and subheading, and category codes assigned where appropriate. Thus for the heading SKIN, diseases, the specific serial number for SKIN, 76130, and the subheading code 42 for diseases are assigned. From the category code for anatomical terms, the subcategory code 02295, integument, is chosen. Similarly:
DERMATITIS, therapy 09570
VITAMINS, therapeutic use 0310
ANTIHISTAMINICS, therapeutic use
There could also have been a category code for ANTIHISTAMINICS, 03310. It will be seen that we have a 5-digit code for specific subject headings, which is based on an alphabetical list, a 2-digit code for subheadings, arid a generic category code. It should be noted that there is no coding for additional information that might appear in the "modification".
For the machine searching study all of the code numbers for a given article plus the register number, for identification, were punched on one IBM card. The set up of the card for the above article is as below:
Specific serial codes Category Codes
|Skin||diseases||Dermatitis||Therapy||Vitamin||Ther. use||Antihistaminics||Ther use||Skin||Skin Diseases||Vitamins|
With the cards as punched in the manner shown above, one could search for the specific subject, in each 5-column area, or one could search for this specific code plus the subheading, over seven columns, or for the subheading code alone over two columns. It should be pointed out that almost one-half of the Current List standard subheadings are also main subject headings and so one might want to search for them in either of the two codes. In addition, one could use the category codes for combinations of generic searching with the specific code numbers.
In one trial run we asked for all cards which answered to the requirement for codes for Neoplasms and Surgery, either as a main subject heading, the specific 5-digit code, or as a subheading, the 2-digit code. When we formulated the question we had in mind the concept surgery of neoplasms. All but one of the cards were for articles related to this question and it was our opinion that in machine search we had answers that would have taken a good deal more time to have worked out by searching the printed index. But the one card pointed out a serious difficulty. It was for an article on a neoplastic condition following surgery. In other words we did not have here correlation indication. It might be said, and actually has been said, that one undesired reference among 50 or 60 correct ones was of no consequence. On the other hand, though, if one had asked the question "neoplasms following surgery", the 50 or 60 wrong answers would be a serious problem. We feel that this example points out, rather strikingly the need for indications, by the code, of correlations.
The major conclusion, though, from these trial runs was the realization that there were possibilities of combining the usual indexing procedures for a printed index with coding for machine operations. We would then have at least the beginning of a set up for machine searching. In the example shown above, which is typical of all the articles coded, there is no indexing beyond that done normally by the Current List. If the indexes were produced by machine methods, most of the coding, also would be done in advance. One more desirable step would be the coding of additional items in the "modifications".
These trial runs demonstrated what the IBM 101 machine could do with a file of cards based on actual indexed material. The results also suggested that certain degree of specificity of selection would be attained by the very nature of the question itself, that is, the combination of the focusing of each facet of a question. It might be that one could ask a question containing one or two specific facets and one generic one and yet get a fairly specific answer. It might be compared to shooting at a target with a rifle or a choke-bore shot gun.
On the basis of what has just been said, it is suggested that the first step in setting up a large-scale machine information searching system for actual reference purposes might very well be the combination of machine operations for assembling the material for a printed index and for information searching. Thus, if the Armed Forces Medical Library wished to have a practical trial of machine methods for its own reference work, such a system could be tied up with the indexing procedures of the Current List of Medical Literature. If the indexes to the Current List were assembled by machine methods, as described in another Project report, we would have, as we have seen in the above description of the practical trial runs, a fairly adequate basis for setting up a machine searching system, at least as a beginning.
The actual subject indexing of articles would be done in the same way as at present, with the subsequent steps of arranging and assembling the subject entries performed by machine methods. The coding required for these machine operations could probably be applied without much change to machine searching methods. As has been suggested, it might be advisable to code for any additional information in the "modifications", already set forth there so that no further indexing is required. In actual operation it might be found necessary to make some modifications in the codes to take care of special features of the machines, such as correlation or superimposed wiring. The greater part, though, of these two essential and time-consuming steps, indexing and coding, would be the same for the two quite different operations.
Aside from these advantages of expediency in saving time and effort, such a combined system has other advantages. As reference questions are asked and answered, it could be determined how satisfactory such indexing and coding was for machine searching and what more might be required. Such studies might be useful in evaluating the indexing methods and principles for the printed index itself.
It is our opinion that such a practical trial would be the best way and perhaps the only way to get final answers to the unanswered questions raised in this report such questions as those of problems of coding, of what machines and methods of use are best for specific situations and of programming. Above all we would need such a trial to answer the fundamental question ó just what are those needs of scientific research that might be met by machine methods and that may or may not be met by our present indexing and abstracting services. Here we would have the two systems working together on the same material and thus would have a sound basis for comparison.
Such a set up could, of course, be only the beginning. One can envisage a central information center for all science, using machine methods. Here we could have complete coverage of all of the important scientific literature, so that all of the really valuable articles for any field, no matter where and how published, would be available to the worker in that field. We know that this is a problem of some importance to medical research today, with the demand for research search information from widely scattered fields.
We believe that, for any of these purposes, there are very definite possibilities in the machine methods developed by the Project, and by others, particularly with the expected future improvements in these machines or by the applications of the principles of these methods to other machines. Machine methods might have other applications in the field of medicine, such as the compilation of bibliographies on special subjects, the correlation and organization of basic factual information from many sources and for analyses of various forms of medical records.