Intro: Proteins are large biological molecules consisting of amino acids. In general, the genetic code specifies 20 standard amino acids. This assignment is based on the systematic exploration of the distribution of certain amino acids in proteins’ structures.
3-letter codes of the 20 standard amino acids:
ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
Your task
Implement a program invoked like:
program_name configuration_file output_file
Command line always contains configuration file name and it can contain output file name. If the output file name is not listed, standard output should be used.
Both the configuration file and the data files are row-oriented.
Data file structure
One file describes one protein and contains information about all its amino acids and their spatial coordinates (x, y, z – discrete values). Each row begins with the 3-letter amino acid code and continues with the spatial coordinates for that amino acid.
Keep in mind: coordinates can also be negative numbers, space between strings on one line can be one or more whitespaces.
(Note: This is a simplification of a real PDB file describing protein structures).
Configuration file structure
R-neighborhood .. is an integer Pattern ......... is a sequence of one or more amino acids separated by one or more whitespaces protein_1 ....... is the name of the first data file ... protein_N ....... is the name of the N-th data file
Histogram
The R-neighborhood represents the neighborhood of a certain amino acid at a distance less than or equal to R. The R-neighborhood of an amino acid with coordinates [x,y,z] is defined as points with all coordinates in the range [x-R..x+R, y-R..y+R , z-R..z+R].
A histogram is constructed for each point in discrete 3D space in which an amino acid from the set of specified proteins is located. For each point, we calculate the number of amino acid types - specified in pattern - in its R-neighborhood. Let these numbers be (in order according to the specified pattern) [c1..cn]. Then the record corresponding to the values [c1..cn] is incremented. The resulting histogram is created by gradually incrementing the records according to the R-neighborhood of all points corresponding to the amino acids of all input proteins.
Output
The output format is row-oriented, one line is in the form
The output is sorted lexicographically, i.e.
[0 0 0 1]: xxx [0 0 0 2]: xxx ... [0 0 1 0]: xxx [0 0 1 2]: xxx ... [0 0 2 1]: xxx
Only non-zero occurrences are included in the output.
Example
Configuration file:
6000 ARG LYS simple.pdb
Data file (simple.pdb):
ARG 14872 -18107 30327 LYS 16112 -17325 26790 HIS 17615 -20594 25563 ILE 18797 -24042 26472 ARG 21860 -24523 24296 ARG 24156 -21734 23132 GLY 27393 -22378 21345 HIS 29225 -19697 19391 ALA 32741 -18808 18304
Output:
[1 0]: 1 [1 1]: 2 [2 0]: 3 [2 1]: 1 [3 0]: 1
One of the data files used in tests:
Attached link, that will die soon probably, copied at the bottom
Assumptions and efficiency requirements
The discrete 3D space where all the amino acids are located is large, think on the order of 100000^3. It is therefore not possible to store in memory a map with data for every point in this space.
Space filling with amino acids is very sparse. Assume tens to small hundreds of amino acids (occupied points in space). Therefore, choose a suitable data representation so that the necessary operations are as efficient as possible.
It is certainly not efficient to search every point of the entire space for each amino acid, nor to go through all other amino acids entered.
You may find it useful to observe that for each amino acid in each dimension there are sufficiently few other amino acids in the range of R-neighborhoods (i.e., in the subspace [x-R..x+R, *, *]) that one can already search sequentially.
Configuration and data file syntax checking requirements
The primary evaluation criterion is functional correctness and efficiency on correctly entered data. The program must be stable (i.e. not perform any undefined operations, have unhandled exceptions, exit uncontrollably, etc.) on any (i.e. arbitrarily corrupted) data.
In order to achieve the full number of points, a check of the syntax of the configuration and data files is necessary, if it is violated, the program writes (to the output file or to the standard output, according to the parameters of the command line) the string "error" and ends (with a return code of 0). Consider a syntax violation other than a valid 3-letter amino acid code, a different number of coordinates, non-numeric characters at coordinate positions, etc.
If any data file specified in the configuration file cannot be opened (e.g. because it does not exist), it is not considered an error, simply skip the file. Being not able to open configuration file is an error.
File from downloadable from the link:
GLY -5902 73707 44647 PRO -6264 73743 40764 TYR -3705 71988 38542 LEU -3494 70898 34880 VAL -2843 67241 34000 ILE -2073 65571 30665 VAL -5000 63138 30106 GLU -3371 61798 26891 GLN 0298 62526 26068 PRO 1548 63200 22516 LYS 2971 60182 20620 GLN 6732 60114 21213 ARG 7588 58664 17782 GLY 6245 58612 14266 PHE 5232 62279 13870 ARG 6704 64364 11105 PHE 7776 67992 11705 ARG 7017 70083 8594 TYR 9026 72923 7063 GLY 7212 76167 6207 CYS 7470 75226 2529 GLU 5463 71982 2791 GLY 2067 73278 3865 PRO -0149 73817 6887 SER -2581 70903 6312 HIS -0429 67805 6799 GLY -1985 66380 9962 GLY -2112 67101 13681 LEU 0271 65758 16340 PRO -1704 62677 17591 GLY -2080 61172 21099 ALA -0097 58364 22710 SER -2608 55734 21533 SER -3856 56280 17981 GLU -4190 52984 16070 LYS -6365 52493 12940 GLY -9823 54116 12842 ARG -9728 55799 16290 LYS -7067 58485 16685 THR -6530 60460 19863 TYR -4810 63883 20210 PRO -2865 65278 23197 THR -5445 66082 25941 VAL -5149 67919 29242 LYS -7630 68753 31976 ILE -7906 71378 34717 CYS -8719 69194 37749 ASN -10404 70905 40716 TYR -11209 73822 38373 GLU -11458 77453 39415 GLY -13284 79824 37052 PRO -13557 79329 33267 ALA -9954 79073 32030 LYS -7299 79768 29439 ILE -4861 77005 28304 GLU -1846 77730 26095 VAL 0660 75301 24561 ASP 4077 75922 23036 LEU 7228 74017 22190 VAL 9966 73747 24855 THR 13386 72129 24601 HIS 14112 68506 25486 SER 16709 69869 27923 ASP 15438 70035 31507 PRO 14855 73696 32442 PRO 12178 73987 29700 ARG 13172 76853 27501 ALA 11308 77859 24303 HIS 12261 75907 21143 ALA 13204 77563 17826 HIS 10407 75400 16314 SER 6815 76712 16480 LEU 3416 75054 16816 VAL 1340 75471 13671 GLY -2446 75104 13090 LYS -5815 75395 14808 GLN -5961 78357 17279 CYS -2120 78878 17123 SER -0589 82386 16999 GLU 2146 83357 14620 LEU 4788 82755 17230 GLY 3830 79151 17958 ILE 1670 79732 21043 CYS -1672 77951 21035 ALA -4276 79584 23267 VAL -7716 78245 24030 SER -10487 78742 26569 VAL -11923 75715 28467 GLY -15648 76296 29200 PRO -17220 76803 32657 LYS -18205 73099 32658 ASP -16010 70721 30632 MET -12476 71383 31856 THR -10707 69209 29289 ALA -8884 70494 26198 GLN -8229 68097 23312 PHE -5664 69825 21044 ASN -7002 68392 17826 ASN -4686 70039 15291 LEU -1173 70852 16446 GLY 1745 70832 14001 VAL 5462 71549 14380 LEU 6882 74031 11870 HIS 10501 73041 11348 VAL 13044 75904 11296 THR 15901 75530 8833 LYS 19433 75655 10285 LYS 19805 78943 8383 ASN 16548 80073 10015 MET 17317 79109 13580 MET 19235 82218 14690 GLY 16743 84782 13422 THR 13784 82759 14770 MET 15698 82476 18034 ILE 16634 86225 18473 GLN 12996 87040 17579 LYS 11470 84834 20294 LEU 14178 86005 22727 GLN 13369 89704 22123 ARG 9699 88988 22587 GLN 10682 87286 25841 ARG 12753 90306 26946 LEU 10048 92859 26526 ARG 7623 90445 28155 SER 7480 90991 31936 ARG 10089 93811 31835 PRO 10058 97252 29870 GLN 9652 98057 26142 GLY 12949 98079 24275 LEU 16442 96818 23593 THR 19556 98978 23660 GLU 22375 97823 21343 ALA 24145 96236 24282 GLU 20912 94388 25119 GLN 20607 93331 21481 ARG 24262 92206 21301 GLU 23787 89912 24374 LEU 20410 88680 23083 GLU 22215 87556 19909 GLN 24750 85902 22229 GLU 22082 83603 23681 ALA 20996 82677 20179 LYS 24443 81409 19128 GLU 25101 79517 22353 LEU 21496 78247 22381 LYS 21766 76912 18827 LYS 24900 74822 19619 VAL 23010 72824 22323 MET 19482 72631 20887 ASP 18136 69379 19421 LEU 16045 70184 16340 SER 14729 66589 16289 ILE 13135 66618 19726 VAL 10525 68948 21257 ARG 8191 68845 24255 LEU 4723 70383 24547 ARG 4344 72756 27479 PHE 0787 73383 28638 SER 0296 76558 30645 ALA -3150 76794 32224 PHE -4588 80131 33465 LEU -7519 80693 35772 ARG -9333 83809 34513 SER -4554 86056 33994 LEU -3785 83785 37002 PRO -1175 81039 36386 LEU -1311 77353 37450 LYS 1867 75230 37378 PRO 3065 74417 33818 VAL 2985 70745 32807 ILE 5618 69450 30341 SER 4877 66389 28188 GLN 7291 63642 27201 PRO 9670 64213 24228 ILE 8346 64193 20665 HIS 10762 62906 18060 ASP 10791 64095 14467 SER 10061 61036 12283 LYS 12033 62736 9520 SER 15216 62997 11542 PRO 17245 59950 10317 GLY 17438 58727 13939 ALA 13952 59279 15418 SER 11119 57328 13782 ASN 9027 54290 14618 LEU 11099 51143 14313 LYS 9026 49054 11907 ILE 9506 45573 10446 SER 7956 45698 6957 ARG 8694 42101 6048 MET 11300 39415 6458 ASP 12873 36586 4550 LYS 12076 33624 6745 THR 10020 33044 9879 ALA 11498 29718 10916 GLY 14891 28175 11319 SER 16964 25704 13281 VAL 17143 25680 17052 ARG 20963 25794 16510 GLY 20824 29465 15507 GLY 22848 30896 12662 ASP 20178 31069 9967 GLU 20493 34243 7847 VAL 17354 36452 7654 TYR 16848 39532 5394 LEU 14658 42018 7295 LEU 13119 45024 5485 CYS 12551 47987 7815 ASP 11835 51741 7949 LYS 14999 53877 8126 VAL 17326 52835 10966 GLN 20796 53806 12137 LYS 23533 51175 11547 ASP 25093 51639 14954 ASP 22001 52243 17053 ILE 19859 49238 16143 GLU 19426 45655 17326 VAL 17308 42563 16717 ARG 15951 41370 20044 PHE 14592 37820 20221 TYR 12631 36842 23322 GLU 9749 34819 24787 ASP 8051 36240 27899 ASP 6886 33394 30254 GLU 7554 33506 34056 ASN 11262 33452 33047 GLY 12252 35701 30103 TRP 14890 34456 27603 GLN 16442 37019 25278 ALA 19102 36908 22576 PHE 20204 38978 19590 GLY 20627 38977 15814 ASP 24156 38799 14470 PHE 25101 41451 11922 SER 27787 44113 11384 PRO 26997 47664 10160 THR 27811 46422 6610 ASP 24814 44143 7078 VAL 22504 47113 7550 HIS 21818 48009 3911 LYS 21168 51714 3043 GLN 19246 52071 6376 TYR 16281 50029 5037 ALA 17352 46377 5349 ILE 19238 44272 7870 VAL 20851 40933 6960 PHE 21363 39049 10191 ARG 22067 35534 11465 THR 19508 34280 14044 PRO 20741 33513 17538 PRO 20765 29990 19008 TYR 17734 28967 21094 HIS 18353 28429 24837 LYS 17559 24665 24766 MET 19471 22662 22095 LYS 17386 19628 23051 ILE 13946 20695 21789 GLU 11678 17961 20507 ARG 8822 20158 19311 PRO 9074 23447 17403 VAL 8841 26555 19640 THR 7655 30009 18645 VAL 9550 32969 20076 PHE 9433 36593 18957 LEU 11879 39031 17452 GLN 11742 42784 17347 LEU 13867 45695 16363 LYS 15128 47655 19342 ARG 17014 50927 19625 LYS 20262 50606 21626 ARG 20462 54056 23188 GLY 16657 54294 23601 GLY 14906 50997 24262 ASP 11870 51519 22010 VAL 10922 48525 19942 SER 9069 47812 16756 ASP 5930 45668 16633 SER 7200 42113 17169 LYS 7071 39374 14520 GLN 7145 35655 15455 PHE 9903 33186 14637 THR 9762 29446 15181 TYR 12667 27357 16209 TYR 12735 23834 14772 PRO 14682 20629 15825 GLY 37829 72937 -44895 PRO 39423 72794 -41437 TYR 37085 70892 -39150 LEU 37150 69964 -35506 VAL 36522 66345 -34541 ILE 35992 64789 -31118 VAL 38760 62171 -30602 GLU 37605 60944 -27176 GLN 33947 61358 -26113 PRO 32931 62292 -22574 LYS 31774 59246 -20589 GLN 27986 58978 -20925 ARG 27254 57588 -17370 GLY 29050 57545 -14042 PHE 29709 61297 -13617 ARG 27942 63593 -11120 PHE 26922 67248 -11435 ARG 27902 69285 -8385 TYR 26066 72303 -6889 GLY 27852 75521 -5996 CYS 27794 74536 -2296 GLU 29922 71490 -2950 GLY 33186 73050 -4097 PRO 35508 73107 -7144 SER 37760 70053 -6401 HIS 35373 67158 -7101 GLY 36772 65325 -10144 GLY 36952 66091 -13863 LEU 34560 64640 -16456 PRO 36434 61522 -17763 GLY 36657 60146 -21346 ALA 34448 57382 -22743 SER 37231 54895 -21834 SER 38620 55456 -18366 GLU 39077 52164 -16546 LYS 41392 51549 -13499 GLY 44617 53604 -13698 ARG 44243 54637 -17369 LYS 42031 57687 -17194 THR 41083 59663 -20266 TYR 39365 63079 -20564 PRO 37316 64423 -23507 THR 39574 65366 -26395 VAL 39122 67040 -29740 LYS 41373 68063 -32609 ILE 41324 70539 -35486 CYS 41946 68584 -38686 ASN 43403 69977 -41943 TYR 44839 72761 -39735 GLU 44771 76441 -40494 GLY 46301 79189 -38229 PRO 47095 78906 -34391 ALA 43648 78184 -32947 LYS 41316 78587 -30054 ILE 38732 76075 -28770 GLU 35780 77030 -26562 VAL 33632 74468 -24686 ASP 30188 75142 -23224 LEU 27199 73091 -22019 VAL 24297 72664 -24407 THR 21014 70880 -24011 HIS 20088 67351 -24992 SER 17220 68465 -27306 ASP 18198 69046 -30976 PRO 18611 72685 -31991 PRO 21409 72986 -29380 ARG 20731 75777 -26945 ALA 22892 76815 -23963 HIS 22058 74953 -20771 ALA 21474 76500 -17328 HIS 24519 74535 -16100 SER 28005 75924 -16368 LEU 31391 74300 -16789 VAL 33790 74805 -13893 GLY 37519 74517 -13335 LYS 40668 74478 -15402 GLN 40425 77592 -17526 CYS 36675 78042 -17053 SER 35058 81522 -17152 GLU 32580 82666 -14533 LEU 29870 82035 -17146 GLY 30647 78441 -17876 ILE 32746 78936 -21037 CYS 35977 76948 -21311 ALA 38437 78503 -23737 VAL 41905 77243 -24548 SER 44516 77856 -27256 VAL 45636 74954 -29506 GLY 49323 75501 -30280 PRO 50759 76164 -33752 LYS 51624 72523 -34351 ASP 49767 70304 -31866 MET 46101 70914 -32799 THR 44598 68639 -30157 ALA 43079 69783 -26877 GLN 42566 67450 -23924 PHE 39931 68678 -21442 ASN 41585 67449 -18259 ASN 39319 69052 -15668 LEU 35681 69631 -16569 GLY 33250 70024 -13721 VAL 29536 70474 -14319 LEU 27859 73088 -12154 HIS 24307 72231 -11125 VAL 22013 75266 -11054 THR 19235 75040 -8477 LYS 15623 75391 -9703 LYS 15398 78571 -7664 ASN 18379 79795 -9696 MET 17636 78545 -13167 MET 15411 81499 -14058 GLY 18078 83957 -12924 THR 21083 82429 -14703 MET 19053 81406 -17729 ILE 18243 85108 -18162 GLN 21916 86068 -17595 LYS 22788 83669 -20436 LEU 20118 84884 -22882 GLN 21005 88549 -22110 ARG 24672 87693 -22616 GLN 23671 85807 -25723 ARG 21295 88476 -27022 LEU 23322 91574 -26330 ARG 26019 89432 -27930 SER 25805 89806 -31728 ARG 23461 92803 -31639 PRO 23802 95989 -29396 GLN 24416 96566 -25613 GLY 21159 96897 -23672 LEU 17588 95752 -23213 THR 14599 98104 -23056 GLU 11835 97171 -20554 ALA 9898 95329 -23247 GLU 13115 93418 -24002 GLN 13797 92428 -20363 ARG 10196 91204 -20045 GLU 10419 88862 -23062 LEU 13685 87547 -21613 GLU 12001 86316 -18456 GLN 9461 84426 -20610 GLU 11916 82414 -22732 ALA 13423 81462 -19352 LYS 10147 80698 -17671 GLU 9112 78601 -20684 LEU 12647 77405 -21077 LYS 12712 76076 -17490 LYS 9600 73963 -18084 VAL 11260 72077 -20982 MET 14891 71883 -19786 ASP 16243 68527 -18513 LEU 18654 69339 -15747 SER 20064 65778 -15666 ILE 21517 65642 -19203 VAL 23951 68034 -20892 ARG 26031 67834 -24067 LEU 29538 69313 -24601 ARG 29970 71840 -27425 PHE 33360 72571 -28962 SER 33548 75789 -30986 ALA 36966 76344 -32621 PHE 38292 79719 -33893 LEU 41297 80162 -36201 ARG 43042 83517 -35791 SER 38086 85130 -34988 LEU 37045 82737 -37790 PRO 34565 80041 -36628 LEU 34793 76331 -37577 LYS 31477 74455 -37338 PRO 30630 73532 -33676 VAL 30917 69785 -32897 ILE 28544 68398 -30256 SER 29373 65369 -28130 GLN 27045 62567 -27060 PRO 24822 63154 -23987 ILE 26287 63144 -20432 HIS 23973 61973 -17642 ASP 23935 63028 -13955 SER 24990 60041 -11793 LYS 23222 61538 -8765 SER 19909 62085 -10494 PRO 17509 59227 -9469 GLY 16966 58164 -13095 ALA 20657 57955 -14074 SER 22205 56453 -10944 ASN 24644 53523 -11154 LEU 22663 50243 -10992 LYS 24099 47657 -8536 ILE 22975 44300 -7088 SER 24336 44221 -3560 ARG 23105 40924 -2045 MET 20579 38271 -2955 ASP 18332 36099 -0843 LYS 18801 32885 -2669 THR 20739 32013 -5772 ALA 18894 28851 -6708 GLY 15322 27664 -7134 SER 13047 25395 -9187 VAL 12766 25518 -12968 ARG 9037 26422 -12579 GLY 10151 29791 -11316 GLY 7869 31477 -8797 ASP 10454 31473 -6040 GLU 10393 34614 -3941 VAL 13711 36507 -3830 TYR 14814 39437 -1636 LEU 17392 41375 -3661 LEU 19020 44375 -2038 CYS 20329 46965 -4451 ASP 21480 50568 -4929 LYS 18665 53141 -5412 VAL 16190 52357 -8208 GLN 12876 53968 -9091 LYS 9999 51542 -8573 ASP 8331 52370 -11854 ASP 11563 52273 -13922 ILE 13324 49080 -12825 GLU 13381 45447 -13927 VAL 15164 42232 -12928 ARG 16189 40548 -16147 PHE 17234 36834 -16128 TYR 19061 35699 -19259 GLU 21626 33484 -20983 ASP 23772 34090 -24092 ASP 24587 31202 -26397 GLU 24101 31390 -30186 ASN 20364 31765 -29327 GLY 19543 33559 -26099 TRP 16677 33700 -23613 GLN 15344 36329 -21221 ALA 12637 36581 -18586 PHE 11740 39050 -15875 GLY 11248 38679 -12138 ASP 7632 39425 -11182 PHE 6800 42104 -8657 SER 4558 45119 -8051 PRO 5610 48719 -7102 THR 4730 47736 -3511 ASP 7510 45118 -3736 VAL 10258 47658 -4256 HIS 11094 48572 -0658 LYS 12111 52249 -0233 GLN 14117 52266 -3519 TYR 16778 49927 -2025 ALA 15286 46475 -2137 ILE 13155 44447 -4511 VAL 11069 41418 -3389 PHE 10212 39404 -6508 ARG 9213 35982 -7791 THR 11689 34430 -10225 PRO 10526 33672 -13764 PRO 10450 30155 -15221 TYR 13435 28807 -17212 HIS 12592 27953 -20867 LYS 13119 24168 -20635 MET 11163 22511 -17793 LYS 12513 19051 -18392 ILE 16171 19712 -17420 GLU 17906 16675 -16002 ARG 20739 18749 -14442 PRO 20795 22205 -12656 VAL 21420 25231 -14950 THR 23035 28549 -13979 VAL 21787 31726 -15708 PHE 22609 35461 -15284 LEU 20442 38158 -13807 GLN 20811 41936 -13908 LEU 18869 45057 -12875 LYS 17767 47105 -15890 ARG 15886 50384 -16147 LYS 12986 49924 -18601 ARG 13828 53217 -20348 GLY 17599 53680 -20335 GLY 18712 50174 -21345 ASP 21222 50776 -18587 VAL 22137 47629 -16586 SER 23898 46535 -13407 ASP 26794 44083 -13346 SER 25489 40525 -13460 LYS 24824 38022 -10674 GLN 23829 34354 -11076 PHE 20884 32126 -10268 THR 20616 28329 -10690 TYR 17440 26526 -11742 TYR 16906 23050 -10274 PRO 15210 20064 -12063