# Zkouška 17.1.2023 Histogram proteinu

<{ForumPost(poster="dalsineznamymatfyzak", timestamp=2023-01-17 18:17:41)}>
Intro: Proteins are large biological molecules consisting of amino acids. In general, the genetic code specifies 20 standard amino acids. This assignment is based on the systematic exploration of the distribution of certain amino acids in proteins’ structures.  
  
3-letter codes of the 20 standard amino acids:  

    ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

Your task  
  
Implement a program invoked like:  
  
program_name configuration_file output_file  
  
Command line always contains configuration file name and it can contain output file name. If the output file name is not listed, standard output should be used.  
  
Both the configuration file and the data files are row-oriented.  
Data file structure  
  
One file describes one protein and contains information about all its amino acids and their spatial coordinates (x, y, z – discrete values). Each row begins with the 3-letter amino acid code and continues with the spatial coordinates for that amino acid.  
  
Keep in mind: coordinates can also be negative numbers, space between strings on one line can be one or more whitespaces.  
  
(Note: This is a simplification of a real PDB file describing protein structures).  
Configuration file structure

    R-neighborhood .. is an integer
    Pattern ......... is a sequence of one or more amino acids separated by one or more whitespaces
    protein_1 ....... is the name of the first data file
    ...
    protein_N ....... is the name of the N-th data file
    

Histogram  
  
The R-neighborhood represents the neighborhood of a certain amino acid at a distance less than or equal to R. The R-neighborhood of an amino acid with coordinates \[x,y,z] is defined as points with all coordinates in the range \[x-R..x+R, y-R..y+R , z-R..z+R].  
  
A histogram is constructed for each point in discrete 3D space in which an amino acid from the set of specified proteins is located. For each point, we calculate the number of amino acid types - specified in pattern - in its R-neighborhood. Let these numbers be (in order according to the specified pattern) \[c1..cn]. Then the record corresponding to the values \[c1..cn] is incremented. The resulting histogram is created by gradually incrementing the records according to the R-neighborhood of all points corresponding to the amino acids of all input proteins.  
Output  
  
The output format is row-oriented, one line is in the form  
  
The output is sorted lexicographically, i.e.

    [0 0 0 1]: xxx
    [0 0 0 2]: xxx
    ...
    [0 0 1 0]: xxx
    [0 0 1 2]: xxx
    ...
    [0 0 2 1]: xxx
    

Only non-zero occurrences are included in the output.  
Example  
  
Configuration file:

    6000
    ARG LYS
    simple.pdb
    

Data file (simple.pdb):  

    ARG 14872 -18107 30327
    LYS 16112 -17325 26790
    HIS 17615 -20594 25563
    ILE 18797 -24042 26472
    ARG 21860 -24523 24296
    ARG 24156 -21734 23132
    GLY 27393 -22378 21345
    HIS 29225 -19697 19391
    ALA 32741 -18808 18304
    

Output:

    [1 0]: 1
    [1 1]: 2
    [2 0]: 3
    [2 1]: 1
    [3 0]: 1
    

One of the data files used in tests:  
  
*Attached link, that will die soon probably, copied at the bottom*  
  
Assumptions and efficiency requirements  
  
The discrete 3D space where all the amino acids are located is large, think on the order of 100000^3. It is therefore not possible to store in memory a map with data for every point in this space.  
  
Space filling with amino acids is very sparse. Assume tens to small hundreds of amino acids (occupied points in space). Therefore, choose a suitable data representation so that the necessary operations are as efficient as possible.  
  
It is certainly not efficient to search every point of the entire space for each amino acid, nor to go through all other amino acids entered.  
  
You may find it useful to observe that for each amino acid in each dimension there are sufficiently few other amino acids in the range of R-neighborhoods (i.e., in the subspace \[x-R..x+R, *, *]) that one can already search sequentially.  
Configuration and data file syntax checking requirements  
  
The primary evaluation criterion is functional correctness and efficiency on correctly entered data. The program must be stable (i.e. not perform any undefined operations, have unhandled exceptions, exit uncontrollably, etc.) on any (i.e. arbitrarily corrupted) data.  
  
In order to achieve the full number of points, a check of the syntax of the configuration and data files is necessary, if it is violated, the program writes (to the output file or to the standard output, according to the parameters of the command line) the string "error" and ends (with a return code of 0). Consider a syntax violation other than a valid 3-letter amino acid code, a different number of coordinates, non-numeric characters at coordinate positions, etc.  
  
If any data file specified in the configuration file cannot be opened (e.g. because it does not exist), it is not considered an error, simply skip the file. Being not able to open configuration file is an error.  
  
----  
File from downloadable from the link:

    GLY  -5902  73707  44647
    PRO  -6264  73743  40764
    TYR  -3705  71988  38542
    LEU  -3494  70898  34880
    VAL  -2843  67241  34000
    ILE  -2073  65571  30665
    VAL  -5000  63138  30106
    GLU  -3371  61798  26891
    GLN   0298  62526  26068
    PRO   1548  63200  22516
    LYS   2971  60182  20620
    GLN   6732  60114  21213
    ARG   7588  58664  17782
    GLY   6245  58612  14266
    PHE   5232  62279  13870
    ARG   6704  64364  11105
    PHE   7776  67992  11705
    ARG   7017  70083   8594
    TYR   9026  72923   7063
    GLY   7212  76167   6207
    CYS   7470  75226   2529
    GLU   5463  71982   2791
    GLY   2067  73278   3865
    PRO  -0149  73817   6887
    SER  -2581  70903   6312
    HIS  -0429  67805   6799
    GLY  -1985  66380   9962
    GLY  -2112  67101  13681
    LEU   0271  65758  16340
    PRO  -1704  62677  17591
    GLY  -2080  61172  21099
    ALA  -0097  58364  22710
    SER  -2608  55734  21533
    SER  -3856  56280  17981
    GLU  -4190  52984  16070
    LYS  -6365  52493  12940
    GLY  -9823  54116  12842
    ARG  -9728  55799  16290
    LYS  -7067  58485  16685
    THR  -6530  60460  19863
    TYR  -4810  63883  20210
    PRO  -2865  65278  23197
    THR  -5445  66082  25941
    VAL  -5149  67919  29242
    LYS  -7630  68753  31976
    ILE  -7906  71378  34717
    CYS  -8719  69194  37749
    ASN -10404  70905  40716
    TYR -11209  73822  38373
    GLU -11458  77453  39415
    GLY -13284  79824  37052
    PRO -13557  79329  33267
    ALA  -9954  79073  32030
    LYS  -7299  79768  29439
    ILE  -4861  77005  28304
    GLU  -1846  77730  26095
    VAL   0660  75301  24561
    ASP   4077  75922  23036
    LEU   7228  74017  22190
    VAL   9966  73747  24855
    THR  13386  72129  24601
    HIS  14112  68506  25486
    SER  16709  69869  27923
    ASP  15438  70035  31507
    PRO  14855  73696  32442
    PRO  12178  73987  29700
    ARG  13172  76853  27501
    ALA  11308  77859  24303
    HIS  12261  75907  21143
    ALA  13204  77563  17826
    HIS  10407  75400  16314
    SER   6815  76712  16480
    LEU   3416  75054  16816
    VAL   1340  75471  13671
    GLY  -2446  75104  13090
    LYS  -5815  75395  14808
    GLN  -5961  78357  17279
    CYS  -2120  78878  17123
    SER  -0589  82386  16999
    GLU   2146  83357  14620
    LEU   4788  82755  17230
    GLY   3830  79151  17958
    ILE   1670  79732  21043
    CYS  -1672  77951  21035
    ALA  -4276  79584  23267
    VAL  -7716  78245  24030
    SER -10487  78742  26569
    VAL -11923  75715  28467
    GLY -15648  76296  29200
    PRO -17220  76803  32657
    LYS -18205  73099  32658
    ASP -16010  70721  30632
    MET -12476  71383  31856
    THR -10707  69209  29289
    ALA  -8884  70494  26198
    GLN  -8229  68097  23312
    PHE  -5664  69825  21044
    ASN  -7002  68392  17826
    ASN  -4686  70039  15291
    LEU  -1173  70852  16446
    GLY   1745  70832  14001
    VAL   5462  71549  14380
    LEU   6882  74031  11870
    HIS  10501  73041  11348
    VAL  13044  75904  11296
    THR  15901  75530   8833
    LYS  19433  75655  10285
    LYS  19805  78943   8383
    ASN  16548  80073  10015
    MET  17317  79109  13580
    MET  19235  82218  14690
    GLY  16743  84782  13422
    THR  13784  82759  14770
    MET  15698  82476  18034
    ILE  16634  86225  18473
    GLN  12996  87040  17579
    LYS  11470  84834  20294
    LEU  14178  86005  22727
    GLN  13369  89704  22123
    ARG   9699  88988  22587
    GLN  10682  87286  25841
    ARG  12753  90306  26946
    LEU  10048  92859  26526
    ARG   7623  90445  28155
    SER   7480  90991  31936
    ARG  10089  93811  31835
    PRO  10058  97252  29870
    GLN   9652  98057  26142
    GLY  12949  98079  24275
    LEU  16442  96818  23593
    THR  19556  98978  23660
    GLU  22375  97823  21343
    ALA  24145  96236  24282
    GLU  20912  94388  25119
    GLN  20607  93331  21481
    ARG  24262  92206  21301
    GLU  23787  89912  24374
    LEU  20410  88680  23083
    GLU  22215  87556  19909
    GLN  24750  85902  22229
    GLU  22082  83603  23681
    ALA  20996  82677  20179
    LYS  24443  81409  19128
    GLU  25101  79517  22353
    LEU  21496  78247  22381
    LYS  21766  76912  18827
    LYS  24900  74822  19619
    VAL  23010  72824  22323
    MET  19482  72631  20887
    ASP  18136  69379  19421
    LEU  16045  70184  16340
    SER  14729  66589  16289
    ILE  13135  66618  19726
    VAL  10525  68948  21257
    ARG   8191  68845  24255
    LEU   4723  70383  24547
    ARG   4344  72756  27479
    PHE   0787  73383  28638
    SER   0296  76558  30645
    ALA  -3150  76794  32224
    PHE  -4588  80131  33465
    LEU  -7519  80693  35772
    ARG  -9333  83809  34513
    SER  -4554  86056  33994
    LEU  -3785  83785  37002
    PRO  -1175  81039  36386
    LEU  -1311  77353  37450
    LYS   1867  75230  37378
    PRO   3065  74417  33818
    VAL   2985  70745  32807
    ILE   5618  69450  30341
    SER   4877  66389  28188
    GLN   7291  63642  27201
    PRO   9670  64213  24228
    ILE   8346  64193  20665
    HIS  10762  62906  18060
    ASP  10791  64095  14467
    SER  10061  61036  12283
    LYS  12033  62736   9520
    SER  15216  62997  11542
    PRO  17245  59950  10317
    GLY  17438  58727  13939
    ALA  13952  59279  15418
    SER  11119  57328  13782
    ASN   9027  54290  14618
    LEU  11099  51143  14313
    LYS   9026  49054  11907
    ILE   9506  45573  10446
    SER   7956  45698   6957
    ARG   8694  42101   6048
    MET  11300  39415   6458
    ASP  12873  36586   4550
    LYS  12076  33624   6745
    THR  10020  33044   9879
    ALA  11498  29718  10916
    GLY  14891  28175  11319
    SER  16964  25704  13281
    VAL  17143  25680  17052
    ARG  20963  25794  16510
    GLY  20824  29465  15507
    GLY  22848  30896  12662
    ASP  20178  31069   9967
    GLU  20493  34243   7847
    VAL  17354  36452   7654
    TYR  16848  39532   5394
    LEU  14658  42018   7295
    LEU  13119  45024   5485
    CYS  12551  47987   7815
    ASP  11835  51741   7949
    LYS  14999  53877   8126
    VAL  17326  52835  10966
    GLN  20796  53806  12137
    LYS  23533  51175  11547
    ASP  25093  51639  14954
    ASP  22001  52243  17053
    ILE  19859  49238  16143
    GLU  19426  45655  17326
    VAL  17308  42563  16717
    ARG  15951  41370  20044
    PHE  14592  37820  20221
    TYR  12631  36842  23322
    GLU   9749  34819  24787
    ASP   8051  36240  27899
    ASP   6886  33394  30254
    GLU   7554  33506  34056
    ASN  11262  33452  33047
    GLY  12252  35701  30103
    TRP  14890  34456  27603
    GLN  16442  37019  25278
    ALA  19102  36908  22576
    PHE  20204  38978  19590
    GLY  20627  38977  15814
    ASP  24156  38799  14470
    PHE  25101  41451  11922
    SER  27787  44113  11384
    PRO  26997  47664  10160
    THR  27811  46422   6610
    ASP  24814  44143   7078
    VAL  22504  47113   7550
    HIS  21818  48009   3911
    LYS  21168  51714   3043
    GLN  19246  52071   6376
    TYR  16281  50029   5037
    ALA  17352  46377   5349
    ILE  19238  44272   7870
    VAL  20851  40933   6960
    PHE  21363  39049  10191
    ARG  22067  35534  11465
    THR  19508  34280  14044
    PRO  20741  33513  17538
    PRO  20765  29990  19008
    TYR  17734  28967  21094
    HIS  18353  28429  24837
    LYS  17559  24665  24766
    MET  19471  22662  22095
    LYS  17386  19628  23051
    ILE  13946  20695  21789
    GLU  11678  17961  20507
    ARG   8822  20158  19311
    PRO   9074  23447  17403
    VAL   8841  26555  19640
    THR   7655  30009  18645
    VAL   9550  32969  20076
    PHE   9433  36593  18957
    LEU  11879  39031  17452
    GLN  11742  42784  17347
    LEU  13867  45695  16363
    LYS  15128  47655  19342
    ARG  17014  50927  19625
    LYS  20262  50606  21626
    ARG  20462  54056  23188
    GLY  16657  54294  23601
    GLY  14906  50997  24262
    ASP  11870  51519  22010
    VAL  10922  48525  19942
    SER   9069  47812  16756
    ASP   5930  45668  16633
    SER   7200  42113  17169
    LYS   7071  39374  14520
    GLN   7145  35655  15455
    PHE   9903  33186  14637
    THR   9762  29446  15181
    TYR  12667  27357  16209
    TYR  12735  23834  14772
    PRO  14682  20629  15825
    GLY  37829  72937 -44895
    PRO  39423  72794 -41437
    TYR  37085  70892 -39150
    LEU  37150  69964 -35506
    VAL  36522  66345 -34541
    ILE  35992  64789 -31118
    VAL  38760  62171 -30602
    GLU  37605  60944 -27176
    GLN  33947  61358 -26113
    PRO  32931  62292 -22574
    LYS  31774  59246 -20589
    GLN  27986  58978 -20925
    ARG  27254  57588 -17370
    GLY  29050  57545 -14042
    PHE  29709  61297 -13617
    ARG  27942  63593 -11120
    PHE  26922  67248 -11435
    ARG  27902  69285  -8385
    TYR  26066  72303  -6889
    GLY  27852  75521  -5996
    CYS  27794  74536  -2296
    GLU  29922  71490  -2950
    GLY  33186  73050  -4097
    PRO  35508  73107  -7144
    SER  37760  70053  -6401
    HIS  35373  67158  -7101
    GLY  36772  65325 -10144
    GLY  36952  66091 -13863
    LEU  34560  64640 -16456
    PRO  36434  61522 -17763
    GLY  36657  60146 -21346
    ALA  34448  57382 -22743
    SER  37231  54895 -21834
    SER  38620  55456 -18366
    GLU  39077  52164 -16546
    LYS  41392  51549 -13499
    GLY  44617  53604 -13698
    ARG  44243  54637 -17369
    LYS  42031  57687 -17194
    THR  41083  59663 -20266
    TYR  39365  63079 -20564
    PRO  37316  64423 -23507
    THR  39574  65366 -26395
    VAL  39122  67040 -29740
    LYS  41373  68063 -32609
    ILE  41324  70539 -35486
    CYS  41946  68584 -38686
    ASN  43403  69977 -41943
    TYR  44839  72761 -39735
    GLU  44771  76441 -40494
    GLY  46301  79189 -38229
    PRO  47095  78906 -34391
    ALA  43648  78184 -32947
    LYS  41316  78587 -30054
    ILE  38732  76075 -28770
    GLU  35780  77030 -26562
    VAL  33632  74468 -24686
    ASP  30188  75142 -23224
    LEU  27199  73091 -22019
    VAL  24297  72664 -24407
    THR  21014  70880 -24011
    HIS  20088  67351 -24992
    SER  17220  68465 -27306
    ASP  18198  69046 -30976
    PRO  18611  72685 -31991
    PRO  21409  72986 -29380
    ARG  20731  75777 -26945
    ALA  22892  76815 -23963
    HIS  22058  74953 -20771
    ALA  21474  76500 -17328
    HIS  24519  74535 -16100
    SER  28005  75924 -16368
    LEU  31391  74300 -16789
    VAL  33790  74805 -13893
    GLY  37519  74517 -13335
    LYS  40668  74478 -15402
    GLN  40425  77592 -17526
    CYS  36675  78042 -17053
    SER  35058  81522 -17152
    GLU  32580  82666 -14533
    LEU  29870  82035 -17146
    GLY  30647  78441 -17876
    ILE  32746  78936 -21037
    CYS  35977  76948 -21311
    ALA  38437  78503 -23737
    VAL  41905  77243 -24548
    SER  44516  77856 -27256
    VAL  45636  74954 -29506
    GLY  49323  75501 -30280
    PRO  50759  76164 -33752
    LYS  51624  72523 -34351
    ASP  49767  70304 -31866
    MET  46101  70914 -32799
    THR  44598  68639 -30157
    ALA  43079  69783 -26877
    GLN  42566  67450 -23924
    PHE  39931  68678 -21442
    ASN  41585  67449 -18259
    ASN  39319  69052 -15668
    LEU  35681  69631 -16569
    GLY  33250  70024 -13721
    VAL  29536  70474 -14319
    LEU  27859  73088 -12154
    HIS  24307  72231 -11125
    VAL  22013  75266 -11054
    THR  19235  75040  -8477
    LYS  15623  75391  -9703
    LYS  15398  78571  -7664
    ASN  18379  79795  -9696
    MET  17636  78545 -13167
    MET  15411  81499 -14058
    GLY  18078  83957 -12924
    THR  21083  82429 -14703
    MET  19053  81406 -17729
    ILE  18243  85108 -18162
    GLN  21916  86068 -17595
    LYS  22788  83669 -20436
    LEU  20118  84884 -22882
    GLN  21005  88549 -22110
    ARG  24672  87693 -22616
    GLN  23671  85807 -25723
    ARG  21295  88476 -27022
    LEU  23322  91574 -26330
    ARG  26019  89432 -27930
    SER  25805  89806 -31728
    ARG  23461  92803 -31639
    PRO  23802  95989 -29396
    GLN  24416  96566 -25613
    GLY  21159  96897 -23672
    LEU  17588  95752 -23213
    THR  14599  98104 -23056
    GLU  11835  97171 -20554
    ALA   9898  95329 -23247
    GLU  13115  93418 -24002
    GLN  13797  92428 -20363
    ARG  10196  91204 -20045
    GLU  10419  88862 -23062
    LEU  13685  87547 -21613
    GLU  12001  86316 -18456
    GLN   9461  84426 -20610
    GLU  11916  82414 -22732
    ALA  13423  81462 -19352
    LYS  10147  80698 -17671
    GLU   9112  78601 -20684
    LEU  12647  77405 -21077
    LYS  12712  76076 -17490
    LYS   9600  73963 -18084
    VAL  11260  72077 -20982
    MET  14891  71883 -19786
    ASP  16243  68527 -18513
    LEU  18654  69339 -15747
    SER  20064  65778 -15666
    ILE  21517  65642 -19203
    VAL  23951  68034 -20892
    ARG  26031  67834 -24067
    LEU  29538  69313 -24601
    ARG  29970  71840 -27425
    PHE  33360  72571 -28962
    SER  33548  75789 -30986
    ALA  36966  76344 -32621
    PHE  38292  79719 -33893
    LEU  41297  80162 -36201
    ARG  43042  83517 -35791
    SER  38086  85130 -34988
    LEU  37045  82737 -37790
    PRO  34565  80041 -36628
    LEU  34793  76331 -37577
    LYS  31477  74455 -37338
    PRO  30630  73532 -33676
    VAL  30917  69785 -32897
    ILE  28544  68398 -30256
    SER  29373  65369 -28130
    GLN  27045  62567 -27060
    PRO  24822  63154 -23987
    ILE  26287  63144 -20432
    HIS  23973  61973 -17642
    ASP  23935  63028 -13955
    SER  24990  60041 -11793
    LYS  23222  61538  -8765
    SER  19909  62085 -10494
    PRO  17509  59227  -9469
    GLY  16966  58164 -13095
    ALA  20657  57955 -14074
    SER  22205  56453 -10944
    ASN  24644  53523 -11154
    LEU  22663  50243 -10992
    LYS  24099  47657  -8536
    ILE  22975  44300  -7088
    SER  24336  44221  -3560
    ARG  23105  40924  -2045
    MET  20579  38271  -2955
    ASP  18332  36099  -0843
    LYS  18801  32885  -2669
    THR  20739  32013  -5772
    ALA  18894  28851  -6708
    GLY  15322  27664  -7134
    SER  13047  25395  -9187
    VAL  12766  25518 -12968
    ARG   9037  26422 -12579
    GLY  10151  29791 -11316
    GLY   7869  31477  -8797
    ASP  10454  31473  -6040
    GLU  10393  34614  -3941
    VAL  13711  36507  -3830
    TYR  14814  39437  -1636
    LEU  17392  41375  -3661
    LEU  19020  44375  -2038
    CYS  20329  46965  -4451
    ASP  21480  50568  -4929
    LYS  18665  53141  -5412
    VAL  16190  52357  -8208
    GLN  12876  53968  -9091
    LYS   9999  51542  -8573
    ASP   8331  52370 -11854
    ASP  11563  52273 -13922
    ILE  13324  49080 -12825
    GLU  13381  45447 -13927
    VAL  15164  42232 -12928
    ARG  16189  40548 -16147
    PHE  17234  36834 -16128
    TYR  19061  35699 -19259
    GLU  21626  33484 -20983
    ASP  23772  34090 -24092
    ASP  24587  31202 -26397
    GLU  24101  31390 -30186
    ASN  20364  31765 -29327
    GLY  19543  33559 -26099
    TRP  16677  33700 -23613
    GLN  15344  36329 -21221
    ALA  12637  36581 -18586
    PHE  11740  39050 -15875
    GLY  11248  38679 -12138
    ASP   7632  39425 -11182
    PHE   6800  42104  -8657
    SER   4558  45119  -8051
    PRO   5610  48719  -7102
    THR   4730  47736  -3511
    ASP   7510  45118  -3736
    VAL  10258  47658  -4256
    HIS  11094  48572  -0658
    LYS  12111  52249  -0233
    GLN  14117  52266  -3519
    TYR  16778  49927  -2025
    ALA  15286  46475  -2137
    ILE  13155  44447  -4511
    VAL  11069  41418  -3389
    PHE  10212  39404  -6508
    ARG   9213  35982  -7791
    THR  11689  34430 -10225
    PRO  10526  33672 -13764
    PRO  10450  30155 -15221
    TYR  13435  28807 -17212
    HIS  12592  27953 -20867
    LYS  13119  24168 -20635
    MET  11163  22511 -17793
    LYS  12513  19051 -18392
    ILE  16171  19712 -17420
    GLU  17906  16675 -16002
    ARG  20739  18749 -14442
    PRO  20795  22205 -12656
    VAL  21420  25231 -14950
    THR  23035  28549 -13979
    VAL  21787  31726 -15708
    PHE  22609  35461 -15284
    LEU  20442  38158 -13807
    GLN  20811  41936 -13908
    LEU  18869  45057 -12875
    LYS  17767  47105 -15890
    ARG  15886  50384 -16147
    LYS  12986  49924 -18601
    ARG  13828  53217 -20348
    GLY  17599  53680 -20335
    GLY  18712  50174 -21345
    ASP  21222  50776 -18587
    VAL  22137  47629 -16586
    SER  23898  46535 -13407
    ASP  26794  44083 -13346
    SER  25489  40525 -13460
    LYS  24824  38022 -10674
    GLN  23829  34354 -11076
    PHE  20884  32126 -10268
    THR  20616  28329 -10690
    TYR  17440  26526 -11742
    TYR  16906  23050 -10274
    PRO  15210  20064 -12063
    


<{/ForumPost}>

