Data Formats

Omic supports a wide range of standard bioinformatics and cheminformatics data formats.

Molecular Data

SMILES

Simplified Molecular Input Line Entry System for representing chemical structures.

# Example SMILES strings
CC(=O)Oc1ccccc1C(=O)O    # Aspirin
CN1C=NC2=C1C(=O)N(C(=O)N2C)C    # Caffeine
CC(C)Cc1ccc(cc1)C(C)C(=O)O    # Ibuprofen

SDF/MOL Files

Structure Data Files containing 2D/3D coordinates and molecular properties.

compounds.sdf
├── Molecule 1
│   ├── Atoms and coordinates
│   ├── Bonds
│   └── Properties (MW, logP, etc.)
├── Molecule 2
└── ...

Sequence Data

FASTA

Standard format for nucleotide and protein sequences.

>sp|P04637|P53_HUMAN Cellular tumor antigen p53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE

Expression Data

Count Matrix (CSV/TSV)

Gene expression count matrix with genes as rows and samples as columns.

gene_id,sample_1,sample_2,sample_3,sample_4
ENSG00000141510,1523,1821,892,1102
ENSG00000171862,4521,4102,5821,5012
ENSG00000134086,892,1021,723,812

Sample Metadata

Sample annotations including condition labels and covariates.

sample_id,condition,age,sex,batch
sample_1,disease,45,M,batch1
sample_2,disease,52,F,batch1
sample_3,control,48,M,batch2
sample_4,control,51,F,batch2

Output Formats

Target Discovery Results (JSON)

Ranked target list with scores and supporting evidence.

{
  "targets": [
    {
      "gene_symbol": "EGFR",
      "ensembl_id": "ENSG00000146648",
      "druggability_score": 0.92,
      "expression_fc": 3.21,
      "network_centrality": 0.85,
      "literature_evidence": 127,
      "existing_drugs": ["Erlotinib", "Gefitinib"]
    }
  ],
  "patient_clusters": [...],
  "pathways": [...]
}

File Size Limits

File TypeMax SizeNotes
Expression Matrix500 MBUp to 100,000 genes × 10,000 samples
SDF File1 GBUp to 10M compounds
FASTA100 MBProtein or nucleotide