This project aimed to develop a knowledge graph system that could extract genetic sequence information from unstructured biotech patent documents. As the team leader, the project planning and research were my primary responsibilities. I also managed the working progress of my team members and reviewed their project designs.

POS tagging genetic extraction pipeline
Using POS tagging to extract genetic sequences from biotech patent documents

The team developed a data pipeline to extract gene names, sequences, and organism names from biotech patent documents. We integrated the ETL pipeline with the BLAST+ database to classify the taxonomy category of the genetic information. This process helped us consolidate intellectual property acquisition with other information and build next-level business value.

The PoC project showed promising results, demonstrating the potential for knowledge graph systems to extract valuable information from unstructured data sources. As the project leader, I was pleased to have contributed to the development of this innovative solution.