Software Practicals Summer Semester 2019
Transcript of Software Practicals Summer Semester 2019
Database Systems Research GroupHeidelberg University
April 17, 2019
Software PracticalsSummer Semester 2019
● Overview of topics (today)○ send application for a topic until Tuesday, April 23, 13:00○ assignment of topics by April 26
● First milestone (mid/end May)○ prototype/part of software○ summary of research (literature and related systems/tools)○ further milestones in agreement with supervisor
● End of practical (mid/end July)○ code in local gitlab○ report / documentation as local Wiki document ○ presentation/demo of practical and software (10-15 minutes)
● Application○ by email directly to supervisor○ brief list of relevant courses / prior knowledge○ schedule and milestones for the practical○ group work is not possible○ application is binding (don’t apply if you don’t want to do the practical)
● Deadlines○ presentation: planned for third week in July 2019 ○ Report & gitlab upload: by August 10, 2019○ no extension possible○ not finished = failed (grade 5,0)
● Credit points (Leistungspunkte)○ Beginners Practical (IAP, 6 ECTS) [Bachelor students]
■ workload: 180 h (~1 ½ days/week)○ Advanced Practical (IFP, 8 ECTS / 6 ECTS)
■ workload: 240 h (~2 days/week)
● Grading based on○ code (readability, structure, functionality)○ documentation (README, comments)○ commitment and self-reliance○ cool ideas!!
● IMPORTANT○ talk to / communicate with your advisor
● Michael Gertz (MG)
● Sebastian Lackner (SL)
● Satya Almasian (SA)
● Dennis Aumiller (DA)
1.2.3.4.5.6.7.8.9.
10.
Given: 1. Doctoral letters written in German (semi-structured)2. Medical vocabulary (e.g., Unified Medical Language System, UMLS)Tasks: • Build pipeline that identifies and manages medical named entities• Manage and allow querying named entities in database
Subtasks:• Extend existing information extraction pipeline • Develop GUI components for querying medical named entities
Languages / Tools:• Python; MongoDB; Django/Flask• UMLS, https://www.nlm.nih.gov/research/umls/
Given: 1. Medical term co-occurrence network extracted from doctoral letters
written in German (and managed in MongoDB)2. Medical named entities and medical vocabulary Tasks: • Adapt and extend construction of co-occurrence networks• Web-based querying and visualization of co-occurrence networks
Subtasks: • Consolidate extraction pipeline for co-occurrence networks• Develop GUI components for graph querying
Languages / Tools:• Python; MongoDB; Django/Flask• UMLS, https://www.nlm.nih.gov/research/umls/
Given: 1. Website with information about voting behavior 2. Lists of politicians and partiesTasks: • Extract information about politicians, topics, and votings from
https://www.bundestag.de/abstimmung• Develop Web-based visualization and query framework
Subtasks: • Extract information from Website and manage them in database• Develop GUI components for politician/topic centric querying
Languages / Tools:• Python; MongoDB/Solr; Django/Flask
Given: 1. German legal texts (as XML files for, e.g., BGB, StPO, ZPO)2. Machine Learning frameworksTasks: • Develop pipeline(s) to compute and manage language models for
collections of legal texts• Evaluation and comparison with existing word embeddings
Subtasks:• Extract legal texts from www.gesetze-im-internet.de/ • Apply Machine Learning pipeline on collections of legal texts
Languages / Tools:• Python; SciKit-Learn/Tensorflow
Given: 1. German doctoral letters (semi-structured)2. Machine Learning frameworksTasks: • Develop pipeline(s) to compute and manage language models for
collections of doctoral letters • Evaluation and comparison with existing word embeddings
Subtasks:• Develop and apply Machine Learning pipeline on collections of
medical textsLanguages / Tools:• Python; SciKit-Learn/Tensorflow
Given: 1. Hypergraph/Graph Document Model 2. Relational Implementation (PostgreSQL) as referenceTasks: • Propose and implement a schematic model in a graph database• Evaluate performance on a set of predefined query types
Subtasks:• “Translate” queries from SQL to product-specific query languages• BP: Neo4j only, AP: Can extend this to other frameworks as well
Languages / Tools:• SQL; Neo4j, OrientDB, ArangoDB, Dgraph, MongoDB, ...
Given: 1. Relational Document Model in PostgreSQL 2. Set of “Standard Queries”Tasks: • Find optimal execution pattern and potential improvements
(including Postgres setting)• Investigate optimal SQL execution plan
Subtasks:• Learn details about PostgreSQL internals and query planner• Find bottlenecks in execution plans
Languages / Tools:• SQL/PostgreSQL (at low level, this is C code)
Given:1. News Extraction Pipeline 2. Time-Varying Graph ExplorerTasks: • Extract articles (adapt sample code)• Use Ambiverse to link entities• Implement live view in TVG Explorer
Subtasks:• Decide on intermediate representation (DB / in-memory?)
Languages / Tools:• Python, HTML, JavaScript, (MongoDB), ...
Given: 1. News Extraction Pipeline 2. Measures to rate importance
of entities in News articles [1]Tasks: • Implement browser-based visualization
of current News based on the News Extraction PipelineSubtasks:• Get familiar with the paper and existing code• Decide on suitable graph visualization framework (visjs?)
Languages / Tools:• Java or Python, HTML, JavaScript, ...
Given: 1. Existing code to track communities [1] over multiple snapshots2. Time-Varying Graph ExplorerTasks: • Fix performance bottlenecks in existing code• Port / Reimplement visualization in TVG Explorer
Subtasks:• Measure performance bottlenecks• Decide on suitable replacement algorithms
Languages / Tools:• Python, HTML, JavaScript