Web-Scale Knowledge Inference Using Markov Logic Networks

Web-Scale Knowledge Inference Using Markov Logic Networks Yang Chen Daisy Zhe Wang University of Florida, Dept of Computer Science, Gainesville, FL 3...
Author: Lesley Bradley
1 downloads 0 Views 2MB Size
Web-Scale Knowledge Inference Using Markov Logic Networks

Yang Chen Daisy Zhe Wang University of Florida, Dept of Computer Science, Gainesville, FL 32611 USA

Abstract In this paper, we present our on-going work on ProbKB, a PROBabilistic Knowledge Base constructed from web-scale extracted entities, facts, and rules represented as a Markov logic network (MLN). We aim at web-scale MLN inference by designing a novel relational model to represent MLNs and algorithms that apply rules in batches. Errors are handled in a principled and elegant manner to avoid error propagation and unnecessary resource consumption. MLNs infer from the input a factor graph that encodes a probability distribution over extracted and inferred facts. We run parallel Gibbs sampling algorithms on Graphlab to query this distribution. Initial experiment results show promising scalability of our approach.

1. Introduction With the exponential growth in machine learning, statistical inference techniques, and big-data analytics frameworks, recent years have seen tremendous research interest in automatic information extraction and knowledge base construction. A knowledge base stores entities, facts, and their relationships in a machine-readable form so as to help machines understand information from human. Currently, most popular techniques used to acquire knowledge include automatic information extraction (IE) from text corpus (Carlson et al., 2010; Poon et al., 2010; Schoenmackers et al., 2010) and massive human collaboration (Wikepedia, Freebase, etc). Though they proved their success in a broad range of applications, there is still much improvement we can make by making inference. For example, Wikipedia pages state that Kale is very high in calcium and calcium helps prevent Osteoporosis, then we can infer that Kale helps prevent Osteoporosis. However, this information is stated in neither page and only be dis-

[email protected] [email protected]

covered by making inference. Existing IE works extract entities, facts, and rules automatically from the web (Fader et al., 2011; Schoenmackers et al., 2010), but due to corpus noise and the inherent probabilistic nature of the learning algorigthms, most of these extractions are uncertain. Our goal is to facilitate inference over such noisy, uncertain extractions in a principled, probabilistic, and scalable manner. To achieve this, we use Markov logic networks (MLNs) (Richardson & Domingos, 2006), an extension of first-order logic that augments each clause with a weight. Clauses with finite weight are allowed to be violated, but with a penalty determined by the weight. Together with all extracted entities and facts, the MLN defines a Markov network (factor graph) that encodes a probability distribution over all inferred facts. Probabilistic inference is thus supported by querying this distribution. The main challenge related to MLNs is scalability. The state-of-the-art implementation, Tuffy (Niu et al., 2011) and Felix (Niu et al., 2012), partially solved this problem by using relational databases and task specialization. However, their systems work well only on a small set of relations and hand-crafted MLNs, and were not able to scale up to the web-scale datasets like ReVerb and Sherlock (See Section 3). To gain this scalability, we design a relational model that pushes all facts and rules into the database. In this way, the grounding is reduced to a few joins among these tables and the rules are thus applied in batches. This helps achieve much greater scalability than Tuffy and Felix. The second challenge stems from extraction errors. Most IE systems assign a score to each of their extractions to indicate their fidelity so they typically don’t do any further cleaning. This poses a significant challenge to inference engines: errors propagate rapidly and violently without constraints. Figure 1 shows an example of such propagation: starting from a single er-

ProbKB: Managing Web-Scale Knowledge

ror1 stating that Paris is the capital of Japan, a whole bunch of incorrect results are produced. Errors come from diverse sources: incorrectly extracted entities and facts, wrong rules, ambiguity, inference, etc. In each rule application, with a single erroneous facts participating, the result is likely to be erroneous. In a worse scenario, even if the extractions are correct, errors may arise due to word ambiguity. Removing errors early is a crucial task in knowledge base construction; it improves knowledge quality and saves computation resources for high-quality inferences. This paper introduces the ProbKB system that aims at tackling these challenges and presents our initial results over a web-scale extraction dataset.

2. The ProbKB System This section presents our on-going work on ProbKB. We designed a relational model to represent the extractions and the MLN and designed grounding algorithms in SQL that applies MLN clauses in batches. We implemented it on Greenplum, a massive parallel processing (MPP) database system. Inference is done by a parallel Gibbs sampler (Gonzalez et al., 2011) on GraphLab (Low et al., 2010). The overview of the system architecture is shown in Figure 2. PROBabilistic Knowledge Base

Entities Facts

Rules

Relational Database

Factor Graph MLN

Inference Engine

regarding Horn clauses. Section 2.2 introduces our relational model for MLNs. Section 2.3 presents the grounding algorithms in terms of our relational model. Section 2.4 desribes our initial work on maintaining knowledge integrity. Section 2.5 describes our inference engine built on GraphLab. Experiment results show ProbKB scales much better than the state-ofthe-art. 2.1. First-Order Horn Clauses A first-order Horn clause is a clause with at most one positive literal2 . In this paper, we focus on this class of clauses only, though in general Markov logic supports arbitrary forms. This restriction is reasonable in our context since our goal to discover implicit knowledge from explicit statements in the text corpus; this type of inference is mostly expressed as “if... then...” statements in human language, corresponding to the set of Horn clauses. Moreover, due to their simple structure, Horn clauses are more easily to learn and be represented in a structured form than general ones. We used a set of extracted Horn clauses from Sherlock (Schoenmackers et al., 2010). 2.2. The Relational MLN Model We represent the knowledge base as a relational model. Though previous approaches already used relational databases to perform grounding (Niu et al., 2011; 2012), the MLN model and associating weights are stored in an external file. During runtime, a SQL query is constructed for each individual rule using a host language. This approach is inefficient when there are a large number of rules. Our approach, on the contrary, stores the MLN in the database so that the rules are applied in batches using joins. The only component that the database does not efficiently support a probabilistic inference engine that needs many random accesses to the input data. This component is discussed in Section 2.5. Based on the assumptions made in Section 2.1, we considered Horn clauses only. We classified Sherlock rules according to their sizes and argument orders. We identified six rule patterns in the dataset: 2

Figure 2. ProbKB architecture

The remaining of this section is structured as follows: Section 2.1 justifies an important assumption 1

See http://en.wikipedia.org/w/index. php?title=Statement_(logic)&diff= 545338546&oldid=269607611

http://en.wikipedia.org/wiki/Horn_clause

ProbKB: Managing Web-Scale Knowledge

Figure 1. Error propagation: How a single error source generates multiple errors and how they propagate furthur. Errors come different sources including incorrect extractions, rules, and propagated errors. The fact that Paris is the capital of Japan is extracted from a Wikipedia page describing logical statements. All base and derived errors are shown in red.

extracted and already inferred facts: p(x, y) ← q(x, y)

(1)

p(x, y) ← q(y, x)

(2)

p(x, y) ← q(x, z), r(y, z)

(3)

p(x, y) ← q(x, z), r(z, y)

(4)

p(x, y) ← q(z, x), r(y, z)

(5)

p(x, y) ← q(z, x), r(z, y)

(6)

where p, q, r are predicates and x, y, z are variables. Each rule type i has a table Mi recording the predicates involved in the rules of that type. For each rule p(x, y) ← q(x, y)

(1)

of type 1, we have a tuple (p, q) in M1 . For each rule p(x, y) ← q(x, z), r(y, z)

(3)

of type 3, we have a tuple (p, q, r) in M3 . We construct M2 , M4 , M5 , and M6 similarly. These tables record the predicates only. The argument orders are implied by the rule types. Longer rules may cause a problem since the number of rule patterns grows exponentially with the rule size, making it impractical to create a table for each of them. We leave this extension as a future work, but our intuition is to record the arguments and use UDF to construct SQL queries. We have another table R for relationships. For each relationship p(x, y) that is stated in the text corpus, we have a tuple (p, x, y) in R. The grounding algorithm is then easily expressed as equi-joins between the R table and Mi tables as discussed next. 2.3. Grounding We use type 3 rules to illustrate the grounding algorithm in SQL. p(x, y) ← q(x, z), r(y, z)

(3)

Assume these rules are stored in table M3 (p, q, r), and relationships p(x, y) are stored in R(p, x, y), then the following SQL query infers new facts given a set of

SELECT DISTINCT M3.p AS p, R1.x AS x, R2.y AS y FROM M3 JOIN R R1 ON M3.q = R1.p JOIN R R2 ON M3.r = R2.p WHERE R1.y = R2.y;

This process is repeated until convergence (i.e. no more facts can be inferred). Then the following SQL query then generates the factors: SELECT DISTINCT R1.id AS id1, R2.id AS id2, R3.id AS id3 FROM M3 JOIN R R ON M3.p = R.p JOIN R R1 ON M3.q = R1.p JOIN R R2 ON M3.r = R2.p WHERE R.x = R1.x AND R.y = R2.x AND R1.y = R2.y;

To see why it is more efficient than Tuffy and Alchemy, consider the first query. Suppose we first compute RM3 := M3 ./M3 .q=R.p R. Since the Mi tables are often small, we use a one-pass join algorithm and hash M3 using q. Then when each tuple (p, x, y) in R is read, it is matched against all rules in M3 with the first argument being p. In the second join RM3 ./RM 3.r=R.p AND RM 3.y=R.y R, since RM3 is typically much larger than M3 , we assume using a two-pass hash-join algorithm, which starts by hasing RM3 and R into buckets using keys (r, y) and (p, y), respectively. Then, for any tuple (p, x, y) in R, all possible results can be formed in one pass by considering tuples from RM3 in the corresponding bucket. As a result, each tuple is read into memory for at most 3 times and rules are simultaneously applied. This is in sharp contrast with Tuffy, where tuples have to be read into main memory as many times as the relation appears in the rules. Our performance gain over Tuffy owes also to the simple syntax of Horn clauses. In order to support general first-order clauses, Tuffy materializes each possible grounding of all clauses to identify a set of

ProbKB: Managing Web-Scale Knowledge

active clauses (Singla & Domingos, 2006)3 . This process quickly uses up disk space when trying to ground a dataset with large numbers of entities, relations and clauses like Sherlock. Our algorithms avoid this time- and space-consuming process by taking advantage of simplicity of Horn clauses. The result of grounding is a factor graph (Markov network). This graph encodes a probability distribution over its variable nodes, which can be used to answer user queries. However, marginal inference in Markov networks is #P-complete (Roth, 1996), so we turn to approximate inference methods. The state-of-theart marginal inference algorithm is MC-SAT (Poon & Domingos, 2006), but due to absence of deterministic rules and access to efficient parallel implementations of the widely-adopted Gibbs sampler, we sticked to it and present an initial evaluation in Section 3.2. 2.4. Knowledge Integrity (Schoenmackers et al., 2010) observed a few common error patterns (ambiguity, meaningless relations, etc) that might affect learning quality and tried to remove these errors to obtain a cleaner corpus to work on. This manual pruning strategy is useful, but not enough, for an inference engine since errors often arise and propagate in unexpected manners that are hard to describe systematically. For example, the fact “Paris is the capital of Japan” shown in Figure 1 is accidentally extracted from a Wikipedia page that describes logical statement, and no heuristic in (Schoenmackers et al., 2010) is able to filter it out. Unfortunately, even a single and unfrequent error like this propagates rapidly in the inference chain. These errors hamper knowledge quality, waste computation resources, and are hard to catch.

computational-safe subset of our knowledge base so that we can safely apply the rules with no errors propagating. To save computation even further, we adapt this workflow to a semi-naive query evaluation algorithm, which we call robust semi-naive evalution to emphasize the fact that inferences only occur among most correst facts and errors are unlikely to arise. Semi-naive query evaluation originates from Datalog literature (Balbin & Ramamohanarao, 1987); the basic idea is to avoid repeated rule applications by restricting one operand to containing the delta records between two iterations. Algorithm 1 Robust Semi-Naive Evaluation candidates = all facts beliefs = promote(∅,candidates) delta = ∅ repeat promoters = infer(beliefs,delta) beliefs = beliefs ∪ delta delta = promote(promoters,candidates) until delta = ∅ In Algorithm 1, infer is almost the same as discussed in Section 2.3, except that the operands are replaced by beliefs and delta, potentially much smaller than the original R. The promote algorithm is what we are still working on. It takes a new set of inference results and uses them to promote candidates to beliefs. Our intuition was to exploit lineage and promote facts implied by most external sources or to learn a set of constraints to detect erroneous rules, facts, and join keys. 2.5. The GraphLab Inference Engine

Instead of trying to enumerate common error patterns, we feel that it is much easier to identify correct inferences: facts that are extracted from multiple sources, or inferred by facts from multiple sources are more likely to be correct than others. New errors are less likely to arise when we applied rules to this subset of facts. Thus, we split the database for the facts, moving the qualified facts to another table called beliefs, and we call the remaining facts candidates. Candidates are promoted to beliefs if we become confident about their fidelity. The terminologies are borrowed from Nell (Carlson et al., 2010), but we are solving different problems: we are trying to identify a

The state-of-the-art MLN inference algorithm is MCSAT (Singla & Domingos, 2006). For an initial evaluation and access to existing libraries, though, our prototype used a parallel Gibbs sampler (Gonzalez et al., 2011) implemented on the GraphLab (Low et al., 2010; 2012) framework. GraphLab is a distributed framework that improves upon abstractions like MapReduce for asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. We ran the parallel Gibbs algorithm on our grounded factor graph and got better results than Tuffy.

3 For illustration, consider a non-Horn clause ∀x∀y(p(x, y) ∨ q(x, y)), or ∃xp(x). These clauses can be made unsatisfied only by considering all possible assignments for x and y.

Though the performance looks good, one concern we have is coordinating the two systems. Our initial goal is to build a coherent system that works on MLN inference problem. Getting two independent systems to

ProbKB: Managing Web-Scale Knowledge

work together is hard and error-prone. Synchronization becomes especially troublesome: to query even a single atom, we need to get output from GraphLab, write the results to the database, and perform a database query to get the result. Motivated by this, we are trying to work out a shared memory model that pushes external operators directly into the database. In Section 3.2, we only present results from GraphLab.

3. Experiments

3.2. Inference

We used extracted entities and facts from ReVerb (Fader et al., 2011) and applied Sherlock rules (Schoenmackers et al., 2010) to discover implicit knowledge. Sherlock uses TextRunner (?) extractions, an older version of ReVerb, to learn the rules, so there is a schema mismatch between the two. We are working on resolving this issue by either mapping the ReVerb schema to Sherlock’s or implementing our own rule learner for ReVerb, but for now we use 50K out of 400K extracted facts to present the initial results. The statistics of our dataset is shown in Table 1. #entities #relations #facts #rules

As discussed in Section 2.3, the reason for our performance improvement is the Horn clauses assumption and batch rule application. In order to support general MLN inference, Tuffy materializes all ground atoms for all predicates. This makes the active closure algorithm efficient but consumes too much disk space. For a dataset with large numbers of entities and relations like ReVerb-Sherlock, the disk space is used up even before the grounding starts.

480K 10K 50K 31K

Table 1. ReVerb-Sherlock data statistics.

Experiment Setup We ran all the experiments on a 32-core machine at 1400MHz with 64GB of RAM running Red Hat Linux 4. ProbKB, Tuffy, and GraphLab are implemented in SQL, SQL (using Java as a host language), and C++, respectively. The database system we use is Greenplum 4.2. 3.1. Grounding We used Tuffy as a baseline comparison. Before grounding, we remove some ambiguous mentions, including common first or last names and general class references (Schoenmackers et al., 2010). The resulting dataset has 7K relationships. We run ProbKB and Tuffy using this cleaner dataset and learn 100K new facts. Performance result is shown in Table 2. System

Time/s

ProbKB Tuffy

85 Crash

Table 2. Grounding time for ProbKB and Tuffy.

This section reports ProbKB’s performance for inference using GraphLab. Again we used Tuffy as comparison. Since Tuffy cannot ground the whole ReVerb-Sherlock dataset, we sampled a subset with 700 facts and compared the time spent on generating 200 joint samples for ProbKB and Tuffy. The result is shown in Table 3. System

Time/min

ProbKB Tuffy

0.47 55

Table 3. Time to generate 200 joint samples.

The performance boost owes mostly to the GraphLab parallel engine. The experimental results presented in this section are preliminary, but it shows the need for better grounding and inference system to achieve web scale and the promise of our proposed techniques to solve the scalability and integrity challenges.

4. Conclusion and Future Work This paper presents our on-going work on ProbKB. We built an initial prototype that stores MLNs in a relational form and designed an efficient grounding algorithm that applies rules in batches. We maintained a computational-safe set of “beliefs” to which rules are applied with minimal errors occurring. We connect to GraphLab for a parallel Gibbbs sampler inference engine. Future works are discussed throughout the paper and are summarized below: • Develop techniques for the robust semi-naive algorithm to maintain knowledge integrity. • Tightly integrate grounding and inference over MLN using shared-memory model. • Port SQL implementation of grounding to other frameworks, such as Hive and Shark.

ProbKB: Managing Web-Scale Knowledge

References Balbin, Isaac and Ramamohanarao, Kotagiri. A generalization of the differential approach to recursive query evaluation. The Journal of Logic Programming, 4(3):259–262, 1987. Carlson, Andrew, Betteridge, Justin, Kisiel, Bryan, Settles, Burr, Hruschka Jr, Estevam R, and Mitchell, Tom M. Toward an architecture for neverending language learning. In Proceedings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010), volume 2, pp. 3–3, 2010. Fader, Anthony, Soderland, Stephen, and Etzioni, Oren. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics, 2011. Gonzalez, J, Low, Yucheng, Gretton, Arthur, Guestrin, Carlos, and Gatsby Unit, UCL. Parallel gibbs sampling: From colored fields to thin junction trees. Journal of Machine Learning Research, 2011. Low, Yucheng, Gonzalez, Joseph, Kyrola, Aapo, Bickson, Danny, Guestrin, Carlos, and Hellerstein, Joseph M. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July 2010. Low, Yucheng, Bickson, Danny, Gonzalez, Joseph, Guestrin, Carlos, Kyrola, Aapo, and Hellerstein, Joseph M. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8):716–727, 2012. Niu, Feng, R´e, Christopher, Doan, AnHai, and Shavlik, Jude. Tuffy: scaling up statistical inference in markov logic networks using an rdbms. Proceedings of the VLDB Endowment, 4(6):373–384, 2011. Niu, Feng, Zhang, Ce, R´e, Christopher, and Shavlik, Jude. Scaling inference for markov logic via dual decomposition. In Data Mining (ICDM), 2012 IEEE 12th International Conference on, pp. 1032–1037. IEEE, 2012. Poon, Hoifung and Domingos, Pedro. Sound and efficient inference with probabilistic and deterministic dependencies. In Proceedings of the national conference on artificial intelligence, volume 21, pp. 458. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.

Poon, Hoifung, Christensen, Janara, Domingos, Pedro, Etzioni, Oren, Hoffmann, Raphael, Kiddon, Chloe, Lin, Thomas, Ling, Xiao, Ritter, Alan, Schoenmackers, Stefan, et al. Machine reading at the university of washington. In Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, pp. 87–95. Association for Computational Linguistics, 2010. Richardson, Matthew and Domingos, Pedro. Markov logic networks. Machine learning, 62(1):107–136, 2006. Roth, Dan. On the hardness of approximate reasoning. Artificial Intelligence, 82(1):273–302, 1996. Schoenmackers, Stefan, Etzioni, Oren, Weld, Daniel S, and Davis, Jesse. Learning first-order horn clauses from web text. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1088–1098. Association for Computational Linguistics, 2010. Singla, Parag and Domingos, Pedro. Memory-efficient inference in relational domains. In PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, volume 21, pp. 488. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2006.