If you saw my previous post you could have seen that I use some of the components from the GettingStarted project. Now the 3.1 has been released on OTN (so you do not have to change those 3.0 components, but just copy the new ones into your project):
A new feature with OEID v2.4 is to load data directly from an Oracle Business Intelligence (OBI) Server. For companies who are using OBI and have modeled their common enterprise information model in it this can be a very nice feature. One place to get the structured data from!
How do we load data from OBI into an index with Integrator? According to the documentation we go in the Integrator to the menu bar and choose: File > New > Project. Integrator will display the New Project dialog. Then expand the Information Discovery Node and select Load Data from OBI Server.
Unfortunately this does not appear in my screen:
I can only see the Load Metadata from a Record Store ...
This perhaps has to do with my quickstart installation? It appears that not all plug-ins are present. To fix this download the Integrator v2.4 from edelivery.oracle.com. You will find it in the Media Pack: Oracle Endeca Information Discovery Integrator (2.4) Media Pack for Microsoft Windows x64 (64-bit) .
Unzip the file. You will see the plug-in appear:
Copy this file to the plug-ins directory of Integrator:
<Endeca Install Dir>\Endeca\Discovery\2.4.0\Integrator\plugins
Restart Integrator and the Wizard will appear as stated in the documentation:
You can now follow the instructions in the Wizard documentation to load data from the OBI server. I suggest the first time to choose the option: 'Create new project'. Then it becomes clear how it works (no magic there) and we can use it in other projects.
When you choose the option 'Create new project' a complete project will be generated. If you run the graph Baseline a new index will be created and data loaded from the OBI Server. The 2 important and re-usable information pieces are: How the format of the BI Server query should look (the QueryStatement.sql in the Navigator pane) and what connection information is needed (Connections in the Outline pane):
First have a look at the connection. If you double click the connection in the Outline pane the connection pane opens:
As you can see it uses the Oracle BI JDBC driver which is embedded in the Plug-in. So if we want to create a connection in another project we have to choose the driver
oracle.bi.jdbc.AnaJdbcDriver and copy (or type in) the URL
jdbc:oraclebi://wvillano-nl:9703/ (where wvillano-nl should be your OBI Server name) upon creation of the connection.
What should be the format of a query against the OBI Server? Double click on the QueryStatement.sql:
It looks like the logical query OBI is normally producing. But there is a difference. It starts with:
SELECT_BUSINESS_MODEL. And that is also what it does! So the entities should match the Business Model. Here a piece of my rpd:
As you can see in my presentation layer the
Department is still called
DEPT and in my Business Model layer it is called
Department. The Integrator query uses the Business Model naming:
So be aware, if your presentation layer has different names for attributes/tables then in your business model, you cannot copy/paste the logical query created in the OBI Server log. Otherwise: just copy/paste the logical query put the command SELECT_BUSINESS_MODEL before it and it will run.
We can also create any query we want manually of course e.g. add filters to the query:
In the generated project the query is stored in a seperate file, but you can also enter it as
'SQL Query' in the DBInputTable Reader component in the graph.
For the edge metadata we can use the same query since we are not able (yet) to create the metadata via the graphical interface.
That should be enough information to use OBI Server as a source in your projects. I hope you like it too.
One of the unstructured sources for analyses with Oracle Endeca could be web content. Especially when doing some social media analyses for a customer the forum(s) definitely is/are one of the sources. This blog will show you how to crawl a forum with the CAS software.
When you install the CAS software (for installation instructions see the blog entry here), you also have a web crawler tool. The configuration and usage is different then crawling documents as described in the just mentioned blog entry, so here I will give a short example how to configure this. More options and information about the Web Crawler can be found in the Oracle documentation: here.
After the installation of the CAS software there are already sample scripts provided to configure a web content crawl. You will find such an example if you go to the directory: <CAS Install Dir>\3.0.2\sample\webcrawler-to-recordstore.
There you will find all necessary files to configure a web crawl. We take this as a starting point for our forum crawl. So make a copy of the directory "webcrawler-to-recordstore" (in the same directory: <CAS Install Dir>\3.0.2\sample) and give it a different name e.g.: "myfirstcrawl".
Go to the directory "myfirstcrawl". We will make modifications to the following files (details below):
In the file endeca.lst you can list all the URLs that you want to crawl. In my case I thought it would be nice to crawl the Dutch forum of the Liferay portal (because the number of posts can easily be handled :-) ):
But you can put there any (number of) URL(s).
In this file you can define what to do with URLs found on the page: you can specify to follow a URL and crawl it or skip certain URLs and not follow them.
When you hoover hyperlinks on the Liferay forum page you can see the URLs. We do not want to follow all URLs. For example: we do not want to follow the URL which is under the Statistics option (
www.liferay.com/community/forums/-/message_boards/statistics), but we do want to follow the category hyperlinks (e.g.:
We also want to follow the threads. So if you click on category "Algemeen" (which means in English: "General") you can see the threads. The threads have a different URL structure:
So, to instruct the crawler to follow the following URLs:
We replace the following code in the file crawl-urlfilter.txt:
The rest of the file we do not touch for now.
Here we can specify the record store to write to (among other things). So at the tag
<property> with the name tag
we can specify the record store name. If not existing it will be created automatically when running the crawl. My record store will be called rs-myfirstrs:
The last configuration has to be done in the run-sample.bat. Here we have to point to the right record store (in my case
rs-myfirstrs) and we can configure the depth of the crawl (so how deep we want to follow the URLs, starting with level 0). The wanted depth in my case is
2. Because Level 0 is: URL as mentioned in
endeca.lst, Level 1 is: All Category Level and level 2 is: All Thread content level:
These are the (minimum) configuration steps. We are now ready to run web crawl. Execute the
run-sample.bat. Be sure that the Endeca CAS Service is running. The crawl can take up to a few minutes. The result will be something like this:
To see the web content what has been crawled you can create a simple graph as also explained in the CAS installation blog here.
Now it is up to us to create better information out of the crawled data with all functionality available in CloverETL like text tagging, regular expressions, etc.
One last remark:
If you need a proxy to get access to the internet (most companies have one), you have to configure the file conf/default.xml. Change the
<!-- Proxy properties -->.
Maps are an attractive way of visualizing your data. The 'Map' component of EID is surprisingly easy to configure. The prerequisite is that the data in your datastore contains a longitude/latitude attribute. This is also not very difficult to achieve. First, let's add a map to the EID portal and see what happens...
Select 'Preferences' to observe that your Endeca Server data source must contain a valid geospatial attribute in order to use the map.
The geospatial attribute is a data type in Endeca Server that is designated to storing longitude/latitude combinations. See the Data Loading Guide for the list of data types (property types) that are supported. In this case we are interested in the 'mdex:geocode' type.
Now, let's go to Integrator where I for example have data that contains attributes called 'X' and 'Y', representing the latitude and longitude as a number. Here's a snap of the metadata.
Before we can feed this data to Endeca Server as a geospatial attribute, we have to make sure that it is transformed to a string with format '<latitude> <longitude>' (the two decimal numbers with a space in between). In order to achieve this we insert a transformation to our graph. The whole graph looks like this.
Suppose we call the new attribute 'LATLONG', the code for the transformation could look like this.
The expressions map the input fields X and Y onto the output field LATLONG. The example contains some checks for null values, but the actual conversion is done in the following expression.
$out.0.LATLONG = num2str($in.0.X,"#.#####") + " " + num2str($in.0.Y,"#.#####");
The 'num2str' function converts the decimal numbers to a 5-digit string. The '+' sign is the concatenation operator. Finally, note the space string that is put between the numbers.
So, how does Endeca Server know that the LATLONG attribute is of type 'mdex:geocode'? It doesn't. By default, it will be interpreted as text.
In order to set the property type, edit the metadata of the second edge, navigate and click on the LATLONG field. (If the field does not yet exist, add it.) Add a custom property by clicking on the green + sign on the right-hand side of the dialog.
In the next dialog enter the following values.
Confirm all your changes, run your graph. Go back to the front-end and observe the results in the map component. After pressing the 'Update data source' button, your map component will no longer show the error message and you are able to select the LATLONG attribute as the 'Geospatial filtering attribute'.
A mindblowing feature of Endeca (at least the first 50 times you see it) is the spelling correction and the so-called 'Did You Mean?' functionality. It comes out of the box and operates based on the data you've loaded into Endeca Server. Yes, this means no messing around with dictionary files!
So what is it?
- Spelling Correction - corrects typos when you use the search box. This results in an auto-correction of your search term. If your search term does not give any results and there is a term available in your data that looks like your term, the look-a-like will be used.
- Did you mean? - suggests terms that will have more resulting records. A click on the suggested term will change your search.
A prerequisite for these features is that you have defined a search interface, but then enabling them is very simple. After you've loaded your data, just navigate to the following url:
http://localhost:7770/admin/<your datastore name here>?op=updateaspell
Check the results, it's fun! Here are some examples (hint: one of these results is fake ;))
The spelling dictionary must be updated after each full load...You don't want to navigate to the url manually every time after an etl run, so let's embed it in a graph.
Add a 'HTTP connector' component from the palette to your graph (don't forget to set the phase correctly if needed to make sure the call is done after data loading into Endeca Server).
Edit its properties and enter the url in the URL property.
Variables are allowed as well, so something copy in something like this:
Wow, your EID app has just become even more amaaaaazing!
I've just loaded some twitter data into my EID environment and it's really funny to see the results. One of the obvious things to do with twitter feeds is to extract the #hashtags (and urls). This can be done using Integrator's 'Text Tagger - Regex' component.
What regular expression to use to exactly match a hashtag (or a url) on twitter is not as easy as you'd think. The folks at twitter have given some thought to it and published the API they are using to do it. Some information can be found here. The Java class that holds all their regular expressions is found here. Give it a quick glance and decide (like me) that for starters you will go for something simpler. Here's the expression I use to match the hashtags, it's not perfect but gives acceptable results because it just matches any word preceded by a '#'.
Now let's add the 'Text Tagger - Regex' component to a graph. Make sure you connect your source data to the input. In the following image I have my tweets stored in a database table and one of the fields contains the full text of the tweet.
Now, it's time to set the properties of the tagger. You have to specify the 'Source Field Name' obviously. The 'Target Field Name' can be any name of your choice. I'll get back to that. The 'Search Pattern' is your regex and you can use so-called capturing groups here to determine what you want to see as the outcome of the component. The capturing group that contains the whole match is group 0, so I am using this one as the value for the 'Render Pattern' field. Groups are referenced by the $ sign followed by the capturing group number. Finally, I've set the 'Multi-assign delimiter' field to the '|' symbol. This makes sure that multiple matches (hashtags) in the tweets are properly separated.
That's most of the work, we can now insert the results into Endeca Server. The usual steps should be applied for this, however I am doing something different now. For debug purposes Integrator (or in this case CloverETL) provides a component called 'Trash'. So instead of adding the data to Endeca Server, I am dumping it to the trashbin. This is what the graph looks like.
That leaves us with the metadata between the text tagger and the trashbin. The text tagger simply propagates all its input fields, along with the new target field which we previously named 'HASHTAGS'. This field must be added to the outgoing metadata. We can do that by duplicating the original metadata and editing it.
Because the tweets are separated by the '|', we can insert them into Endeca Server as a multi value attribute.
Note: the regex tagger is case-sensitive and there is no option to pass the CASE_INSENSITIVE flag like with the Pattern.compile method. In order to increase your matches, add a transformer to your graph before the tagger. This transformer can turn the whole tweet to lower case. Or is there any other option that can for example be embedded in the regular expression? Yeah, that's a question, because I don't know :). Any suggestions to improve the expression are welcome...
Some of the elements in Integrator will output multiple values in a single field. These values are separated by a special character. Examples are the 'Text Tagger' and the 'Text Enrichment' components. But it may also be present in your source data. Endeca Server is able to handle this and split the field into multiple fields.
First, an example. The following image shows a graph with a database table that is directly inserted into Endeca Server. (If you don't know how to set this all up, look here.) The data is also shown, notice that the field 'SKILLS' contains multiple values.
However, the nice thing about Endeca is that it is able to store the same attribute multiple times on the same record. The solution is easy, just tell Integrator what the delimiter is by setting it as a property of the 'Add/Update Records' object.
The 'Record Details' component shows the results somewhat better.
In a lot of cases it is probably useful to add some extra metadata to this column. For instance, setting its property 'system-navigation_Select' to 'multi-or' can be useful. More on setting attribute metadata is posted here.
Endeca Server requires a unique record identifier in its record stores. This identifier is called the 'Spec Attribute' of the store and it is the element that determines the granularity of the data store. There are as many records in a data store as there are distinct spec attributes. Supplying the Spec Attribute is the only data modeling that Endeca Server enforces on the user. You can do more (group attributes, assign display names, set sort orders, etc.) but you don't strictly have to. The image below shows the 'Add/Update Records' element in Integrator and its property screen with the Spec Attribute element.
One of the great things about Endeca is that you can integrate different datasets with different structures but this contradicts somewhat with the need for a unique record identifier. If a datasource contains a unique property, you might be able to use it but this will surely not be available across multiple sources and especially not be available across sources like webpages and twitterfeeds. Fortunately, even in the revolutionary and changing world of information discovery, some solutions from the 'old world' are still valid.... Allow me to introduce: the sequence for generating a surrogate key :).
Adding a sequence in Integrator is done by first creating it in the Project Outline pane.
Now, the best place to add the sequence is just before the element that inserts your data into the server. First add a 'Reformat' object to your graph and make sure your data flows into it. In the image below, I've got a simple database input table connected to the Reformat object and added the appropriate metadata to the edge between them. After that I've connected the Reformat object to the 'Add/Update Records' object. The dotted line shows that there is no metadata assigned to this second edge yet.
Go to the Edit screen of the Reformat object and edit the 'Transform' property by clicking on the dots button '...'. In the next screen, make sure you map your source columns to pass through the transformation as you desire. The easiest way to do this is by dragging all the columns from the input pane at the left to the output pane on the right.
The result will look like the image below, all columns are mapped to columns in the output pane on the right-hand side. In order to add the sequence, navigate to the 'Sequences' tab in the middle lower pane. Select your sequence and drag it into the Transformations pane.
Create a new output column by pressing the top-right '+' icon. Rename it to a suitable name and drag the sequence expression from the Transformation pane onto the new output column. This is all that is needed to create the transformation that uses the sequence. You can inspect the Java code that was generated by these steps by navigating to the 'Source' tab, the code should be self-explanatory. Finally press 'OK'. The Integrator will complain about some syntax errors, but you can ignore these.
Your graph is finished, the metadata for the second edge is automatically generated, as you can tell from the fact that the edge is no longer drawn with a dotted line. You can inspect the metadata to see if it really contains the table columns and the new sequence column.
Items created in the [Outline] of a project are bound to a graph. If you want to reuse those items (like metadata of a table or connection to a database) you can either copy/paste it or Externalize it. If you Externalize it you can reuse the item throughout the complete project.
To externalize an item right mouse click on the item:
On the appearing screen click <ok>. The externalized connection will now appear in the [Navigator] pane (in folder [conn]) and can be reused in the project i.e. by drag/drop Scott.cfg on a DBInputTable of a graph :
After installing the Quick Start and running through the provided content, we are ready to create some analyses ourselves. This blog entry is not meant to be a reference approach, but just to show the basic steps to get a quick result.
What I want to do is read data from an Oracle database table, load this into the Endeca Server and examine the results in Latitude Studio. For simplicity all done with a minimum of configuration. In next blog entries we will load more data from a different source and shape and do more configuration.
So, let's get started.
First start Latitude Data Integrator Designer and except the default Workspace. You will probably see one project: Latitude_QuickStart. Create a new project: Go to [File] -> [New] -> [CloverETL Project]
Give the project a name like 'FirstSteps' and click <Finish>.
Then we want to create an ETL process to load data into a MDEX. Such a process is called a 'graph'. So right click on 'graph' and select [New] and then [ETL Graph]:
Type a name e.g.: Load_Emp click <next> and then <finish>. You should now see in your graph folder: Load_Emp.grf.
Double click on Load_Emp.grf this will open the graph in the right canvas (it will be empty) and also show the Palette on the right with the ETL options.
The table I want to read is EMP in the SCOTT schema in the Oracle database. If you have not the schema in your database here is the DDL and data for the table: emp .
Create a connection to the database by going to the lower left pane [Outline] and right click on [Connections] then hoover [Connections] and select [Create DB Connection].
The following screen will appear (without the filled in details):
Fill in the [name] of the connection (can be any, mine is Scott), then [user] and [password]. If you then select Oracle at [JDBC Specific] and also Oracle from the [Available drivers] it will fill the [URL] with the generic syntax. Replace the hostname, port and database with your environment data. Press <Validate connection> when successful click <Finish>.
In the Outline pane you will now see under Connections the newly created connection with the given name. Now we will import the metadata of the table EMP. Right click in the Outline pane on [Metadata] select [New metadata] and then [Extract from database]:
In the appearing screen select the newly created connection at [Connection] and all details will be filled in:
Click <Next>. Select [Database schema] SCOTT and then click on the table name [EMP] and then [Generate query]. In the query pane will appear:
select * from "EMP" :
Click <Next> and <Finish>.
This will result in the metadata definition of EMP under Metadata in the Outline pane:
Now we will add a table reader to the canvas via drag/drop of [DBInputTable] from the [Readers] component.
Add a [Bulk Add/Replace Records] from the Latitude component.
Draw a line (called [Edge] in the upper part of the Palette component) from the [DBInputTable] to the [Bulk Add/Replace Records] component on the canvas:
This is the basic graph (ETL process) where we still have to apply the source information and metadata on.
For the source information Double click the upper [DBInputTable]. Input should be:
[DB connection]: Scott
[SQL query]: select the EMP table (the same way as creating the metadata step)
[Component name]: EMP
To apply the metadata drag/drop EMP from the [Metadata] part of the [Outline] pane on the dashed arrow going from the [DBInputTable reader] called EMP to the connector of the [Bulk Add/Replace Records] component. The dashed line will be replaced by a solid line.
Now we have to tell Endeca what uniquely identifies a record (primary key). In this case it is column EMPNO: double click on the [Bulk Add/Replace Records] component and type EMPNO at the row [Spec Attribute]. Click <ok>.
Init the MDEX and then Reset the MDEX (as stated in this blog post) so that we are sure we start with a clean MDEX. Double click again on the graph Load_Emp.grf to make sure it is active. Start the graph by clicking on the green 'play' button:
After succesful execution of the graph you will see, in blue, how many records have been processed (in this case 14):
Create Analysis/Result Page
Now we want to see the result in Latitude Studio. Start Latitude Studio Server if you have not done so (via the Windows Start Menu). Login to Latutide Studio with userid/password: firstname.lastname@example.org/test. Click on [Add Page] on the right side of the screen and name the page FirstSteps. Click on the just created tab FirstSteps. Click in the upper right corner on the control panel and select [Add component]:
In the popped up box Click on [Latitude], drag/drop [Breadcrumbs] to the appearing left canvas, drag/drop [Guided Navigation] below the [Breadcrumbs] and drag/drop [Results Table] to the right canvas. Now we will set the preferences for the component [Results Table]: click the button with the dots and click on [Preferences]:
Select columns by checking the [Lock Column] rows:
Click on <Save Preferences> on the bottom of the component and then <Return To Full Page> at the top of the component.
The result should be like this:
That's it! These are the base steps to go from a database table to an Endeca analysis page.