- CAS is now called IAS
- default port was 8500 and is now 8510
- A base for a crawl configuration file can now be found with the basic install: <IAS Install Dir>\3.0.0\sample\crawlConfigFiles\fileSystemCrawl.xml. The changes to this file are as defined in the blog.
In this blog entry we will do a basic install and configuration of the available Content Acquisition System (CAS) with Oracle Endeca. With the CAS you can extract, enrich and integrate unstructured content from network file systems, web sites and content management system (CMS) repositories. Seperate software licenses may be needed to use availabe connectors.
You can download the software (Oracle Endeca Content Acquisition System 3.0.2 for Microsoft Windows x64 (64-bit)) from edelivery.oracle.com. The documentation can be found here.
Before you can install the CAS you need a user with Administrator privileges and a Policy setting 'Log on as a service'. You can create a new user according to the (CAS) documentation or (the easy way) add the Policy setting to the current user (assuming you are logged in as an Administrator). To add the Policy go to: Control Panel/Administrative Tools/Local Security Policy/Local Policies/User Rights Assignment. Then go to 'Log on as a service' and add the user (who will install the CAS) to the Policy.
Start the downloaded CAS installer. Click <next> and <next>. At this point we don't want to use the Console as a Workbench extension. It requires more installation and configuration which will be covered in a future post. The Console would give a graphical user interface. In this post we will use the command line interface.
Exclude the extension:
Click <Next>. Give a path to where you want to install the CAS software.
Click <Next>. Give the credentials of the account which will be used to run the CAS server:
Click <Next>. Then leave the default ports as is (default ports are CAS Server port: 8500 and Shutdown port: 8506).
Click <Next> and then <Finish>.
After installation you will notice the Endeca CAS Service in the Windows Services pane:
Now we are ready to crawl some content. Let us start with an easy one: crawl a local directory with documents. When a document is crawled the output (the content of that document) can be a xml file on the file system or a record in a so called Record Store Instance. This Record Store is something created and maintained by the CAS system. A Record Store can be queried with the Integrator and feed Endeca. Below you will see the components of the CAS system (taken from the documentation).
We now have the CAS Service in place and the Web Crawler. We did not install/configure the CAS Console at this moment.
To crawl a directory for the contents of the documents we have to define a crawl and then run the crawl. A crawl is defined via a configuration file. An example which we will now use can be found here: Output Record Store. With this configuration file the extracted content of documents will be stored in a Record Store Instance.
Some changes you might want to do in the configuration file:
is used to indentify the crawl. In this case I have named the crawl: rs_crawl.
Then you have to point to the directory which contains the documents. This is done in the moduleProperty key seeds tag:
In my case the directory "d:\temp\crawl" contains some documents (.pdf, .xls and .doc).
Define the host of the record store. This is done in the moduleProperty key host tag:
In my case my local machine is wvillano-nl.
The last configuration which is important to know is the name of the record store. This is set in the moduleProperty key instance tag:
In this example the record store is called rs_output. The record store will be created automatically by just defining it here.
All other properties can be found in the documentation, but for now we leave them all to the default values.
Save the configuration file (e.g. in d:\temp\config\crawlconfig_rs.txt). Now we can define a crawl with a command line option. Be sure the CAS services is running and be aware that the command line options are case sensitive!:
<your local directory of CAS installation>\bin\cas-cmd.bat createCrawls -f <your saved configuration file including path>. So for me that is:
d:\oracle\OEID\CAS\3.0.2\bin\cas-cmd.bat createCrawls -f d:\temp\config\crawlconfig_rs.txt
To check if the crawl has been created you can use the command listCrawls: <your local directory of CAS installation>\bin\cas-cmd.bat listCrawls
Everything is ready to start crawling the directory with contents. Give the command startCrawl:
<your local directory of CAS installation>\bin\cas-cmd.bat startCrawl -id <crawl identifier>
In our case the crawl identifier is rs_crawl, so the command would be: d:\oracle\OEID\CAS\3.0.2\bin\cas-cmd.bat startCrawl -id rs_crawl. To stop crawling the directory use the stopCrawl command. Otherwise the CAS service will continue to scan the directory for updates.
The contents of the documents should now be in the Record Store Instance. Let's have a look how to use them in Oracle Endeca Integrator. So start up Integrator. Open an existing project or create a new one. We need to add the Record Store Instance metadata to that project. To do so go to [File]->[New]->[Other]. Then pick: [Load Metadata from a Record Store]:
You only need to point to the right Record Store Instance. In this case: rs_output and leave the rest defaulted:
To see the results of the crawl we create a graph with the following components:
From the [Discovery] pane the component [Record Store Reader], from the [Writers] pane the component [Trash]. Draw an edge between the components. Drag/drop the newly created metadata on the edge. You will find the crawl metadata in the project under [meta]:
Double click on the component [Record Store Reader] and fill in the Record Store Instance. In our case: rs_output. Do a right mouse click on the edge and select [Enable debug].
Now run the graph. After successful completion right click on the edge and select [View data]. Then click <OK>. You can now see the results of the crawl.
The results are:
Where you can see the recognized file types (Endeca_Document_Type) and the extracted content (Endeca_Document_Text). You will also see other metadata about the documents like: file system extension, document name, path, file size, etc.
Of course these are the basic steps. Perhaps you want to create a CAS Manipulator (see CASDevGuide documentation) to manipulate the contents or create additional Integrator graphs to work with the content ... This sounds like a future blog.
While all EID downloads are still only available through edelivery.oracle.com, I just noticed that the documentation is available on OTN. This is useful because some - but not all - documentation is available on edelivery.oracle.com.
So, for documentation go to:
For software downloads go to:
Just after Wim posted this morning that 2.3 will be released soon, soon became now :)
Go to https://edelivery.oracle.com and Sign In with a valid Oracle account.
The most important new features are:
- New version of CloverETL with corresponding improvements
- A new Text Tagger component that allows text tagging based on white lists and regular expressions
- A new Text Enrichment with Sentiment Analysis component to interact with Lexalytics/Salience
- New CAS Record Store Reader component
- Shipped with a central hub for managing 'data stores' (aka. as indexes/mdexes/dgraphs).
- Improvements to LQL (supports more functions, joins)
- Allows to create LQL-based views that can be reused between components
- Maps are back in Studio, now based on maps.oracle.com
Oracle has put some very interesting "Getting Started" screencasts on Youtube about Oracle Endeca: http://www.youtube.com/playlist?list=PLF23635ACA47F1E6D&feature=plcp.
They are based on the soon to be released Oracle Endeca Information Discovery v2.3.